Building Your First Data Engineering Portfolio: A Step by Step Guide (2024)
In this video, I will show you how you can put your data engineering project and build your portfolio on the GitHub repository so that you can easily add it to your resume. One of the challenges you will face when you create your data engineering project is how to showcase this project to the employer or to some clients. In this video, I will give you the complete guide. Data engineering projects are not just like other projects, right? You can't keep them running for a longer period of time because you will get charged. So,
we need to find a way to host our project in such a way that it doesn't cost us anything. However, if someone is looking at that project, they should be able to understand what this project is all about. In this video, I will give you the complete guide on how you can host your projects on the GitHub repository and then put that portfolio on your resume. So let's get started. This is one of the projects that I've done on my YouTube channel called Uber Data Analytics. If I were to host this project on my GitHub repository, what are the different steps
that I need to take? Right, because, let's say in this project, we went through a lot of different things. If you understand the architecture diagram, I'll show you the architecture diagram. We have the data stored in the storage, we have the compute instances running, we have the BigQuery, and we have the Looker Studio. Now, for this, I can keep the Looker Studio running as it
is completely free, but I can't keep running this compute engine or the BigQuery because it will cost me after some time. So what I need to do is find a way to host this project in some other location and provide detailed documentation of what I went through in this project. So this is what we will do. Now, for this project, I have all of the files available, right? It is available over here. As you can see, I have my queries, I have my architecture
diagram available, I have the data model that I created in this project, and I have all of the other files such as data that I've used, some other major files that I have such as load, transform, extract functions that I've written, and some Jupyter Notebooks. Now, just like this, when you create your own data engineering project, you need to create this kind of documentation. Let's say you don't have the architecture diagram, you can use tools like Lucidchart or Google Slides. It is pretty easy to make the architecture diagram. You need to configure your data at one location. If you have the data model, you can create the data model. Basically,
what you need to do is document every single thing that you have done in that particular project, from architecture diagrams, data models, to other documentation about the data such as schemas, schema details, queries, and code that you have written. All of these things need to be in one single location so that you can easily host it on your GitHub repository. Okay, so once you have this file, I will provide the link for these files also so if you want to try it by yourself, you can do that, and you can follow along. Let's say I have all of these files
available. Now, what I have to do is go to my GitHub repository and click on the repository. Over here, I can click on new and then I will be creating my repository. This repository, I can easily put it on my resume so that if anyone is interested in understanding everything about the project, they can come here and look into it. Okay, so let's name it something unique like Uber Data Engineering Project. You can also provide the description. Okay, so for the description, let's say Uber Data Engineering Pipeline using Mage AI and BigQuery. You can keep this. For this, you can also use ChatGPT, but we will come to the ChatGPT part after some time. For now, we have this much.
You have to keep your repository public, so you can just keep it public, or if you want to keep it private, that's fine, but if you want to share it with the entire world, you need to keep it public. Then, I will add the README file. This is very important because this is where you will write all of the instructions that are needed. Everything else you can keep as it is
and click on create repository. This will create the empty repository on your GitHub page. Okay, now what you need to do is understand a few things. First of all, we need to put all of the resources into this particular GitHub repository. So, we have the empty repository created. Now,
the simplest thing you have to do is upload everything that you have done about your project onto this repository. The steps are pretty simple. All you have to do is click on add file, click on upload file, and over here you can either choose your files or drag them. So, what I will do is drag my entire thing. I don't want to get the README from here and I will put it over here. This will upload all of the files from my local PC to the GitHub repository that we
just created. Once this upload is complete, you can just click on commit changes. If you want to put some comments or add some messages, you can do that, but this is added by upload and I will just commit. This will directly commit all of the files to my GitHub repository. Once this is done, we will look at how we can write the documentation about our project. It is very simple, not that complicated, and for that, we will be modifying this README file. You will find this README file. Click on it. Over here, right now, we just have this much,
the introduction and the description. What you need to do is click on this particular thing, let me just adjust my OBS, and click on this button, edit this file. Now, this is the most important thing. Everything you will write here about the project documentation needs to be presented in a way that if someone goes through your GitHub repository, they should understand everything about the project, from what the project is all about, what the data is, how to execute this project on their own local PC if they want to do that, what the architecture diagram is, the data model, every single thing that you have used, and the challenges that you have faced. You can put it over here. In this video, I will keep it short and try to explain to you the basic fundamentals
of writing this, but this is all about your creativity, problems, and challenges that you have faced while building this project. So, everyone has a different point of view while building a project and they face different challenges. You have to put all these things over here. So, we can start with the basics. This is a markdown file, so if you want to understand how
to work with markdown, you can just write GitHub markdown. If you write it, you will see the best writing and format syntax. You will find all of the details about how to format text, images, links, and every single thing over here. So, let's say if I want to write a heading. For a heading,
I can just write a single hash and then the text. It will look something like this. If I use two hashes, it will look like this. If I use three hashes, it will look like this. So, this is the way you can format your text in the GitHub README file. We'll start with this. I'll just clear this entire thing and say the introduction, "Uber Data Analytics." I'll just
write "Modern Data Engineering Project." This is just a single heading that I've written. If you click on the preview, you will be able to see everything that you have. You can change it. So, this is the heading. Now, let's say I will create one more section inside this with two hashes. So,
I want to write a subheading. There is the main heading and inside this, I have a second-level heading. So, with two hashes, let me just zoom this if you can see it, yeah, with two hashes, I will write, let's say, I'll give the chapter name such as "Introduction." I want to introduce
my project. If I see the preview again, you will see "Uber," okay, and then I have the introduction. For the introduction, either you can write it by yourself or you can use any AI tool to make it faster. If you have the understanding of the introduction and want to go ahead, you can write it. But let's say if you were to use an AI tool, right now my ChatGPT is not loading,
so I will be using the Google Gemini. You can just copy-paste the title and then come here onto the Google Gemini. You can just write that, "I have a data engineering project on this. We are using BigQuery, Looker Studio, and Cloud Storage to build a complete project. Can you give a two to three-line introduction for this project?" Something like this. This will generate a simple introduction for you. You can just copy-paste if it looks good and you can paste it over here. This
project dives into the world of Uber analytics using modern data engineering practices on GCP. We'll use these tools and data visualization and all the other things. This is how you can just get started and improve on top of it. If you want to make changes to this, you can go ahead. If
you click on the preview, you will again see the proper preview available over here. So you will be able to see that. Now, let's say this is done. The introduction is done. The second thing is I want to give the architecture diagram. So I can again write the same subheading "Architecture." Now, for the architecture, I want to provide the image. What I will do is right-click on this repository name and open it in a new tab. All of the information about the project is available over
here. So if I click on the architecture.jpg, you will see it over here. It is available in the root repository itself. To display the image, it's pretty simple. All I have to do is check here on the GitHub markdown, click on the images, and to display the image by adding the exclamation sign and wrapping the alt text on this, and then provide the link of the image over here. It is something like this. I can just copy this and paste it over here. This is the alt text, so I can just write the project architecture. In case the image is not available, this text will be shown. This is where I have to provide the link. Now for the link, I can directly write maybe architecture.jpg. Let's try with that. I'll just try with this architecture.jpg,
the name. If I click here, you will see it is getting shown. Let's say we make some typos on this, which is like instead of architecture, I remove the H, and you will see the alt text as the project architecture because it is not able to find the image. It is giving me the alt text. So this is working. I can see the architecture. Now this is getting proper. As you can see, our documentation of the project is proper. Now what I can do next is add more things. So let's say
I'll do "Technology Used." For this, I can simply use the ordered thing, which is if I come here and say, "Do we have anything to order?" which is a task list. I can use something like this, task list or the list. The simple list works. I can just use the dash or the star or the plus to represent this, or I can just use the numbers. What I will do is directly use the numbers. First, I will use programming language. I'll just write the programming language as Python. We have used Python for scripting. For the scripting language, I have used SQL. Then we have used Google Cloud
Platform. Inside the Google Cloud Platform, you can just hit the backspace, add the dash over here, and add one more list on top of this. So what I will do is add BigQuery, Cloud Storage, Looker Studio, and Compute Instance. The fourth thing that we did was around Mage. Modern data pipeline tool. You can also provide the link over here. So if I see the preview, do I see it properly? Okay, programming language, scripting language, GCP, and Mage. So this looks much better now. I have the architecture, I have the technology used. Now, what is the next thing? If
I were to provide the link to this Mage AI tool, I can just write "Modern Data Pipeline Tool." I can come here and write "Mage." I can copy the link of this and paste it over here. If I see the preview, it will look something like this. If I want to bold this, it is also possible. I can just select this and hit Ctrl+B. This will add double asterisks over here and you will be able to see this in the proper format. Again, you can modify it as per your creativity, but this is how you can
move forward. Let's say this is the open-source project. If someone is interested in contributing to the Mage AI, we also have the GitHub repository available. If someone is interested in contributing, you can just add "Contribute to this project here." I can add the link, and I can keep
it like this and import this. Pretty simple. If I come here, modern data pipeline tool contribution is done here. You can also put it right below. This is fine. Now, after this, what do we have? We also have the data available. We also have the data model available. We also have the scripts available. So let's talk about that now. We'll start with the dataset used because this is the first one. Dataset used. Here you can write a description of the data. We have used the Uber dataset, which is the New York Taxi one. I already have the text available, so I will just paste it
over here. You can again use ChatGPT or the Gemini available and generate it. This is what it looks like. I have the simple text about the data. Here is the dataset, so you can provide the link to the dataset. I can come here, click on this, and right-click on this and copy the link address. I can paste it over here. This is where, if someone wants to understand where the dataset is, they can come here and see the data. If I see the preview again, you will see the technology used, dataset
used. Simple. If you want to describe the data, it is also possible on your end. If you want to, like, again, you already provided the link, but if you want to share the actual dataset link, which is the New York Taxi data, it is available over here. The goal of this entire thing is to provide as much information as possible for a person who is going through this project. So I can come here
and write "Original Data Source" and paste the link. If anyone is interested in understanding this data, they can come here also. We also have the data dictionary available. This is all you will see in the data dictionary. I can come here and provide the link to the data dictionary. Data dictionary. We can give numbers to this. Original Data Source 1, Data Dictionary 2, and more info about the dataset 3. We can add three hashes here. If I go to the preview, you will see "Dataset Used" inside the dataset used. We have a subsection which is more info about the data.
Now this is looking much better because we are providing more and more information. After this, let's say we have the data. We describe the data. We have the architecture. We also have the data model available, which is this one. So I want to again explain the data. I can just copy the name of this, and we can use the same thing. First, I can use the data model. I can add the section. I
have to make this big. Data Model. For this, what I will do is add the same thing as we added here, exclamation. Between this, we will have the alt text. For the alt text, we have data model image, and then we have this, which is where we can put the name of the data model, which is datamodel.jpg. If I see the preview, am I able to see it? Let me check. It's jpg. If I come preview, am I able to see it? Okay, now I'm able to see that. If I zoom out,
you will be able to see it properly. I have this thing available. Introduction, architecture, technology used, dataset used, looking much better now. One by one, we are able to create this entire thing. After this, what do we have? After this, the data model is done. Let's say after this, I want to explain the scripts. Script for project. For this, what you can do, we have three scripts
available inside the Mage files: extract, load, and transform. We can just write the name of this and provide the link. Let's say if I want to directly give the link to the scripts. Section links. I have these links available. I can also do the relative links, which is like this. We can
provide like this. First, I'll paste it over here. I'll copy the name of, let's say, first is the extract. Extract.py. I'll give the extract.py, and I have to provide the path to that, which is under mage files/extract.py. This is the name and this is the file name. If I see the preview, am I able to, if I click it over here, I can directly redirect to it. Pretty simple. Let me just edit it. You can also give the name properly. Extract Python file. The same thing I will do, I'll just copy-paste. Instead of copy-pasting, let's just write it again. Then we have the load.
After that, we have the load. Load Python file. Mage files/load.py. The third one is transform Python file. Something like this. You can give it as per your understanding. Not that complicated. I'll put it over this. Copy the file name inside the Mage files. Just provide the path and this is done. Script is done. Now, what else is needed? We have kind of covered everything. There are other things that you can also put. The challenges that you have faced, if you want to describe the data
more, like what was the type of data, if you want to put some of the arguments that you have done, the final conclusion, you can put all of these different things as per your understanding. But you get the gist of it, right? All you have to do is understand the markdown and think about how to put more and more information into this project so that if someone is going through this project, they can understand it. At the end, let's say this project is available on my YouTube, so I can say complete video tutorial. If you have a detailed blog, if you have the video, you can also put it there. Video tutorial, and I can put the video link. I can paste it over here. Something like this. If I come here and see the preview, it looks good. What I will do is commit the changes, update the README. I can just add project doc, commit directly to the
main branch, and do the saving. Now, if you come here and someone comes from anywhere in the world to this repository, they will be able to understand the introduction, what this project is all about, its architecture, technology used, dataset used, data model, script, and complete video tutorial. If you want to understand more about this, you can just check the good repositories on GitHub and get more ideas about it. What I can do is check for,
let's say, inside GitHub.com, and you will find a lot of trending libraries. Over here, I can write, let's say, React. I'll just go to the React, and I'll see the React repository, Facebook React, and you will see how they are actually documenting the entire thing. For React, just like this, there are numerous projects available on the GitHub repository. If you want to have an understanding of how they actually document the entire thing, you can just go here. You can put the tables, you can try and experiment with the markdown file, and you can understand how to make your project a unique project. When someone comes to this project, they should be able to understand
everything from the start. This is how you can take inspiration from other repositories. I just gave you the basic understanding of how you can go about it. This will not cost you anything. It is completely free, so you can move forward and do that. This is everything about this video. I just wanted to quickly cover this part so that you can have an understanding of how to host your data engineering project on the GitHub repository and showcase it on your resume. This is everything. I hope you learned something new. If you did, then don't forget to hit the like button and subscribe to the channel if you're new here. Thank you for watching. I'll see you in the next video.
2024-08-08 02:57