Building Your First Data Engineering Portfolio: A Step by Step Guide (2024)

Building Your First Data Engineering Portfolio: A Step by Step Guide (2024)

Show Video

In this video, I will show you how you can put  your data engineering project and build your   portfolio on the GitHub repository so that  you can easily add it to your resume. One of   the challenges you will face when you create your  data engineering project is how to showcase this   project to the employer or to some clients. In  this video, I will give you the complete guide. Data engineering projects are  not just like other projects,   right? You can't keep them running for a longer  period of time because you will get charged. So,  

we need to find a way to host our project in such  a way that it doesn't cost us anything. However,   if someone is looking at that project,  they should be able to understand what   this project is all about. In this video, I  will give you the complete guide on how you   can host your projects on the GitHub repository  and then put that portfolio on your resume. So let's get started. This is one of the projects  that I've done on my YouTube channel called Uber   Data Analytics. If I were to host this project on  my GitHub repository, what are the different steps  

that I need to take? Right, because, let's say in  this project, we went through a lot of different   things. If you understand the architecture  diagram, I'll show you the architecture   diagram. We have the data stored in the storage,  we have the compute instances running, we have the   BigQuery, and we have the Looker Studio. Now, for  this, I can keep the Looker Studio running as it  

is completely free, but I can't keep running  this compute engine or the BigQuery because it   will cost me after some time. So what I need to do  is find a way to host this project in some other   location and provide detailed documentation  of what I went through in this project. So this is what we will do. Now, for this  project, I have all of the files available,   right? It is available over here. As you can  see, I have my queries, I have my architecture  

diagram available, I have the data model that I  created in this project, and I have all of the   other files such as data that I've used, some  other major files that I have such as load,   transform, extract functions that I've written,  and some Jupyter Notebooks. Now, just like this,   when you create your own data engineering project,  you need to create this kind of documentation.   Let's say you don't have the architecture  diagram, you can use tools like Lucidchart   or Google Slides. It is pretty easy to make  the architecture diagram. You need to configure   your data at one location. If you have the data  model, you can create the data model. Basically,  

what you need to do is document every single thing  that you have done in that particular project,   from architecture diagrams, data models, to other  documentation about the data such as schemas,   schema details, queries, and code that you  have written. All of these things need to   be in one single location so that you can  easily host it on your GitHub repository. Okay, so once you have this file, I will provide  the link for these files also so if you want to   try it by yourself, you can do that, and you can  follow along. Let's say I have all of these files  

available. Now, what I have to do is go to my  GitHub repository and click on the repository.   Over here, I can click on new and then I will  be creating my repository. This repository,   I can easily put it on my resume so that if anyone  is interested in understanding everything about   the project, they can come here and look into it.  Okay, so let's name it something unique like Uber   Data Engineering Project. You can also provide the  description. Okay, so for the description, let's   say Uber Data Engineering Pipeline using Mage AI  and BigQuery. You can keep this. For this, you can   also use ChatGPT, but we will come to the ChatGPT  part after some time. For now, we have this much.  

You have to keep your repository public, so you  can just keep it public, or if you want to keep it   private, that's fine, but if you want to share it  with the entire world, you need to keep it public. Then, I will add the README file. This is  very important because this is where you   will write all of the instructions that are  needed. Everything else you can keep as it is  

and click on create repository. This will create  the empty repository on your GitHub page. Okay,   now what you need to do is understand a few  things. First of all, we need to put all of the   resources into this particular GitHub repository.  So, we have the empty repository created. Now,  

the simplest thing you have to do is upload  everything that you have done about your   project onto this repository. The steps are pretty  simple. All you have to do is click on add file,   click on upload file, and over here you can  either choose your files or drag them. So,   what I will do is drag my entire thing. I don't  want to get the README from here and I will put   it over here. This will upload all of the files  from my local PC to the GitHub repository that we  

just created. Once this upload is complete, you  can just click on commit changes. If you want   to put some comments or add some messages, you  can do that, but this is added by upload and I   will just commit. This will directly commit  all of the files to my GitHub repository. Once this is done, we will look at how we can  write the documentation about our project. It   is very simple, not that complicated, and for  that, we will be modifying this README file.   You will find this README file. Click on it.  Over here, right now, we just have this much,  

the introduction and the description. What you  need to do is click on this particular thing, let   me just adjust my OBS, and click on this button,  edit this file. Now, this is the most important   thing. Everything you will write here about the  project documentation needs to be presented in   a way that if someone goes through your GitHub  repository, they should understand everything   about the project, from what the project is all  about, what the data is, how to execute this   project on their own local PC if they want to do  that, what the architecture diagram is, the data   model, every single thing that you have used, and  the challenges that you have faced. You can put   it over here. In this video, I will keep it short  and try to explain to you the basic fundamentals  

of writing this, but this is all about your  creativity, problems, and challenges that you have   faced while building this project. So, everyone  has a different point of view while building a   project and they face different challenges.  You have to put all these things over here. So, we can start with the basics. This is a  markdown file, so if you want to understand how  

to work with markdown, you can just write GitHub  markdown. If you write it, you will see the best   writing and format syntax. You will find all of  the details about how to format text, images,   links, and every single thing over here. So, let's  say if I want to write a heading. For a heading,  

I can just write a single hash and then the  text. It will look something like this. If   I use two hashes, it will look like this. If I  use three hashes, it will look like this. So,   this is the way you can format your text in  the GitHub README file. We'll start with this.   I'll just clear this entire thing and say the  introduction, "Uber Data Analytics." I'll just  

write "Modern Data Engineering Project." This  is just a single heading that I've written. If   you click on the preview, you will be able to see  everything that you have. You can change it. So,   this is the heading. Now, let's say I will create  one more section inside this with two hashes. So,  

I want to write a subheading. There is the main  heading and inside this, I have a second-level   heading. So, with two hashes, let me just zoom  this if you can see it, yeah, with two hashes,   I will write, let's say, I'll give the chapter  name such as "Introduction." I want to introduce  

my project. If I see the preview again, you  will see "Uber," okay, and then I have the   introduction. For the introduction, either you can  write it by yourself or you can use any AI tool to   make it faster. If you have the understanding  of the introduction and want to go ahead,   you can write it. But let's say if you were to use  an AI tool, right now my ChatGPT is not loading,  

so I will be using the Google Gemini. You can just  copy-paste the title and then come here onto the   Google Gemini. You can just write that, "I have  a data engineering project on this. We are using   BigQuery, Looker Studio, and Cloud Storage  to build a complete project. Can you give a   two to three-line introduction for this project?"  Something like this. This will generate a simple   introduction for you. You can just copy-paste if  it looks good and you can paste it over here. This  

project dives into the world of Uber analytics  using modern data engineering practices on GCP.   We'll use these tools and data visualization and  all the other things. This is how you can just get   started and improve on top of it. If you want  to make changes to this, you can go ahead. If  

you click on the preview, you will again see the  proper preview available over here. So you will be   able to see that. Now, let's say this is done. The  introduction is done. The second thing is I want   to give the architecture diagram. So I can again  write the same subheading "Architecture." Now,   for the architecture, I want to provide the image.  What I will do is right-click on this repository   name and open it in a new tab. All of the  information about the project is available over  

here. So if I click on the architecture.jpg,  you will see it over here. It is available in   the root repository itself. To display the  image, it's pretty simple. All I have to do   is check here on the GitHub markdown, click on  the images, and to display the image by adding   the exclamation sign and wrapping the alt text  on this, and then provide the link of the image   over here. It is something like this. I can  just copy this and paste it over here. This   is the alt text, so I can just write the project  architecture. In case the image is not available,   this text will be shown. This is where I have to  provide the link. Now for the link, I can directly   write maybe architecture.jpg. Let's try with  that. I'll just try with this architecture.jpg,  

the name. If I click here, you will see it is  getting shown. Let's say we make some typos on   this, which is like instead of architecture, I  remove the H, and you will see the alt text as   the project architecture because it is not able  to find the image. It is giving me the alt text.   So this is working. I can see the architecture.  Now this is getting proper. As you can see, our   documentation of the project is proper. Now what  I can do next is add more things. So let's say  

I'll do "Technology Used." For this, I can simply  use the ordered thing, which is if I come here   and say, "Do we have anything to order?" which  is a task list. I can use something like this,   task list or the list. The simple list works.  I can just use the dash or the star or the plus   to represent this, or I can just use the numbers.  What I will do is directly use the numbers. First,   I will use programming language. I'll just write  the programming language as Python. We have used   Python for scripting. For the scripting language,  I have used SQL. Then we have used Google Cloud  

Platform. Inside the Google Cloud Platform, you  can just hit the backspace, add the dash over   here, and add one more list on top of this. So  what I will do is add BigQuery, Cloud Storage,   Looker Studio, and Compute Instance. The fourth  thing that we did was around Mage. Modern data   pipeline tool. You can also provide the link  over here. So if I see the preview, do I see it   properly? Okay, programming language, scripting  language, GCP, and Mage. So this looks much   better now. I have the architecture, I have the  technology used. Now, what is the next thing? If  

I were to provide the link to this Mage AI tool, I  can just write "Modern Data Pipeline Tool." I can   come here and write "Mage." I can copy the link of  this and paste it over here. If I see the preview,   it will look something like this. If I want  to bold this, it is also possible. I can just   select this and hit Ctrl+B. This will add double  asterisks over here and you will be able to see   this in the proper format. Again, you can modify  it as per your creativity, but this is how you can  

move forward. Let's say this is the open-source  project. If someone is interested in contributing   to the Mage AI, we also have the GitHub  repository available. If someone is interested in   contributing, you can just add "Contribute to this  project here." I can add the link, and I can keep  

it like this and import this. Pretty simple. If I  come here, modern data pipeline tool contribution   is done here. You can also put it right below.  This is fine. Now, after this, what do we have?   We also have the data available. We also have the  data model available. We also have the scripts   available. So let's talk about that now. We'll  start with the dataset used because this is the   first one. Dataset used. Here you can write a  description of the data. We have used the Uber   dataset, which is the New York Taxi one. I already  have the text available, so I will just paste it  

over here. You can again use ChatGPT or the Gemini  available and generate it. This is what it looks   like. I have the simple text about the data. Here  is the dataset, so you can provide the link to   the dataset. I can come here, click on this, and  right-click on this and copy the link address. I   can paste it over here. This is where, if someone  wants to understand where the dataset is, they can   come here and see the data. If I see the preview  again, you will see the technology used, dataset  

used. Simple. If you want to describe the data,  it is also possible on your end. If you want to,   like, again, you already provided the link, but if  you want to share the actual dataset link, which   is the New York Taxi data, it is available over  here. The goal of this entire thing is to provide   as much information as possible for a person who  is going through this project. So I can come here  

and write "Original Data Source" and paste the  link. If anyone is interested in understanding   this data, they can come here also. We also have  the data dictionary available. This is all you   will see in the data dictionary. I can come here  and provide the link to the data dictionary. Data   dictionary. We can give numbers to this. Original  Data Source 1, Data Dictionary 2, and more info   about the dataset 3. We can add three hashes  here. If I go to the preview, you will see   "Dataset Used" inside the dataset used. We have  a subsection which is more info about the data.  

Now this is looking much better because we are  providing more and more information. After this,   let's say we have the data. We describe the data.  We have the architecture. We also have the data   model available, which is this one. So I want to  again explain the data. I can just copy the name   of this, and we can use the same thing. First, I  can use the data model. I can add the section. I  

have to make this big. Data Model. For this, what  I will do is add the same thing as we added here,   exclamation. Between this, we will have the alt  text. For the alt text, we have data model image,   and then we have this, which is where  we can put the name of the data model,   which is datamodel.jpg. If I see the preview, am  I able to see it? Let me check. It's jpg. If I   come preview, am I able to see it? Okay,  now I'm able to see that. If I zoom out,  

you will be able to see it properly. I have this  thing available. Introduction, architecture,   technology used, dataset used, looking much better  now. One by one, we are able to create this entire   thing. After this, what do we have? After this,  the data model is done. Let's say after this,   I want to explain the scripts. Script for project.  For this, what you can do, we have three scripts  

available inside the Mage files: extract, load,  and transform. We can just write the name of this   and provide the link. Let's say if I want to  directly give the link to the scripts. Section   links. I have these links available. I can also  do the relative links, which is like this. We can  

provide like this. First, I'll paste it over here.  I'll copy the name of, let's say, first is the   extract. Extract.py. I'll give the extract.py, and  I have to provide the path to that, which is under   mage files/extract.py. This is the name and  this is the file name. If I see the preview,   am I able to, if I click it over here, I can  directly redirect to it. Pretty simple. Let me   just edit it. You can also give the name properly.  Extract Python file. The same thing I will do,   I'll just copy-paste. Instead of copy-pasting,  let's just write it again. Then we have the load.  

After that, we have the load. Load Python file.  Mage files/load.py. The third one is transform   Python file. Something like this. You can give it  as per your understanding. Not that complicated.   I'll put it over this. Copy the file name inside  the Mage files. Just provide the path and this is   done. Script is done. Now, what else is needed? We  have kind of covered everything. There are other   things that you can also put. The challenges that  you have faced, if you want to describe the data  

more, like what was the type of data, if you want  to put some of the arguments that you have done,   the final conclusion, you can put all of these  different things as per your understanding. But   you get the gist of it, right? All you have  to do is understand the markdown and think   about how to put more and more information into  this project so that if someone is going through   this project, they can understand it. At the end,  let's say this project is available on my YouTube,   so I can say complete video tutorial. If you  have a detailed blog, if you have the video,   you can also put it there. Video tutorial, and  I can put the video link. I can paste it over   here. Something like this. If I come here and  see the preview, it looks good. What I will   do is commit the changes, update the README. I  can just add project doc, commit directly to the  

main branch, and do the saving. Now, if you come  here and someone comes from anywhere in the world   to this repository, they will be  able to understand the introduction,   what this project is all about, its architecture,  technology used, dataset used, data model, script,   and complete video tutorial. If you want to  understand more about this, you can just check   the good repositories on GitHub and get more  ideas about it. What I can do is check for,  

let's say, inside GitHub.com, and you will find a  lot of trending libraries. Over here, I can write,   let's say, React. I'll just go to the React, and  I'll see the React repository, Facebook React,   and you will see how they are actually documenting  the entire thing. For React, just like this, there   are numerous projects available on the GitHub  repository. If you want to have an understanding   of how they actually document the entire thing,  you can just go here. You can put the tables,   you can try and experiment with the markdown  file, and you can understand how to make your   project a unique project. When someone comes to  this project, they should be able to understand  

everything from the start. This is how you can  take inspiration from other repositories. I just   gave you the basic understanding of how you can  go about it. This will not cost you anything. It   is completely free, so you can move forward and  do that. This is everything about this video. I   just wanted to quickly cover this part so that you  can have an understanding of how to host your data   engineering project on the GitHub repository and  showcase it on your resume. This is everything. I   hope you learned something new. If you did, then  don't forget to hit the like button and subscribe   to the channel if you're new here. Thank you  for watching. I'll see you in the next video.

2024-08-08 02:57

Show Video

Other news