CI/CD pipeline using dbt, Docker, and Jenkins, Simply Business
so today i'll be talking about ci cd pipelines using dbt docker and jenkins so a little bit of history from us in simple business we've been using dbt for just over two years now dbt is pretty much at the center of our mission to build a single source of truth as we expose data to systems and users across the business thanks to the artwork of the team we significantly reduce the number of times where people ask the question why is this number different why this data doesn't match the system and yeah in dna we believe in empowering users to own and explore their data as our user base has been growing we now include engineers analysts and finance users so with great growth also comes great responsibility and today i'll be talking about how we combine dbt and docker to address challenges we faced and how our solution made it easier for users to utilize dbt we'll also be talking about our journey with jenkins and the problems we faced and how we implemented a solution which made our deployment process more robust i'll also touch on what is coming down the line within simple business for in terms of this project so yeah so here we identify that needed improvement was around user setup two years ago our plan was to make the installation process as easy and seamless as possible if we now pictured the journey of a new dbt user in simple business they would start by having to install dbt local to their machine they also have to make sure that they have that all the dependents required to run the bt installed in their machine correctly like python for example and this process is well documented but if you can imagine it's sometimes daunting for someone that is not very technical to use command the command line for example to run a bunch of scripts which they probably don't even understand what what they do and if there is a problem midway through the installation most of the times they probably don't even know how to fix it and even in those scenarios if someone comes in to help that person needs to understand have they done all the if they follow the steps after you've installed all the dependencies so all this process takes time just to get the tool up and running so we also seen situations where users were having difficulties in upgrading dbt so for example they didn't know if they had to use pip or brew when upgrading and sometimes when upgrading users didn't realize they've actually upgraded uh python for example which in in some cases wasn't compatible that python version wasn't compatible with with dbt so they had to go through the process of reversing that change and kind of having it in order to use the tool pretty much as you can see here here that our processes can be very smooth if they follow the documentation and they know what they're doing but it can also be a little bit bumpy so we thought there must be a better way to make the whole process more user-friendly and the tool is easy to use so we saw on our special virtual room and we were looking for an answer and the answer we came up with was docker so what is docker so docker is a tool designed to make it easier to create deploy and run applications an important element of docker is images so images are immutable files that contain all the necessary parts to run an application such as source codes libraries and other dependencies and then there are containers so containers are running instances of docker images so one thing to denote here is that containers require images to run so they are dependent on those images and they use that image to construct a runtime environment to run the application so as a reminder so images can exist without container but container needs an image to exist so what are the benefits of using docker containers they are portable so they are available from pretty much anywhere and can quickly be shared they're reliable so as the containers operate in an isolated environment everything remains pretty much consistent between users they are lightweight which means they share the machine os system and means that for example they use less memory so going through our project so our objective here was to dockerize dbt by making use of a common image which could be shared between users and systems we plan to make use of containers to run dbt in reliable environments so how did we achieve this so we started by creating a docker image which contains the dbt infrastructure in this image we installed dbt and all other dependencies such as dvt helpers python and in our case a snowflake connector so once we have the image pretty much packaged we we're pretty much ready to publish that to aws and we then have access by our users so once an image is built and ready to be shared we publish that to aws so in terms of in terms of our project so we started by creating a dvd docker dvt infrastructure and which is an image in this image we installed dbt and all of the dependencies such as dvt helpers python and snowflake once we have that image bundle up we publish that to aws which can be then accessed by the users so the benefits of using a docker image to hold the infrastructure is that as a team we can control any future upgrades to dvt or even roll out new features when they're pretty much in a stable position and we can test this in an isolating environment so this slide here goes through the pros the setup of local developments so we if we imagine the previous slide we built our dvd infrastructure this is available in aws so now let's focus on the users and how the users would work with dbt locally so the process is very simple as you're gonna see so we start by the users would start by running a script called pull dvt so this script connects to aws and pulls down the latest infrastructure image which then gets stored locally so after that after they have they got the latest image they then run a star dbt script which utilizes the same image to start a docker container which has all the complete which has dvt in the complete environment so here the users can access the models the macros the seeds and they can do pretty much what they do outside the container so they can run them the models they can test the models they could compile the codes do pretty much everything they do outside but remember they operating in isolated environments so what did we achieve by doing this we now pretty much simplified our setup process so now for a user to use dvt all they need to do is run two scripts so one script which they do only when required to upgrade or to when they first start using the tool and the second script which they use when they need to access dbt so we solved the problem right so to summarize we add an installation process which wasn't very user friendly and we solved this by condensing the installation process to a single script which installs dbt and then by using containers we pretty much made dvt portable reliable and lightweight which is everything that all the benefits of a docker container so well now we managed to make dvt easier to use we went back to our special virtual room and we started to think what else we could improve and the thing that we decided to do was to look at our ci cd pipeline which runs the jenkins so jenking is an open source automation tool used to build test and deploy software continuously and to give everyone a brief description of what ci cd means it means continuous integration and continuous deployment yeah so we used jenkins a simple business so we looked at this pipeline and he decided can we make this better so when reviewing our pipeline the key we found was that sometimes program code was being deployed to production we even with all the checks we have in place um and some of the reasons we found this was happening was because user's tests were executed against a dev database also users were only required to run tests on models they change which meant that even if those models run successfully locally when running with dependencies they could still fail so the problem of having broken code in production is that it prevents pretty much anyone else from deploying until that broken code is either fixed or reverted so the solution i'm sure everyone is excited to find out what the solution was we've introduced automated tests against clone production data so how did we achieve this so as part of our solution we have at the center python script so this python script wait does it go through logic steps to determine first if there are changes made by the user if the change is made by that user will work in production so it starts by comparing all the changes made in that dev branch against the main branch using a gig command called git diff so at this point we get pretty much all the changes within that that branch for that user so that's good but not everything will be relative for our test so what we do here is we try to condense that list to only models and seed files and the way we do this in the filter we include a condition which filters only for sql and csv files so now all great we regularly saw what models have changed and once its files have changed so they exist in a dev branch but they don't exist in master so we got all these grids so all these of change files and models once we have that we implement some further logic to determine what models are changed by running so here at this point we the further logic we determine the dependencies and if you remember here the the reason why we do this is because even if that model runs successfully you can still fail on the dependencies so in order to get this we use dbt lists which first looks at the model that you're making a change and then we get one dependence one depend parents and one dependencies child and at this point you might be asking yourselves why have we decided to do one dependency either way and the reason for that is performance we've tried it with multiple dependencies and we realized that the effects performance and one dependency works for us while doing the test as well one thing that we've noticed as well is that sometimes the child's dependency or the child model as dependency on a parent on another parent model and for this we've included some further logic which checks does that child model depends on another parent model and if it does it gets added to the list and we'll use this list later on so the next step is based on the previous list that we get of models and dependencies we got everything lined up we the script goes through runs a dvt run operation that creates an empty database which is dynamically named by the user then branch so in the example there you can see if we have a give branch called demo dbt there they'll get created a cloned baseball clone demo dvt so the users know where to look if they have to debug and things like that is pretty easy to find so the clone database so it first creates the the call database which is empty and then it looks at the list of objects that we had previously and he uses that to create a copy from production into that database so one thing to that we've done here is that after seven days the con database is automatically dropped so there's no storage of unnecessary clients databases at this point so we're pretty lucky in symbol business that we using snowflake which provides uh zero copy clone functionality out of the box but if you guys are using another database you can probably achieve this in a different way maybe by having views on top of production database so the final steps of this test process is we use the python script to generate a list of dbt commands which will then be used to perform a dvt run and the dvt tests against the clone database and by doing this we pretty much make sure that whatever they trying to run will also work in production so i understand i went through the steps so putting all the script together this is pretty much what what you get so just to recap we first detect the changes on the development branch we detect the model dependencies we clone the database objects we perform a dbt run and a dvd test and if successful the user is able then to deploy to production if not the users can also go into the clone database and kind of check the data make and try to debug that way so what did we achieve by doing this so less manual tests for users tests now run production data which pretty much means that if they run at that point then they should run in production without any issues the risks of broken code is written production is simply reduced at this point so what's next so how does the feature look for us we hope by using docker now we will be able to split the bt into smaller projects which will mean that the tool will perform better and faster and we can also tailor projects to specific departments so if you can imagine finance marketing they all have their own little dvt projects which they can look after we also hoping to develop a full blue-green deployment pipeline at some point so by having two production databases running side by side for example we can make deployment deployments running even faster thank you very much and open the floor for questions at this point awesome thank you so much ben one for me i was getting the impression that your organization was pretty large because i was trying it was just reasoning about why it would make sense to why you'd want to solve for people getting ramped up on dbt as quickly as possible like installing out on the computers and it totally makes sense when you have a team of that scale i'm curious what were you doing before what was you described the pain briefly but i'm curious what it was like prior to do it you know completing this work what were what did that look like sure so we had documentation so a pretty much a step-by-step guide on what to do once you start a simple business if you require to use dbt and it's pretty useful it kind of guides you step by step and you pretty much have to copy and paste those commands into the terminal when it works it is perfect and is very straightforward the problem starts when you don't have something installed on your machine or you have a version which is not compatible then the errors appear and for someone which is not very technical like for example we start on onboarding people in finance and other departments and they just don't know what to do so that's the reason why we try to simplify the process as much as we could even though engineers can probably figure out what the code is telling them and what else to check most of any other folks they just need to ask for help and it just is time consuming for everyone because they can't use the tool and we need to figure out what they've done wrong if they've done anything wrong it sounds like it makes your yeah it makes your life a lot easier and as someone who was i came from that same path of being less technical and figuring out the command line so i it's definitely something i would have appreciated we've got another question from actually here i'm gonna hand it over to martina to ask it out loud martina go ahead and ask your question yeah i was just wondering whether this will also work for windows user and if you have to do another docker image or how that work or will it work so it should work for docker for windows user at simple business we're not quite there at the moment only mac users are using docker in simple business it's one of the key requirements so they need to have docker installed and access to to aws to pull the image down yeah it is possible but it depends on on your infrastructure on your company at simple business we're not quite there but that's the next step especially as we're looking to board more people we did ask the question the infosec team went away and think about it how they would implement this on windows machines but they said it is quite possible but they have security concerns which they're going to think on how to of dissolve thank you yeah that sounds awesome though all right thank you one last naive question just only because this is my first time at the london meetup how long has your team how long has simply business been using dbt it was just over two years so even before i joined the company they're already using dbt for me i've been with the company for just over a year almost a year and a half i've never used the bt before but i think this the tool is amazing especially because it empowers analysts and other users to own their own data and which frees up times for engineers to focus on other things
2021-05-06 12:23