DataOps – How access to your data with this approach will release untapped value

DataOps – How access to your data with this approach will release untapped value

Show Video

good morning uh welcome to our talk this morning um today we're going to be having a look at data ops and how you can access your data with this approach to release as much untapped value as possible quick introductions so hopefully you should see me in the top uh corner in a little bubble so my name's ian russell i'm the director of operations for software solved we're bespoke software development house and i'm responsible for the operational team and the end-to-end delivery of our software services but i also set out the strategic direction for the team for the processes tools approaches and methodologies that we use on a day-to-day basis presenting with me today is is john stace um he's our director of technology um he wears a number of different hats in his role so he's responsible for technical strategy technical architecture infrastructure and hosting information security and data protection so lots of different responsibilities that he holds and as you can imagine um man and john's role on on a day-to-day basis overlaps quite a bit so uh the the tools the the processes the approaches that we use have to support each other in order for us to be able to deliver the end-to-end process because we need to be able to to write software as well as have environments and infrastructure to support that process okay so let's set the scene for the the subject that we're going to be looking at today so uh one of the biggest challenges that me and john have both faced over the last few years is um around data access so it can be difficult at times to get fast and efficient access to data through the development life cycle um this can be internally and externally either on our own tech stack or for the the systems that we build and support for our customers and they can come with their own um challenges in order to make and enable good business decisions so what i want to draw attention to here is the analytic cycle time so there's this this is the concept of um the elapsed time between the proposal of a new idea so someone comes to the data team and asks them a question or asked them to build a report to the point at which that report or that idea is analyzed and deployed to a position where you can make that decision so slow analytics cycles can be caused by a number of differing factors and as you can see on the the picture here we've got poor teamwork so teams may not be set up you might not have a data team you might not have data scientists so you might have to be able to do the best that you can with who you've got and the systems and the the other things that you need to be able to make that decision lack of collaboration so passing off um between different departments can be a a real problem if the requirements aren't understood waiting for systems or access to systems that they might be disparate in different places so you might not be able to quickly gain access to the data sets that you require um caution of data quality so one of the big things is not having confidence in the data that's passing through the system so you're not willing to ask the right questions of the data sets um inflexible architecture so it might be set up in a way that you you just can't gain access or you can't build it in a way or script it in a way to to access the data when you need to um and also process bottlenecks so again talking back to that that kind of handoff between systems like or between processes or teams it might not flow in the right way so it becomes disjointed and it slows down that efficiency what i'm going to do now is i'm going to ask the john and he's going to talk through kind of how we've dealt with data requests in the past hello everyone now that ian set the scene i wanted to give some context to our how our journey has been with data ops by talking about how we used to deal with data access requests we'd start off by getting that request from a client whether an internal client or external client the process was much the same we'd then get a business analyst involved who would talk to that client get all their requirements worked out written down and finally signed off that would allow us to move on to the next stage of the technical design so a technical person would come in look at those business requirements and find the right technologies that fitted that need that would also be written down and signed off allowing us to move on to the implementation phase where we would build out that dashboard or report or however whatever it was and get that tested internally to make sure it was fit for purpose before returning back to the client for their acceptance testing now at this stage this is the first time the client has seen the output of their the whole process and so they might have feedback and we go back around the loop once more and get to the point where they're happy and have signed off that this is the right solution for them which point we put it into production once it's in production this is where you really find out the feedback about what what's working what's not how they'd like to have it tweaked so we'd be running through a standard support process and there would usually be more feedback more requests it would go back around the loop and um we'd go through that implementation phase again getting everything signed off why was this bad in terms of access well you know if you've ever been involved in a discussion about agile versus waterfall this is a classic example of all the downsides of a waterfall approach we went through this the entire scope of the requirements function and functionality before delivering anything this took a long time i even i remember one time i spoke to a client said so you want some data out of this system what do you want and they said oh well we just want to report on everything so we often had that kind of blank sheet of paper problem where nobody really knew what we were building beyond allowing someone to report on everything which is obviously quite a big scope we also had problems because each of these projects took so long to deliver we would have problems that each implementation was a custom implementation and the technology probably changed between each project because the amount of time it took to deliver something and so we would lose out on a lot of efficiency around reusing the same technology and the same approaches every time we'd also have different members of staff join the project and not necessarily be part of the previous project so there would always be a loss of efficiency there and that just extended the time it took to deliver things so coming back to the theme of this whole conference it meant that access to this data to this important data was really poor it took way too long and often key insights were missed due to the timing and now i'll hand back over to ian so why data ops well me and john went away and we had a number of different conversations and we did some research and different reading and we decided to focus our attention in the end towards this this approach for a number of different reasons so data ops is kind of a perfect combination between methodologies that we already kind of utilize on a day-to-day basis here so we can take agile which we use scrum and it's a methodology that everyone should be familiar with kind of it chunks up and time boxes the development and delivery of requirements so that you can increase velocity to the customer and really focus on what it is that they're trying to achieve devops so this is um an approach that utilizes automation for continuous integration and deployment so that you're you're not manually having to push code to places you're putting in automation to do a lot of that heavy lifting so that you can increase velocity and push through the different environments and hopefully get coat alive as quickly as possible and then finally statistical process control so we use this and it came from manufacturing as a way for us to in-build automated tests that can check as we're moving code or data through the different steps within the life cycle we can check that the the results that we're getting the inputs and the outputs are exactly what we're expecting so that we can build confidence and speed up that confidence as we go through that cycle what data ops does is it tries to overlay onto this iterative approach and it focuses on and attempts to integrate with real-time analytics with that ci cd approach into the into a collaborative team so a lot of these principles rely on teams working together in unison and kind of collaboratively so the handoffs and that the pass off between work is as smooth as possible and what data ops teams do is they they attempt to measure the performance of the analytics based on the insights they deliver so we'll go into it a little bit more detail in a bit but we kind of break that up into two different things so kind of hygiene data so making sure the performance of systems is work is right and you can you can test that throughout with spc um through to asking questions of like what if scenarios are modeling data so um that what we do is we're now focusing that process to put data in the right position and orchestrate systems and the the process to allow us to make that decision as we go through how does that link to access you might ask so as you can imagine so this process can be difficult without having to have full control over data so getting systems to be automated to test that they're correct and push through as fast as possible is can be quite difficult when you're also bringing in the ability to be able to get data in the right place so it's it's like live and it's able to be able to to query it and it it can be very difficult so in order for us to be able to do this we have to make sure that we have a stable data environment and the the evolving needs of our customers can be very challenging so looking at day erupts closer we have this concept of the value pipeline so um what it does is it breaks down um the flow of data into two different streams or pipelines so as you can see from the diagram here we've had the value pipeline and we also have the innovation pipeline and i'll explain in a second the differences between the two um what they're both trying to do though is they're trying to extract as much value from the process as well as the systems as they possibly can as data flows from all the way from initial sandbox or development all the way through to production and when data enters the pipeline we're trying to move it through as quickly as possible into production and with several quality checks so this is where spc comes in and balances so that we can kind of increase the confidence of that data so that with increased confidence hopefully we can increase the speed and then what production is this is generally a series of different stages of accessing data transforming it modeling it visualizing it and then reporting on it um and what we'll do is we'll look at that in a little bit more detail in in a second um for now we'll focus on the pipelines themselves so first of all as you can see along the top we have the value pipeline so what that's trying to do is trying to focus on ensuring that the hygiene of our data passing through the process is what we want it to be so we're trying to increase the quality so that we have increased confidence so that we can get decisions to flow through this as quickly as possible so that when someone comes along with that question we can answer it as quickly as possible when data passes through the pipeline into production what we get is useful analytics and value is then being created so that's where we're talking about creating as much value from this as possible data in the pipeline is updated on a continuous basis whereas code is kept constant so this is where we're testing different data different connotations or different models or scenarios whilst keeping code in a particular environment stable so that we can kind of check or test whether we get the outcome that we we think that we should be getting obviously you don't want poor quality so again like i said this is where our automated testing comes in so we're putting checks and balances in place to make sure that we're getting the expected results from our data sets through each of those different environments so for example we're expecting um a certain performance time for a screen within a system we can we can put an automated test into that process to check it so that we don't have to manually go in um so that we can move that code base much quicker or data set through the system much faster than we could previously the other um pipeline that you can see on there is the innovation pipeline and what th this pipeline is trying to do it's trying to seek to improve analytics by implementing new ideas that yield analytical insights so like i said like sales might come along and ask you well what happens if we increase the average size of our order what will that mean can we model that out and look at that over the next five to ten years um what this um diagram illustrates is that new feature can undergo development and we can put it through that modeling and we can have a certain environment or setup that allows us to put it through that modeling so that we can quickly deploy it into the production system so we can get to that that decision or that decision point as quickly as possible what the innovation pipeline does is it creates a feedback loop so what innovation will do is it will spur new questions and ideas and enhance and analytics so teams will become much more what like they'll want to ask more questions of the data set on a much more regular basis and they'll get because they're they know they're going to get the answer that they need in the in the in the format that they're expecting it in during the development of those new features the code can change so that the the modeling that the kind of and john will go into the the technology we use to do this in a bit um but the actual data step is kept constant so the way that we're treating this is very different to what we're doing in the value pipeline um and what we're attempting to do with all this is reduce that that overall cycle time and turn ideas into innovation into more question as quickly as we possibly can so that teams have confidence in being able to ask those types of questions so although that's good in principle however what me and john had to do next was kind of have a much closer look at the the data ops process and how it fitted in with our current end-to-end development process as i've mentioned already we'd already implemented a process of devops so ci cd continuous integration and deployment and as you can see in that that that top diagram here in that top process it's kind of a process of pushing from development through to build and test where you're implementing continuous integration of data sets of systems so that you're getting it into a position so you can test as close to as possible the real life system um and then you're trying to once it's gone through that level of spc and automated testing so that you've checked that it's it's delivering everything it needs to continuous deployment so you're regularly using automated tools to push it into deployment and then through obviously into into live for our customers as quickly as possible now data ops is a process it has it adds extra levels of complexity and complication and as you can see down the bottom here um we've got a number of extra steps that we have to take into account and a lot of this normally comes down to a number of different extra environments that we have to be able to set up manage maintain orchestrate um orchestration being quite a key point in all of this um and what i'll do here is i'll quickly talk through some of the the key points of how it differs so well first of all we've kind of got um rather than just development environments and stuff like that we've got a sandbox environment so these are isolated environments that enable our development teams and the people trying to gain data analytics insight to test without affecting the application so as we said we want to be able to change the the data in this process but we don't want to change the code so it's a stable environment so that they can they can do those types of tests early on in the process um orchestration is the other massive thing so we need to automate um a number of different processes through this and what um orchestration is trying to do is it's trying to automate the tasks to bring together the data sets in the format that we need to run time processes data transfers integration that i mean the list goes on and on um into kind of a single i.t driven process where that we've got as much control and management over it as we possibly can what we're attempting to do with that is automate the data factory pipeline process so that we we know the expected outcome we know what um format which structure how the data is going to be when it gets there so this is where we're going to need specific environments for data itself because we don't want to be the the data set to be ever changing and people to come in and you're then starting to get false positive or false negative kind of results when you're trying to to gain insight as quickly as possible um and what the linking it back to the innovation pipeline what the innovation pipeline ends up having is kind of a copy of the data pipeline from live that we can quickly access and quickly test and um we can get that insight level as quickly as possible why do we do it this way so in all the reason we do this is because we want to integrate data from different sources keep in mind that lots of systems don't have just one data source they could be different systems that integrate with each other we want to be able to control the storage of the data in different versions over time so for us to orchestrate that and expect and get the outcome that we want it's much easier for us to be able to then build in those levels of automated testing we know the input we know the output we want centralized management of our kind of metadata um so that it not only know the available information but also um how to configure the the platform's processes tools like it it's it becomes very complex but yeah there's lots of different moving parts with these types of processes um we want to put into management in relation to requests and authorization and access to data as well so we don't just want anybody going in at any time this needs to be very controlled so we need to make sure that only the people that have access do have access but when they need access they can get access to it nice and quickly um and then what we will need to be able to do is apply analytical reporting dashboarding and mechanisms and techniques to be able to monitor track what is happening throughout that pipeline so that yes it's great that the test comes in and checks it or the the person comes in and they're running their analytics but we want to know that what is happening and when throughout that process and what we'll do now is we'll look into orchestration and because it's got a number of different steps so we'll go through those steps and i'll try and relate it back to the types of data that we're processing throughout this process so how do we tap into the value using this approach using this pipeline as i mentioned previously we've got a number of different steps that we would can break down this kind of orchestration um phase into um where we're gaining access to that data so that we can we can get it in the right position so as you can see here um there's a number of different steps that data has to go through to be in the right position for us to be able to to report on it in the right way so first of all access so we've got orchestration at number of different levels within the process here so for example orchestration at an earlier stage maybe more focused on checking the hygiene data is kind of how we classic kind of the performance accessibility kicks click through usability so this is data that already exists within that end-to-end process and what we're trying to achieve here is having an environment or a set up in a place where we can um we can test we can put in that the automated testing the statistical process control to check whether the performance of a specific like i said previously of a specific screen is correct so that we can then pass it through this process as much as possible now access is is key in this so um we may have data from a number of different sources that we're trying to pull in we might have a sales system we might have a marketing system there's lots of different connotations of integrations that can happen and what we're trying to do is pull that data out into a position especially when we're at that later stage when we're we're in the innovation pipeline so that we can gain access to it in the best format possible um and what we're trying to do is ingest that raw data from those sources and replicate it as close to live as possible so that then we can interrogate it in the best format possible um next we've got the step which is transform so generally we'd have a a process to extract transform and load this into a database or a warehouse or whatever um place that you might want to store it um and what that process would do is it would standardize data in us and sort it so that we can validate that it's correct before we start any sort of testing any sort of interrogation on it so um then we're getting into when we're looking at data models and this is where in the future even things like ai and machine learning come in you can build models to we're then looking at things like what we class as innovation data so as i said previously the example of sales or marketing coming and saying well what happens if we increase our average sales size over the next five years we could map that out and model it so that then we can then push it through into visualizing and reporting on that and what this is all setting up for us to be able to do is to be able to get through that process in a much faster and more efficient um velocity so that when someone comes to us with those questions they don't have to wait weeks they don't have to wait months it's a days type or much faster to get to that real key decision that you want to be able to make so orchestration automation what they do is they allow us to access the value of the data much faster and we can build in a much higher level of confidence because we've built in those automated tests what i'm going to do now is i'm going to pass to john and he's going to go through the technology that we use to implement this thank you in and i want to talk about the technologies we've used and how that's evolved to meet the new needs of the new data ops process so previously we'd use fairly traditional tools such as excel which is obviously the everybody loves excel it's the king of the data uh processing world and microsoft sql server and the associated tools with that so we'd be storing data in sql server we'd use sql server integration services to do our extract transformer load our etl processes and we'd use sql server reporting services for the visualization the output of that data request now reporting services is fine for doing charts and it does paginated reporting traditional paginated reporting really well but we were looking to modernize that side of things anyway to meet the new data ops needs we've now follow a different set of technologies to access data we use well we still use sql when the data source is a sql database but we also use python other data processing languages are available but we find that python works really well for bringing in different data sources into our process to transform that data this is uh where we started using as your data factory adf and i'll talk about a bit more about that in a minute but that's great technology for processing data doing the etl bring it into a format that we can then report on for modeling that data we're using still using sql databases but we're focused on cloud-based databases these days and with visualization and reporting it's pretty much all power bi now there's a very strong microsoft influence across that and these tools work really well together but fundamentally data ops does not depend on a specific set of tools it's all about the processors so i just wanted to expand on that a little bit one of the advantages that we've got from using these cloud-based services especially as your data factory is that it gives us a lot better control in terms of the orchestration of these environments these cloud environments can all be implemented with infrastructure as code so we can create new environments copy them roll them out delete them as and when we need when we're doing a new data request create a new environment use it get rid of it when we're finished with it and these cloud-based services really help with that and or again coming back to orchestration it allows us a lot more control because the apis talking to these services gives us a lot more control over managing what's going on feeds really nicely into our ci cd uh process that ian talked about earlier again because he's cloud-based because we can um use the infrastructure as code approach and in terms of statistical process control spc we find well at the moment it's still fairly basic for us we we do we use a service within azure called as your monitor which allows us to monitor the metrics of these processes in action so we can keep an eye on them and making sure that uh not or get alerted when something starts to go wrong and the other side of things is testing i mean we could talk about the difference between unit testing and integration testing these probably fit more into the integration testing realm but they are programmatic tests that we can implement to um keep re-running regression testing our um pipelines making sure that as we're iterating through these requests adding new requests we're not breaking things already there an adf has great support for um these programmatic tests and the kind of tests we're doing we're measuring the number of records coming in and the records going out and we're running tests against the logic that we're applying in that transformation stage creating warnings around variations of record counts and things like that there's a lot you can do there and adf does make that really quite easy um we're in terms of the actual testing framework we're using because we're we're a.net microsoft house we're using n unit which is a net tool to write programmatic tests but specifically with um adf you can use things like pi test as well if you prefer to write that sort of thing in python and as i said i want to emphasize that we've focused a lot on these cloud-based services because they give us a lot of this capability this data ops process out of the box but it doesn't mean you can't work with on-prem technology and you know infrastructure as code exists for on-prem with the likes of terraform and things like that it's just really easy with these tools and if you were working with on-prem you'd have to put a lot more effort into set up that whole orchestration sandbox environments the icd side of things um next stage so what does this mean for our customers well it as as the the picture says it does bring closer collaboration and productivity we're working closer as a team iterating faster our team can make proactive suggestions because they they're having a lot more conversations with the clients it's a lot more interactive we have that shared understanding of those important data sets which feeds back into that collaboration and proactivity leading on to increased innovation we can move quickly we can try things out we can do experiments if they don't work we just get rid of them because we've got that data ops infrastructure in place it makes that really easy um it for our clients and again both internal and external clients it really increases the speed of their data driven decision making so we could they can ask us for for some data insight we can build that quickly build on the infrastructure we've got roll it out with minimum effort making sure we're not introducing regressions all of that good stuff and they get that insight quickly um it also allows for the option around data enhanced applications so more and more applications can use potentially use that data that we're processing fundamentally coming back to increased customer satisfaction moving quicker getting them insights quicker getting access to that data much more efficiently there's no hesitation there's no oh it's going to take months to produce this we uh get that done for them really efficiently and that just encourages that data-driven thinking and getting those insights i'll now hand back over to ian to talk about what the future holds what does the future hold for this approach so me and john are constantly having conversations about this and trying to keep an eye on trends within the market um to see which direction or what could affect this going forward so we've we've brought it down to five key things that we're keeping an eye on number one we've obviously all got the internet of things and ever expanding sources of data there are a number of different data sets out there that people are trying to constantly utilize or integrate with so there's is only ever going to increase so we're going to have to keep an eye on the types of data that we're going to be able to use in the future for analyzing and gaining insight from two we've got devices so i think there was a stat the other day there's about 12 different sensors on your iphone these days there's so many different ways the devices and tools and things interact with with the users now so the way that we ingest data is ever evolving and changing so we're going to have to keep our finger on the pulse to understand exactly how to best uh to ingest it and to report on it data integration so there's obviously an explosion of the different types of data and greater volumes and it's kind of eroded this assumption that you can you can master this approach of like big data through a single platform most people use specific or dedicated tools for a specific job these days so integration is going to be a key and obviously as we've mentioned over and over in this this presentation that bringing that data together in a single place so that you can report on it is is is a challenge and it will ever be a bigger challenge as we continue down this path uh we've got ai and machine learning it's not new but obviously it's ever ever changing and improving as we go so the big thing here is it's going to be even more of a push in the future to supplement human talent with with ai and machine learning with more data in a variety of different formats to deal with we have to take advantage of the advancements in automation in order to augment that human talent so it's something that we're keeping a very close eye on and finally working with smes and non-technical users so self-service so lots of people within organizations now want to be able to report they want to be able to do analytics modeling without having to go to someone that's too techy so we're keeping an eye on how to best work with our customers and non-technical users to set them up and train them and work in the way that it's best to suit them and their organization and finally it's just a big thank you from me and john thanks for coming along to our talk and does anybody have any questions

2021-09-19 12:52

Show Video

Other news