Transforming to a Data Business (Dow Jones DNA) (Cloud Next '19)
So. What we're here today to talk to you about is how we're working to transform Dow, Jones into a data business so. The data engineering team which Dylan was manager of we. Had. The opportunity to move. A 30 an over a 30 year archive, the premium news data to the cloud and by. Experimentation. With cloud-based services, we're able to identify usage. Patterns, and optimize. Our architecture, in order to meet those usage patterns so what we're going to talk to you today is how we were able to learn through experimentation, how. We're able to identify. Opportunities. To, either optimized, for, cost or optimize, for customer performance and how, those lessons, learned have, influenced, how we're transforming, our business, and future, proofing in, future, data. Driven developments. Ok. Dow. Jones had an over a 30 year archive. Of premium, news data that existed only an on-premise, of, infrastructure. What. We, discovered, is that it met the need of our existing use cases which were predominantly, around advanced. Research and advanced analytics predominately, focused at research librarians, while we were getting increasingly. Frequent. Requests, for extractions. Of data and, that. Was actually really, challenging, in our on-premise, infrastructure, so, what, the, data engineering team was charged with doing was actually doing, the migration to the cloud to, support existing, use cases for the Advanced Research new. Use cases for, machine, learning and big data workflows. As well, as being innovative, enough to be future proofing for whatever is next so. Dylan what were some of the original challenges in making this data available to customers so for. This new, class of customers these data scientists, they required more data so we. Were trying to, like. Feel, fulfill their requests, but. They needed more data than we needed, for the researchers. So with this on-prem, database we'd have to give. Them the data on weekends, or over the the. Low traffic times that we had for our research. Product. But. That just wasn't working we weren't able to fulfill the requests. Summers that were needing this data, as. A result we were actually. Post. A question are, we going to stay on Prem and try to scale, up or we're going to move to the cloud where we can massively. Scale up, he. Firmly like with, the formal clouds. So. Due. To this lack of bandwidth, in this lack of scale, and. The smaller the size of our team we decided to move to the cloud. Specifically. The Google cloud because they gave us zero. Ops options. I have some listed here. Like. Data flow data proc and also, GCS. I'll. Get over I'll go over those when you go over the architecture later. So. Yeah I'm, just gonna hit home again what the pros and cons are for moving to the cloud you. Can scale massively, especially, for these data science workloads where you're going to need large clusters, to, process all this data for, us it's over 30 years of news content from, over. 33,000. New sources. It. Also allows for smaller teams because you have these managed services that you can take advantage of. It. Also allows to give you disaster. Recovery capabilities. That previously were, not available, because on Prem we only had a limited, number of locations with, GCP, or with, cloud, offerings that we have available to us now we. Can be located in different countries, throughout the world. And. Given. This upfront costs, we still wanted, to move forward to, the cloud I saw. Bring it over to Patricia, to continue. Us on our journey, well. After. I step us through the architecture here. So. Here's. Our simplified, architecture what. We had was a news. Content feed, on. The the right side there it's kind, of shown as a Mario warp pipe, and. That fed into GCP. Our GCS, for. Storage. And. Every day we'd have a dataflow job that would process all that those news events there. Were updates deletes, or additions of articles and that. Be pushed into to, datastore. From. There our customers are interactive. Ian's API, but. This API we had we had to apply custom. Dsls. So, that was problematic Allah, with the fact that it, was pretty expensive off the cuff here. In. All that Patricia. Let's. Continue on so, once the 30-year archive was moved to the cloud what. We wanted to enable was, really this the, opportunity to integrate our data with machine learning workflows. As well as big data workflows, so we discovered, that we.
Identified, Two delivery mechanisms, that we wanted to enable via cloud services, first. Delivery mechanism, was what's called a snapshot and that's a downloadable, extraction. Of data, custom. Data set for whatever is the specific use case that can be either transferred, to a to a cloud. Provider to work in existing, workflows or can, be downloaded to their to any enterprise. Customers on-premise, behind a firewall so if it's F privacy, is an issue we're able to meet that need the, second delivery mechanism. We enabled, was what's called a stream and that's a custom filter of incoming. News of near. Real-time ingestion. That, can either be used to keep a data store up-to-date, or it can be used for things like risk. Target, identification brand. Management at scale. Due, diligence at scale these, are the delivery mechanisms, we wanted to investigate. In and make available by the DNA platform, what, we discovered early, on in, our initial. Experimentation. With cloud architecture, is that, the usage pattern, moving, to the cloud gives us a better opportunity to track and identify and, understand our usage patterns and our architecture, needs to be adjusted to meet the needs of our usage patterns and we, identified, kind of trade-off, decisions that. Were either to optimize, cost or, to optimize, the performance for our customer. One. Opportunity, so what we're going to do today is we're going to use this balance, as an, analogy, for what is either, the the optimization. Of costs that's enabled, by. A cloud services, or the, opportunity. To improve your. Performance, to your customers, by cloud by, leveraging. Cloud services, cost. Optimizations. If we were going to improve. Our cost for. Example in the use case of a snapshot which, is a downloadable, extraction. What, we identified, as the usage pattern is very consistent so these are workloads. That can be done in batch processing, these are workloads that can whisk interruption, so, what we have is we could make an investment in, cost, optimization by. Doing by using preemptable instances, we. Could, invest, in cost optimization, by. Investing. In tiered storage by leveraging the advantages. Of tier storage you, could do this the same with sustained, use discounts, and. You'll. Notice as we leverage the cost optimization, the performance, may go up and it excusing, the performance, may, degra gate so. You're as you're, as, you're. Optimizing. Either for cost or performance, the key is not necessary finding equilibrium it's, finding striking the right balance of, cost to performance, for that specific, use case now, streams, on the other hand which is a near, real-time filter. Of data well. That, we would really heavily, weight. Being. Highly available and being reliable in that, event we'd really want to invest, in managed, services, and we'd, want to invest. In readwrite, access I, and, we'd, also want to invest in analytics, capabilities, so, in investing, in our performance the performance time, goes down but the costs may go up so, it's really as I mentioned never a man a matter, of trying to strike, equilibrium. It's a matter of trying to find the balance that, meets your specific use, case so. Dylan once the data was made available on the cloud can you walk us through some of the costs to performance, trade-offs that. You discovered, with datastore yeah. So with datastore, we, were giving high bioavailability of. Reads and writes and also it integrates really well with our pipelining, tool data flow. It. Also allows us to scale a lot better than we did on Prem some. Of the cons where it's very expensive for the read and write operations six for the especially, for the usage. Patterns we were seeing on. Top of that we had to create our own DSL and educate our customers on that DSL, so. There was a better way for us to move forward in. PatrĂcia, yeah, so. For the product of snapshots. We do require analytics, capabilities, but we also require the readwrite capabilities, datastore. Proved to be a little bit to, really have meet the need for analytics, for the, performance.
Expectation, Of our customers but, did not meet, the. Cost performant, needs so. So. Dylan what were some of the advantages in moving towards bigquery, as our search index yeah. So, I'll explain the architecture. And then kind. Of bring to light the the optimizations. We made that. Made, this a better architecture, than the one that we had for for datastore, and. I'm gonna explain it from from. Right, to left. Starting. With the Mario pipeline of content, and so. Yeah that's still feeding in into, GCS with, those events that were using, to update our archive, and. Then we still have that dataflow pipeline that updates daily. Except. The difference here is we separated, our core you layer from our storage layer, and. We for a query layer we use bigquery and big Cray allows us to have that sequel, the standard siegel syntax, that we can use and allow, our customers to leverage as well and. Also we have GCS. To store our full, set of content, another. Reason we had to split our query, layer and our storage layer was because there were some limitations with bigquery. Specifically. The extraction, limits that we ran into we, do not run into this with the current architecture, and. I'll kind of bring that to light through. The customers, journey when they submit. A request for a snapshot, and. That, can be seen here on the right side our customers interact with the api's. And. What they're requesting is a snapshot of data and, once they submit that request an. Example request to be for all English English content, that. We have in our archive, and. That'll kick off a data. Flow snapshot, or a data flow job and. What. That data flow table will do is it'll go to bigquery with. That where clause and grab, all the the IDS of the articles the most the. Identifiers, for the most current version of each article, and from. That point it'll do a CO, group co, group by key or I just sort of join with. The the data we have stored on GCS, from, that point once it has all, the, the, content, identified, it'll. Then output the snapshot to a location that the customer can use to, to download it but, that's still done via our API is and our customer still has no idea that, we're no longer using data store. So. That's the architecture, and. I'll kind of reiterate, the the pros and cons of this. So, with these resources that we're spending up with data flow the affirm their referrals so we're not, incurring. That cost as time goes on we're only incurring the cost while we're, doing, the operation, we're doing the snapshot for example. We're. Not paying a high price for updates, to the archive as we were before with data flow sorry, datastore. Often, we get that free query syntax that bigquery gives us with. With sequel, and. GCS, is very cheap so, it's the perfect place to store our data when we're not using it, so, with, this architecture we still have that very expensive joining. Process, with the Lecrae layer and the storage. Layer and. We also need, a separate, cluster for each in every operation. So. There, may be a better way and you may see that further on in our talk, here.
In. Moving to using bigquery for analytics, we were able to do. Big. Workflows, and relatively inexpensive. Big. Queries relatively inexpensive, for analytics capabilities, but we hit some quotas and we had some limits when it came to extraction, which is why we then paired with Google cloud stores we really use the the, cost optimizations. Of the tiered storage with the analytics capabilities, of bigquery so. Dylan, in our snapshots, architecture, why was the mood made from data flow to data proc, okay. Yeah I'll explain this architecture, and you. May see the cons kind. Of bubble up or what what no longer are cons with this are this, architecture. So. Here, we still have that that Mario, pipeline of content, going. Into GCS, updating. Daily, and. Going into the, content, archive we have here that's a little different, as. You may see, the. Customers still submitting, API, requests, but. When, the customer submits an API request for all English content this time, they're. Just going to be having a job spin up and we. Can use data, proc, and their sequel syntax to. Create. The snapshot of content that'll be delivered to our customer. There. Was there, were some cons with this method but yeah. I'll go over those after, I go over the pros because this is a good thing. So with this architecture we were able to reach share, resources so. When we had to customers make, a snapshot, request, we can still use that same archive, of content, with. Data flow you'd have to load. It into our cluster each time each and every time here. We can reuse the same we can mount the same drive in the same content. So. This also more. Closely aligns with our usage patterns we have more and more customers join. And use our platform and, because. Of that we can use, these higher cost resources when. It comes to doing the operations. There's. Also no need for that joining, of the query layer and the storage layer because we're using spark. On data proc which has a sequel. Syntax that we can leverage. But, with this there is still. More operational, overhead with. With data flow it can recover from errors or any issue, with the cluster, with. Data proc we still have to kind of build in some logic to kind, of restart the cluster and recover, from errors, on, top of that when we moved from bigquery, and, went to the. Spark syntax we. Had to create some UDF's, to. Kind, of get bridge that gap between functionality. And yeah. For those of you interested it was it was related, to regular. Expressions. And. I'll throw it over to Patricia to continuous, on this journey so. We, have a relatively small team we started with two engineers and we now have five engineers. Through. The experimentation, in. Moving to the cloud we were able to identify that the usage pattern for snapshots is very consistent, and it's very predictable so. It was a worthwhile trade-off, to go to the, increased. Overhead, of an unmanaged, service as it, gave us a increasing. Control over, the expensive, operations, like joining, and windowing. So there was a cost benefit to us it was worth it and worth, taking on the the, overhead, so. Dylan what were some of it for, streams which is a near real-time filtered. Stream, of data where. Highly available, services is weighted, heavier what, was the what was the choice made to move from data flow to data proc or, did a prokta, data flow. Okay. Yeah with with the streams architecture, our. Customers are going to notice if it goes down for a few minutes or seconds.
Unlike With snapshots they're not going to notice because it's those. Operations are going to take between an, hour or two hours. This. It's very its low latency, so they're gonna notice when messages are not streaming, to them. With, this we're still hooking into that stream of content that's coming from the Marne YouTube, but, it's. Streaming on to pub/sub. Right now and, we're, still leveraging, the data flow. Services. To do our ETL processes, and the. Great thing about dataflow, is you can write job for a batch or streaming job and it's going to be either, the same or similar so. We're taking an advantage of some of the code we wrote for snapshots. And usually in streams to de filter and update our content. From. That point we stream it out to different. We filter it into different product types because we have, content. That's that's. Licensed. To only certain products and, we. Use data, flow to filter that content. From. There this, is where our customers, kind of interact, with our system they, use our user api's, as if they would be using a snapshot they submit the same query like, if they want all English content they'll submit that little where clause and from. That point we'll spin up a data. Flow. Filter. Before, though that was a data proc filter. And I'll go into the pros and cons as to why we changed that part, of the architecture. So. Now once the customer has created the stream and they have their custom set of content streaming, to them on this pub/sub, topic. Here. They, can then use our client code and. Get streamed all the content that gets filtered down to them. But. We. Had to make that move that I was talking earlier from, data proc to data flow and that, was because every, time we, would use preemptable x' in our cluster it. Would take down our. Our. Stream and your customers would notice. With. What data flow we actually, have had streams, that have running for multiple, years. Along. With that allows for this fully. Managed zero ops data processing, and for our small team this, is great because we have to wake up in the middle of the night and handle, an issue it. Also allows for more, seamless auto scaling. Which helps with smaller teams and with streams, there. Was not a heavy load as, we only receive like. 800,000. Messages a day, and. In big data land that's not a lot. So. Yeah there were the cons of not being able to use preemptable which were very cheap if you do use data proc I I suggest. Looking into it and, also, we, had to kind of leverage in in memory database for the queries right. We didn't have the big query syntax or the the spark syntax that we could plug into and use. Sequel. But. If I were to do this again, data flow now has a sequel. To yourself, which. We have not used. Thor. Whatever to you Patricia Thanks so. Sort. Of to, summarize some of the lessons learned. The. Usage patterns seem obvious, but actually, once we release it to the world we did have some api's that were used from ways we hadn't intended and all, of a sudden your your costs, go to skyrocket. And you start becoming, very painfully. Aware of where, you have opportunity, to optimize where you have opportunity, to change and so. What, I would recommend and advise is initially. We were sort of always optimizing. For performant, responses, and if. You invest too heavily in one side of the trade-offs you fall over on top of your suit yourself if you're only investing, in one side you, get skewed off balance, really quickly and that becomes very apparent when you get the bill, for the service, we. Had a use. Case that have been like a $70,000. Use, case that we were not expecting, so. What we really learned is in anytime we do a feature development to first identify the usage pattern identify the behavior really. Understand, from your users, how they're intending, to use it which might be different than your original assumption. And then once you do understand, the usage pattern to build in such a flexible way that, you can.
Understand. The trade-off decisions you're, made versus the performant, experience for your customer and your. Opportunities. That the cloud enables, you to for cost optimizations. So Dylan, what were some of the lessons you learned in making, this new product available on Google. Cloud, ok, yeah one, thing that we found useful as we're developing this new product that we've, not built before was we, built time in our road to iterate and it spiked on new, infrastructure. When we could Google. Is always developing, new like. Zero ops tools. That you can use and other, cloud providers are as well we actually started on a different cloud provider. Before. We moved to Google. Also, we learned that being, an early adopter is, a double-edged, sword you can guide the roadmap, for their products but, at the same time you can use one of their products and it can tip over in a, way he would not previously. Have, expected, also. Google, very tightly. They. Keep their quotas down to a limit that, prevent. You from incurring, a lot of cost but at the same time it could cause an issue in the future so it's, good to know your quotas especially, when you're an early adopter. So. Yeah. This, can actually be boiled down to three things when you're deciding, in architecture. The, first one is cost the, larger you can make that margin, for. Cost of operations, the quicker you can grow your team iterate on feed features, another. One is team size there. Are there are some architectures, that may not work for your small teams, especially. If you want I have a team that can develop features, and you don't want to be spending. Half at a time managing. Infrastructure, this, is why Google dataflow, and in data proc have actually come in handy for us and kind. Of accelerated, or roadmap, another. One is we, kind of learned trial, by fire is usage, patterns we. Had a new product with a new use case and our customers, are reusing in a way we did not expect them to use it as, a result we took advantage of the fact that we had time, and a roadmap from using. These zero operational. Tools. And also the fact that we built, into a roadmap this time for iterate. Iterating, our architecture. So. Yeah luckily were able to adapt, and and change according. To this users patterns but it's very important, to keep a very tight eye, on, that and maybe even log. That data somewhere so, you can be alerted of when like your costs are kind of spiking and it's time to. Address. That issue. So. Those are the lessons I learned. So. In making. This new product available, on the cloud we, meet. Our existing, use cases of advanced research we've all so now made a new line of business for a machine learning and Big Data workflows, and. Dildos can walk us through some of what we've also enabled in terms of applying, these lessons to even future, proofing and some kind of new and upcoming things we've got going on yeah. So. This, was just about news content but with news content you have different entities you have organizations. You have people you have events you, have, like. Different industries, affecting the. News, so. There's. A lot of context that can be gleaned from different, sets of data and. DNA. Has allowed us to kind of you. Like build on the top of the shoulders of this platform and creating products. That can generate knowledge and actually you can get signal from these news events and we're, doing that by structuring, a sort of knowledge graph in the way that Google is doing for their search engine we're, gonna be doing for the news and events, so, we've been able to work on that develop new context traversal relations, through like, our graph architecture. And, on top of that if you're interested in the kind, of content we've been dealing with here we have a commercial. Data set in bigquery, just. Google bigquery. Commercial, data sets DNA, and you can take a look at the kind of news content we've been dealing with, so.
Yeah That's. The. Knowledge graph is kind of the next big thing we've got coming up and we're so excited to have this premium news content but also available with, you know the entities that exist the relationships, between those entities see, if the data in the context, of the data which is something we're really excited, about.
2019-04-21 12:05