Grid Computing With Dataflow: UBS Tries Option Pricing as an Example Use Case (Cloud Next '18)

Grid Computing With Dataflow: UBS Tries Option Pricing as an Example Use Case (Cloud Next '18)

Show Video

Good. Morning, everybody thank you very much for coming, along and giving us some, of your morning my. Name is reza rodney i am one, of the solution architects, here at google currently, based out of our Singapore, office and part, of the solution, architect, role is to try. And find new and interesting, use cases to apply for our technologies. And one, of the things I've been able to do is work with a few banks including, Neil. And Ruben I'm gonna introduce now so Neil if you want to is yourself, hi I'm Neil Boston I'm a head of IB technology, VBS. And I also run UK. Technology, so I run wealth management and asset management out of the UK as well and. Reuben, I my. Name is Reuben Lacs I'm a software engineer on, Google. Cloud dataflow and. So. The project that we've been doing is making. Use of data, flow which is our batch, and stream processing. System. To. Do, grid, computing, and we've done some pretty, interesting, work we're building out some demos and and some near. Realistic. Scenarios. But, to just describe what grid is and some of the challenges, and opportunities but. I hand over to Neil. Thanks. Raiser so I'll just talk. Through maybe three or four slides very briefly so. And it's very from a very personal perspective as, well so if you, agree or disagree please, come on have a chat afterwards and we can we can debate some of the points what. Is grid computing, I guess to me. Not. To read the read, the words out exactly but basically, having a large cluster with. A set of nodes that we're going to do some kind of analytic calculation, with this generally the way that I approach it now, that can be, deterministic. Stuff that I've coded in myself can, be elements of ml as well but basically I look at it as a, large cluster of calculation. Agents that are doing some set, of tasks, generally. Hopefully you can make them relatively. Stateless, and doing a relatively independent, way so, when you aggregate and bring the data back that could be thought.

And, Thought out in. A relatively straightforward manner. So. As with, most things in my life I started with three wishes it's, not asking. For him for the lis many wishes is not what other wishes or, a large of a finite amount wishes as one of the wishes I guess, so. I when I approached. And talk to Reza and we were discussing this I. Guess, in London, it. Was actually in a pub when we first started talking about it good. Engineering, projects, in London and they were baseball Farringdon, which is very good I, started, writing down what my user stories, were but they became, three wishes basically, so. I wanted to write single-threaded simple calculation, classes, functors, or. Packages, I didn't have to worry about things like multi-threading, for example, I, wanted to stop thinking about creating, my own, directed. Acyclic graph by temple or not I just, basically wanted to focus on my IP and my analytics, in a very simple way but, I'd like to be able to tweak my experiments. And run my experiments, in a much more seamless way rather. Than having to recompile, and rebuild a lot of my analytic classes, so, what its some environment, I could basically drop my prompters inside, run, a set of experiments. Get, a set of results out and do that repeatedly either, manually, or through a set of machines. Potentially. So. Through the three simple wishes I guess. Challenges. I guess. I've. Been in I've been markets technology. For. 150. Years now so a long time, so. Quants, I wanted, to really focus about quantitative, analysis even maths. I just, wanted my kids to focus on the maths and the outcomes, rather. Than thinking about this whole fabric, so over, the years that I've been in markets technology, you. Can spend 20%, of your time thinking about the mathematics, and at 80% of your time thinking, about how you make this thing scalable, how you make it distributed, how, this thing can scale, and. Actually, one of the challenges as well as you can spend a lot of your time I guess my background, is C++, it's trying to optimize your C++ code, to, run on your PC, which. Is obviously. You know diminishing, returns of scale so again I wanted to try and get away from that I wanted. People to focus on their kind of mathematicians and physicists and engineers to, focus on running experiments. Rather. Than, thinking. About the other aspects, about. The ecosystem and, much more about the mass the. Data coming back and the outcomes, and analyzing, that either manually, or again through machines, and. I want to be able to do it very very fast so. If I didn't know which model to choose and I had a category, of models to, be able to get to some reasonable outcome in, a relatively quick way by running experiments, a high velocity was, really the the, challenge, it's. A kind of utopia, I guess and. What kind of quality perspective. Opportunities. Running. It on demand and, also event-driven, they're two sides very, much in markets technology, I want, to sometimes revert things at very large scale and lots of things which we'll see later in the experiment, but, I also have things coming in a high frequency though, I want to deal with as well so. You can think of as a bifurcation, a little bit you don't a bifurcation. In ecosystem. Or fabric. What. About to run a universe of parallel, experiments. And then do some assessment, either of, them either through some parameterization. Through, some ml, piece or again manually, just looking at these things in a reasonable way and then run huge I guess. Simulations. So if you come from banking, and, have a look at the way that the the, markets gone especially with regulation, in, recent, years running. Large Monte, Carlo simulations, for example on port entire portfolios, across.

All The markets, that the investment, bank of the markets division has is again, a significant. Challenge, and. I'll talk about it a little bit later about the organizational. Issues that you have around some of the technology, choices the banks make to try and accommodate some of those solutions but. Again regardless. Of the simulation. Experiment, that I wanted a relatively. Stable and consistent ETL. Style environment, that I wasn't too worried about. Now. I'll hand over to Reza he can talk. To you date to play sure and so. This is interesting because data flow we, use predominantly. In things like ETL pipeline, so we're taking. Data we're processing it, we are passing, that on to things like bigquery, so why do were you using this for grid computing, to. Look at this and this was like after many conversations where we're sitting down and and storyboarding. This stuff and looking at architectures. You, kind of have to deconstruct what a grid is and then it makes perfect sense why. Dataflow, is a very good tool for solving this problem so, what is a grid first of all it's data, it's. A bunch of math functions so smart math folks like these guys, you. Need lots of CPUs and, you need some way of scheduling jobs on those CPUs so you need to get data, and functions, to the CPUs and get them to do work and you need, storage, so. Let's go through those in a little bit more detail and we. Start getting into why data, flow is particularly. Useful here, so. In terms of data there's quite a few data sources for, us there's internal data sources so the trades etc, that happening within the bank there's a lot of information there there's external data there sorts it's potentially, from market environments, all of these raw, underlying. Data that we need to then do some processing and then, there's sort of derive data so this is data where you've taken underlying, information you've, done some processing, on it and you've got some new data that you need to distribute for, example the yield curves that are you're creating based. Off of things, like swaps and deposit information that you have early on now. The, next piece around data is that. It doesn't come in a nice static, file, it's streamed. It's continuously. Being updated again. This is starting to get interesting in terms of what data flows capable of which, is being able to do stream processing, on that data sets are coming in, next. It's, functions, its mass functions, it's a lot of CPUs. One. Of the interesting, challenges, fun bits we had is a, lot, of the quant code that's been written in the last 20. Years I. Guess is. C++. And, dataflow. Today does not support C++, as a. Standard, language and. We had to do some work and I work with even to get, data flow to run C++, code for us so, that was a little bit of fun especially, I was just gonna say I guess the three point one wish is I don't, want a J and I into the C++, I wanted to think of it in a much more I guess an actual way so. I guess that was my three, point one which is rather extreme wishes and I. Found, out from Rubin that actually C++, is the, original. Flu right. Yes. So internally, at Google we, do run version, of data flow on C++, there was just at. The time we came out with data flow there wasn't as much of a cloud market, interested, in C++, most people were interested in Java or Python or other.

Languages. So. We basically made retrofitted. C++ running on dataflow even though the, original flume Java was all C++. The. Other piece around this is kind, of that ETL, work so you're, having to manage data you have to shape it enrich it and. Then. Once, you've got that data is a place of storing it you need to then store pull, from storage, within. The pipeline itself you may potentially need to do output, to storage as well and these. Grids have huge fan out so like I can take a couple of Meg's worth of market, information or even let's say gig and it, generates, terabytes, of output so, we need some way I'm actually being able to deal with huge fan out from these environments and. Then. There's one secret ingredient that actually, means. That weren't the testing we've done so far has been very very, productive, and that secret ingredient is the movement of data so. If. You think about the, fact that we've we're pulling source information, we're doing ETL then. We're sending, that data to the next stage which is doing. The processing, with that data so running those math functions, and then we're sending that data again to the storage these, are all chunks, of data movements, now, with, dataflow we can actually create a dag a distributed, a. Directory. It's like a graph sorry of computation. And those. Connections, between all of the nodes data. Flow takes care of moving, the elements, for me so, it starts so this is again why we start thinking actually dataflow is a very good, fit for this environment. And when we think about this this this. Movement of data one thing I noticed do I don't actually originally come from a, finance, background was. You see a lot of organizational. Structure. Coming. Into how, the dag is created, so by that I mean there's, a team that doesn't ETL, work and they generate a file this, file is then moved to, the team that does the grid work and then the grid work outputs, another file which is moved the team that does the analytics, work and, what that is doing is essentially, creating. This organizational. Structure on top of how the technology should look but, that's not really what you want what you want is one continuous. Operation. From beginning, to end and Neil, will explain. This is a better detail than I could so. I guess this slide. Here this, schematic, is not it's not it's, not specific, to the banking industry, the way that I've received it over, many industries over the years is they tend to design technology, by organizational, construct, so, what you tend to find is people within. Grebe a B and C will actually have a view about how their world looks and maybe they have specific. User stories, that are giving them some some niche aspect, that, as the design, patterns, well they tend to do is build the ecosystem, above. And beyond their own boundary conditions, and then hand over the data or the information to, the next group so, then you end up in data kind of translation, and ecosystem, translation, and then, reconciliations. And that style of problem so, again and. It hits Lou you know like I said it's not just our industry. Specifically. There's other industries, but I think it can be the people that start thinking about technology choices rather rather. Actually thinking about what's optimal, for their solution. Front. To back or from, left to right more. About thinking about how their organizational. Construct, is driving, there so, again one of the things I wanted to challenge a little bit was that we don't need to think about when. We take when we think about it in an evolution, or a revolutionary, sense I guess is, to think about how we can offer something, out that actually crosses, those boundaries naturally, it has a very integrated. Stream. Based on batch based approach as well to it the problem. So. A. Little. Bit more about being now and a better. Person. Than me to describe it is Reuven one of the technical leads who built beam and the, data. Flow the writer for being yeah, Thank You Reza and Neil oh thank, you for that clicker. This. Is this, use case is very interesting, to me because as Reza mentioned, I think. A lot of people see data flow as an, ETL, tool it's a very common use case to use data flow to get your data into files or into bigquery, but. Data flow is actually a programming.

Model And it's. A programming, model that fits, this grid use case very well it's a you you divide your logic into massively. Parallel computation. With, the with it's stateless, functions, and then. And. Then at key points you also say you also put aggregations, where you need to bring. The data together to, compute. A count or to write it out to a bigquery and. The. Key to this is as, reza said this data movement which. Is. One. Of what are these things that sounds easy you just have to move data around and, it turns out that writing. A system. That can move massive amounts of data at scale. Efficiently. With. Strong semantics, never you. Know moving it never over. Moving, an element twice never duplicating, an element is surprisingly. Hard to do and integrating. That all together is one of the advantages you get out of data flow that makes it work really well for systems. Like this. So. The way we have the data flow data flow designed and, presented. Is as, a. Three layer system, so. We have AP ice currently. We have a Java API and a Python API and we. Have more API scumming so there is a go API which is usable which is usable today in an early access there's also a sequel, a pie so you'll be able to write dataflow pipelines, entirely, in sequel. Then. There's this programming, model I talked about all these api's just live on top of this programming model and, then. Run. Multiple runners, to run your to run your pipeline so there's the Google hosted Runner called Google Cloud dataflow this is this fully managed runner we just pointed, at our cloud and it runs but, yeah we also have open source runners one one. On top of the system called Apache flank one on top of a system called Apache spark, so. Your, dataflow pipelines, are not. They. Are not bound to only running on data flow so. This is interesting in the hybrid cloud case we've, also found in the financial. Services industry, this. Has been interesting to some people for regulatory reasons. If job. Needs to be rerun on Prem, if a. Regulator requests, it it can be done with one of these open source runners. So. What. Is the advantage of the cloud dataflow, a system. So Cloud dataflow is one of the runners for this beam API. So. Cloud dataflow is, just. Presents this fully managed, you know no ops serverless solution, so.

You, Don't, have to the. SDK, gives, you a nice, simple API to develop. These things but. Then dataflow, manages, security, for you and manages. The cluster, for you you don't have to spin. Up a cluster to run to. Run your pipelines, it. Will Auto scale your pipeline. So dataflow will attempt to learn how. Big of a cluster you need to. Run your pipeline, and it will actually shrink, and grow this cluster, over time you. Know as you get towards the end of your job and, there's less data left data flow will just start shrinking the cluster and try to optimize costs. As much as possible. There's, also a. Lot. Of integration, with 10 servo blaze machine, learning models so, there are tensorflow, api, is that you can integrate inside, your dataflow job and there are there's. A suite of, tensor, flow, tensor. Flow based transforms. That you can run, via dataflow so, there's, a suite called TFS, which is a suite of, machine. Learning, transforms. For. Prepping. And evaluating. Machine. Learning models so a classic example of that is you, know I, trained. A new machine learning model and I run a simple analysis, on and says ok it's 5 percent better than my last model, then. You deploy it and you find out that oh it was 5 percent better on average but. It was predict, it was producing, worse, results. For everybody, from the state of Florida for. Instance it, is actually not a better model so, it turns out evaluating. These models, is. Fairly, tricky to, make sure that the model actually is better, tf-x, provides Suites of tools to do things that's one example of the tools that tf-x provide is good evaluation, of models and this all runs on top of the. Cloud dataflow. So. I mentioned. That dataflow, lets you for su run, serverless. Data analytics, as opposed to other clusters, where you have to run a cluster, so what. Is the benefit of serverless data analytics. Well. If you look you know this little chart, on the left is traditionally. What you would have to do to run your big data analytics, so. There's, a tiny little piece of your work in which you actually, you, know worked on your analysis, and your insights your business, logic you. Know your functions, actual, quant. Stuff that Neil wants his people to spend most of their time working on and, then. You would actually spend about 90%. Of your time on everything else building. A monitoring, solution, to make sure your job is running, performance. Tuning. You. Know figure out how many workers do I need to run this on how, much memory should I give its worker so. What, particular. Type of worker should I be I run on four core worker serve on a two for workers or one at sixteen core workers. Then. Making, sure my utilization. Is, up, to snuff. Then. You, know figuring out a deployment, story, and a configuration, story, and people with people. Would often spend a huge amount of time figure coming, up with different deployment, stories for the for these systems, research. Provisioning, handle. Handle, and growing scale, so, I came up with something that worked and then suddenly six months later my input data is 50% larger the. Thing I came up with before no longer works. Going. Back you, know resetting, going back to the start and running through this whole process again. Reliability. Setting up alerting, making, sure that you, know these jobs actually run reliably, and. And complete as I expected, the. Eventers of serverless is to cut out all of, this pie except, for the analysis, and insights. Focus. On the. Actual, business logic you want to run in your pipeline and let. The serverless system in this case Google Cloud dataflow, handle. All the rest for you. And. Finally. We've been saying that dataflow, is not just is not just a product it's applets a platform and Google. Cloud platform is, the platform that dataflow is part of so. A platform, is actually not just one technology, it's a group of technologies. That. Often. So act. As a substrate. To, build higher-level. Systems, so. As res res. Is going to show you here, an. Interesting thing here for for. This problem is not just dataflow. But how can I use dataflow, in conjunction. With, bigquery, and in conjunction with Google Cloud Storage and use, all these things together to.

Provide. To. Provide, a solution, and, distance. Dataflow has sources and sinks as we're gonna show here dataflow. Has sources and syncs to all of these other GCP. Stores. And and, data sources, dataflow. Provides a great, substrate, to, link all of these things together so, if dataflow so you have sources of sinks, you. Have an API which, in this case is the data flow API which is your way of, linking. Everything together saying read front you know read from. Read. From this data run. These transforms, over it right, to bigquery right, to BigTable, right to spanner right to pub/sub right, to whatever other sink you want to, and. Your, pipeline is declarative, and you this. Actually. This. Image here is actually an example of a data flow graph in fact I believe the one that resin will show you in a second. Running. Data. And. Now Reza yep, so if. I can particularly thank, you so. Just a couple of things around the, other bits of technology we use I when. I was doing the experimentation, with Neil I don't actually have access to sort. Of UBS bank's data obviously I have, to work in my own environment and, we wanted to make this as real as possible I didn't just want to make synthetic, data especially. For the inputs and the ETL layers so. What we were using was bigquery now bigquery, is google's dead a petabyte. Scale data warehouse in the cloud fully. Managed and serviced one, of the key aspects, that I was making use of here is that, it separates, processing. From storage, which, means sharing, of data is very very easy and this. Means I could get access to external, data sources in particular. Thomson. Reuters were very kind enough to provide, me data, so, they have a. Store. In their, project, where they're putting historic. Tech information, and current. Information including. Things like swaps, deposits, etc which I need to use to, build things like yield curves they. Also for. Pocs they can put up, to 20 years of historical data, onto. Bigquery. And. You know I get. The benefit of just instantly, being able to use that stuff without having to move it around project, so again. Thank you Tom Suarez for helping, me do this if, I can. One. Of the other pieces that we. Weren't showing the demo right now but, we, will be using and the real thing is today. I'm actually sending, even the raw data to bigquery like, very granular. Data just, so I can show it to you guys but. In the real production system we'd send outputs, the results, which, is still many billions of rows of information to bigquery but, the raw data that we want to keep. Is going to be many many terabytes, that I would send to big table big, table is linearly, scales, and we use it internally within Google for solving, exactly, that kind of problem and it's, perfect, for this environment because we have many thousands, of cores all, running, generating, data and if. We don't want to create a bottleneck we can just use BigTable, and linear, scale it out and have, all of those calls just push data directly, into BigTable. At. Many many tens of thousands or hundred actually hundreds, of thousands of QPS. So. With, that if, Neal is going to walk through the. What. The pipelines actually doing in terms of the finance, side yeah.

So I guess based, on the three, point one wishes I set. Up a kind. Of very basic experiment. I mean I don't know how much background you have I guess, in an, investment banking and kind of markets piece but basically. The. Most basic experiment, I could think of was the source market environments, or market data so. That can be things like FX spot or yield curves or volatilities. Take. A portfolio, a set of trades in. This in this case we just randomly created, a million option trades. Basically. Take that as our underlying data processor, as you can see here, apply, some transforms, for the market data which. Is really you can you, can you can select a set of market data, you then have to apply, in. This case we use the we use an open source on library to. Produce a yield curve functor. I guess at the end of a an object from. The input, process. The data I also wanted to create scenarios, so what I mean by that is in, the market you tend to have a set of points coming in you also want about to shift things around so, you. Can end up with a curve with ten points on it for example and I want to shift each point. Independently. And I also want to do these parallel, shifts as well so you can see I wanted. To try and create scenarios, on, the fly which I thought would be good just from, an experimentation, point of view, then, I wanted to run. A set of analytics, on it some. Of which I coded in, which. Is very basic and then analyze, the experiment, so this. Was kind of the most basic experiment. I could think of I could choose a different trade, set to, make it easier to price and easier, to risk, but. Again it was just personal. Bias and preference that I wanted. To try options from. Up from a personal perspective so. Based on the three point one wishes this is the most basic experiment. That I thought we could start with just to see what. They look like scenarios. Again I just chose Mary market standard things non parallel shifts weighted shifts independent, shifts and, parallel shifts again to create a much larger, set of output data sets more in line with YC in the market. Thank. You can we switch over to the top please. So, we have a good. Problem, to have with a demo. So. We had time this so that it is actually doing the. Scenario. Running, while I switch over to the laptop last. Night I was working with one of our performance, engineers, from dataflow and it. Made it a little bit too fast. So. What. You won't see right now is this, stage and I'll talk through what that's doing you would have seen some numbers going across here but, it's already zip past it oh I walk through what, was showing with now, so. What. I have here is the, monitoring, interface for, data, flow so, the, graph. That I've built in code I've, submitted a job at. The point of submit the job there's no machines running everything is cold by. Submitting the job data flow starts spinning up the number of workers that I've enabled it starts sending my code to the environment, and it, starts doing all of the the sync and source information that we need so, in particular here, on. The, right hand side we see information about the job itself it's, still running because it's now pushing the output to, bigquery. We. Can see that it spun up a thousand, workers, and. In. This instance it's using 4000, cores from. Our environment. Dataflow. Does support auto-scaling. So we can actually let it bring. Up the number of machines it needs and then bring them down I deliberately. Don't use that because I just wanted to make this run as fast as possible, which, introduces, some potential, in efficiencies, but for, this demo we're just like start, with 4,000 and keep going with the full 4,000. A. Little, bit more information here so we have. Able. To in my code put custom, counters, that, actually tells me what's going on within the environment and this, is tightly integrated to things like stackdriver. So you can actually have, monitoring. Going on while you're while you're your big jobs are running you, can see where things are getting to in this particular instance. We. Were sending a million trades and it, was the.

Number Of scenario trade combination, comes out to 847. Million. So not quite the billion but we that's. Our next set of experiments, a. Little, bit more information these are some of the options that I am sending in to, the job so but. If, we look at the dag itself, so, I'm doing some reading of information, so there's a market data that I'm reading from Thomson Reuters information. We. Have some swap data if I expand. This out, what. You can do within the code each, one of these boxes January. Corresponds. To a. Transform. And I, can collapse logical, things together within my code and what, that you see there is like four examples read swap is we swap what us do is these, are all being put into a single transform, the, way I've done this isn't how we do a production version of this we, just have a read swaps. And I just use code to do the different things you need to do for all the different, data types but. I just use this for good, illustration so. This is the read stage if I highlight every. Single element. Here. We. Can actually see the input and outputs of each one of these stages now I'm going to move down and sorry. I'm not used the, touchpad. On this device. One. Thing that we did as an experiment, that's worked out quite well is. Rather. Than have the trades, flow, through. The dag with. Introspection. To the, data sources for all the data that they need we. Reverse, this so, what we have is all of the trades come as a lump as a side input I do, all the ETL, and processing, that I need to do including scenario. Generation, from the top and then, what I do is because. The introspection would have been lots of long RPC, calls and not. Long that fast but RPC. Calls which are still expensive what, I do by bringing the trades and all the scenarios together I just do a really dumb massive, fan out with a loop so, what it does it just generates, huge amounts of data. Tuples. So, the data that I need to run which includes all of the values from the yield curves for example, which is an array of two. 9000, in this case doubles, I generate. All these values and just let data flow do what it's good at which is move lots of data around for me and, this should be that work so, here we have the trades that are being read coming in from the side and. Being. Added, in as a side input so if we was just to explain a little bit more on what side inputs are so. If you think of kind of like a broadcast, joint if your data is is. Of a certain size that will fit in the environment. Rather, than do a shuffle join where we're joining all the elements together you just take this it's like having a smaller table you take that smaller table and just make it available to all the environments, and then when my scenarios, are coming in there goes all that data is there they're just looping through and any trades that happen to match the currency that I'm looking for come, out as a fan. Out of the scenarios, so, here, we. Have. The. Scenario is being generated I actually. Do a little bit of optimization. In that we. For. Each piece of work we actually make it a hundred. Scenarios. That needs to be processed but, it's a bit more efficient, in the next stage here. We have the. This, is Neil's, code so, this, is C++, code running in data flow Neil, has written a I'll, let him describe because I have no idea, well. I guess basically I took, the I don't have familiar with options pricing, there's, nothing particularly. Smart about it it's black. And Scholes wrote a paper in 1973. On how to price options, they. Managed to show that it was like the diffusion equation which, is like a partial differential equation, so, all I did is write a thing called finite difference for a way of discretizing, and solving that but, ultimately all I'm really doing in a c++ is doing matrix sparse. Matrix, solving calculations, over time so, there's nothing particularly, smart if you if, you look at the focus it's probably the least smart piece of the whole kind. Of ecosystem I guess well that's what he thinks I mean. Days. As well. Well. That's that's where the real, work happens here right this is and this was a good we, use that function as as something.

As A good indicator of what a real thing would look like because, it's doing real work and, this is where most of the work is happening for this environment so up until now we've been doing ETL, we've, been setting up all the scenarios, I've been doing these joins and distributing, this massive. Fan out of work which. Generates, as we, can see from the output of this eight hundred and forty seven thousand. Results. Now. What, we do next is we go to push. This information, into bigquery and if. You'll notice the. Number. Of elements. Multiplies, quite a bit what. We do is the. Code that Neil. Has written he's, got a spot index which is actually, the thing we really want but. He also has an array of information which, is the. It. So basically when I when I do the matrix calculation. Which aids array over time to some answer, you get basically a vector back based against different spot levels yeah but I actually want the one in the middle because that's the current spot level so that's the thing we wanted and that's the thing which you sent a big query however. We. Want to record one of the advantage of being able to use this data platform is at, each stage, of the pipeline I can output results, and go. Back and look at them because I don't care about the i/o right it's just BigTable, we'll deal with it, so what we do here is I'm actually used, taking, all of those values as well and dumping, them into bigquery now I did that so you guys can see it in bigquery in, a real production, system the, spot index is what we take to bigquery because that's what you're going to do analysis on the, other values, I'd write to BigTable, so if you wanted to go back and check your results, you just go to BigTable, and Victor, we can do really fast index and range scan lookups, so, here I'm writing 14, billion, pieces. Of information rows. Of information to, bigquery. At the end of this so this, is the pipeline. Running. It's now finished its work I can know it's finished because we're, going to link it looks a bit almost, a bit closer to 15 billion to me Reza 14.9. Okay, okay he's, the engineer here right. So, I'm, just going to show the.

Results As they landed in bigquery so. Dissapoint. Is another point the, when. This finished, there. Was no file, that, I then now need to load. Into, my analysis, system when, this finished it's all done I can. Immediately run a query with bigquery against, that data so, this. Is actually using a different. Table. But just to show the the principle here we have grid. Results. The. Details, is, 14. Billion rows, of information which, actually equates to 1.6, terabytes, of data that's, been loaded in this table and what I'm going to do not particularly smart I'm just gonna say given. The trade number was 8 8 7 6 give, me the max spot index across all the scenarios so. When I run this. It's. Actually going through and doing a query against, all the 14 billion rows of information and I have my output, so, right at the end you can actually start exploring, your data set or generating. The reports that ultimately we're, not doing this all for fun right the bank is actually running this and want reports at the end of it we can immediately start running, the reports another. Thing that you can. Explore with and I we we started to do, so. This. With. Data. Lab which is a. Notebook. That. We can do Python coding against the, data sources because. All this all the intermediate, stages of the the calculations, are being, stored in bigquery what. I did is just did some exploration. With a notebook and I'm stored it here as a PDF. Essentially. What we're doing is looking. At all of the information that's available, for. The whole pipeline so, here. This is some underlying data that I had from. The Thomson. Reuters environment. Datasets I've. Graphed, some, of this information, this is what some of the underlying data is I. Actually. Had a look at sort, of the yield curve creation, to make sure that stuff's looking okay and one. Of the things so this is the. Asset, and the new Murray of the, the things that I will send on, to the C++, code. Both. Are being graphed one of the interesting, things here is, this. Is a new Murray and if you notice most, of them are are at. The top there's one weird one and this, was an accident by me one of the scenarios I typed 0.5. As the, amount I wanted things to shift instead, of 0.05. Right. And by looking at this I immediately know something's weird right something's, wrong, and, this is again something that the data scientists, will then be able to use, throughout. The output. Of the results now I'm just going to go right to the bottom and let, Neil, site explaining this piece. Essentially. This is like looking at the the, result sets and Neil. Was interested, in like trying to start, looking at how we can show. This in in three-dimensional. Space and so. Neil so. I guess, the. Thing I was trying to get to is I'm a big believer in just looking at a set of results as well to see you. Know just things that you can see visually so when we went down the line of running, the experiment, then I asked whether if there's stuff we could put on top but, you start visualizing what this thing looks like versus various outputs and. When. We looked at this I was interested to see we. Were adding up the Peavey's I guess the present value of a lot of these trades so I didn't it, seemed like they were relatively, canceling out so, actually when we looked at the portfolio and, realized we. Hadn't randomly, created, buys and sells we've done roughly, half buys and half cells so they. Were tending to kind of net out in some in some optional sense we're, just kind of interesting himself when we thought about the portfolio we've created so. Some of the talks actually saw yesterday, in some of the other in some of them with some other groups. There. Was some I guess ultimately they. Really resonated, with me I was trying to get to almost like having, a single, inquiry box, where I could type, a set of asks, of the data. Is. This thing is this a correlation. Against this does this allow this historical, volatility, wise how, does this look against like this over time they. Seem to me was always this was kind of evolving a visual representation or almost like a single kind.

Of Inquiry box and I saw I saw a couple of demos yesterday, where they had that style of that. Style of approach and a little bit of I guess kind of almost saying I'll be over the top as well so you could take you could type in quite market. Specific, language and it would give you back a set of results. Around that so I thought as an inquiry marrying, that to this this, style of experimentation. Out would be quite strong thing so this was really, just the very basic, I tend. To plot kind of like time against spot against Peavy against other things in a very basic way just. To start looking at what the nature of the portfolio, is but, then given, the couple of days I've been here there seem to be some. Already some evolutions, that was kind of thinking of around how we could like start analyzing, the data in. A more effective way, and. Asking questions as well in, a much more fluid experimental. Sense as well thank. You can we go back to the slides please. Go, back to the slides please. Thank. You. So. Just. In terms of creating, the the pipeline. We in that one we showed you know the pre-processing stages, the, running of the stage is the, post-processing stages, and that you, know the semi joke here is like one pipeline to rule them all right we we, don't create separate, processes, for each one of these things we, just did one and it, went from the beginning raw data all, the way to the output including analysis, and. One. Of the other so the, other thing that we wanted to do is one of Neil's. First wishes, is I want to experiment so. As we, now, have this dag, to be able to do processing, the, other thing we want to do is break. Apart, some of the large, years or more monolithic. Kuan. Code and see, if there are linear components, in there that, we could distribute again and. Neil. And let Neil in room and do some of this yeah. So again it. Was something that I guess I should have thought of over. My career a little bit more intensively. About when, we got into this sense that we had stopped worrying so much about the, kind of like ETL, piece that we were running in and the kind of simulation, and. Ecosystem. B's we could then start really thinking about where, there are parts certainly. Within the the. Kind of code base we were writing that we could actually. Look. At in a more linear way from a parallel processing perspective and again even. When we were starting to look at that the solution that we've just been talking about that, even had other implications. That I hadn't expected where we. Started looking at the way that we could kind of like break these problems apart from what we've seen in the past just. As just as someone writing some quantitative folk. This. Is a common, thing we've seen as people move from. You know there. Are old systems, the new systems, is people, have code that was written assuming, I'm running on one machine and. It. Would be highly optimized, in ways that make it very difficult to paralyze, like oh I'm sometimes. You know I'll spin up threads and then join on these threads or. Interleave. Many, different functions, and, throughout, my calculation, so, that sometimes requires people to rethink how they wrote the code or. Maybe. Out, of mathematical, sense look at your functions sake and these functions could be linearized, as a you. Know as a combination. Of multiple other functions, so, then we can paralyze all these things out and get a final result later, and. So. What that means is that when you're drawing, out the graphs the more, access.

Me. As a data engineer has. To, the, various, functions. And being, able to split them apart the. More I can try and think of ways of optimizing the flow and the parallelism within, the system so the. More I can I can have like separate functions, that I can pass data to and from the, easier this becomes for me and this, gets another an, interesting, point that we found, very useful when we were doing this project together and that's, how we talk to each other so as, you could tell I have no, idea half the time what Neil's talking, about when he's doing the maths it's. Not my area and, neither is the. The finance area however, when. We break the problem down for, me as they'd engineer 2 bytes so. He can tell me I need these bytes to, move from here and I need to run this function against these bytes this, output needs to move here when, you start breaking this down into a way that I understand, it becomes very easy for, us to start communicating and one, of the things that we used as, a common language is, protobufs. So, we started talking in terms of protobufs, I'll then Reuben. Do what protobufs, are a protocol. Buffer, is, it's. A it's a common, structure and format for data that has been used at Google soon. Probably. Around 2001. 2002, so it's been using Google for quite a while it's, very similar in. To. Other forms of structured data there, there are more so some people use Avro, some people use cryo. Some people use RIF. There are other forms of structured data Preta buffs are an optimized one that Google has traditionally, used and is, also open sourced, and. That made it very easy for me to start working with neo cuz I could, just give him a, definition. Of how I'm, gonna pass things through or he could give me the definition and, then I know what, I need to do in terms of doing this process and the, other thing it opens up is this, ability now to actually break apart the functions, because if, all your passing, as the, parameter, to the function is, a protobuf, and, within. Your code you start just using protobufs, to pass the information around when, you break apart the big pieces. Of code it, becomes very easy to, plug it into the DAC because.

All I'm doing is sending protobufs. Across those nodes and that data movement that we talked about, so. With that next steps you know we as, we said running C++ on. Beam. Is not standard, so we've got an example, that. Allows you to do some of that I'll be updating that code a little bit more after. This you. Know please have a go and thank. You very much for your time. Thank. You.

2018-08-01 18:01

Show Video

Other news