Plaid's Journey to Global Large-Scale Real-Time Analytics With Google BigQuery (Cloud Next '19)

Plaid's Journey to Global Large-Scale Real-Time Analytics With Google BigQuery (Cloud Next '19)

Show Video

Thank. You all for coming and listening to us. My name is Tino I work on Google bigquery and we're. Gonna talk about some, interesting advanced. Topics. Namely, real, time architectures. Real, time tends to be fairly difficult, but. It doesn't have to be and so we worked really hard to, make that as easy as possible for you so we'll go into that a little bit detail further on but first let's, hear it from a customer who has gone through his journey throughout, the last four or five years with us please, welcome Yuki, who's the CTO of plate. Yuki, please, sixteen. Oh. Thank. You for taking time for, our session. TripIt, oh this. Is my, second. English presentation. In. My life so, I. I. Approach. As in, other, bands, if. You cannot understand, or want, to know more about. Detail. Or, my. Part. Please. Find, me and talk to me after. The session I. Teach. You Japanese, and, talk, in. Japanese. I. Am Yukie, Makino. City, / parade. I am. I'm, working, in place and I'm. In, charge, of real-time. Data. Engine. For. About four years and. Before. Since. Before, the. Read release, of our service, karta. Today. I'll show you our. Architecture. History and. We. Carry usage, pattern, in. Part. Ok. Fred. Is a fast growing startup. Company in Japan. We. Provides, real-time, analysis, named, Carter. Carter. Has three characteristics. Compared. With other. Data. Services, a. Real-time. Process. Care and grower Gruber, and March tenant. Sorry. Carter, it's a weird time Alex, -, so. It has data, in there time. And. It has more than or, 500. Interprets, Cale Bryant. And. So data bottom is we hundred is very, large. And. Provinces. Isolated. By data set to. Share the data, correctly - by cart a -, groans, and. Some. Japanese kron4 with, the. Data in foreign countries, so we, destroyed. Our data to. Monitor March. For global region in bigquery. Okay. Carter. Is not. Only, data visualization. Sir but also actions. So. It will process, data, and, returns, actions, within one second so, it's really their time service. Using. One second it. Processes. Sixty, sixty. Five thousand. Events per second, three. Billion events, per day and it, is as one petabyte streaming, data, being. Biased, streaming. Insert per, month. And. The country, we stored stupid. Modern two-bit a petabyte, data. In bigquery and. The modern 260. Petabyte. Honor. Is per month. And. We contracted. Flat. Rate prom with. Six, thousand, throats. Four. Thousand for us and 2004. Japa. Okay. As. You can see this, guillotine. Stuff we, use, a big query Chevy very, heavily and, this. Is why this. Is the reason why I. Share. Our experience. Today. Okay. So. In order to understand, architecture Lee, actually. I'd. Like to have a brief introduction, of court a, court. Is a website, real-time on a customer. Nordisk on the action so we call it customer, experience. Brought home. It. Is for managing, customer experience. In website, to know real-time. Customer. State, and to, do appropriate, action, which, suitable, to each other each user. I'd. Like to show more technical, concept, of the, system, cartridge. System concept, is real-time, reactive, system. Firstly. Individual. Sense event, to our system, and karta. On ERISA currently, user state based.

On Four events a user has said change. The first budget. Based. On the calculator, state it returns, appropriate. Options, of the reactions, to the register, event the, users hand, and. Also. Our clients can see the other user state and configure, options, in, the application, in the airtime, Lacroix, and said options, to segmented, to users then the. Action for some segments, are started, to fire instantly. I. We. Started to use bigquery about, four years ago soon. After our first release. At. The time we is inserted. In event rotate, about streaming insert, shimmery. We. Use bigquery. Just. For all analysis, at. This moment. Next. Face we started, to use bigquery in the production, application. Their. Directory, to. Show analyze data to our client. We. Also started to insert honor Aziza stated, in bigquery, in. Order to aggregate his estate in the application. This. Is current architecture, we. Add new. Data, management functions so that our, clients. Can import, and export data, in. Bigquery with, data row, and. It can also export, data to spanner and, big table, to store data in radish incense trail. Bigquery. Actually, sent. Up our data, ecosystem, not. Only for us but only, who our clients, we. Use bigquery in, various, way to realize a, theater, analytics, service. All. Right so today, I'd Rock Show. 5. The avatar and bigquery usage. Pattern we use in production system. The. Hospital, is compressed, complex, query, and cache cutter. This. Pattern is infertile. Client, application. This. Pattern can be used when, you want to make her a little time-consuming complex. Query in, production, system, in. Normal way when you want to analyze data stored in data rate or warehouse, through. Application. To. The production, application. You. Put ETL, layer and fetch data eater of the database. Previously. We use another, eto database, to, fridge analyze, data but. It's complex, and not fracture, and a lot real-time. When. We try to change, access pattern we had to change the ETL process, and the, application call, to fetch the data, so. We broke it in memory cache layer in front of bigquery and, access. It with Seco, transparently. If. We want to change the access part you. Just a secret, eckists application. Use it's. Very flexible. In. Your case we use harsh body all the sicko take this as. A key of cache entries. If. The Russian Therese did not exist, in church table then, it runs the query actually in bigquery and, create. A new cache entries. Okay. Sorry. However, in this pattern cache. Missed query takes some time so, if you want to shorten, it's, a fast query latency. Optionally. You can create cash are synchronously. In. Background. In, advance, and then. You can be this leadership, or. This. Is actual production screen, actually, we use and first pattern to show this view you. Can. See real-time, data, application. It's. Quick enough right small application, okay. The. Next button is easier, on simple. Query pattern. This. Pattern is also used, in front. Of current, application. When. You analyze data more interrupt, will ease bigquery, this. Potency, is for. Bigquery. Is powerful, so we store a full roll everything, to bigquery as. Data rec with schema rest and JSON string. The. Query in us enough, fast even. For these kind of raw data but, if you, want to. Do or a little heavy it is heavy, tasks such as a passing, rajasam. Joining, multiple. Step, obligation. It, may take more, than 10 second, if. You want to analyze this schemers didn't, or interrupt, to be weak, really it ELT. Save, inquiry, is suitable. Saving. Theory is very. Useful, and fast and simple, saving. Carry, saves. Query results, as a new table. You, need you don't need any additional system. To do ETR. On a year ot. To. See real-time data, real, periodically. It exhausts, key and, the bodies from rowdyism calculate. Are great the bodies and, merging. Into a simple, summary. Table. Okay. For. Example with saving freely it pasts, passage, JSON. Log data on the aggregates part is good reply user ID and, please, a row having, segments. And stats of each, user and. From. Production, system, its application. Run simple, query such as intersection. The, query data sheet or brick query for single query is bedroll. Look. Right this, super. Fast this. Is implemented only. By yearly the bigquery table. Simply. Just. Next button is a frequent, up the draw draw pattern. This. Pattern is, in, streaming, insert or user data. When. You query for, the, register, data flow. Data this, part is for. It's. A very draw size of both, sides and cardinality. Not, so Raj you. Just insert all rows and query. Query. The latest row by using the string arc or row. Number, what. If the average roadsides, and the card energy enlarge the. Course ingredient, streaming inside cause storage, cause scan cost.

Becomes. High, so. We put perfectly to reduce cardinality. In front of sterling, minute. Prophetess. Does data to be inserted in. Conquer. The shorter, let's. Say three minutes it's. In. Three, minutes since, roster, entry and if, we're all having the same key comes, it. Just overrides. All these and enter, his body, after. The three minutes since, the first entry the entry, is, inserted, to require. It with strumming insert. By. Using this Baja, key it prevents, to. Ride on this history all, the roads to be curly. Perfect. You can permitted. By data, flow or a memory. Cube. After, we use in memory a skill that, is cute. Ok. New. Yorkers we use this, pattern who, authorized, user state data the. Average of. Data straws data size is about, 200. Kilobytes, and, we. Said buffering, time five minutes. It's. Easy, 800. Terrible, per month with, this pattern if, we didn't abide espada the data size would become ten times more. So. By using this problem, we need is a cost by 90%. The first part is optimized. For already, she's having a pattern. This. Pattern is used for ready to change to rail. When. You want to use data, in bigquery, invitation. Good sensitive system, you, can use this pattern data. Broker, eg export, data in bigquery. To. As a DCP, datastore, service including, spammer big. Turbo GCS. And readies, you. Know case we exported, attitudes, perma on the big table to. Use user. In the relationship. Layer. For. Example we. Export. Such as membership. Information and. Documenting, information. Stored. In bigquery, two. Graphs use a state or make, recommendation. Option. Span. On the big table our key body based data store so if a scope, of entries, to be fetch is small, the. Lady she becomes very goal. So. If we want to use data in registry. Rotation. System it. Is reasonable, to, export. Data to, these. Key body vessel data. Straw. And. Roast, button is typed Joanne Potter. This. Pattern is this, pattern used. Client. Data management. Radio. If. Your customer. Has only repeated. Their, table data in bigquery you. Can directly. Draw in it your data set over, a project. If. You get your data is riotta and customers, data is your the beer time then. Joined, data is also, real time. You. Can control, access right, in data set or aware so you should, separate, data by each user, each. Customer, to share. What. Data set were shared then, you can touch share. The data set with in Seco. Directory. Don't. Need any operation, cause for import, and export if. You use bigquery, fraud, raid zero. Additional, cost is required, with his data, linkage, and, analysis. If. You care sensitive, data. BBC. Service, control, helps you it. Can control, access write all the data in DCP, service. DCP. Services, in fine-grained, level. Ok. Directory. Important. Can do if, the target data set are located, in the same region, in. Our case on the other hand some. People in, Tokyo region use including. Government. Financial. Services, and, banking services is, restricted. To store data, in Japan, by. The company, policy and a, neighbor, export, essential. Data to. Foreign. Country. In. This case we copy data, from global, region to illustrate, region, or, insert with solving, insert restricted. Region in advance. Week. Interrupt, us vision. And Tokyo regions at Tokyo, leader growth rate we. Analyze data in u.s.. Region, basically and, when we want to analyze that, we, change. To the data we, allow us in Tokyo, leisure. Ok. I just shared, 5br return patterns, we, use in product, production system. This. Is the rasterizing, apart bigquery. Is a core our. Analytic. Architecture. As I. Showed and it. Is not only large. Scale and high school put data. Engine but. Also can be important, role in the earth time system. And. It is suitable, for, global.

Service, Global. Scales scales are, a strike, karta now. It is the center of data ecosystem, who, asks and all or also, who our customers. Started. Thank, you. Thank. You Yuki. Thanks. For. Team. Played, coming. Out all the way from Tokyo, to. Share their story and. Share, what they've done over the past years but plates, started out with with Google and the query in you, know four years ago and they. Didn't start. With as you saw they didn't start with the massive, architecture, that they have today, they started slow and that's really the. Message that, I want to tell. You guys is that you. Don't have to reply, for me you don't have to reorganize things real-time, doesn't have to be hard you can start with a very basic architecture. If, you're building. A brand new architecture, it doesn't have to be difficult and if you have an existing environment. You. Can simply just. Put. This on top of that right so a very minimal. Disruption. Of your existing environment and this is actually. Probably. Top. Three architecture. For real-time analytics, at Google that's. It right, going, from your application using. Bigquery streaming. API which is awesome, straight, into the quarry and your data elbows, in real time right, and a, lot of our customers do just that and that's, perfectly acceptable, but. You can you know, our. Services, work together well and everything. Is extendable, so you can continue. To kind of expand if you wish to do so so, for example if you have an IOT. Use cases or mobile maybe, you have gaming, logs, and so on and so forth maybe, it makes sense for you to keep, some intermediate, state that, passes, information, back to the, to, the end, device in. Which case you might want to use high, performance, low latency database like BigTable, or fire store we. Actually built integration, with buffalo's you can actually query, BigTable. Data directly, from query I'd be moving it into the core and on top of that as you saw with, played you. Can start using bigquery as an ETL, or alt tool itself, right you can move data from left. To right cleanse, it and rich it in the query itself. The. Next stage potentially. If you so desire is to, start thinking about a true real time architecture, right so you can start thinking about putting. A pop. Sabor compla, with Kafka, in. Your architecture comic, as a shock absorber, and, then you can use our excellent service Google Cloud dataflow or you, can use Hadoop, or spark or. Spark specifically, on data proc for. Real-time data processing, to enrich data to cleanse it to create reports in real time moving. That data into bigquery using, our streaming API again and potentially. Enclosing it further so things, are getting more interesting and. You. Know as you, developed, there's, more and more options right you can take your data from left to right and. Finally, if you, know if you start to really. Want to productize, your, architecture, we have services. That allow you to you, know to control your security, privacy. Access. Controls, network. On so on so forth and, including, orchestration, so all of these things are kind of fairly. Flexible right. But. Let's talk about one specific pattern. Here and, that is once your data is in bigquery a storage, what. Can you do with it especially. What. Can you do with it that doesn't have to do with the things that the query provides, right so we've built standard. Sequel with lots, of lots of extensions, that allow you to express your logic just in playing sequel we, built UDF's, user-defined, functions that allow you to go a little bit beyond them we. Also introduced, declarative. Bigquery. Machine learning, in sequel. And geospatial. Capability. In the last year and so all these things kind of allow you to do a lot of things on the query storage, but. As. Long as you, are bound by these, processing. Terms, but, of course you're gonna say Tino how come you know I I have, lots, of other things I want to do with my data but it's already in bigquery storage what, are my options right, so whether, you're using languages, like Python Java or go maybe, you have a Jupiter notebook. Or Zeppelin, maybe, you're using different, libraries. Like canvas or care ass or, different. You should be reprocessing frameworks, like spark or flink what are your options right the data is already not a warehouse what can you do with it well.

There's A couple of them right one is maybe. Your data warehouse allows, you to put agents, into. The data warehouse you you can part run third-party, code in there or. Maybe you can move, data what kind of read data from storage, directly so. Trickling data out or, maybe you can bulk export data, which is kind of this uncontrollable. Monsoon. Of that so, to recap the options, are you can trickle that out just, by using, a list API write your paginating, over data which, can be about 20 megabytes a second in the query so, it's real-time, but it's very slow it's not parallelizable, you. Can bulk export which, is decisive, in that real-time it. Requires the middleman like Google Cloud Storage it's. Probably, going to take you you, know thousands of minutes to hours. To move lots, of data out of the query so, maybe it's not suitable for some of the more real-time use cases and finally. We're, not going to let you have third-party, agents, running, in the query it's just the security no no right it's it's, just something that we can't do right so. No. Good options right and ultimately when you think about it your. Data warehouse storage, typically. Is kind of locked out right your your storage is locked away behind a castle, with a huge wall and a moat around it, and. Yeah. You you kind of don't have a lot of options and I guess. A quick, segue I was going to use a real image for this slide. But. I. Couldn't, get rights to it so I went on Twitter last week and I asked folks to write a to, draw a picture, and. This is the winner of the contest. And. The. Good news for me is this this happens to be my boss's wife so. It's, good hopefully. I get a promotion out of it. But. There is one worker out that that we see in the wild out there right if you have that and ubiquity storage or, other, data warehouses, you could potentially keep a copy of that data you know what. A cold data Lake storage, it could be GCS, it, could be HDFS, you could use park' files or org files but, this is a little complex right, it allows you to to, leverage. These non, the quarry processing. Systems. On top of your storage but it's, twice the complexity twice the cost twice. The vibrational, headache we. Feel like we can do better, so I'm. Gonna pose the question is what, if you could treat. Your data warehouse storage, like, like. A data link like storage, really what, would that look like well. I have an answer for you and we, launched something called the McCrory storage API this. Is solve this exact problem let's, talk about that some more so the quarry storage, essentially. Allows you, to, read data because, storage API allows you to read data, from, bigquery storage at. Low. Latency, writes, one second to first byte it's. Highly, parallelizable so, you. Know as if you have something that's reading from, that's. Distributed that's reading from the quarry storage API you, will have some really serious throughput, and we measured you know multiple, dozens of Giga bits per second and, it's, also very flexible so you can take advantage of the queries call. Them their formats, and so on and so forth right so it's it's, really useful. And it allows you to do that non, bigquery, compute, native processing, on top of the query storage, and. With that I'd like to welcome Cerf who is, a developer, advocate engineer, he's. Going to talk a little bit more about this and give us a demo as well stuff. Thanks. Dino so, those, of you coming from other data where's technologies. Like. When, you're dealing with your data warehouse storage, you're dealing with a lot of complexity, like you have to deal with file system layouts how, you map storage, to nodes all, sorts, of kind of indexing and vacuum strategies.

There's. All this kind of management overhead that goes with maintaining your data warehouse and. When. We built bigquery we thought maybe. The. Thing to do is to provide an expert to every user and we, do that by managing. Your storage on your behalf you, don't have to worry about all these low-level. Issues. And you can focus on, deriving. Insights from your data as. Quickly as possible. So. Let's. Talk about what it actually means to be bigquery, storage our. Primary, abstraction, is the notion of a table and when. We present a table to you we, focus, we focus your interests. On like, the things that matter for your. For. Your use cases you. Define schemas, so you set us a you. Tell us what kind of columns are available, on your table what kind of typing information. We. Have a lot of type, functionality, so you can do things as simple as strings fixed. Precision numbers floating-point. Bytes. Geography. Types we. Also support, complex, data types like struck structures, and arrays. You. Also provide information, that's. Useful for you the consumers, of your tables so you provide descriptive, metadata about the tables themselves any, information, you want to relay the users of the tables about what's. Actually what's, what's actually contained in these columns like how should I use this data. There's, access. Control policies, data, lifecycle, policy so you can decide like. Wind wind should data age out of this table am I am i setting up for a world where I retain, all information, indefinitely, or do. I want to maintain a, sliding. 90, day 365. Day window that. Tells, me you know as data. Ages out remove it from the table automatically, I don't think about it. The final, thing you do is you, provide hints to us through. The notion of partitioning, and clustering, that, basically, tell us these. Are the kinds of access patterns I'm going to use with this data and we, use that information to speed up access for you. When. We look at a table we see a much different view so, we take everything from that first slide and that's basically the user metadata. And. Then we're doing all the tracking for everything, else that happens in the system so. Bigquery, supports what. Sonoda is is a snapshot, isolation, model, so at. A given point of time we know what data is is available, in the table and what's not, and. So we maintain all the all the information about this data arrival, data leaving. We. Also have to deal with all the replication on your behalf so, you don't ever have to about where's my data do it is it fully replicated, that's, all handled for you, as, I said earlier the lifecycle information so we. Know at, a granular level what, storage should be arriving and leaving the table. Internally, we also have a bunch, of other stats for understanding. Operations. That mutate, storage. So. In, addition to that metadata layer where. Do we actually put the contents of your table. Like where as where is the where. Is the data for your table itself held. We've. Talked about this in some blog posts previously, so. We use a columnar. Format for, your data called capacitor, and. What that does is it provides us very, fast. Ways. To access your data such. That it supports, kind of analytic workloads so. As opposed to like a transactional, system where you're doing a lot of frequent mutations for individual, entities, in an, analytic data, warehouse you're scanning a lot of rows at, any.

Given Time. We. Also so, as this, diagram. Kind, of shows the. In. Addition, to grouping. Bait grouping, data based on transactional. Operations. We also group on a higher level abstraction, of the partitioning. Information, and so. What that effectively means is. Whenever. We want to read a table we end up with a set of columnar files that we need to operate, on, our. Query. Engine basically opens, all these up. Simultaneously. And, so what we've done is. We've. Tried to figure out a weight so that you can do very similar. Things essentially. What we want to present is the, ability for you to act as a query engine. So. On the Left as you see we have the you've, selected you, know some set of you. Know you've selected the table or partition. And you, want to read that into your external, application. We. Deliver that through a set of streams and. Then. We do decoding, on the on the client side so. In, comparison. Let's talk about what what you could do today as Tino, mentioned, we, have the table data list mechanism, which, is a paginating, cursor method so, it's a rest api which. Means you're doing in. Individual. Discrete HTTP, requests, and responses, for each page of data you. Have all this off flow to get through and the. Data is presented to you in a JSON format so it's not particularly, efficient. At scale. By. Contrast, the, storage API, does. A couple different different things to handle this on the. Network layer we, use G RPC and streaming. RPC, protocol, to. Ensure that the latency, for just moving data back and forth is, as low as possible and we use a lot of flow control to make sure that data, is continuously arriving. We. Use a binary, serialization, format, so, those of you that are familiar with format, or formats like, Apache Avro, you. Will be reading Avro, out of these streams. We're. Also looking at other delivery, formats, so feel free to catch me after the talk and tell me like your favorite format and why and we can we, can look at what we want to do that. We. Also know, this, abstract. Concept, of a stream to deliver data and that's, our unit of parallelism, so. Streams. Make it easy easy to build. Consumer, applications, based on the notion of your architecture rather than our, architecture. We, didn't want to ship a bunch of complexity. Around understanding. Our low level storage interfaces, and so we built this abstraction for you so. When you when you set up a session you basically tell us basic. Information about what, kind of data you want to read. Properties. Of the table how you want to access it and importantly. The parallelism you want to read it at. Even. A single stream is generally. More performant, than the new method simply. Due to the aid that the technology, changes as. You. You, know are able to scale up across multiple nodes. You. Can establish. Additional streams and, your throughput goes up as. Well and, you. Can keep doing this. The. Basic, bounds you have are how much bandwidth you have a bill.

Do, You have the compute capacity to, do the decoding and. What. Is the underlying parallelism, of the table you're accessing so, a very small table may only be represented by a single file so you're not able to use. More than a single stream to access it by. Contrast, as Yuki, mentioned like they. May have terabytes, to petabytes of data that. Could be represented by hundreds or thousands, of collander. Files and so your, parallelization factor, scales is the amount of data scales. So. The. The. Other thing to keep in mind is, we, don't we. Don't make you reason. About which. Files are going into which streams we do a dynamic, dispatch so. That essentially. All you have to worry about is the stream and we will make sure that it, is delivered to the streams and it, we. Allocate things in a safe way so you don't have to worry about duplication. Due. To the the network transport. However. You're dealing with distributed, systems. Distributed. Systems fail they, just do so. How. Do you deal with that when, accessing your storage, through this API so. We essentially present. Three levels of you. Know recovery, and error control so, each of those streams has an offset mechanism. So if you have a bad. Read you can basically, restart. At the last point you had a good read and couldn't and continue. On if, for. A for. Example here, if like your receiving, node failed and you need to restart the entire stream, that's perfectly, legitimate as well and say. You had a much more catastrophic. Failure you. Can restart, the entire stream and we, will deliver the data in the. Same same order we did in in, the original session. One. Thing to keep in mind though many. Applications, are inherently single node a lot. Of people have tools that just run on a local machine as. Long as you have sufficient, capacity. On, that machine it's perfectly, okay to fan in multiple streams onto that worker. One. Thing to keep in mind. There. Is no ordering guarantees, B for, each element. Of the stream, so, you can't, say stream. One's gonna get a through its around etc. So. When. We think about the, storage API we, want to give you the same capabilities, we give it we, give our query engine which, means projection. Filtering. If. You only want two, or three columns out of a very wide schema, we, will only deliver you those columns, that. Speeds, you up because obviously, we're, not sending that over the network and we're, able to leverage our columnar, format so, it's very fast for us to do this. Similarly. You. Can establish row size filter or row filters to. Say I would. Like to throw away any data on the server side before it even comes across to the network so. That my might consuming, application, doesn't have to even look at it and you can do that as well so. As, I mentioned so we have a snapshot isolation mechanism, this. Is sometimes called time could our time. Travel in our system, so. We, also give you the ability to, say I would like to read this table as a midnight, yesterday, or three. Hours ago. Cases, where you might want to do this, if. You you, know just, recently did. Like a big DML update and changed a bunch of your data and went wait, what was the state of that table before, I did that you, can walk me what, you can walk before that event. Similarly. If you're doing like. Simultaneous. Reads of a bunch of tables and you want a coherent, point. In time for all three of them or however min you're reading you. Can set the snapshot, time for all of them to be identical. So. If, I could switch to the demo real quick. So. One. Thing to keep in mind is I'm terrible, at visual demos, so. What. We'll be doing today is I'm going to demonstrate like, comparing, the use of the existing, API to the new API and we're, going to use some data from the public, data program the public data sets program and in.

This Case what we're going to be working with is something called the sample, commits table so, it's basically a. Subset. Of all the. Commit. History in, github, open. Source projects. Restricted. To some, subset of the, available projects. And. We do that. Because. The. Whoops. The. Raw data is. 753. Gigabytes and, we, can demo it but it will be a much longer demo, session. This. One by a contrast, is only about two and a half gigs a couple. Hundred thousand rows. And. So let's, look at, essentially. I have three nodes here the GCE VMs. Absolutely. Vanilla, that I've just put a variant. Of debian on. So. To, set the stage. Top. Box is the, existing, cursor. Method, it's. It's. Gonna move but, it's not going to move particularly fast. So. If we switch to the storage API and a single stream. We're. Doing. A little better. We're. Certainly going to arrive at the end before before. The first one so let me stop that and. Let's. Paralyze it some more so. My, my beautiful bark. Bar chart each color is the data being delivered. To a specific, stream. So. With a parallelism factor of four we're, moving a little faster. And stop that, make. It bigger. Add more streams. And. Unfortunately. Uh text. Demos aren't the, greatest for visuals, but. Low. Level storage api's don't, inherently, make for great visual, demos I'm sorry. So. Let's, do one more thing so in addition to that we talked about the fact that you can do column projections, so. If. We say we're only interested in the actual commit, hash and. The. The. Author's email address the. Addresses are anonymized so we we lose the identifier, but we keep the domain so if X in this data, set a. Little. Faster. And. Last. Thing we can do is say. So. This, bar chart doesn't look very pretty because we're filtering almost all the data out, as. You can see we apparently this is the equivalent of a count star query on the table. Where we say where is domain, suffix is like that, filter so we, have apparently, 6000. Commits in the github data and as, you can see we're still waiting on table data to make its way to the end of the table. That's, all I had for the demo, could we switch back to the screen, now I'd, like to, yeah. You. Know turned it back over to you and I'll. Go sit, down over here yes, Wow. Fantastic, thank you so much stuff so yes, I think between, hand drawings. Of castles, and text. Demos. We've really, raised the bar here but the, point stance it's a really, really fast API and you know as a product manager I like, to slide a lot and, hopefully as a customer, or as a user of the course to our JPI you, like it as well so it's. It's a beta it's out now go. Start using it there. Are still some things that we want to do to get it to GA but. It's already out there it's already highly useful so how do you use it well one. Is if you have data proc running or Hadoop or spark we, have. Released and are continuing, to release a number of connectors, so for example hive, MapReduce. Spark. Cycle or, you can actually even can. Just create a panda's, data frame that, is powered, by bigquery, storage this. Is really powerful because for. One the. Patterns, I've described before you, no longer have to do that and two, is you don't need to sample your data, and, reduce it on your bigquery storage before, loading into pandas, you. Simply. Have a data frames that runs on top of the core storage okay, great the. Next one is if you're using data for which is incredibly, powerful unified batching stream processing, engine you, can read data out of bigquery storage do. Some enrichment from, third-party, sources maybe pops up and put, it back into bigquery storage you're using our real-time streaming, API so, now you have a real-time ETL, processing, system so the data flow connectors out there you can start using it today finally.

Odbc. Jdbc right, there's some good examples of, tools. That could be potentially be benefiting from this we've, already updated, the JDBC driver and ODBC is coming very shortly right, so you just get a quantum, leap of performance, if you're using any third-party tool that relies on mass or you, your programmatically, connecting, to the, query using. ODBC. Or JDBC and, finally, it's a set point it's an API so you can start using it just go, and start building, code on top of it and we'll have client libraries and all the good stuff that you, used. To relying, on and. We'll, continue to expand on this ecosystem so it's uh pointed out if you have some requirements if your interests and other things that could work on top of the Cori storage API please, let us know we're really excited to build. All that and, so, I kind of give you a. Whirlwind. Of different. Patterns. That are possible here and, and you saw some of them live with with. Played but. But this is just a sampling right this is not everything, that we can give you but. I'm really excited to tell you that that, one pattern, of your data being in the core storage is now, we feel, like is very, unique and it's very powerful, and. We're looking forward to hearing how our customers are using this we already have a couple of. Users. Of this API Mixpanel catalytics. Which are kind of similar to play they're marketing analytics. Providers. And. They're, using this as scale and it's. Kind of opened up a new, way of doing things it's much simpler than the old ways, it. Allows a cleaner, architecture. In general.

2019-04-18 15:19

Show Video

Other news