What's new with the AI Lakehouse

Show Video

[Music] I'm Arena faruko I'm one of the product leaders for our smart analytics portfolio and joining me today is Steve Jared SVP of data and AI at Orange Steve has been a product and Technology innovator across many stored Silicon Valley companies like digital magic Apple Facebook as well as a CEO of and founder CEO of a technology startup and I'm really excited for Steve to share the journey that his team is on trying to transform orange with data and AI on a global scale but before we delve into the deeper detail as you saw this morning in the keynote generative AI represents a generational opportunity to to rethink how we interact with the world whether it's new customer experiences whether it's um employee productivity on new category defining products that none of us thought were even possible a few months ago just like many of us can't even imagine a world without the internet I guess is that a very short period of time we will not remember what the world was like without generative AI every Enterprise in the world is thinking and putting plans on how do they leverage generative AI for their data many are already way on the way towards that Journey but for many others it remains a tremendous challenge because they are not they don't have the right data Foundation instead their data is distributed and silos where every single Innovation project becomes a complex data engineering project moving data around duplicating data and trying to connect different systems with different metadata and Technologies together to make the data available to their users in their tools of choice Google has reinvented the way customers use data and information and our Google cloud data and AI cloud is trying to do the same for your data and your AI with the same Simplicity security intelligence and scale that you're used to across our other products it starts with getting you on the right data Foundation unifying all your data irrespective of where it is giv you the flexibility to use the best tools in the ecosystem and infusing AI for all personas for the end to-end workflows and simplification so has anybody here heard of a lake house I'm just kidding lake house is a foundational industry term to describe the convergence between data Leakes and data warehouses that convergence is foundational for the Next Generation Innovation around data and AI it's less about the underlying technology it's more around the productivity gains and Innovation that you can achieve by getting onto that unified architecture here at Google Cloud we strive to give you the best platform to build your AI Lakehouse that has four key components one unifying all your data structured unstructured badge streaming here on Google cloud or beyond the boundaries in a cross Cloud fashion given you the openness and flexibility to store your data in any open format and use the best of Google tools as well as from the open-source ecosystem to help you innovate as you build your platform governance and especially data to AI governance plays a critical role to make sure that you can inovate in a fast secure and responsible fashion and last but definitely not the least is activating your data with AI whether it's making your data seamlessly access accessible to your gen applications on making your us end users more productive with duet Ai and AI activation there this platform is allowing customers like dun and Brad street to unify data lakes and data warehouses and bring rapidly new generation of self-service datadriven capabilities while improving their performance the performance of their analytics by 5X let's take a look at some of the exciting new things we're announcing here at next one of the first foundational components of our lake house architecture is Big Lake a unified storage layer for all of your structured and unstructured data today we're announcing General availability of support for Iceberg hoodie and Delta in Big Lake with fine grain security and performance acceleration we are also announcing managed tables for a Pache Iceberg giving you high throughput streaming ingestion storage optimizations and DML all while maintaining full Iceberg Rita compatibility as you can see big lake is pretty unique in the industry in the fact that it gives you full flexibility in storing your data in the right formats open formats that you choose by giving you security and performance at Google scale what we've done here is we've taken the learnings from running Planet scale systems with big qu and brought those learnings around security governance data management two open formats so we remove the unglamorous tasks of data management M and engineering and let you focus simply on using your data another capability that I'm particularly excited about is the general availability of object tables for all of your unstructured data like images files audio now you can seamlessly combine in your analysis structured and unstructured data prepare your training Co Corpus for AI applications and seamlessly integrate with document AI for an endtoend data toai workflow many of our customers are already seeing Big Lake as transformational to their applications take for example Deutsche Bank who seen Big Lake as transformational for their risk system one of the largest in the world and really is one of the core technologies that will Define the impact that cloud could have on um on their applications the other set of capabilities we're announcing around streamlining your data to AI workflow today we're making available llms and specifically vertex foundational models directly from bigquery while col an llm from mqu may not be as big of a deal but doing that in a full secure scalable and compliant fashion where your data remains within the boundaries of your security perimeter is really important we also announcing the ga of bqu inference engine so now you can run large scale inference of our Google pre-train models or any of your own models in tensor flow Onyx or XG boost formats all with familiar tools like SQL embeddings moving on to embeddings one of the most powerful Concepts in machine learning we are now bringing that capability to bigquery in scalable and secure fashion now you can build nextg applications like semantic search and recommendation systems as well as seamlessly and at scale do llm fine-tuning and last but definitely not the least in this set of capabilities is the big query data frames for all the python developers out there giving you a python API that's psychic learn and uh Panda compatible add big query scale right parab scale all all from the same environment as you unified your data and you streamline your data to AI workflow we're also helping your users collaborate more seamlessly in bigquery studio a unified workspace for your data Engineers data analysts data scientists to collaborate using their language of choice or framework of choice whether it's SQL python JavaScript or even natural language all integrated with your SQL pipelines and um um uh sorry and uh data quality and lineage all embedded to help give you the trust and the intelligence about your data and lastly as you saw on the keynote stage we're bringing duet AI to all all the data analytics products big query and looker and datax and others coming soon across spark beam and airf flow what this allows you to do is really making your users much more productive in the context of their existing workflows let's take for example in bigquery now I get powerful capabilities around SQL generation autoc completion as well as I cited assistant to help me understand the capabilities and help educate me on the best way to use the system as you leverage all this intelligence across your data Suite one of the most foundational things that you can do is make sure you're operating on the right data to AI governance foundation so your users can innovate with the right data in their tools of choice without compromising on your policies and your governance guard LS with data Plex we have been leading the market around centralized management and governance and in fact over 70% of to analytics customers I using data Flex to simplify the management and governance of their integrated platform it provides unified metadata across your distributed data centralized security and policy management M and data intelligence with automatic life cycle management data lineage and data quality and today we're excited to announce that we extending data Plex to vertex AI models and data sets so now you have a single integrated always on catalog that has all of the metadata around your entire platform from data to AI that really serves as a foundation for Providence for governance and for streamlined workflows for all of your users across their tools but governance doesn't stop at the boundaries of Google cloud and we've integrated datax with bigquery Omni and big lake for multicloud governance where in areas where you have bigquery uh Omni you will now get a centralized catalog a centralized ability to apply policies an endtoend lineage as your data moves across its life cycle we're also excited to announce the general availability of data plx data quality and profiling we've brought a lot of intelligence in automation to really solve the age-old problem when you're sitting around the table conference table and somebody says I don't trust that data point and the whole meeting goes off the rails it's one thing when there there it's a bunch of humans sitting in the meeting and in other ways it's when machines have to make decisions and making sure that you can automate and they're operating on the high quality data with data Flex data profiling in quality we will automatically profile your data we automatically generate recommended rules and we execute them and surface them to the users in their tools of choice so let's say that I'm a user a data engineer working in bigquery studio trying to perform some analysis of build a pipeline in context I can see the quality of that data the information about that data and helps me gain trust and intelligence about that data another capab ability that we're announcing today is the extension of our data lineage capabilities lineage Providence is critically important to build trust in your data but use it as a foundation for compliance and perform impact and roota root cause analysis we've had for a while the capabilities for lineage and biter as well as our pipelining tools like composer and now we're extending lineage to Apache Spark so you can see endtoend lineage capabilities across your entire data life cycle but we also know that your ecosystem doesn't stop with Google Cloud products and so today we're announcing the support for open lineage apis where you can integrate dataplex lineage with any open source system or any system that supports open lineage and see it within the same intelligent graph in your current and existing workflows but all of these capabilities are just beginning in the foundation how do you then solve what we're focused on and especially with d AI is how do you solve the ageold cold start problem how do I even know which data I should be looking for or what questions I can ask of my data this is why I'm excited to announce the AI capabilities in datax where we will automatically surface the questions you can ask your data based on our understanding of the metadata that you have as well as usage patterns for popular queries that you have uh that other users are using within your environment then we will use it as a continuous feedback loop for your users to provide feedback on how useful this was to them as a starting point and then start their analysis and use that as a jump off point there we're really hoping that many of you will provide us feedback on that capability to help Drive the next level of innovation for metadata powered insights and democratizing your data I've talked a lot about the cool new capabilities around our na products bit query datax uh you heard about Vortex aai but we also have a longstanding commitment to open source and from the analytics side that has been a tremendous commitment specifically to aachi spark we are committed to for Google Cloud being the best place to run open- Source spark with the industry leading price performance over the past year we've innovated tremendously to improve the price performance of Open Source park with many customers seeing 50 to 70% Improvement within their existing environments we've also innovated on the customer experience serverless spark gives you a true serverless experience for aache spark across all workloads today data engineers and developers spend over only 40% of their time writing code and instead 60% of the time managing infrastructure and all of the glue that comes around it we've dramatically simplified that with servess spark and today I'm excited to announce that we're extending servess spk for data science with servus spark sessions we're providing coste effective and rapid development for data science workloads across vertex AI across your own Jupiter notebooks and as well as partner Integrations that are coming soon with hex and deep note as you heard in the keynote stage earlier this morning we're also deleting Servo spark service to integrate with Nvidia for both performance acceleration and cost savings everything we've talked about today has been focused on analytics in the cloud but we also know there's a set of Leading Edge use cases that span The Edge and the cloud right those could be in retail manufacturing Financial Services where the amount of data you're processing in the edge sometimes is so large that you you have to make a decision before you can make a round trip to the cloud or there are cases where you have data sovereignty requirements where you are not able to send all of the data before you pre-process and anonymize it this is where Google distributed Cloud comes in Google think of Google distributed Cloud as a true extension of Google cloud in your own data center with a single unified control plane and management plane letting you run distributed applications across data and AI that makes sense for your business and today I'm incredibly excited to announce that we've extended our spark offering to Google distributed Cloud you can now run a fully manage Apache spark on Google distributed Cloud you can pre animize on anonymize data aggregate data or truly run distributed workloads across data and Ai and that environment and with this I am delighted to welcome Steve to the stage to sh who's been a key launch partner for a lot of these capabilities to share the journey they've been on at Orange thank you Steve thanks Serena and thanks so much for the great partnership between our engineering teams orange is one of the world's leading Telecom providers we offer services in 26 countries that you see here on the map ranging from our host country France to countries like Egypt and Morocco and and the Democratic Republic of Congo and so the breadth of our challenges is really extreme and I'll talk more about how we're working with Google to solve some of those as a company we're known for a lot of our Cutting Edge contributions to Telco in fact a lot of the power saving modes that are in 5G are there because Orange has cared about those kinds of issues for many many years and we're contributing to Future versions of these fundamental Technologies like 6G but we also have a longstanding fundamental commitment to Pure research in fact in my own team we have 50 people that are in the pure research team working on data in AI topics the way that we like to describe AI at Orange is that it gives us super powers to our employees to our networks and to our customer interactions and we have already many high impact use cases in those domains mostly where the business has a lot of capex and Opex today where we see data and being able to bring a lot of efficiencies and improvements in those domains so for example in Reinventing the customer experience we use AI today to dramatically improve the the offers that we're providing to people the the the interest of those offers we're also working to improve our call centers by using generative AI to live transcribe the call with the consumer and you can imagine across our 26 countries how many different languages were're were we faced with on a daily basis and the idea that we can live transcribe that call with AI and then use the AI to prompt the call center agent with suggestions of what to how to resolve the customer's problem we can then use AI to summarize that call for for storage but also we can autogenerate a follow-up email that could be very useful you know helps the customer agent save time to create that email that can cover the topics that they covered in the call as well as suggested items for them to solve their problem long term in operating efficiency we're using AI uh large language models as an interface to a lot of the really complex data we have today in the company and in fact we we find that pairing a large language model with a knowledge base is very very powerful and then lastly in terms of our networks across all 26 countries we're generating over a petabyte a day of just the Telemetry data that's coming off of the equipment so being able to to work with that data is is a really significant problem but the idea that we use AI to do not only predictive Network maintenance but also longterm to identify the root cause of the problem so you can imagine how many cell sites we have in the field so if we can use AI to determine an advance of the truck rolling to the cell site to identify what the problem is that can help send the right technician with the right equipment and also save a lot of time so we've been on this journey of trying to improve the way that we work with data ni across all of our businesses and we started with these very diverse requirements and in fact inside of each country what we had is many times the operating business unit would create their own data stack and so we would have a silo a data Silo from the marketing team or the finance team and the networks team that reflected those organizational silos and so the complexity was extremely high and each of those ecosystems had their own process data model and so on and so the result was that as Arena was saying the teams were spending all of their time just managing the data infrastructure as opposed to focused on on generating value for the business then we attempted to build and integrate our own infrastructure that was very kubernetes Centric and we found that it's not just another it workload there's lots of complexity around identity and access management compliance security and and and and so we learned very quickly that we needed to have a really strong partner and so we chose Google we also found that because of the dynamic uh how fast the AI infrastructure business was moving the fact that we were working with fairly monolithic vendors was also a big problem so we needed to have a partner that was ow us to to use much more open Technologies and move much faster so in terms of our ambition we we're we're working towards this much more modular plug-and-play architecture and the idea that we can leverage both open source am AI models open data Technologies but also manage services like the spark uh offering that Arena mentioned is very powerful for us and the fact that we can run this now on GDC Edge gives us an infrastructure on premise where we can manage the data in a way that saves us a lot of cost because the data that we're generating is so large that we need a really sophisticated way to filter that data before we would send it to public cloud and the fact that with GDC Edge this infrastructure is uniform and so we can have uniform measure management of these systems between what we have for gcp in the public cloud and what we would have on Prem is very powerful and also the fact that these regulatory requirements we have from country to Country they change over time and so again having a fairly flexible infrastructure to respond to those is extremely useful for us so to give you a Peak at what this architecture looks like on the left hand side you have all of our on premise systems that relate to these different business units and their data sources and then in the middle we have the bridge of GDC Edge between our on premise systems and and and the pub cloud and there you see our ability to run not only things like uh like the spark manage instance we can also run AI models there we can do inference of AI models there and manage it with airf flow the data comes from our onpress Prem systems into Google cloud and there in a very secure Zone that's where we offer and build data products and then the data is consumed by the different business units in the sharing layer and when they generate data that they think is useful to be shared with other teams it's written back to the secure Zone and at the bottom you see the data Operation Center so we work with calbra to provide business metadata across all of our systems not just the ones that are residing on Google and also a Marketplace and for us this vision is really fundamental to our strategy the idea that we have a marketplace where you can discover data and consume data is really really um it's we have to dramatically open up the access to data across the business and the fact we have these data protection um dashboards and alerts that are provided by dataplex including much improved data observation and this Federated governance that can help us manage the data whether it's on Prem or whe whether it's in in Google Cloud so we're on this journey towards this vision of a data democracy and for us a data democracy is not an anarchy of data it's a system where we have rules by which are that are enforced using policy as code and within that environment the user has a lot of freedom to get the job done with data so the first point is this idea that I mentioned before about breaking down these silos that we had that map the organizational silos so now we can have availability of data across these different business teams and units the second as I mentioned was this elegant management between what's on PR and what's on public Cloud our existing infrastructure that we inherited in many of these countries the architecture is very very inflexible especially between these business units so having this be really elegant between on Prem and public cloud and having uniform management infrastructure is really powerful and this D this dynamism allows us to respond to things that are unexpected so for example let's say we're training a new AI model and we don't know what data is predictive and so the idea that we can send a we can send for a relatively short period of time a large amount of that data to public Cloud for model training then we can deploy the models back down uh in country to be run on inference on this GDC Edge infrastructure is really really useful but also let's say we encounter an unexpected event in the network where we have a a failure in the network and again you know given that we're generating across the company over a petabyte of a day of just the Telemetry data in a country where we have a problem we can send a vast amount of that data to public Cloud for training when normally a lot of that data would be aggregated or or even deleted over time and the only way to do this really at massive scale is to use policy as code and so the way that the fact that we can use this this at a role based approach uh where we can ensure that users who have access to certain data are are doing the right thing and also that we have the ability to observe their operations really helps with compliance and also dramatically reduces our security and privacy risks one of the things that I don't think has been mentioned yet at the show about GDC Edge that's interesting is that one of our concerns from a lot of our Regulators was uh what data was passing between the public cloud and the suppliance on premise and so Google provided us with an open-source proxy called The Boundary proxy that we can run on Prem and inspect everything that's running over the management plane and so if we see something unexpected we can sever the connection between the appliance and the and Google cloud and because the the control plane of GDC Edge is running locally the appliance can continue to operate and it can operate for as long as we need to to then correct the problem uh and then reestablish the connection so the fact that we can have this uniform architecture the fact that we can have this environment to run AI model inference but also have our own first-party or open- Source data Technologies we can Leverage The manage services from Google and we expect to see more and more of those over time and in fact we also believe that this will become a really thriving ecosystem where we'll have many more services coming from people like you and the audience that we can pick and choose from and so the idea that our on premise data and AI infrastructure becomes this really flexible relatively easy to manage uh environment is really a breakthrough for us in terms of Elegance and simplicity and it also dramatically reduces the complexity of the training that we need to provide to our teams uh you know one of our challenges is that we have very very good data engineering resources in all of these countries but with Antiquated architectures they were spending all of their time just keeping the lights on on the infrastructure as opposed to adding value and lastly the fact that we can use this architecture to really rapidly respond to changing requirements that are that come from our Regulators across all of our countries uh was really fundamental to us so Arena thanks again for helping us towards this our journey towards a data democracy and also to give superpowers to everybody at Orange and uh what do you want to do next well than Steve and thank you for the partnership I'll go with a few unscripted questions maybe okay um you know it's so impressive with what you and your team have been able to accomplish in such a short period of time and we all know these Transformations are like I would want to take credit for all the products I built but I know a lot is about people and processes sure internally and having such a large distributed organization and driving change at this rapid Pace I'd love I think all of us probably in the audience would love to hear what have been the true keys to drive that change armit technology that we have here that's a great question so we did a few things the first is we chose a training partner we chose corsera we use their tools widely across the whole company and we make those available to everybody on an unlimited basis that was really fundamental the second was we're very very transparent in the way that we work with one another so all of the work that we do so for example all of the cloud foundation work we did with for gcp we share all of that work and we're very transparent about our metrics so for each of the projects that I mentioned we have really clear kpis that we've defined across the companies and we share our successes and failures openly on a very regular basis so that allows the teams that know that one team for example poer in the front row from Poland they've done a particularly good job in a number of domains and the fact that we're so transparent about that allows the different leaders across all 26 countries to reach out for for help um so I think it's really a balance of being measurable uh and transparent at the same time and that's enabled us to move a lot faster than we were in the past

2023-12-22 08:24

Show Video

Other news

How Programmer Jobs get created and destroyed, & how AI might change that. 2024-11-19 04:49

Big Tech in Wait-and-See Mode | Bloomberg Technology 2024-11-16 22:44

Valve's Steam Machines: How Did They Fail? - Krazy Ken’s Tech Talk 2024-11-14 03:07