How to Build an ML Platform | HPE Data Science Community Virtual Session #1

How to Build an ML Platform | HPE Data Science Community Virtual Session #1

Show Video

perfect timing so hi everybody uh my name is Matt Armstrong bars I'm a chief technologist at HP uh specializing in artificial intelligence and what we wanted to do today in this webinar was cover something that we've been doing as a collection of meetups uh quite recently so this one is a follow-up from our face to-face Meetup where we all got to play table tennis uh pingpong for my uh for anyone around the world that was really talking about designing a machine learning infrastructure platform and that's kind of a continuation of what we want to do today is drive into that in a little bit more depth and for anyone who couldn't join the face to-face session have an opportunity to to listen to Jordan cover that session again in in a bit more depth so what I wanted to talk about is you know why from a Meetup perspective why are we doing these things which artificial intelligence you know it's an emerging field it's still massively accelerating in terms of the platforms the technology and how we can deploy this in a meaningful way so what we want to do is have an opportunity to get together and learn from each other you know to build a community of likeminded individuals in the in a data science type space of people who are working on building out um algorithmic based deployments so that we can share knowledge and really kind of build out our community together so I'm sort of going to be from keeping involved and this is an event hosted by H Pac Enterprise but it's for us as a community to get together so we would absolutely love to hear from members of the community what do you want to hear about who do you want to hear talking um what subject matters do you want to cover things along those lines are there things that you in the community where you want to come forward and present some of the things that you're working on please please please give us feedback if it's working great if we need to change it just let us know and we'll we'll start to shift back in more feedback into you and continue to build out uh the sessions going forward we do have some ideas that we want to to talk about over the next couple of months so we've done our first face to face we planning another one in January and then probably another webinar after that as well you know with deploying once you've built these systems how do you deploy them of which again we're going to build bring the Titan from Titan ml back and they'll be running that session probably in early mid January then later in January probably early February we're going to do pipelines you know how do we understand and make use of effective pipelining from a machine learning perspective and then again we've got ideas for a fourth session which is really about um how you do the modeling from an ml perspective so with that in mind we have the Q&A function so I'm going to drop off video in a second I will still be here I'll be monitoring the Q&A and we'll save some time at the end and I'll pose any questions that are asked in the Q&A form back to uh our key speaker today who's going to do a fantastic job of covering this complex complex subject matter and I'll try and think of some interesting questions as well as well as combining them from the Q&A so now I'm going to hand it over to the man the myth the machine learning uh legend that is Jordan Nanos okay thank you Matt um good morning everybody I'm uh here in beautiful Vancouver British Columbia I appreciate all those in the uh European Time Zone staying on a little bit late today so subject for today is building a machine learning platform like Matt said I'll be going through a little bit of the history of what hb's been doing in this space how we got here and then uh and then we'll be talking exactly about that concept what is machine learning platform how can you go about building it today so first of all um you know we are all here from HP hula Packard Enterprise and uh I wanted to introduce a little bit about what HP is in terms of data and AI today not a lot of people know that we've acquired five AI software companies in the last six years uh those include blue data mapar and ampol that have joined our esmeral organization with our Flagship product esmeral unified analytics and also the esmeral data fabric that powers that from a data storage and data processing perspective um I'm also going to spend some time talking about Pure Play software in the high performance Computing and AI business unit uh determined AI our Enterprise version is called the machine learning development environment and packet the Enterprise version of this is the machine learning data management software uh hopefully those who are joining from the determined uh Community are going to be plenty familiar with with at least the determin software there but hbe you know our journey in this space and in a lot of ways my journey uh about you know learning about Ai and in the Enterprise and working with our customers in the space kind of started in 2016 when we acquired SGI brought in a lot of uh people with a lot of experience working with Hardware at uh the scale at which a lot of AI projects are getting started uh they introduce Professional Services in a meaningful way to HP and the Enterprise a lot of knowledge around GPU servers uh cluster management for HPC you know Technologies like slurm and PBS and infin band fabric management things like that and that was only solidified further in January of 2020 when we acquired cray um you know an organization that's brought us uh into the supercomputing space and really uh like I said solidified us as the number one supercomputing company in the world so we would say that uh you know AI is just another high performance Computing workload um that means that basically a cluster is not good enough but all sorts of of organizations are are starting to evaluate the use of I today in in the Enterprise uh really because of this I think a lot of people have probably seen similar slides but you know around this time last year chat GPT was released quickly on the path to 100 million users and uh we started to hear about it from all sorts of organizations that had never really been AI or deep learning customers in the past but was restricted to you know some of the more advanced Industries and Advan advaned organizations within those Industries now really got democratized to all sorts of different Industries and uh you know we started to basically get requests from organizations that had private and confidential data that they wanted to add to a model like chat GPT and get a really you know similar experience where they could chat with their confidential data that continues to really be the killer application for the most advanced neural network architecture today which is the Transformer model you can see based on this slide and thanks to the the original author at the source GitHub down below that um you know we're seeing both encoding models on the um embedding space kind of BT based models uh take off in usage but most importantly the decoding models on the uh on the right side of the screen the GPT style ones llama 2 um B and Gemini from Google Cloe from anthropic on and on these have really really taken off in popularity and the combination of the two is driving a lot of the a lot of the reasons why again that Enterprise who had never done AI before is now bringing gpus into the data center for the first time buying their first infin band switch building a cluster of bare metal servers chasing performance in a way that they haven't been for a long time anyway that all of that has led to a need for uh a machine learning platform as we Define it machine learning platforms basically have three major components uh first is the underlying infrastructure though this is uh quite similar to HPC clusters a big data like Hadoop clusters there are some some distinct differences when you think about a cluster that's built for deep learning or machine learning and uh I'll go through some of those in a second the second piece is the platform software you know the move from IAS to Pas Services really exists in any location on Prem Edge collocation the point is that users are looking for an experience where software is already installed the cluster is up and running people can authenticate log in start developing code and uh and share resources and uh that platform software kind of breaks down into three different aspects that we'll go through today in a lot of detail but the third piece that we've really seen is um the introduction of models as a concept where regardless of the infrastructure and the and the platform software that you happen to be using there's some combination between open source models and public apis with models that you've downloaded from some sort of reference like from the hugging face model repo um some you know academic paper that's been uploaded to Archive that you're trying to replicate some open source you know public GitHub repository whatever reference you're using to get started um or you know API you may be calling the introduction of of pre-trained models that can be used as a service is really uh really a big change for a lot of organizations that may have been you know doing this ml platform thing for traditional you know classification and forecasting type workloads for computer vision for recommendation engines but are now really getting into language models which uh which again requires the use of a lot of pre-trained models that that is quite unique so what it takes to actually deliver an ml platform fundamentally it builds up like you know layers in a cake the first thing being physical infrastructure most people especially developers in this space don't really consider the amount of power and cooling or rack space that's required for some of these high-end gpus but you know Enterprise organizations really need to consider it at this point um we've seen a lot of people recently really basically spend a bunch of money on gpus and then try and figure out where they're going to fit in the data center and uh this doesn't work you know the most popular GPU in the world right now is the Nvidia h100 um a sxm version of that that card is 700 watt compared to its predecessor the a100 which is 400 you know almost doubling in power per GPU with additional power for the latest you know CPUs network cards NV drives even fans for the big servers are consuming so much power that we're seeing well above 8 kilowatt approaching 10 kilowatt per system in production not just rated these systems can be rated up to 12 kilowatt but actually running consuming 8 to 10 in you know when it's running means that just one server can fill up an entire rack's worth of power budget in an Enterprise Data Center today so you know of course physical infrastructure is worth considering where are these gpus going to go it's kind of the big question right now uh but on top is where people really start to to think about the tra traditional kind of components of of the infrastructure layer which is to say you know which GPU to actually pick since the h100 is kind of sold out right now organizations are considering the l4s from Nvidia they're looking at Alternatives like amdgpu uh Intel gpus um there's other Alternatives from various Cloud providers out there and you know the sort of combination of CPU GPU how it's laid out are you using interconnect technology like NV link are you just using the pcie bus pcie Gen 5 means a lot more bandwidth to some of these gpus um anyway compute is a decision how many how they're laid out uh things like that storage is is a big consideration we see people um storing data at a very significant scale in this space um I have a example that I share where you know checkpoint files from a popular language model like llama 2 in even just for the 7even billion parameter you know range where you'd expect about 30 gigs written to disc if you're saving everything written a disc you're pushing 100 gigs per check checkpoint um especially as you start exploring models that are bigger than this kind of 7 billion parameter entry range and uh and that just means an explosion in storage cost um over time where uh where people really need to think about how much they're consuming because hey a fine-tuning run that goes for 45 minutes can result in 100 gigs of new data being stored um you can quickly start to blow out existing file systems and like network attached storage that people have just available lying around and start to approach the the world of parallel file systems like luster gpfs um our Green Lake for file product at here at HP based on vast you know and many others in the space of storage in addition uh you know looking at at uh object storage um for kind of that that scale requirement S3 compliance for you know different organizations to work from um and you know CSI compatibility which I'll I'll cover in a second and finally the fabric you know we see a lot of people uh adopting infiniband almost as as default from melanox you know Nvidia now uh but we also see people re-evaluating the need for this uh and just saying I need high bandwidth I need low latency um but maybe I don't need uh you know 400 gigabit per Port of bandwidth and um you know there's different ways to approach aggregate bandwidth Uh there's Big needs for ethernet connectivity to existing data sources in the Enterprise data center things like that building on top of the compute storage and the network fabric we see kind of classic system software the operating system management uh for you know cluster management as well as fabric and uh device drivers different libraries that kind of need to be pre-installed so that you can build you know an orchestrated kind of cluster level environment um organizing these clusters you can do this using typically uh kubernetes or slurm those are the two most popular that we've been seeing for uh basically for for training um but also for uh llm deployments on Prem see a lot of use of of kubernetes and uh the point here is really just the fact that there's different orchestration to do depending on the type of uh experience that you want for your users which is to say do you want dedicated gpus for individual users or do you want a cluster that many users can schedule against uh in most cases we see requests for both um and you know the point is that you need to be able to support whatever experience that your users are looking for so in summary all of these considerations kind of go together into an infrastructure service or the first layer of that ml platform you know the left side on my previous slide which which means like you know this is a turnkey cluster many organizations are coming to HP and saying I would like to buy you know a few hundred gpus I'd like them delivered like as a package as a cluster that I can just turn on log in and uh start deploying applications to that slurm cluster or that kubernetes cluster that are going to connect and you know allow people to write code and run it but many users are actually looking for higher level experience which means actually deploying the ml applications or the ml Ops sort of platform that uh that builds on top of that infrastructure service um and that really means delivering a platform like okay I'm actually going to install applications on top of kubernetes or that connect to slur so that people can do three basic things um on the left side kind of manage their data prep it um you know process it in certain ways to get it ready for development in the Middle where people are actually writing code and running it and then deployment which is where people are really taking models that they've trained and and running them in Productions so you can call on the model you can chat with it you can generate something and go from there so I'm going to spend some more time talking about this platform software but it wouldn't be an HP webinar if we did not show you the uh different Hardware options that you have from our organization on the left side we've got the traditional Enterprise server that includes up to four doubley gpus like the latest and greatest from Nvidia the h100 or the l4s Intel or MD CPUs all of the standard features that you'd be looking for from an HP prant like your iow out of band management uh support for different a wide variety of memory and hard drive configurations standard power supplies the sort of typical Enterprise server uh that's kind of the fundamental building block we see a lot of right now we also have in the middle the Enterprise HBC and AI platform which means really the introduction of of that um let's say the high bandwidth networking the parallel file storage um typically hundreds of these gpus allinone cluster and this is that the system I was talking about which can consume up to 10 kilowatt in production really need to contend with that sometimes uh it also builds with the Nvidia hgx board so using ND Link Technology between the eight h100 gpus in the system and then finally on the right side this is custom blades that we've seen for example being used at at um you know public supercomputing facilities like um you know the vonado system at uh at Livermore here with the Department of energy in the US um the recently announced uh eisenbart AI in uh the UK at uh University of Bristol um and some others and and you know the point of this is uh that we we get to really build uh packaging that's customed for these gpus which means combining the Nvidia gray CPU with the hopper gpus all into one uh blade um and then powering you know with liquid cooling up to 400 kilowatt per rack really you know this means significant amounts of density where you can fit an entire supercomputer really in the floor space of like a shipping container put it in a parking lot as opposed to kind of worrying about fitting these highend gpus in the traditional air cooled data center but let's get into the uh you know the the meat here with respect to the platform software you I pointed out data development and deployment as that highest level when I was building up the layers in the cake earlier and so I just wanted to tell a story about what we typically see people go through on their Journey um and and how these three aspects kind of build up over time the first thing here of course is um compute you know we uh we see that like this is the first thing people do they they buy a GPU server or they spin up a virtual machine um they spin up a notebook in you know some cloud service and they start like writing code and running it so assuming that you have a way to get a GPU and you have a way to get a development environment in some cases this is where people just stop thinking about the platform they're like I can write code and run it here in my jupyter notebook in my um you know vs code remote shell and off I go I can you know basically from here just install the required packages that I that I need in here pip install torch um numpy you know Transformers bits and bites all the latest libraries that I need to actually take a model that's pre-trained download it and do something with it you know add my data for retrieval find tune the model on my specific data sets run some inferences test out how it works this is really enough for people to get started but you notice I'm really talking about an individual user and so what we've seen is that organizations who try to do this in their data center sometimes end up buying an entire server just for one user or for one small team so that people can kind of share it but that really doesn't scale and it's very inefficient because you know we've heard stories from organizations where they're like you know we had 45 users and so we had 45 virtual machines all with an individual GPU and what happens well certain users want to train faster than what one individual GP can provide uh they want to use models that are bigger than the memory space of that one GPU like the free T4 and colab or something like that um and uh and you know even in in many cases the gpus are kind of sitting around not doing anything which leads to low utilization kind of wasted cost and people not getting the performance they need which is like the worst of Both Worlds so the introduction of a scheduler really happens at some point like I said this earlier but the two big um examples that we see are kubernetes and slurm there's big trade-offs here HBC organizations uh are usually most familiar with slurm gives you the most control over the network kind of the best performance for launching jobs at scale and uh again certain admin practices are well defined here but it really doesn't do well for uh on demand inference um you know it's built for batch jobs basically uh or small allocations in interactive work but not really long running uh inference workloads and uh a lot of these Foundation models start where people just want to have an API endpoint that they can continuously deploy to which seems more like a kubernetes microservice and uh so the the decision point is really should I take kubernetes which is built for on demand Services scale up and scale down and then build a batch scheduler into the kubernetes or should I take take the batch schedule or slurm and try and figure out how to serve models with the underlying infrastructure there what you really end up wanting is is kind of the best of both and uh that's where our determined AI software comes in um hbe builds the Enterprise product the machine learning development environment uh from the the core of determined AI which is open source and one of the Enterprise features we add is support for slurm and PBS for those HPC you know customers who are running big clusters with lots of gpus AMD Nvidia even CPU resources that they want to tap into and schedule you know machine learning experiments on top of and then start to you know evaluate over time where people are you know basically looking to build a quick user interface and and see how models are performing and get some intuition about how the model you know looks and feels as opposed to trusting some academic Benchmark or or some kind of evaluation script or you know loss metrics from uh the training job you know that doesn't really cover what it takes to build an llm powerered application right so one thing that I've really overlooked at this point is where you actually access data this is pretty fundamental uh when I started in talking about this and it was just servers with an IDE of course uh you know you can download models and start running but generally you're adding your data this can come from a wide variety of different data sources and a wide variety of different formats not everything is nicely um you know set up as a as a a pandas data frame or and as a you know tensor here but uh of course we want to kind of get there from our CSV file uh raw text files uh stuff that's in a parquet format exports from a or queries from a you know SQL database um flat files you know binary files images videos um audio files there's all sorts of different data that's sitting out there in the Enterprise available in the P public domain and uh you know accessing data is one thing but a big thing that we've seen from organizations is that when they're tracking who ran what experiment over time say from basically the determined UI we've seen that there's a big uh people kind of track what what uh file system directory was used for the experiment but it's really hard to track what the data set looked like in that directory at that time and so obviously this changes because people in many cases need to label or annotate data over time and sure okay creating a new directory is is great but that can lead to um a lot of inefficiencies uh when you have to you know try to do D duplication at the file system layer as opposed to kind of managing at a higher level and uh and data prep jobs like labeling annotation but also just you know distributed SQL queries and um kind of any parallel processing that happens to the data before it's ready for training can require a whole set of different Frameworks so the first big one of course uh big piece here there's there kind of three pieces to data management in our perspective is is that parallel processing framework itself see a lot of use of Ray in this space um the Legacy Hadoop uh people are are most familiar with with spark uh pandas at scale means dask all of this if it runs in the GPU is powered by Rapids but one alternative that we provide through packet or through hb's Enterprise product the machine learning data management software is the ability to basically do parallel processing at the file level or at the directory level based on a Linux Global pattern rather than needing to use cluster side software and a specific parallel processing framework like is required for you know writing pisar jobs uh with pador you kind of get native parallel processing out of the box uh the requirement here is that you need to effectively automate that parallel processing as part of a pipeline which generally you know we see a lot of um because people aren't interested in in manually running dozens or hundreds of um customer you know uh spark jobs um at one time time um they're actually looking to automate this over time especially when it comes to new data commits which is the third piece here tracking what version of your data looked like what your data looked like over time as sort of a ver version or having a a lineage experience here um means that you can actually trigger pipelines to run automatically to prep data with a known methodology built into a script um that you know again is triggered when new data lands uh prep data or um some change data capture pipeline could run so that raw data is exported um in many different cases so sometimes it's it's on a schedule which is the typical scenario for tools like air flow and Cube flow but in many cases it's based on some condition a certain amount of data has been committed um you know some uh like drift detector on the model has has run and required a retraining there's many different kind of logic based conditions that could kick off uh a data pipeline to run prep you know require the the prep job to run and expose new data for training in the development section finally deployment on the right side you know assuming that you have a trained model that's that's done people start needing to run it somewhere um you see more and more optimizations in this space as well as more and more um sort of serving Frameworks uh so calling out the language model stuff like vlm and gml and like all sorts of kind of Investments have been going into making models run with less memory footprint or run faster on the existing Hardware but uh at the enterprise we really need a a frame workor to actually deploy these models for inference at scale whether it's serving them centrally like near your proprietary data set in your data center on a kubernetes specifically or at the edge where uh where people are looking to you know deploy it in an embedded device or on a vehicle or at you know a mobile phone or some sort of like network uh endpoint right uh these are two different scenarios with you know different considerations in terms of being able to update the model over time to improve performance and just measure how things are going take logs and metrics which is kind of the next pieces here being able to build some sort of CI experience where you can test what's going on with the model over time optimize the model with you know some of the latest Frameworks to actually improve uh its performance make it run uh with lower latency improve throughput on less uh you know GPU memory things like that logs and metrics I already mentioned and uh and finally a really big um place where we have seen people spending a lot of time and research recently has been in the area of trustworthy AI trying to build some level of explainability into these models actually uh you know detect drift and bias and uh uh you know build some robustness or security layer heard people use the term guard rails a lot here for the llms um so that we can actually trust the AI powered applications that we're we're building with these platforms so in summary three big um aspects here data on the left side development in the middle and deployment on the right side mean that people need to contend with basically deploying and managing a lot of these applications at scale so that you can build these AI powered applications for your users Um this can be really complicated and so I'm going to wrap here by basically presenting our example platform to get started especially for llms because that is the Hot Topic right now we basically recommend that people start with the HP machine learning development environment there's easy integration with jupyter notebooks and VSS code basically any ipynb file that you've got should be able to run in the environment out of the box um you then get the concept of an experiment as an alternative to The Notebook which allows for uh priorities resource pools and uh access controls uh based on people's roles for admins to control basically who's got access to What GPU who can see what model trained in the registry and uh all of this scheduling can be done on top of kubernetes slurm or just with Docker agents so any number of Docker machines and you can install this platform and get started distributed training out of the box uh much easier launching of deep speed horod and torch distributed um integration with the hugging face uh model Hub and lots of examples to build from there and out of the box support for Cuda gpus from Nvidia Rock M from AMD uh tpus and gcp as well as uh CPU based jobs in in case there's CPU only notebooks or experiments you want to run add in their packet or the HB machine learning data management software those three big things packet repos are to uh machine learning data what GitHub repos are to code pipeline steps are driven by commits to those repos rather than being organized on a schedule with Alternatives um parallel processing being done in a python native way no need to import special libraries or write code in a spe specific way just launch the python script for every file in the directory uh and organize all of this on top of kubernetes so that you can do this at massive scale with you know petabyte scale object storage on the back end and the packet room file system on top guaranteeing you know good duplication and um again scale and support for unstructured or structured data in there and then finally you know how do I get started with deployment uh basically we've seen a lot of organizations looking for things like quantization and optimization out of the box with support for many different runtimes many different language models batched options for streaming um you know quick the ability to build a quick UI out of the box constraint outputs and run with multiple models on a single GPU or multiple gpus for a single model all of this is kind of available with a a nice partnership that we have uh uh with an organization called Titan ml uh based in London and very you know they're very much focused on language models which is you know a great way to get started in this space um that is the uh the next uh webinar that we're going to be hosting in this community um talking a lot about model deployment and modering because we've seen it such a important topic in this space specifically for language models all of this Builds on top of some distributed file and object storage doesn't have to be but we recommend the Asal data fabric you can get started with you know a terabyte or less and scale to hundreds of pedabytes we've got you know programs around the world like self-driving uh research programs running with all of their data on this uh this file system and it's software defin so you can basically install it on any Linux server connect it to the kubernetes cluster through a CSI driver and you know build those clusters with GPU accelerated infrastructure uh delivered out scale and hosted on kubernetes through HP Green Lake so uh yeah that that is the the hopefully easier way to get started big three applications deployed kind of build an ecosystem around there hopefully this review of platform software and some of the infrastructure that powers it has been interesting um you know we've got easy ways to get started if you're interested in in working with us please let us know and uh thank you for your time so asking whether or not people have seen a collection of the tools that you presented whether or not they're using them yeah so we got a few votes um yeah it looks like a wide range some people have have uh have seen 10 or more of the tools that we've uh we shared today um yeah I mean there's a there's a ton of tools in this space like uh I think a lot of people get into it in different ways we see people coming from a background in Hadoop where they have a lot of familiar with like familiarity with like you know hdfs and and the ecosystem apps on top of it like like spark like Hive and drill and Kafka and uh you know hbas and all the way back to map ruce and stuff like that and they're kind of moving into the from running reports to running you know forecasting models to now building with language models and they have a very different perspective and and experience level you know coming in writing like high performance Java code to and that's very different from somebody who has experience in the HBC space writing like you know MPI and uh working with slurm clusters and working with infiniband you know uh some cases writing C C++ for Tran code and those people you know both of those things can actually be different than people you know coming out of school python only like pytorch tensorflow forever numpy pandas you know gpus right uh all three of those different backgrounds can be uh can be a little bit different in terms of like what perspective they bring to the table what experience they have and what there is left to learn so it's always useful to start with background and then go from there I wonder whether or not people drop into the chat whether or not they're using additional tools tools ones that we haven't seen presented yeah so Mato asks uh like it's an overwhelming amount of tools how how could a small team handle it um you know you're asking me so I'm a little bit biased but I shared this slide to say this kind of represents three tools to get started I mean you can really even just start with this one which is this is the Enterprise version I'm describing here with the support for slurm and the arbac features but maybe people from the community uh here are going to know about determined and and that's kind of the the way to the way to get started because this is the most familiar this is the place where users log in write their python code and run it um and then over time we start to see people you know have needs for like Automation and data management that's where we add the second tool which is you know tools like padm machine learning data management software is the Enterprise version from hbe um determined is you know open source Apache 2.0 runs at any scale pador we put a limit in at I believe 17 pipelines which means you can do a lot uh with the community addition but um you know doing these things at scale you really need support and then deployment it's like there's all sorts of different ways to do it right you can do it on the GP use that you ran all the training on you can do it on some other system so we recommend getting started with Titan ML and that's like the three applications to cover the first thing in the middle logging in writing the code managing data and automating some things that scale on the left side and then actually deploying like a rest API or some interface that you users can use the LM like that stack is kind of the three pieces we see people need to get started with cool okay alie's got a a good question how to get hands-on experience ml platform engineering ml Ops vast amount of tools hard to know which one to proceed with yeah so um I hate to be a broken record like start with this right I think that um it's very easy to get started with determined it's three Docker commands to install it on any docker machine that has a GPU and you get two things out of the box you get a notebook and you get experiments that are really important there's other things but that's really important and from there people start to see the need for other mlops tools like how would I call the determined API to launch a job automatically rather than having users log in and run it themselves or in a notebook a user wants to write a certain kind of code which calls on a certain Library that I need to worry about like getting in my environment and having certified and installing or users want to pull a bunch of models but I need to organize them all as a as a cache in a file system if the focus is the place where users log in and write code and run it everything else kind of follows so that's always where I recommend people get started um you know from there it's like you know that's where again build out the automation stuff and then build out the deployment stuff uh ml platform engineering today I would consider to be synonymous with kubernetes experience um yes of course I'm happy to share these slides we will definitely distribute them uh after the webinar is done and we're also going to post this recording online so you can go back and watch it in case I went too fast um but uh but yeah you know to me like experience with these tools or even the broader range of tools here here for platform admins really you know it doesn't require kubernetes there's a way to do this without kubernetes but the modern Enterprise is running these things on top of kubernetes and so um you know people people in this space you know ml Engineers who have who have been asked to install and manage all their own infrastructure they hate kubernetes kubernetes is so hard but you know it's the worst option except for everything else so just kind of do it that would be my recommendation and uh you know I would say it's I always recommend the cka certification great way to understand the concepts in kubernetes what is a name space what is a persistent volume what is a cni provider like what's a stateful set you know learn about all these things in concept but in practice there's nothing better than like taking a Helm chart which is the way to install determined or install packm on your kubernetes cluster and go try and run it and then the config was wrong in some way I need to add in a certain file system directory for where I'm going to store my determined checkpoints okay go edit the helm chart right go go edit that yaml file learn about that um okay something went wrong with my persistent storage layer I need to figure out my file system itself learn about that something went wrong with my network troubleshoot that right that's like the fundamentals of platform engineering because when you think about this stuff at scale that's the same technology that the biggest customers in the world are using openai has a couple of great blogs about running kubernetes at scale they wrote their first one in 2018 I think uh uh you know scaling kubernetes to 2 200 nodes that means um uh 20,000 gpus right eight gpus in each node and uh and then the second one in 2021 scaling kubernetes to 7500 nodes right and you don't need to learn about any of that stuff frankly to get started but you got to imagine that open AI has overcome a couple of challenges along the way that you may experience in your data center and like they're using kubernetes so um give it a go uh the number one kind of hybrid skill set we see people looking for when they want to get started with something like this is the person who understands a little bit about the python code that runs in the notebook or in the experiment and understands kubernetes how the pods being you know actually running things in the container that that hybrid skill set is really really important for the sort of machine learning platform engineer or machine learning architect that you may kind of come across to today um okay Morrow's got a question about uh characteristics that a storage infrastructure could have yeah that's a that's a really good question it may be worth uh digging into in a future webinar I know that um certainly when we we we're planning to do one that's focused on automation pipelines like this pack armm software on the left because we kind of just say object storage and that that doesn't cover everything um if you if I look at like how I use storage here for one of these platforms it's like you need a SP space for users to to build a home directory basically um write their code run it you need a space for trained models to be stored this could be your hugging face model cache um it could also be your um basically like what we would call as as your model registry within the determined interface like where do my trained models go to be stored um or or checkpoints right where's my checkpoint storage directory uh then you've got your hot data sets which are usually shared like if you download common crawl and want to host it for everybody to train their language models from um you don't want to have every user downloading different data sets and and you want a ability to have that data be like hot high performance on the nbme tier and then finally you need a way to kind of run databases so like determine as a postgress database that backs the UI so that you can track like who did what over time same thing as the packm interface you know same thing for different logging and and metrics databases that that you know are required for packet armm or determined or k serve like I need to work with Loki or Prometheus or whatever um those are kind of traditional CSI persistent volumes that that can be created and so there's different considerations I think the big thing we see is that people need some sort of file system that has a CSI driver because generally they need kubernetes um over time the file system doing triple copy for the hot stuff gets expensive so they start doing S3 and that could mean S3 from the start because some app developer knows how to put files to S3 in the cloud and so why not give them the same option on Prem um it's usually not it's usually not like we need an S3 API we can usually use file system or S3 but the efficiencies of object storage are a really big benefit when you start doing 100 gigabyte checkpoints and petabyte scale data sets and stuff like that so um so those three big things if you can provide a storage platform with those three kind of tiers where like I have an easy way to spin up a database and and have it backed by some file storage CSI provision persistent volume on the kubernetes cluster then I've got my like hot data sets and and um like model cache on an nvme tier file system and then I've got kind of checkpoints and cold data and petabyte scale stuff there in my S3 storage that's a really good solution um so I'm leading towards this product Asal data fabric which provides all three of those interfaces out of the box so it's like it it has a CSI driver it is a file system like a Linux posx compliant thing that you can mount as an NFS share you can run LS against the directory it's not hdfs stuff and then it's also S3 you can also create buckets and get the benefits of aaser coding all on the same Hardware platform like one software to install um you can do this with Alternatives so our Green Lake for file product built on vast also does a lot of this stuff our luster product called cluster store does this stuff there's different considerations for like connecting this to the infiniband network anyway like I said there's a lot more that I could go into detail on hopefully that was a useful intro for the storage infrastructure um yeah awesome thanks Jordan I don't know if you can go on mute please just yeah sorry about that there we go lovely thank you for that so awesome as ever the man the myth the uh the machine learning Legend has demonstrated that he is exactly that so just want to say huge thanks uh for anyone who has enjoyed this webinar please remember that we're running more of these we've got face-to-face meetups coming where we've got Titan ml coming in we're going to be talking about pipelining in more depth talking about model construction and we can definitely add some stuff in around some of the storage considerations that need to we need to wrap more depth around that so if you have enjoyed these please join us for more we're continuing to run the SE series if there are ideas feedback is always great we love that um we can get more speakers in more things to talk about and we really want to make this a community-driven thing so the more you want to know the more we can facilitate those conversations so awesome thanks very much for joining us and attending this series and uh we will look forward to seeing you all again and hopefully some more friendly faces uh on future episodes going forward thanks

2024-01-07 04:05

Show Video

Other news