Bringing Generative AI to Life with NVIDIA Jetson
Hello everyone, welcome and thank you for joining us today. I'm Dusty from NVIDIA and I'm really excited to be here and share some amazing advances with you for bringing the latest generative AI models and LLMs to the edge. Since the introduction of transformers and then the Ampere A100 GPU in 2020, we're all aware of the hockey stick growth that's occurred with the size of the models and their intelligence, which feels approaching that of a human. It's been a huge leap forward in a
relatively short period of time, and things have really seemed to hit warp speed with the open sourcing of cornerstone foundational models like Lama and Lama 2. There's a huge community out there of researchers, developers, artists, hobbyists, all working furiously on this day and night. It's incredibly exciting, and the pace of innovation is hard-slash-impossible to keep up with.
There's a whole world out there to dive into and discover, and not just LLMs, but vision language models and multimodality, and zero-shot vision transformers alone are a complete game-changer for computer vision. Thank you to everyone out there who's contributed to this field in some way or another. Many have been at it for years or decades, and thanks to them, it feels like the future of AI just kind of, poof, arrived overnight.
Well, the time's now for us to see the moment and bring this out into the world and do some good. Naturally, due to the extraordinary compute and memory requirements, running these huge models on consumer-grade hardware is fraught with challenges, and understandably, there's been comparatively little focus on the work done in deploying LLMs and generative models locally outside of the cloud, and even into the field in embedded devices. Yet in the face of it all, there's a growing cohort of people doing just that. Shout out to our local llama and our stable diffusion. At the edge, Jetson makes the perfect platform for this because it has up to 64GB of unified memory with Jetson AGX-ORIN and 2048 CUDA cores, or 275 Teraops of performance in a small, power-efficient form factor. Why bother though?
Why not just do it all in the cloud? Well, the reasons for that are the same that they've always been with edge computing. Latency, bandwidth, privacy, security, and availability. One of the most impactful areas that underlies the other applications shown here is human-machine interaction, or the ability to converse naturally and have the robot autonomously complete tasks for you. As we'll see, you really need to be geared for latency, especially when real-time audio and vision are involved, and of course anything that's safety critical. It would also just seem good to know how to run this stuff yourself while keeping all your data local. And fortunately, there's a massive compute
stack openly available for doing just that. We've been hard at work for a while now on Jetpack 6, and it's easily the biggest upgrade that we've done to the Jetson platform software architecture, which is now underpinned by an upstream Linux kernel and OS distro that you can choose. And we've decoupled the version of CUDA from the underlying L4T BSP so that you can freely install different versions of CUDA, CUDNN, and TensorRT too. We provide optimized builds and containers for a large number of machine learning frameworks like PyTorch and TensorFlow, and now all the LLM and VIT libraries too. There are world-class pre-trained models available to download from TAO, NGC, and Hugging Face that can all be run on Jetpack with unmatched performance, along with edge devices like Jetson. We're bringing more components and services from Metropolis to Jetson for video analytics with DeepStream, and we just released Isaac ROS 2.0 with highly optimized vision gems,
including SLAM and zero-copy transport between ROS nodes called Nitrous for autonomous robots. Jetpack 6 will be out later this month and supports Jetson-Orin devices, and going forward should provide a much easier ability to upgrade in the future. With that, let's dig into the actual models that we're going to show you how to run today. First up are open vocabulary vision transformers like CLIP, LVIT, and SAM that can detect and segment practically anything that you prompt them for using natural language.
Then LLMs followed by BLMs or vision language models, multimodal agents, and vector databases for giving them a long-term memory and ability to be grounded with real-time data. Finally, streaming speech recognition and synthesis to tie it all together. All of this we're going to run on board Jetson Orin.
So we've optimized several critical VITs with TensorRT to provide real-time performance on Jetson. These have higher accuracy, they're zero-shot, and are open vocabulary, meaning they can be prompted with natural language expressing context and aren't limited to a pre-trained number of object classes. Clip is a foundational multimodal text and image embedding that allows you to easily compare the two once they're encoded and easily predict the closest matches. For example, you can supply an image along with a set of complex labels and it'll tell you which label is the most similar contextually without needing further training on object classes, meaning it's zero shot. CLIP has been broadly adopted as an encoder or backbone among more complex VITs and vision language models and generates the embeddings for similarity search in vector databases. Likewise, ALVIT and SAM or segment anything are also used CLIP underneath. AlVIT is for detection,
whereas SAM is for segmentation. And then over here, EfficientVIT is an optimized VIT backbone that can be applied to any of these and provide further acceleration. Again, these are using TensorRT and it's available today in Jetpack 5 and of course will be in Jetpack 6 as well. So, let's dig into some of these. We have a demo video here showing the capabilities of Alibite VIT. So, you can see you could just type in what you wanted to detect and it will start producing those bounding boxes should it find them. Previously, you would have had to capture your own data set, annotate it, train a model like SSD, MobileNet, or YOLO.
on your training data set and it would have a limited number of object classes in it. Well, you know, Alvit was based on clip which just has a huge amount of images and different objects in it so you can query it for practically anything here, and it's a real game changer not to have to train your own models for each and every last detection scenario that you want to do. This is a really impressive technology to be able to deploy in real time on Jetson, getting up to 95 frames per second on AGX-ORIN. So not only that, but when you combine the detections from LVIT with a secondary clip classifier, you can do further semantic analysis on each object. So here you see within each Detection ROI, it's doing further sub-detections. In brackets means you want it to use ALVIT, and in parentheses that we'll see in a second, you want it to use CLIP. So this is not dissimilar to primary,
secondary detection pipelines that we've done in the past, but in this case, it's all zero-shot open vocabulary and much more expressive. So you can see here not only is it detecting these different objects, but it can also classify them individually. So happy face, sad face, other types of things like that. And you can perform some very powerful detections and subclassifications this way with just writing a simple query on your video stream. No code required even. So all the code for running LVIT in real time is available on GitHub here.
It's called the NanoL project because we've taken the original LVIT models and optimized them with TensorRT, which is how we took it from, I think, 0.9 frames per second to 95. And there are various different backbones that you can run different variants of Alvi IT that have more accuracy, less accuracy, higher performance, lower performance. And this also is a very simple Python API where you just put in your image, put in your text prompts, and it spits out the bounding boxes and their classifications for that. So I highly recommend that you check out this project on GitHub, if nothing else, that you take away today, because object detection is still, by and far, the most popular computer vision algorithm run, and this can completely revolutionize that. So the segmentation analog of this is called SAM or segment anything model and it works very similarly. Basically you just provide some control points like click on different parts of the image that you want to segment and it will automatically segment all of those different blobs for you no matter what they are.
It used to be that you would have to manually go and make a segmentation data set, train the model on that, and those segmentation data sets were very resource intensive to annotate. But now, you know, you can click on practically anything and it'll segment for you. And when you combine it with tracking, there's another project called TAM, or track anything, and that can do segment anything over video for you. We also have containers for this available on
Jetson that you can take and run today. So everything that I'm going to cover in today's presentation is available online, what we call the Jetson AI Lab, which is a set of tutorials that we've put together that can easily guide you through running LLMs, VITs, VLMs, vector databases, the whole thing. And all these tutorials are backed by pre-built Docker container images that we have for all these packages. And you can take those containers, combine them, mash them together, build your own custom versions, and build your own applications with them and deploy them that way. So it takes a lot of the guesswork and debugging from porting what normally are GitHub projects for x86 and discrete GPUs and porting those to ARM and AR64 to be able to patch all that and you don't have to worry about getting Transformers library to run and so and so.
And we have a lot of different packages here available at your disposal and some really awesome tutorials to follow. And you can chat with your own llama models. You can run lava and all the VITs that we cover here. And also stable diffusion too. Stable diffusion is really a lot of fun to do image generation with. Okay, so now we're going to jump into the large language models. Here is a chart of the LLM performance that we currently have on Jetson. This is for AGX Oren, and it shows a bunch of
different APIs that we've exhaustively built tried and optimized with CUDA and it turns out the fastest ones are currently AWQ optimization out of the MIT Song Han lab and also the MLC TVM project out of Carnegie Mellon. So these two are by and far the fastest. Llama C++ is also very popular. We use that too in the community but MLC is about twice as fast and you can see here we can run everything from Lama2 7B up to the full 7DB at interactive rates. And we just showed the Lama models here, but you can pick basically anything from Hug and Face that you want to run and run it through these and do quantization. So these are all running with 4-bit quantization, so they're much more memory efficient, which is how we get them to fit. in the memory capacity because the weight alone for 70B can be up to, you know, 140 gigabytes if you're doing FP16. But if you 4-bit quantize it,
you're down to like 35 gigabytes or so. So here's a video of me chatting with Llama270B. Hi Llama, how are you today? I'm doing well. Thanks for asking. How about you? Is there anything you'd like to chat about or ask? I'm here to help with anything. I actually gave you a voice with text to speech and automatic speech recognition. So we are actually talking to one another. Wow. That's amazing.
I'm glad to hear that we can communicate in a more natural way. How's the voice synthesis and speech recognition working for you so far? Is there anything you'd like to chat about? Oh, it's working really well. You're running here on an NVIDIA Jetson embedded board? That's interesting. NVIDIA Jetson boards
are powerful and efficient, making them a great choice for Edge AI applications. Are you using any specific software or frameworks to run the AM models on the Jetson board? And how are you finding the performance of the board so far? That's funny that you already know about it. Yes, you are running on the GPU in real time. That's impressive. So this was using Riva for the ASR and TTS, and runs in real time on Jetson AGX. As we'll see in some later videos, the smaller models, they generate text way faster. But it's just like a lightweight web GUI that I put together with WebSockets that's geared for low latency. And it takes the mic from your web browser,
transmits it to your Jetson locally over WebSockets, and then sends the TTS audio back to it. So obviously, being at the edge, most of these Jetson-based embedded systems have cameras or other vision sensors attached to them. And as such, everybody in the community is very interested in vision language models like Lava, MiniGPT4, and a lot of these multimodal embeddings that are coming out. And how they all work is essentially they use an embedding model like CLIP that will combine text and images into one common embedding space where contextually, you know, concepts are very similar.
So if you have a picture of a dog and the word dog in this multidimensional embedding space, those two vectors are found very similar location to each other, i.e. they convey the same thought or sentiment to the LLM. And then after that embedding is complete, in the case of Lava, it uses literally the same clip in VIT encoder that we mentioned previously. There is a small projection layer that maps from the clip embedding dimensions into the Llama embedding dimensions.
And then they also fine tune the Llama model to be able to understand those combined embeddings more. And what we found is if you use the larger clip model that uses 336x336 resolution instead of 224x224, it's able to extract much smaller details and even read text out of images. And there are lots of other image embeddings out there. It's a very active area of development, like ImageBind, for example, which combines way more than just images and text. That can do audio, inertial measurement units,
point clouds, all types of modalities that the LLMs can be trained on and be able to interact with. Essentially, what we're doing is enabling the LLMs with all of the different senses so they're able to assimilate a holistic worldview and a perception world model so they can greater understand things rather than just have to do it all through text. So you can see here the performance of the Lava models is very similar to that of the base Lava models. It is actually the exact same model architecture. It's just a smidge slower, a few tokens per second slower, because it turns out that these image embeddings are a lot of tokens, like 512 tokens for a 336x336 image embedding, or I think it's 256 tokens for a 224 by 224 embedding. So, and all of those
tokens come at the beginning of the text chat. So every one of those, you know, your text input is at least 512 tokens long as opposed to just a blank normal text chat starting at zero tokens. So that's why it's just a slightly slower, but otherwise it gets all the same performance. So with the latest Lava 1.5 model that came out that some of you might have seen, it's got some
really exciting new features in it, including the ability to output constrained JSON format. So you can essentially tell it, you know, detect certain objects or I want you to pick out this and that and give it to me in JSON so you're able to parse it programmatically and actually do things with it. And whereas the VITs you needed to prompt with specific things like I wanted to detect a face or hands or whatnot, in this example you can just tell it to detect everything and it will spit out bounding boxes of all objects or whatever it is that you're trying to do. And this will be very powerful for making closed-loop visualizations that are deployed in real-time embedded systems. For example, smart traffic intersection crosswalk monitors, or blind assistant devices, or baby crib monitors, or anything that It is an open-ended problem that you want to extract information from without having to train all of your own models to do that. In addition to things like autonomous navigation, being able to query it on like, hey, where's this path go? Are there obstacles in my way? Are those obstacles dynamic? All types of things like that.
And this is really exciting. Another point to point out is this whole model, Lava 1.5, is a huge improvement over the previous versions of it. but it was still only trained in one day on eight A100 GPUs. So if you collect your own data set, it's actually within the realm of possibility to fine tune one of these yourselves for your own applications.
So here's a video demo of chatting with Lava 1.5 on AGX Oren. What kind of fruits do you see? In the image, I see a variety of fruits, including oranges, bananas, grapes, and apples. How much do the peaches cost? The peaches cost 2.50. How much do the oranges cost? The oranges cost 1.50. What breed of dogs are in the picture? There are two dogs in the picture. One is a husky and the other is a
golden retriever. Where are they? The dogs are sitting on a rocky hillside, likely in a mountainous area. What does this sign say? The sign says, no outlet private road. What kind of trees are these? The trees in the image are pine trees.
Where's the path go? The path is in the woods, surrounded by trees and vegetation. What kind of bird is this? The bird in the image is a robin. Who has the ball? The boy in the black and red shirt has the ball. What game are they playing? They are playing soccer. Can you describe the image?
The image features a young girl riding a brown horse in a corral. The girl is wearing a red shirt and is positioned on the horse's back. The horse is galloping around the corral, showcasing its agility and strength. The scene captures the excitement and joy of the girl's horse riding experience. Here's a look at some of the example code that powers these demos.
And it's basically a lightweight wrapper that I put around MLC and AWQ, since those weren't supported in some of the other LLM libraries out there. And in addition to all the multimodal embedding management stuff, which we'll talk about. But it's got a very simple API for text generation. Essentially, you load the model, it will quantize it for you if it's not already done. You create this chat history stack and then you can append either text prompts or images to that.
It will automatically perform the embeddings for you, depending on what data type. those inputs are, and then it generates a stream of output tokens. So everything we do here is geared for real-time streaming, so you can get that data presented to the user as soon as possible. And then you basically just output the bot response to the chat history too. Now, let's say you want to do your own chat management dialogue, you can totally do that. You can just pass in strings of text to
the model generation function. What you should be aware of is that the chat histories work best when you keep a cohesive stack going, because then you don't have to go back and constantly regenerate every single token in the chat. For example, we know Lama 2 models have a max token length of 4096. But if you were to generate the full 4096 length chat every time, it would take really long.
Instead, you can keep that cached in what's called the KB cache. And if you do that between requests, then the whole state of that chat is kept. You can run multiple chats simultaneously, but it's highly recommended to keep the chat stack flowing as opposed to going back and forth and just assimilating it all from scratch every time. And the reason for that is because
there are two stages to LLM text generation. The first is the decode or what's called prefill, where it takes your input context and has to essentially do a forward pass over every token in there. And this is a lot faster than the generation stage, but it still adds up. when you're talking full 4096 tokens here. So you can see if we're running, you know, Llama70B on a full 4096 token length chat, it'll take 40 seconds to prefill that whole thing. That's before it even starts responding. But if you only prefill the latest input, you're looking at, you know, a fraction of a second. That's typically like the dot dot dot that shows up in the chat or like agent is typing.
What it's actually doing is prefilling your input into it before it can start generation. So this is why managing the KV cache between requests is actually crucially important to keep a very consistent chat flow going. Likewise, here's a look at the token generation speed and how that varies with the input length. That does decrease slightly as well once
you get up into the higher token length. So that's something to account for as well. So obviously a big concern with all of this is what are the memory utilization requirements, and that along with the token generation speed are what really drive everyone very heavily towards quantization methods. So a lot of these LLM APIs that we talked about, like Llama C++ and others, AutoGPTQ, XLlama, they have lots of different quantization methods. You can go everywhere from two bits to eight bits, or most of the time, Below 4 bits you start to see degradation in performance, but at you know Q4 A16 quantization I've not really seen any difference in output from that which is which is really good because it takes Lama 70B from being 130 gigabytes of memory usage to Down to only 32 gigs and that is much more deployable for Jetson and likewise for the smaller Jetson is too. You can run Lama 2 7B on an O-Rin 8 gigabyte
board or Lama 2 13B on the O-Rin X 16 gigabyte. And we can see that here that we have a whole lineup of different Jetson modules that you can deploy. And each of these, conveniently, has a typical model size that is well-fitted to its memory capacity. So as I mentioned, the 7B models
are a good fit for the Oren Nano 8 gigabyte. 13b models are a good fit for the orenx 16 gigabyte and so on and so you can basically mix and match the level of intelligence that your application requires along with its performance and other swap C requirements like size, weight, power and cost of the embedded system that you're deploying and be able to pick the Jetson module that's appropriate for you to deploy those. So a few slides ago, I showed some code that was basically a low-level API for doing text generation with the LLMs. Once you start getting more complicated and adding things like ASR and TTS and Retrieval Augmented Generation and all these plugins, the pipelines get very complex.
And if you're only making one bespoke application, you can absolutely code all of that in one Python application yourself. In fact, the first version of that Llamaspeak demo I did was just like that. But eventually, when you start iterating, make different versions of that, let's say I want to have one that's multimodal or I want to do a closed-loop visual agent, you're going to start having a lot of boilerplate code in there. So I've written a slightly higher level API on top. of that text generation function that you can implement all these different plugins in. It's very lightweight. It's very low latency.
It's meant to get out of your way and make this all easier instead of harder without sacrificing even one token per second of generation speed. And with this, you can very easily chain together all of these different text and image processing methods and use them with other APIs as well. So these are just two basic examples of what the pipeline definitions look like, and it can be a completely directed, open-ended, multi-input, multi-output graph, and get some pretty complicated setups going in there. So another cool aspect that this enables is what we refer to as inline plugins, or the ability for the LLM to dynamically generate code for APIs that you define to it. For example, how is it supposed to know what the time is, or do internet search queries, or what the weather is, or perform actions like turn the robot left or right? All of those core platform functions you can define in the system prompt and explain the APIs to the LLM, and when it needs to, it'll dynamically call that. This is good to do on top of retrieval augmented generation, which we'll talk about in a second, because it doesn't just retrieve based on the user's previous input.
It can do so as it's currently generating the output and insert that into the output as it's going on. That is a good benefit of you know, maintaining lower level access to APIs as opposed to, you know, just going back to the cloud for everything because you need the ability to stop token generation, run the plugin and then insert that into the output and then continue the bot output in addition to doing things like token healing or implementing guardrails and guidance functions, things like that. All of that is very good to have very good granular access to the token generation so you can stop it right when you need to and then restart it completely asynchronously so you don't destroy all of the low latency pipelining that you have. This is just a basic example here of when I was toying around with this to see if it actually could generate it. And this was just with the Llama 2 7B using its baked in code generation abilities. And this is like basic stuff for it. I do recommend using JSON format, even though it's more verbose and results in more tokens, especially if you have functions that have multiple parameters with them, because JSON allows it to keep the order of the parameters straight, so it doesn't confuse them.
If you don't really have parameters or just very simple, like the example shown here, you can just do like a simple Python API. And there are other Open-source framework plug-in agent frameworks that you can use like Lang chain and hugging face agents and Microsoft Jarvis that do this as well They're not quite as geared for low latency low overhead, which is like why I've gone and done this but At the same time those can do very similar things to In fact, the Hugging Face Agents API has the LLM generate Python code directly, which then it runs in a limited sandboxed interpreter. So it's able to interface directly with Python APIs, which is really cool and is not hard to do.
In this case, I prefer just to keep it JSON or text and manually parse that and call the plugins. So, you know, it's not having full access to Python. So I've mentioned retrieval augmented generation a few times and that is big not only in enterprise applications where you want to index a huge amount of documents or PDFs and be able to have the LLM query against those because remember the context length is max of 4096 with LLAMA2 or there's a lot of you know, rotary embedded encodings that go up to 16K or 32K or even more. But there's always a limit. And you might have hundreds of thousands of pages of documentation that you want to index against. And when we start talking about multimodal, you can have huge databases of thousands or millions of images and video that you want to index against.
And not all of that can be included. So what happens is basically you take the user's input query and search your vector database for that. It uses very similar technologies to LLMs and the VIT encodings. In fact, clip embeddings are used in a demo that I'll show you next. But it essentially uses similarity search to determine what objects in the database are closest to your query. And it's a very similar concept to the multimodal embedding spaces, how that all works. And there's some very fast libraries out there
for this called FICE and Rapids RAP that are able to index like billions of records and retrieve them lightning fast based on your queries. And those are very good libraries to use. And I've used those on Jetson here to do a multimodal image search vector database. I basically made this demo to prove out the abilities of the clip transformer encoder and just to be able to understand what I could actually query for retrieval augmented generation before integrating that into the LLMs. So you can see here, you can not only query with text, but you can do image searches as well.
And it's pretty advanced image search, and it's completely real-time, real-time refresh here running on Jetson. This is indexed on the MSCOCO dataset, 275K images from MSCOCO. That took about, I think it was like five or six hours to index the whole thing. But the actual retrieval search only takes on the
order of like 10 or 20 milliseconds, which means it's not going to add lag to your LLM generation pipeline, which is very important because we don't really want more than a couple seconds of lag between the user's query and the response, especially if it's a verbal communication ongoing. So here's a chart of the retrieval augmented generation search. based on how many items are in your database. And some of these databases can get really huge, especially with like corporate documents, things like that.
At the edge, I think it'll be smaller because you only have so much space available on the device. But you can see here, it's just on the order of milliseconds for most people's applications. And I also break out different embedding dimensions here too.
So some of the higher end embeddings like ImageBind use every single image or text as a 1024 element vector that describes it in this multi-dimensional embedding space. Clip uses 768. Clip large uses 768. So then that was the one shown in the demo here. So this scales very well. I think it would be pretty rare that you would get up to 10 million entries on an embedded device like Jetson. But hey, if you're doing lots of data aggregation, have 30 HD camera streams coming on, it's entirely possible. And it's still only on the order of a fifth of a second or less to do all that, which is completely reasonable. All right, so tying it out with Riva
to coordinate how I actually made these demos with the text-to-speech, Riva is an awesome SDK that's openly available from NVIDIA that incorporates state-of-the-art audio transformers that we've trained along with TensorRT Acceleration. And it's completely streaming. It does streaming, ASR, and TTS. You can do 18 ASR streams in real time or 57 TTS streams in real time on Jetson AGX Orin and nobody's really going to do that many streams at the edge unless you have like multi-microphone devices or set up something like that. But what it does mean is when you're only doing one stream, you're going to take like less than 10% of the GPU to be running all that, which is great because that means that our LLM token generation rate is only going to go down by less than 10% there. Because the LLMs, they will consume 100% of the GPU, all that you throw out with them.
And Riva has lots of different ASR and TTS models that it ships. It also has neural machine translation and I've seen some people do some really cool demos with this where you can do live translation between different languages and it turns out that a lot of the LLMs like Llama are trained in English There are some LLMs out there that are multilingual, but if you're working with an LLM that's trained in English, but you want to be able to converse in other languages, you can use neural machine translation in your pipeline to essentially translate between ASR, the LLM, and then the LLM output back into the language that you want before it goes to the TTS. Another really cool thing that Riva has is these what's called SSML expressions for TTS. So you can speed up or speed down or change the pitch or add in like emojis or laughs or all types of cool things to make the voice sound more realistic. And overall it sounds really good for just being done on an edge device locally.
None of the demos that I've showed you so far rely on any cloud compute or off-board compute whatsoever. You can run these entirely without any internet connection once you download the containers or have your application built. Here is a pipeline block diagram of essentially what the interactive verbal chat management looks like. And it turns out there's a lot of nuance to live chat back and forth, mainly the ability to interrupt the LLM while it's outputting. We know these LLMs, they like to talk and they will just keep going.
And you can instruct them to be very concise in their output, but in general, they will like to ramble on a little bit. And it's important, like in the video, to be able to speak over them and have the LLM either resume And if it turns out you didn't want to query it or, you know, stop itself when you ask it another question. And the best way that I found to do this is a multi-threaded asynchronous model where there's just a bunch of queues, everything is going into these queues and being processed and you need the ability to interrupt and clear those cues based on things that happen. So for example, the Riva ASR, it outputs what's called the partial transcript as you're talking. Those are in the videos where you see the little bubbles come up and it's always changing because it's constantly redefining with beamforming what it thinks you said. But then when you get to the end of a sentence, it does what's called the final transcript.
And that final transcript is what actually gets submitted to the LLM. But if the partial transcript starts rolling in and you speak more than like a handful of words, it will pause the LLM. And if it doesn't get any more transcript in like a second or two, the LLM can resume speaking. If it does go final, if the ASR transcript does
go final, then that previous LLM response is canceled and a new one is put in its place, which is important because then you don't want it to keep spending time generating an old one. when you are already on to answering your next question. So there it turns out there is a lot of nuance in here and we tried to do it with the least amount of heuristics as possible because that's just lends itself to corner cases and in general it's really nice and fun to be able to chat back and forth with these models. And I really highly encourage you to go on to the Jetson AI lab, download these containers, start playing around with the different models and discover their personalities and go from there, build your own applications. And I think before long, we'll see these out there on robots in real world embedded systems. So yeah, Let's all do it together.
If you need any help or support at any time, I'm always available on GitHub, the forums, LinkedIn, my email, DustinF at NVIDIA.com, and we're all out there to help each other and keep it going. So yeah, it's great to be part of this community. And thanks again so much for joining us today. At this point, we're going to get set up for Q&A. So if you haven't already, please feel free to type in your questions. If you are watching a replay of this, yeah, feel free to ask questions another way on the forums or GitHub or anywhere.
All right. Thanks.