Google AI Release Notes - Smaller, Faster, Cheaper & The Story of Flash 8B

Google AI Release Notes - Smaller, Faster, Cheaper & The Story of Flash 8B

Show Video

LOGAN KILPATRICK: Hey everyone, and welcome to "Release Notes," a new podcast that takes you behind the scenes of Google AI. I'm Logan Kilpatrick, and each episode, I'll be picking the brains of a bunch of the people behind some of the latest innovations, models, and new products coming out of Google. [MUSIC PLAYING] Ema, I'm super excited for this conversation.

You're one of my favorite people to work with at Google, mostly because you love shipping stuff, moving quickly, all that good stuff. Do you want to kick it off with just a super quick intro of what you've worked on at Google, who you are, and also today, what you're focused on? EMANUEL TAROPA: Sure. I started working at Google as an intern 19 years ago.

I worked on most things the company runs on, from file systems, ads, search backend, search quality, Vertex AI or what was before Cloud AI, Cloud Vision, news, real time detection of events, and more recently, on everything involving large language models and generative AI launches, from the initial Bard launch all the way to Gemini 8B. And I worked on all aspects of the system, from being on call for training runs to working very directly on serving them efficiently in production, to making changes to the internal transformer architecture like quantization, model charting and so on and so forth. And if you want to see more can go to LinkedIn. LOGAN KILPATRICK: I love that. No, that actually is super, super helpful context because I think it speaks to, and maybe we'll push into this later on in the conversation. But part of my takeaway of why you're so effective inside of Google is actually because you have all of that context.

You've built a lot of the infrastructure and someone might say, this is going to take six weeks or whatever. And then you're always like, no, I built that part of the system. It should only take-- like shipping this SKU config change should take 12 hours because I actually built the infrastructure to do it.

So I feel like that is one of your internal superpowers, is you have that context and you know how a bunch of these systems work. [MUSIC PLAYING] I'm curious what your broad take is on where we are with Gemini today versus a year ago, when at this point, Gemini hadn't actually even been announced yet? Have things sort of tracked the way that you would have expected from a traction standpoint? Do you think we're like finally getting the machine that makes the model spun up, all of that stuff? I'm curious where your head is. EMANUEL TAROPA: You're asking the wrong person about planning. You're very kind before.

Oh, because you worked on it, you kind of know how long it takes. My take on it is I'm overly optimistic and always late, but then somehow, it ships ahead of schedule. I mean, last year, where we were, we were in the same micro kitchen here in this sort of den like looking building and working very closely with a bunch of folks from our London office that were very nice and came here and worked with us. And then we did the same in January this year.

And then critical infrastructure things weren't really working and we're launching. So I was uncomfortably excited, to quote someone really famous. But then somehow, it all worked, and then we ended up having a really, really good sort of launch in December last year. And then we were able to follow very quickly in February and then March, and then sort of I/O and then Astra. And then we keep having surprises in store, so that's actually pretty awesome.

It's where are we now. I mean, again, it's a very biased opinion, because I'm in the middle of it. So I work with all of these folks and at this point, they're kind of my extended family. So again, very biased answer. So I do like it, where it's going.

And yeah, I'm pretty bullish. If we did one thing fairly well as a company was starting, and then just incrementally launching and making the system better. If you look at how we started with Universal Search in 2006, or Sidekick, which is search as you type, which is people outside know and suggest and stuff like that. Again, people are like, “it will never work. Oh my God, you're going to use so much compute power or so many resources.

How would you make it work?” And I haven't met an infrastructure problem we weren't able to crack. So I am pretty bullish. [MUSIC PLAYING] LOGAN KILPATRICK: I'm really interested to get your take on this system versus model situation of how much value in the future is actually going to be created by the, quote unquote, model weights themselves versus the actual system that surrounds the model? And I think my take, at least from joining Google, is like, there's actually almost near infinite amount of value that you can create by building the system and the infrastructure around the models.

And I feel like some of the-- I don't know what your perspective is on this, but it feels like long context is almost-- there is model innovation to enable long context, but it's actually an infrastructure problem to make it work at scale and reliably and at a cost that developers are excited about. EMANUEL TAROPA: Yeah, I mean, right on all accounts. It wasn't an easy infrastructure launch. I still am not terribly happy with how everything worked out, and we do have kinks to iron out, especially on cost model for caching, the interface to the Cache API and so on and so forth.

You know them very well, including the finer details, like kind of the examples we show, how well they work, how easy it is for people to understand. So we do realize we have a little bit of work to be done there, to be generous with the characterization. But I do think that infrastructure behind it is actually pretty solid.

And we will do what we do, which is gradually improve it. So don't do a launch and then don't do a follow up. So I think it's very important to get good infrastructure in place. That being said, infrastructure by itself, without a really strong model, it's a nice engineering accomplishment, but it doesn't really serve anyone. So users should get real value from this.

The real value comes from what the models do and how well they're integrated with the rest of our offerings, from Workspace apps, all the way to search and then Gemini. LOGAN KILPATRICK: Yeah, I completely agree. EMANUEL TAROPA: So I think both are equally important.

LOGAN KILPATRICK: Yeah, I think you're spot on. One other quick, maybe slightly controversial question before we talk about Flash-8B stuff, which I'm super excited to dive more into the details of. Back to this narrative of where we are with the machine that trains the model. I'm curious if there was one thing that we could just flip a switch and change about how the model process, like from pre-training to post-training, to all the infrastructure work happens.

If we could just make one of these things work, it would be a huge unlock for us getting models out the door, or even it could be an external-facing thing, as far as one feature that you think would really sort of unlock a lot more use cases that just isn't there yet from a capability standpoint. EMANUEL TAROPA: Yeah, that's a fair question. Look, again, can it be done better? Yes, absolutely. Can we do it without an excessive amount of time spent on details and getting stuff right and making sure it runs in production? No, not that how quickly the entire field is moving. And, this is a very exciting space to be in, and like-minded people gravitate toward each other.

And that's basically the theme, right? But, if I were to fix one thing, I would be a little bit more bold with what we want to put out in our products and then explain it very clear, like, hey, this is an experimental, and doing absolutely everything humanly possible on our end to come up with something really useful for the end users. And then maybe apologizing a little bit in advance if there are imperfections to it. And I do think that's a very tricky balance to achieve, but it's something that we should probably do better going forward. As opposed to letting launches sit for a little bit too long.

Kind of being a bit faster with them to market. And then also, taking some informed disclaimers around them, and then polishing up a little bit the failure cases for them. LOGAN KILPATRICK: Yeah, I agree with you. I think it's also extremely tricky to sort of strike that balance in the developer world, because in a lot of cases, we're telling people, hey, actually go and build your company on this thing. But I do agree, I think there's a balance and we're probably teetered. We're always teetered somewhere, and I think somewhere in the wrong direction, whether it's move slower or faster and constantly needing to reset ourselves.

So I hear you on that. EMANUEL TAROPA: Yeah, I mean like, it's complex. It's a global consumer company.

It has a huge responsibility. Like it or not, it will be a double standard, what we do versus what some other shop does. So all of these have to be considered. This is not finding excuses about why certain things take longer than desired to hit the market.

But it's just like, we're more than three people working on this thing, so we should probably be doing it faster. That's basically the takeaway. [MUSIC PLAYING] LOGAN KILPATRICK: So, Flash-8B developers are really excited. We landed this model after a ton of work internally, by you and by all of the GDM teams who are training the model, by everyone else across Cloud, who helped get this model out the door.

Why did we do this? What was the point of us releasing this model? You were a super strong advocate. I think if it hadn't been for you pushing in all those channels that we had, we likely would not have launched this model. So where did the conviction come from? Why did we make this model to begin with? Tell us part of that story. EMANUEL TAROPA: Yeah, sure.

I mean, it started literally when We were doing the long context launch in January, February this year, and all of us were focusing on Pro, and how do we get 1.5 Pro to be the very best model out there? And then work on this capability. And then at that point, folks weren't even considering should we do something like Flash. But like, no, no, no, hang on.

Let's do Flash because this brings a lot of value and it's going to be a super strong model. And there are clear use cases for which Pro will be used. So that has a very clear market definition or market segment that really wants that particular model.

There are also very high volume use cases for which a smaller, cheaper model, that's also extremely capable on those particular type of queries, work very well. So then we did Flash, and then the next thing was like, well, I really want to be able to experiment with like post-training or experiment in very large QPS settings, something that's extremely simple. I alluded to this before, things that are coupled very directly in the retrieval flow or in the main system that does the inverted index search and sorting of our results. Can we actually start deploying models at that end of the spectrum? So that's basically where if you look at it, you go, well, this is a very interesting SKU. And it gives us a lot of flexibility from deploying it to users' devices, which was the intended target, to actually running it internally for data center workloads that really are hundreds of thousands of queries per second or more.

So then, from there on to hey, I wonder if it does the long context trick as well, and how good is it compared to the bigger one? Then it becomes an interesting engineering challenge. So then, a couple of people, we worked on it and we kind of got the initial configurations for it done in March. And then as you very nicely pointed out, took a little bit of time to figure out how we want to place it, how we want to launch it? On the same topic that probably we should launch things a little bit faster kind of going forward. Yeah, so that's basically where it came from.

Made sense. And then very little marketing SIEMs is very used. So I guess, yeah. There you go.

Yay, validation LOGAN KILPATRICK: I love that. Why-- was it just like the natural obvious thing to do to make an 8 billion parameter model? I had asked on Twitter if people had questions about this conversation and the story for 8B, and there was some questions about-- a lot of companies have 8 billion parameter models. Is there like something natural that sort of happens in the model training process at 8 billion parameters, or do you have an intuition behind why 8B seems to work reasonably well across the ecosystem? EMANUEL TAROPA: Again, you're asking the wrong person. Last year, I said, “hey, nobody's thinking that we should be putting LLMs on phones, right?” So we started the Nano series at Google. As we have a very strong ecosystem, we want to do something both for developers and for our first party partners.

So then basically, sort of like, hey, how do we take this one step further, where folks maybe haven't considered that we can actually do it on phones? And then have something that rather than serving a few 100,000 or so requests per second, could be distributed to 2 billion users pretty much instantly. So that was the motivation for doing these smaller on device models. We initially started GemView on device.

Initially, we had something based on the ULM architecture. But then we're like, well, we kind of already have Gemini, let's just do Gemini. Let's push this forward.

And then you have 4b and then you have 8B, so these are very natural evolutions because the device capabilities also change. And changing the hardware specifications for the phones, I mean, that will take a while. So like, well, we changed the domain as we go along. We also change the type of models we're building.

So we get a lot of experience on compressed models spectrum if you want, with techniques for serving them efficiently from something like high end phones. But then we also look at them and then figure out that, oh, wow, these are actually useful in some shape or form for the internal workloads we have as well, and with the usual tricks that come with them, like routing queries between models, cascading, serving and so on and so forth. So if you say, hey, did you really place this 8B two years ago, and you had the plan to get there? My answer is a very quick and short no. But then things move pretty quickly and then it seemed like a good point in time, so then we did it. And then other products like from Gemini app, overviews, Workspace and so on and so forth, found value in this particular SKU. And yes, we continue building them.

May or may not be 8B going forward. [MUSIC PLAYING] LOGAN KILPATRICK: One of the things that was impressive about this launch sequence was going back to this narrative of you pushing to get this thing out the door. The memory that sticks in my mind was just the weekly post in the temp Flash-8B channel, where you were like, are we shipping it this week? What's happening? What's getting this out the door? What's the blocker? I'm curious, maybe less specific to this launch, or we can use this launch as an anecdotal example, I think that skill of navigating the complexity of organizations to get things out the door is something that you innately have. Do you have advice for people, even just for me personally, as somebody who's newer to Google, about what has historically worked well to do? Is it like finding the right coalition of people, which I know there's a ton of people inside GDM and other places inside of Google who believed in 8B.

Do you have any sort of advice from that perspective? EMANUEL TAROPA: Yeah, I mean, like, good karma. I worked with a lot of the people for a long time. So that's one.

It's something like, you don't let people hanging out to dry and they know they can launch stuff with you. So street cred, I guess. That's one, right? And then the second is just work in full bandwidth.

I mean, nothing off the table, basically. And then I think people respond really well, and then makes everything work pretty efficiently, even if I'm annoying sometimes, as I was. Which is, hey, why aren't we shipping? Why do we have so many people involved in saying yea or nay here? And things like that. But I do think in all transparency, it is good to call in efficiencies both at the system level as well as the organization level.

And blameless postmortems, it's a thing we have as a company. And then use that to incrementally improve what we do. It's the thing that has worked fairly well over the past. I mean, I know it for 19 years, and they've done it before I joined as an intern for quite some time. And it seems like it's somewhat unique to us.

And barring that, go on a run, which we never did. LOGAN KILPATRICK: Yeah, I'm sorry. EMANUEL TAROPA: Go on a hike and a lot of imperfections get sorted out like that as well, or a long train. I mean, it's common practice, I guess.

LOGAN KILPATRICK: Yeah, when I posted that we were-- I don't know if this was Jeff who posted this, but something about the thousands of cappuccino or espresso shots that he's had with you in the blue micro kitchen over the years. And I feel like that actually makes a ton of sense. At the end of the day, I think people believe what you say, which is part of the reason why you're so effective at getting this stuff out the door. But you have to earn that credit of being someone who people respect and want to listen to.

So it's wonderful that you've earned that. You made a comment before about back on the narrative of getting Flash-8B out the door. You made a comment about long context for 8B, and I this is actually one of the most surprising things to me. And I know if this is just the fundamental architecture of Gemini is such that all of the capabilities sort of carry down between models. And without sort of giving away the secret sauce of how we make the magic happen at Google, how is it possible that the model can do long contacts? And the answer might be we can't talk about it, but I feel like that's the most impressive thing, is that the model, even at 8 billion parameters, can do a million token contacts length and actually, perform pretty reasonably well on those queries. EMANUEL TAROPA: I mean, it's magic.

[LAUGHS] But we do have a fairly good set of principles that we do model architecture, design and development on. We insist quite a bit that they are respected at all sizes of the models. This is not something unknown in the industry, Or like scaling ladder and things like that. And then, yes, there will be capability differences between them, as the model capacity increases.

But some of them, we are able to make these transfers from larger models to the smaller ones. So I would say it's a combination of techniques that we use. And in the 8B case, we had a fairly decent spot of model capacity and transferable learnings from larger models that we could pull this off fairly decently.

LOGAN KILPATRICK: That makes sense. EMANUEL TAROPA: So we don't develop each model in isolation, basically, from everything else, because then that will not be a recipe for success in my opinion. LOGAN KILPATRICK: Yeah, I agree with you. And I actually think a bunch of the other questions that folks asked in the comments when I posted the tweet that we were talking was around whether or not any of the innovation from 8B and things like that carries over to the Gemma models. And just as a random side comment, that is the value that Google is offering, which is we take all the learnings from scaling these models on the Gemini side, and a lot of that same innovation ends up in the Gemma open source models, which is awesome to see. [MUSIC PLAYING] Part of this, again, back to this narrative of getting this model out the door.

You were originally, and not to out you on this, but you were pushing to make this model free. The price point ended up being super reasonable from a developer perspective. But I actually think there's a sort of independent of the outcome of the price for the specific Flash-8b model that we've released. I think there's perhaps, like a trend that you are sort of underscoring with the push to make things free, which is the compute cost to run these models, especially with smaller models, more and more intelligence is able to be compressed into smaller models.

The sort of efficiency of scale is kicking in across serving these models, training these models. And ultimately, actually, if you look at a lot of other technologies, people aren't required to-- like, when I go use CPUs across the hundreds of different software products that I use every day, I'm actually not usually bearing a lot of that cost. At scale, CPUs are offered for free across hundreds and millions of products everywhere. And like, it actually would make sense that that same thing will happen with potentially small models, small AI models on GPUs, TPUs, et cetera.

Does that capture your point of view or am I missing some piece of it? EMANUEL TAROPA: It does. One of the very exciting things over the past two years, two and a half years, of me working flat out in this area was to see the cost compression. Or best to look at it as how much quality can you get in a particular energy footprint, which is how much energy do you need to serve a particular level of quality to the end user? And it's been amazing. It's been quite a ride. I've seen it when we are really scaling our index from a few billion documents to hundreds of billions of documents 18 years ago. And then, I've seen it now.

So given the trend and given everything that I witnessed very directly by virtue of working on all of the details of these launches, the next thing is giving this to as many people and developers as possible for basically as low as possible as a cost. So you can't beat free. So we can start there. I guess you could pay people to use your stuff, but maybe we don't get into that. But you can't beat free, right? So it massively allows people to understand what the technology does with very little commitment on their end, and you just rely on their judgment.

How do you stack up against everything else that's in the field? And it's an exciting field because there's rabid competition here, which is really, really nice. So allowing people to make a very low commitment judgment call between you and your competition, I think, it's invaluable. So we should strive doing that.

And the other one is we have long context. Having a very cheap-- again, you can't beat free-- way of playing with long context and seeing what it does, and trying novel integrations with it, there will be glimmers of what it can do. There will be glimmers of, hey, this actually works really well for me, and this is how I built, like my app or my company.

And great. We've helped somebody. So that satisfaction is priceless. The other one is, hey, it kind of works on all of these cases.

I really need the more capable model and longer context for these other cases. And at that point, we also help the business because that will generate revenue. Again, that's basically what sort of drives all of us forward, is this balance between user satisfaction and the company actually running. LOGAN KILPATRICK: Yeah.

EMANUEL TAROPA: So that's basically why I'm pushing for like, hey, we've seen the cost compression. I very directly worked on orders of magnitude improvements in efficiency over the past two years. Let's just make this obvious to our end users as well, as much as possible. LOGAN KILPATRICK: Yeah, this is my obligatory insert that the Google AI Studio is free. You can try the latest sort of generation models, all for free.

There is no cost. There's no payment to use AI Studio. You can try the long context window, the 1 million, 2 million context window. It's all free.

Same thing with the API. Most generous API free tier. 1,500 requests per day.

You can use long contacts with Flash-8B and actually kick the tires on seeing what this model is capable of. And I think removing that economic barrier for developers to get started, I think is, to your point, I think super, super important, because you get to that moment of actually realizing that long context and models being natively multimodal and all the other innovation that's coming, actually does unlock all these super, super interesting use cases that you actually can't build with other models in other ecosystems. So hopefully, we're at least doing a little bit of justice to this vision that you have. [MUSIC PLAYING] Do you think there's an opportunity to go even smaller? That was another question that came up. Could we do a 4B model and serve it at scale? Does the trend of continuing to compress knowledge even smaller break down at a certain point? EMANUEL TAROPA: I mean, we did.

Previous generation were kind of around 4B. I'm not sure to what extent this is known versus not known, but we kind of did do that. We did use that model very extensively internally as well. We found very novel use cases for it, which were like actually surprising, some of them, to me, as well. Like, “hey, can it do this?” Yes. Yes, it can very well.

So simplified a lot of our stack before. For instance, for examining the generated responses. And I'm not going to say more, but rather than using 10 or 20 different systems, we can use one of these models.

And when you cross all the Ts, dot all the Is, the cost ends up being significantly lower than sort of a very sprawling infrastructure deployment to run all of these sidecar almost-like systems that we need to in order to ensure a particular quality and format of the response to the users. So yeah, we did see it. We are continuing to press on this. We shouldn't aim of having LLMs only on high end phones, only for the people that afford these phones. Again, sort of back to two minutes and 30 seconds ago, we should put this in the hands of as many people as possible for-- you can't beat free, but close to free.

So we strongly believe in the smaller model scale. This is not just, hey, we're only going to release tiny models. They have to come from somewhere, and the breakthroughs are very often made at the larger scale of the spectrum. So like the Ultras, the Pros, lots and lots of compute intensity spent on the largest models actually transfer pretty decently to the smaller scale.

[MUSIC PLAYING] LOGAN KILPATRICK: You were a proponent of the Flash-8B name. I'm curious. I've seen a bunch of feedback that people are confused about the name, not to solely put that on you. I think the decision was sort of legacy from the Gemini 1.5 technical report of us using that name.

You also report to Jeff, who his sort of side hustle is supposedly coming up with good names for stuff. So I'm curious, if we hadn't have called it Flash-8B, what is the name that you would have suggested we go with? EMANUEL TAROPA: I mean, I suggested Flash-8B. It was there for the taking and I was like, oh my God, we're going to spend a month debating the name, which more or less happened, and I'm not going to give more details here. But then we ended up with Flash-8B. Yay. Good success.

And then, what if we do one that's larger? Well, then, we can call it still 8B. Just partly joking, or maybe not joking. But I think it's a good name, right? A name is a name, right? We should use more of our models, in my opinion, to generate the names for the stuff that we want to ship out. That would be pretty neat.

Or we can use the date that we first put it into AI Studio. Generate a little competition across our products as well. Who ships fastest, and then they get to name it.

Imagine this, Gemini, I don't know, like 01-01-35, basically, for a model that we're going to release in the year 2035 or 3035, or something like that. So that's all the insight into the Flash-8B one. We had Flash, and then, what do you want to call this? Spark, Swift, Mini Flash, Mini Spark.

LOGAN KILPATRICK: You were pro calling it Sparky. I vividly remember you saying that we should call it Sparky. EMANUEL TAROPA: I threw a bunch of stuff on the wall. See what sticks. There you go.

LOGAN KILPATRICK: Yeah. EMANUEL TAROPA: Sorry. LOGAN KILPATRICK: I do like the idea that the team that ships the model first gets to decide the name. I feel like that would set up some interesting incentives for teams to move quickly.

I feel like maybe we'll adopt that with the forthcoming model launches. EMANUEL TAROPA: Yeah, that would be pretty neat, right? You get to name it with your project and the date that it's shipped, and it's there for people to see who shipped it first, right? Who took a bet on it, basically. Good or bad, right? I think this will focus quite a lot of attention to details across all of the products shipping stuff, because A, they'll get their name there. B, it should be a good launch. LOGAN KILPATRICK: This is wonderful. Just before I go to a bunch of other rapid fire questions and then get your closing thoughts, anything else from the Flash-8B narrative or story that rings in your mind? Any other hopeful thoughts for how people will use the model? Anything like that to close with? EMANUEL TAROPA: Yeah, I think one word really sticks with me, just, sort of, relentless.

Constant pressure, basically. I mean, you know it very well. On the flight to London in June, sending out these survey configs and the initial optimizations from it, while landing and while getting side glances from the cabin crew saying, hey, you really need to store your laptop, and I'm like, the wireless still works. So these kind of things, I mean, they're making it really fun to work here, and they also help a lot the clip of launches. And while I am all for long term planning in particular of sectors of activity, some of them move a lot faster than others, like this particular gen AI thing.

And I do think it's an exciting place to be in. So relentless, constant pressure in shipping stuff out very, very quickly. This really drives the innovation engine. So sometimes breakthroughs can be made completely independently from the main pressures of productionizing and shipping. But oftentimes, I've seen in my career, that the biggest breakthroughs are done in pressure cooker-like environments.

So I'm really looking forward to the next line of models. LOGAN KILPATRICK: So am I. I feel like the world is looking forward to it. I also think that-- I wouldn't describe it as the pressure cooker environment, though, I don't sit there every day, but I feel like the blue micro kitchen and gradient canopy, for anyone in DeepMind or folks who are familiar, is I think, one of the most magical places right now.

There's just so many interesting people who are actually building the frontier. And it feels like there's something special in the espresso machine that sits there, that's sort of fueling that. That espresso machine is having an outsized impact on world GDP as far as keeping researchers and everyone fully caffeinated. EMANUEL TAROPA: And full disclosure, we did carry it from 2008, when we installed it outside, I think at that point, Marissa's office in building 43. We carried it with us across the many buildings we've been to. It's a bit jaded now, if you look at the sides of it.

So it's an older model but still, it performs really well and we love it. And yeah, it's intense, fun and intense, sort of, uncomfortably excited, I guess, all of us. That, again, describes it pretty well. I mean, I don't know how else to place it, but it can be fun and generate results at the same time and be nonstop, which is what it is. [MUSIC PLAYING] LOGAN KILPATRICK: Ema, I'm going to go to a bunch of rapid fire questions that people sent in from the audience.

So hopefully, we can get some rapid fire answers from you as well. One of the questions was, will some of the same techniques that we used for Flash-8B be used for Gemma? We talked about that. That's the general philosophy. The second question was Gemini 2.0. Anything you can tell us about Gemini 2.0?

EMANUEL TAROPA: Yeah. So on the Gemma side, yes, we've been to Paris like two or three weeks ago, exactly working very closely with Gemma team. Again, this is like, hey, let's make this accessible to as many users as possible.

We have to do this in a way where not too much from what we're doing internally on proprietary models gets released too early, and things like that. But we do believe very firmly in having a very strong Gemma offering, and we do see that helping the ecosystem. And we're very intentional about how we want to do it, the type of changes we want to release. So yes, it's a resounding yes, absolutely. We're working very closely.

It's basically part one team. And then we want to make sure that we have very good open source models as well, that developers can get insights into, like, hey, what else should we try? If anything else, that brings everyone along. On 2.0, really good models and a ton

of work in serving them efficiently, which is what we're doing now. But I'll just leave it at that. LOGAN KILPATRICK: I love it. Another question that came in is 8B a mixture of expert model? I don't know if we have revealed any of these details in the technical report, so we can punt this. EMANUEL TAROPA: I don't think we have and I don't think we're going to start now.

So then we can move to the next one. LOGAN KILPATRICK: I love it. Another question, why is there no 002 version of the 8B model? We shipped in 002 version of the latest Flash and Pro models, and this was one of the questions that came in. EMANUEL TAROPA: I know, we blinked.

There you go. So a very quick answer. I mean, I don't know. LOGAN KILPATRICK: Yeah, awesome. EMANUEL TAROPA: There's a certain boardroom. LOGAN KILPATRICK: I'll double check the 002.

So I think the simple answer is we haven't shipped a second version, a second stable version. But yeah, I do think it is somewhat confusing that we have 001 and 002 at the same time. So we'll keep pushing on naming models better, which is tough. EMANUEL TAROPA: Again, I'm just like upvoting my suggestion. And I think someone else had the same thing, which is we could change what we named them.

Dates seem pretty good. Dates and products seem again, pretty good. So then, they'll be like, hey, why isn't Gemini app shipping a newer release of this particular model? And then, we'll be like, hey, go talk to Gemini.

LOGAN KILPATRICK: I think I'm aligned. Let's do dated version for the next model release series. So we'll consider that decision locked in. EMANUEL TAROPA: Yeah, Sergei had the same idea as well. I think it's a great idea. We should probably push on this.

LOGAN KILPATRICK: I'm with you. Another question. Why do smaller models tend to hallucinate more? Is it just like the knowledge is compressed and they know less? EMANUEL TAROPA: Maybe. It's a good question. I do think the smaller models, if you want to tune them more intently on specific use cases, things like summarization, I think a lot of this is fairly known.

It comes with pluses and minuses. It can make it more extractive and thus, keep it more grounded, but also hurts other capabilities in the model. And it's a balance we have to strike. But think of it as the smaller the model, the larger scale application we're using it for, the more finely tuned for particular use cases they are. This is not to say that small models don't have really, really good general capabilities.

Like, pick your favorite benchmark, because there's so many of them and so many of them are already saturated. But to the extent, that's still worthwhile picking. Have a benchmark.

Look at what models that were like two orders of magnitude larger, or an order and a half magnitude larger, were a year and a half or two years ago, or achieving on those benchmarks and compare them with the smaller models. And yes, caveats around distillation and all this, that everyone already kind of knows by now. But it's quite amazing what has happened here. LOGAN KILPATRICK: Yeah. No, I agree with you. Two questions.

My second to last question, on a slightly creative note, if you had to use a billion tokens of Flash-8B every single day for the next year, what would you spend those tokens on? This could be your personal life. This could be solving some work problem. But it has to be an at billion token scale problem. It can't be like, summarize emails.

EMANUEL TAROPA: Yeah, retrieval augmentation. Again, I'm biased. I'm a hardcore search infrastructure slash quality slash news person, so this is what I would use it for. Access all my personal information, help me make sense of it, help me organize it, help me use it effectively in day to day.

It's a huge inflection point in humanity. We should really make sure it's actually very useful for the end user. The amount of time it saves is ridiculous, if done right. So I think we should press on this as much as possible. Again, this is my pet project. This is what I would like to do with them.

Who's to say I'm not working on this already, and I'll leave it at that. LOGAN KILPATRICK: I love that, and the beautiful part is, it would probably only cost you a couple of dollars a day if you were actually running that at billion token scale. Ema, this has been wonderful. The final question, and then I'm excited to get off this call and get back to work. And hopefully, I'll see you in a couple of weeks. We can go for our run.

What is keeping you excited about continuing to push and get these models out the door at Google? Is it that narrative that you just mentioned of retrieval augmented generation at scale. What is the thing that is getting you up in the morning and excited? EMANUEL TAROPA: It varies morning by morning. Did the last night export succeed or not? If it didn't, then maybe we start looking at it at 3:00 AM, right? It's the set of people I work with, and it's the step function in humanity this thing does. Peeling everything away, that's basically what makes me very motivated to work on it.

And there are so many places in the world where you can have this type of impact. So I mean, from an engineering perspective, what more can one ask for? LOGAN KILPATRICK: Yeah, I love that. Ema, this was an awesome conversation. Thank you for taking the time out of the busy 8 to 8 workday. Yeah, I'm excited.

Hopefully we'll have more cool stuff shipping. We'll have you back on. And yeah, thanks for all the hard work. EMANUEL TAROPA: Definitely holding you up to the Strava run. We have to do our little route here.

We'll arrange it to fit in kid drop off and stuff like that in the morning, but we should make it happen. LOGAN KILPATRICK: We will make it happen. That's the easiest problem that we have to solve, so I appreciate it, Ema. I'll see you later. EMANUEL TAROPA: Bye. Thank you.

Bye. LOGAN KILPATRICK: So that was a wonderful conversation. I'm so glad that I got a chance to hear from Ema. Hopefully you all enjoyed the conversation too. And for me, candidly, part of the motivation to do this conversation, to start this video series, was actually seeing the disconnect in public perception around the pace in which Google is moving and what I've experienced firsthand since I joined Google.

So I'm hopeful that this was an interesting look behind the scenes, as far as what actually happens at Google, what it takes to get things out the door. And also, just another quick note, which is, it really does take a village of people to land some of these launches. I think me and Ema sort of talked about this and represented a lot of people's work, but it really does take like so many hard, hard working, amazing people inside of Google, who I'm fortunate enough to get to call teammates. Shipping stuff and especially shipping great stuff, really is a team sport. So I'm excited that we get to continue to do this.

2024-11-21 17:54

Show Video

Other news