Next-Gen Intel Core Ultra’s New NPU Explained | Talking Tech | Intel Technology
- Hi, and welcome to Talking Tech. I'm your host, Alejandro Hoyos. And today we're gonna be taking a close look at Lunar Lake NPU. (energetic music)
With us we have, Senior PE, Darren Crews. Hey Darren, how's it going? - Hey, good, how about you? - Doing well. Good to see you. For those who are joining us for the first time, why don't you let us know what do you do at Intel and what did you work on on Lunar Lake? - Yeah. Yeah, so I've actually been at Intel for quite a while, it's coming upon 21 years or so now. So I've been in the NPU team for around seven years and working on NPU architecture.
And in terms of Lunar Lake, what I work on, I would say architecture kind of spans like two ends of the project. So at the beginning of the project, usually we're working on, you know, what hardware do we need based on the latest workload trends so we can accelerate those workloads. And then as we get closer to production, then we get actual real workloads from customers and let's say the latest workloads in the market. And we go and try to optimize those. We work with customers and ISVs to try to integrate into their software stack.
- When you're starting the architecture, you started planning like a lot of years before. - Yeah. - So I mean, you guys have to have like a great vision of, all right, how does the environment gonna look in three years? How is that? Is that challenging? How it is? - Yeah, it's hard. I mean, really no one can predict what's gonna be there in three years. You just have to look at what's the trend happening today? You know, read papers and figure out, you know, what people are publishing. You know, actually a lot of the networks that are popular today actually came out, you know, quite a few years ago.
But the problem is, is they're not kind of popular. So you have to decide, you know, here's a new network architecture. It looks promising, but it hasn't really kind of picked up. So should we optimize that for that or you know, optimize for other networks that we see? - What have you seen lately? I mean AI, it seems like it's all over the place. What are the kind of the most common works that you're seeing that today? - Yeah, most of the workloads are around transformers.
I would say those are the new workloads that we see. So large language models, all the, basically all the generative AI stuff. So besides large language models, there's also models like stable diffusion, those are image generators.
You know, while large language models get a lot of the hype around transformer models, they actually have a lot of other use cases. Let's say like transcription, like speech to text, that's another transformer workload that uses the same base architecture as a large language model. So yeah, there's just quite a large number of workloads around your drone transformers. Really, transformers are kind of replacing a lot of the previous workloads that we saw using other methods like convolutional neural networks or recurrent neural networks, LSTM networks. So it's really kind of taking over a lot of the previous network architectures that we saw. - So that's kind of the latest iteration on how to solve for these in the workloads using transformers? - [Darren] Yeah.
- Okay, so let's take a quick look at, I think you already told us what, so we have transformers and large language models. Can you tell us a bit more about what it is a large language model? 'Cause we've seen a lot of them today on different, yeah, on different softwares. - So a large language model is transformer architecture.
Transformer actually has kind of two components to it. There's like a encoder part and a decoder part. Encoder can kind of be thought as like a classifier. And the decoder is like a generator. So you can use just the encoder part as kind of, you know, classification.
You can use the encoder and decoder part as translation. So let's say you're doing like language to language translation. Or if you're doing speech to text to get speech in and text out, that would be like encoder/decoder.
Or it can just be decoder only, so that's the generative part. So that's what large language models are using. So they're really kind of taking those decoders and just scaling 'em up to, you know, incredibly large size.
You know, anywhere from, you know, a few billion parameters to hundreds of billions of parameters. - All right, so now let's go back to Lunar Lake and the neural processing unit, the NPU. I was looking through the different architectures in there and I saw that this is called the NPU 4, which to me I was really surprised 'cause I thought it was the second one because to me the first one was the one that we use in Meteor Lake, but that's not the case. Can you tells us a little bit more why it's number four? - Yeah, well I mean if we look at the whole history, the NPU, Intel acquired Movidius in 2016. They were working on products mostly for IOT, computer vision type of products. So it was like drones, security cameras, that kind of thing.
And that was using like classical computer vision methods. But it was starting to move over to deep learning based methods. So in 2018 we had our first NPU 1 that was released. At that point, it was called Myriad X, it was a discrete SOC. So we had a Myriad X in around 2018 and the follow on was King Bay, that was also like an discrete SOC focused on, you know, IOT workloads. But we were starting to work towards client.
Actually almost seven years ago, we took NPU 1, Myriad X, and we started working with Microsoft to see what it would look like to integrate that into a PC. So we enabled like a new software stack, kind of a new programming model that we're using today through MCDM, that's the Microsoft compute driver model and DirectML. So those are, you know, kind of the early days of things. But we're kind of building upon, you know, all those previous generations of work really around, you know, what does the hardware need to look like to support the PC ecosystem and what does the software stack need to look like. - For Lunar Lake, for the NPU, what were kind of the major overarching goals that you guys had for it? Like what did you want to achieve with it? - Well, I think a few things. So one thing is we have a very large increase in compute, so we're like over 4X the compute.
So you know, expecting that the amount of AI workloads was gonna increase. You know, we had a large amount of increase in compute. But really want these workloads to, you know, be able to use them like all day. So we want to have good battery life, you know, cool and quiet operations. So we wanted to make sure the efficiency was improved as well.
So where we're having a huge increase in the amount of compute, we also want to have a good increase in efficiency, so you can really use those use cases all day. So we have around, you know, 2X performance or efficiency increase as compared to Meteor Lake NPU 3. - And I guess we can have a little bit more, how are these efficiencies achieved? I mean, you guys, I think it would, you guys through increase plug across, changes in the micro architecture, what do you guys end up doing? - Yeah, it's kind of a mix. I mean, you know, we have a new node so that helps.
Also, adjusting the VF curve so we can support, you know, kind of higher frequency at the minimum voltage, then we can get better efficiency. We also do a lot of microarchitectural improvements. Usually for neural networks, the most common operation is like multiply accumulate. So that goes to our MAC array. So we put a lot of focus in, you know, optimizing the, you know, power efficiency of the MAC array. Also, you know, minimizing data movement is another important thing.
You wanna, you know, minimize the time, you know, the amount of times you're like reading and writing data 'cause that can also consume a lot of power. - Right. So you were talking about moving data and I saw that in the new architecture, you guys have, like in Meteor Lake we have a scratch pad. Let's start with that. What is a scratch pad within the NPU? - Yeah, so we have a software managed cache as opposed to like a hardware managed cache where let's say it would manage, let's say if you have a cache miss, it would bring in the data or if you run outta space, it'll evict the data. But in this case, we actually manage that by our compiler.
So one thing about a neural network is like the execution is pretty much totally known ahead of time. So you can kind of pre-plan, you know, optimally pre-plan all the data movement ahead of time. So we have a DMA engine of the compiler controls the DMA engine to bring in data at the optimal time, move data back to DDR at the optimal time. So let's say all the network parameters that we need to compute on, we'll bring in those, you know, the weights. Intermediate computation of activations will be kept in there. So that really kinda minimizes the amount of times we need to read and write from DDR, like reading and writing from DDR is expensive from a power perspective.
So, you know, in the spirit of kind of like keeping data local and saving power, you know, that's kinda the optimal way you know, to kind of orchestrate all the data movement, - Right because Lunar Lake, it's kind of designed towards, thought towards mobile. So to conserve battery power and therefore the scratch pad is you have the very quick access to the data that you need and also you reduce the amount of traffic for the DDR and the power. 'Cause like you said, it's a pretty big power hit when you have to go back and send that information.
- Yeah, right. Yeah, I mean it's like, you know, 10 or 20 times or maybe more like power efficient to read outta the scratch pad than the DDR. So if you're gonna be reading it a bunch of times, you know, the data a bunch of times, you wanna be able be reading it out of there. Also, we have higher bandwidth to the scratch pad than DDR as well. - Yeah. You said DMA.
What is DMA? - It's just a copy engine. So it's just an engine that we use to copy data back and forth between DDR and the scratch pad, pretty simply speaking. - And that's the one that you guys compared to Meteor Lake? You double in size? - Yeah, so in Meteor Lake we had like 64 byte per cycle interface to DDR.
So roughly you could say, you know, 60 or so gigabytes per second. But DDR, if you have two channel, you can get over 100 gigabytes per second. So we weren't able to fully consume the bandwidth. But, you know, seeing that there's a lot of workloads that are bandwidth bound, you know, certainly large language models can be very bandwidth bound after the first token.
Increasing or doubling the size of that interface to 128 bytes per cycle, we're able to fully consume the bandwidth out of DDR. - What is the first token? - Yeah, so if you're doing a large language model, let's say you're doing like a chat bot, you're gonna enter a prompt and let's say you enter like 100 words or something like that, you're gonna go through the large language model and you're gonna do compute on all of those first 100 words that you've put in. Those can actually be, it's a compute bound workload 'cause you're doing compute on all a 100 words that you put in. Now, like large language models are very large in terms of the amount of parameters or weights.
Let's just take something that would fit on a client, let's say like a Llama model that's 7 billion parameters. So every time you do an inference you have to read all those parameters in. But for the first token, you have a lot of compute too.
So you have a lot of tokens you have to compute on. So you're actually gonna be compute limited more than bandwidth limited. But once you get to the second token, you actually cache a lot of that information that you've computed and you're only gonna have one new token input, which is the last token that you just generated. That's gonna, like the output is gonna loop around as the new input. So all the parameters we actually have to read again. So those 7 billion parameters, which could be let's say four gigabytes of data we have to read in just to compute that single token.
So we moved from like a compute bound regime for the first token to a bandwidth bound regime for after the first token. - And because you have to iterate and you kinda have to keep loading all those parameters, that's why it goes from a compute bound to completely a more of a memory bound. - Right. - And that's why having like a scratch pad or increasing the direct memory access - Yeah - It definitely improves- - Yeah, I mean the scratch pad doesn't help 'cause you know, our scratch pad is like a few megabytes and you know, it's like a few gigabytes of parameters, so you have to basically reread them every time. - Okay, that's good. Thank you.
Let's talk about the, 'cause you guys incremented the overall architecture. So you incremented the amount of neural core engines. - [Darren] Yeah. - So tell me a little bit more about that. - Yeah, I mean simply to kind of get more compute, you know, capability, you can kind of increase the frequency or increase the number of compute units, and we basically did both. So we have around, you know, 40% increase in the clock speed and then we also triple the number of neural compute engines that we have.
So for overall, it's like over 4X, slightly over 4X, in terms of total compute capability. - [Alejandro] So that means that in Meteor Lake we had two and now we have six? - Yeah, yeah. Two neural compute engines in Meteor Lake. Each engine does like 2K max of compute or 2048 max of compute per cycle.
So we kind of kept that same architecture and just increased that by 3X. So we go from like, you know, around 4K max per cycle to around 12K max per cycle. - [Alejandro] That's a pretty good improvement. - Yeah. - Okay, so Lunar Lake we have six more neural core engines and that means that we have increased the TOPS. So we have gone from 11.5 to 48.
- Yeah, right. - For those, you know, who's still learning about AI, what is TOPS? How does it define? What is it? - Yeah, I mean TOPS is a little bit of a, kind of a narrow focus figure of merit, let's say for a NPU. What it stands for is trillions of operations per second. So it's 10 to the 12 operations per second. So you know, it's a pretty huge number. When we're talking about TOPS, we're focusing on like multiply accumulate.
So we have a MAC array. The MAC array does multiply accumulate and you know, we say those are matrix operations. Pretty much all, let's say like 90 to 95% of the operations are matrix operations. What that translates into is matrix multiplication or convolution.
So that's, you know, pretty much most of the operations that are done, or 95% of the operations that are done. So the way we compute TOPS is it's how many, you know, matrix multiplication, convolution operations that we can do per second. So Meteor Lake, for an example, we had each NPU, sorry, each neural compute engine could do like 2048 max per cycle.
We have two of those. So it's 4,096 max per cycle. And then an operation is either a multiply or an accumulate. So like a MAC is actually two operations. So we basically multiply like two operations times our max per cycle, times the frequency we can run at.
So if we kinda calculate all that out, then we get 11.5 peak TOPS. So if we think about, okay, how does that translate into a neural network? We can take like ResNet-50 for example. ResNet-50 has around 8 million, requires about 8 million operations to complete.
So, and that's basically just the, you know, multiply accumulate operations through doing convolution. So we can think about what's the theoretical maximum performance that we can deliver? Let's say if we had Meteor Lake at 11.5 TOPS and then we have ResNet-50 that's, you know, like 8 million operations, we can probably do like a little over 1000 frames per second kind of theoretically. And then maybe just stick of like a counter example, like large language models, if we're doing like Llama 7 billion and the amount of compute is actually dependent on thee input token size or the number of input tokens, I was saying it's very compute bound. If we do let's say 2000 input tokens, I think our amount of compute is around 30 TOPS of compute.
So I was saying ResNet-50 was like 8 million operations. And if we look at like Llama first token, it's like, you know, a million times larger. - Yeah, orders of magnitude. - It's just huge.
So large language models you know, take a significant amount of compute, so we really need those additional TOPS to be able to service those kind of workloads. - So about TOPS, is that a good kind of representation of the actual workload or the performance of the workload or the efficiency of the workload? - I mean, it's definitely a good metric. I mean you can never go above, you know, the performance you can deliver based on those TOPS, but we can definitely go below that.
So if we look at transformers, well there's two things. We see that some can be very bandwidth bound. Like I was talking, you know, after the first token, if we're doing large language model, we need a lot of bandwidth. So actually our TOPS don't help us a lot. What really helps is bandwidth, like we need bandwidth. So that's one factor that's important.
Another factor is the non-multiply accumulate operations, I would call those like vector operations. So in transformers there's an operation like called SoftMax for example, and is part of attention scoring. So you know, the first transformer paper is called, "Attention is All You Need". It's about how large language models essentially use attention to figure out what the context is and there's a SoftMax operation that's part of that. And you know, that's not a multiply accumulate operation, it's a little bit different. So you really need good performance on that.
You know, a MAC array, like a matrix multiply, you know, array is not gonna help you compute that. If you don't have enough of that kind of compute, you're really gonna slow down the processing. And so one of the things that we did in Lunar Lake in NPU 4 is we increased our vector compute by 4X, you know, over what we had, that's on top of the 3X. You know, we had three times the neural compute engines, we have kind of 4X the vector compute. So it's like, you know, 12X per cycle improvement. in compute from a vector perspective.
And what we see is a lot of workloads, we can get significantly better performance. For example, staple diffusion uses attention layer and on Meteor Lake we are very limited in terms of actually that SoftMax calculation. But on Lunar Lake we have a significant improvement in performance of that calculation, which helps us accelerate the overall workload more. - So that means like the TOPS is kind of like, I guess we talked earlier, it's mostly based on the multiply, accumulate for the matrix. - [Darren] Yeah.
- But that doesn't into account the attention layer that we need. Which actually, like you said, it gives context to the word when you're doing, yeah, when you're trying to figure out what was the input, which was like a type, lets say, the bird is blue or something like that. - Yeah, we can have like two figure of merit, you know, that we look at. The one figure of merit is the TOPS, you know, the peak TOPS, you know, which is just based on, you know, your multiply accumulate. And then we also have like effective TOPS. It's like how well can you use those peak TOPS? So we're always trying to improve, you know, both the peak TOPS, which is, you know, we can add more multipliers, we can increase the frequency, but then also effective TOPS, where we're trying to improve other, accelerate other portions of the network that are keeping us from achieving our peak TOPS, which could be more vector compute, it could be more bandwidth, et cetera.
Or maybe more specialized kind of operations. - No, okay, that makes sense. So we're talking about attention, well the attention, I guess there's an attention block within the whole architecture. But there's also another block that we have done all our improvements, which is the DSP block.
Well let's start with what does DSP mean and then what's kind the function within the NPU? - Just to clarify, the attention block is a part of the transformer architecture. - Oh, okay. Thank you. - Yeah, so we actually made the DSP improvements to help us on the attention block. So if we look at the history of Movidius, actually the DSP is, as part of the core IP that they developed, it's called SHAVE, that stands for streaming hybrid architecture vector engine.
It's basically a vector engine. It's our vector compute engine. So what we did is we had a 128 bit width for the vector engine.
What that means is if we're computing an FP 16, we can do eight FP 16 operations every cycle. It's like 8 times 16 is 128. We can also do like four FP 32 operations every cycle.
So what we did is we increased the vector width or like how many vector operations that we could do per cycle. We increase that from 128 bit to 512 bit. So we can do 32 FP 16 operations every cycle or 16 FP 32 operations every cycle. - That's a big improvement from what we had before. - Yeah. - All right, so this has been a great conversation.
How about we summarize all the different changes that have been implemented to the NPU, so that we have like a nice conclusion like changes and also how it translate to gains and performance? - Yeah, okay. Yeah, I mean I think I would summarize it in a few categories. So the first thing is, you know, peak TOPS increase. So that's just, you know, the raw increase in number of multiply accumulates the number, you know, multiply accumulates per cycle in the frequency we can run at. So that just gives us kind of more compute.
The second thing is of course, efficiency. So that really goes to the MAC array. You know, most of our power is going to the MAC array. So improving, you know, the efficiency of every time we do a multiply accumulate, we want that to be more efficient. The next thing is really the ratio of kind of vector compute to matrix compute, we found for especially transformer models.
Also, you know, back in the day like RNNs and LSTMs, there was more, a higher ratio of vector compute to matrix compute compared to like convolutional neural networks. So we wanted to increase our ratio of vector compute. So that's why we made the DSP 4X of vector links, essentially kind of increasing the ratio of vector compute to matrix compute by 4X.
And the last thing is bandwidth. You know, we have all these engines that are, can do a lot of compute, but we need to feed them. So we need to be able to fully saturate the DDR bandwidth so we can fully take advantage of, you know, what the platforms providing us in terms of bandwidth. So that's where we doubled the interface width for the DMA engine. So those are kind of like all the, you know, key ingredients to really, you know, accelerate today's modern workloads. - No, that's great. Darren, thank you so much.
I learned a lot. As always a pleasure, man. - All right, Thanks. - Thank you. (energetic music)
2024-09-04 08:57