Ironwood: The first Google TPU for the age of inference

453 points by meetpateltech 9 days ago

nharada 8 days ago

The first specifically designed for inference? Wasn’t the original TPU inference only?

dgacmu 8 days ago

Yup. (Source: was at brain at the time.)
Also holy cow that was 10 years ago already? Dang.
Amusing bit: The first TPU design was based on fully connected networks; the advent of CNNs forced some design rethinking, and then the advent of RNNs (and then transformers) did it yet again.
So maybe it's reasonable to say that this is the first TPU designed for inference in the world where you have both a matrix multiply unit and an embedding processor.
(Also, the first gen was purely a co-processor, whereas the later generations included their own network fabric, a trait shared by this most recent one. So it's not totally crazy to think of the first one as a very different beast.)
- miki123211 8 days ago
  
  Wow, you guys needed a custom ASIC for inference before CNNs were even invented?
  What were the use cases like back then?
  - huijzer 8 days ago
    
    According to a Google blog post from 2016 [1], use-cases were RankBrain to improve the relevancy of search results and Street View. Also they used it for AlphaGo. And from what I remember from my MSc thesis, they also probably were starting to use it for Translate. I can't find any TPU reference in the Attention is All You Need or BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, but I have been fine-tuning BERT in a TPU at the time in okt 2018 [2]. If I remember correctly, the BERT example repository showed how to fit a model with a TPU inside a Colab. So I would guess that the natural language research was mostly not on TPU's around 2016-2018, but then moved over to TPU in production. I could be wrong though and dgacmu probably knows more.
    [1]: https://cloud.google.com/blog/products/ai-machine-learning/g...
    [2]: https://github.com/rikhuijzer/improv/blob/master/runs/2018-1...
    
    mmx1 8 days ago
    
    Yes, IIRC (please correct me if I'm wrong), translate did utilize Seastar (TPU v1) which was integer only, so not easily useful for training.
  - dekhn 8 days ago
    
    As an aside, Google used CPU-based machine learning (using enormous numbers of CPUs) for a long time before custom ASICS or tensorflow even existed.
    The big ones were SmartASS (ads serving) and Sibyl (everything else serving). There was an internal debate over the value of GPUs with a prominent engineer writing an influential doc that caused Google continue with fat CPU nodes when it was clear that accelerators were a good alternative. This was around the time ImageNet blew up, and some eng were stuffing multiple GPUs in their dev boxes to demonstrate training speeds on tasks like voice recognition.
    Sibyl was a heavy user of embeddings before there was any real custom ASIC support for that and there was an add-on for TPUs called barnacore to give limited embedding support (embeddings are very useful for maximizing profit through ranking).
    
    mmx1 8 days ago
    
    What I've heard is that the extrapolation of compute needed so many additional CPU servers to keep running the existing workload types that it obviously justified dedicated hardware. Same for video encoding accelerators[1].
    [1]: https://research.google/pubs/warehouse-scale-video-accelerat...
    
    dekhn 7 days ago
    
    yes, that was one of the externally shared narratives. The other part is that google didn't want to be beholden to nvidia GPUs, since they have an associated profit margin that is higher than TPUs, as well as resource constraints (total amount of GPUs shipping at any given time).
    Another part that was left out was that Google did not make truly high speed (low-latency) networking and so many of their CPU jobs had to be engineered around slow networks to maintain high utilization and training speed. Google basically ended up internally relearning the lessons that HPC and supercomputing communities had already established over decades.
    
    miki123211 7 days ago
    
    > The big ones were SmartASS (ads serving) and Sibyl (everything else serving).
    Ah, the days when you, as a tech company employee, could call a service "SmartASS" and get away with it...
  - refulgentis 8 days ago
    
    https://research.google/blog/the-google-brain-team-looking-b... is a good overview
    I wasn't on Brain, but got obsessed with Kerminology of ML internally at Google because I wanted to know why leadership was so gung ho on it.
    The general sense in the early days was these things can learn anything, and they'll replace fundamental units of computing. This thought process is best exhibited externally by ex. https://research.google/pubs/the-case-for-learned-index-stru...
    It was also a different Google, the "3 different teams working on 3 different chips" bit reminds me of lore re: how many teams were working on Android wearables until upper management settled it.
    FWIW it's a very, very, different company now. Back then it was more entrepreneurial. A better version of Wave-era, where things launch themselves. An MBA would find this top-down company in 2025 even better, I find it less - it's perfectly tuned to do what Apple or OpenAI did 6-12 months ago, but not to lead - almost certainly a better investment, but a worse version of an average workplace, because it hasn't developed antibodies against BSing. (disclaimer: worked on Android)
    
    marsten 8 days ago
    
    Google was changed by two things, neither of which were much fun. But very understandable.
    One was the transition to a mature product line. In the early days it was about how do we do cool new things that will delight users: Gmail, Google Maps (Where 2), YouTube. The focus was on user growth and adoption.
    Then growth saturated and the focus turned to profitability: Getting more value out of existing users and defending the business. That shift causes you to think very differently, and it's not as fun.
    The second was changing market conditions. The web grew up, tech grew up, and the investment needed to make a competitive product skyrocketed. Google needed more wood behind fewer arrows and that meant reining in all the small teams running around doing kooky things. Again not fun, but understandable.
- kleiba 8 days ago
  
  > the advent of CNNs forced some design rethinking, and then the advent of RNNs (and then transformers) did it yet again.
  Certainly, RNNs are much older than TPUs?!
  - woodson 8 days ago
    
    So are CNNs, but I guess their popularity heavily increased at that time, to the point where it made sense to optimize the hardware for them.
  - hyhjtgh 8 days ago
    
    RNN was of course well known at the at time, but they werent putting out state of the art numbers at that time.
theptip 8 days ago

The phrasing is very precise here, it’s the first TPU for _the age of inference_, which is a novel marketing term they have defined to refer to CoT and Deep Research.
- yencabulator a day ago
  
  As a previous boss liked to say, this car is the cheapest in its price range, the roomiest in it's size category, and the fastest in its speed group.
- dang 8 days ago
  
  Ugh. We should have caught that.
  Can anyone suggest a better (i.e. more accurate and neutral) title, devoid of marketing tropes?
- shitpostbot 8 days ago
  
  They didn't though?
  > first designed specifically for inference. For more than a decade, TPUs have powered Google’s most demanding AI training and serving workloads...
  What do they think serving is? I think this marketing copy was written by someone with no idea what they are talking about, and not reviewed by anyone who did.
  Also funny enough it kinda looks like they've scrubbed all their references to v4i, where the i stands for inference. https://gwern.net/doc/ai/scaling/hardware/2021-jouppi.pdf
jeffbee 8 days ago

Yeah that made me chuckle, too. The original was indeed inference-only.
m00x 8 days ago

The first one was designed as a proof of concept that it would work at all, not really to be optimal for inference workloads. It just turns out that inference is easier.

Some honest competition in the chip space in the machine learning race! Genuinely interested to see how this ends up playing out. Nvidia seemed 'untouchable' for so long in this space that its nice to see things get shaken up.

I know they aren't selling the TPU as boxed units, but still, even as hardware that backs GCP services and what not, its interesting to see how it'll shake out!

epolanski 8 days ago

> Nvidia seemed 'untouchable' for so long in this space that its nice to see things get shaken up.
Did it?
Both Mistral's LeChat (running on Cerebras) and Google's Gemini (running on Tensors) have clearly showed ages ago Nvidia had no advantage at all in inference.
The hundreds of billions spent in hardware till now focused on training, but inference is in the long run gonna get the lion share of the work.
- wyager 8 days ago
  
  > but inference is in the long run gonna get the lion share of the work.
  I'm not sure - might not the equilibrium state be that we are constantly fine-tuning models with the latest data (e.g. social media firehose)?
  - NoahZuniga 8 days ago
    
    Head of groq said that in his experience at google training was less than 10% of compute.
    
    lostmsu 7 days ago
    
    Isn't Groq still more expensive than GPU-based providers?

throwaway48476 8 days ago

Its hard to be excited about hardware that will only exist in the cloud before shredding.

crazygringo 8 days ago

You can't get excited about lower prices for your cloud GPU workloads thanks to the competition it brings to Nvidia?
This benefits everyone, even if you don't use Google Cloud, because of the competition it introduces.
- 01HNNWZ0MV43FF 8 days ago
  
  I like owning things
  - sodality2 8 days ago
    
    Cloud will buy less NVDA chips, and since they're related goods, prices will drop.
    
    throwaway48476 8 days ago
    
    Pricing is based on competitors more than demand.
  - xadhominemx 8 days ago
    
    You own any GB200s?
  - karmasimida 8 days ago
    
    Oh, you own the generator for you GPU as well?
    
    Koshcheiushko 8 days ago
    
    And time, energy to maintain that.
  - baobabKoodaa 8 days ago
    
    You will own nothing and you will be happy.
    
    simondotau 8 days ago
    
    Seriously yes. I don’t want to own rapidly depreciating hardware if I don’t have to.
    I don’t want to own open source software and I’d prefer if culture was derived from and contributed to the public domain.
    I’m still fully on board with capitalism, but there are many instances where I’d prefer to replace physical acquisition with renting, or replacing corporate-made culture with public culture.
    
    throwaway48476 8 days ago
    
    Nvidia cloud instances have competition. What happens when you get vendor lock in with TPU and have no exit plan? Arguably it's competition that drives value creation, not capitalism.
    
    simondotau 8 days ago
    
    Just don’t let yourself get stuck behind vendor lock-in, or if you do, never let yourself feel trapped, because you never are.
    Whenever assessing the work involved in building an integration, always assume you’ll be doing it twice. If that sounds like too much work then you shouldn’t have outsourced to begin with.
    
    throwaway48476 6 days ago
    
    Switching costs are a thing everywhere in the economy. Measure twice cut once. Planning on doing the work twice is asinine.
    
    simondotau 14 hours ago
    
    Then you shouldn’t have outsourced to begin with.
- throwaway48476 8 days ago
  
  [flagged]
  - maxrmk 8 days ago
    
    I love to hate on google, but I suspect this is strategic enough that they wont kill it.
    Like graviton at AWS its as much of a negotiation tool as it is a technical solution, letting them push harder with NVIDIA on pricing because they have a backup option.
    
    mmx1 8 days ago
    
    Google has done stuff primarily for negotiation purposes (e.g. POWER9 chips) but TPU ain't one. It's not a backup option or presumed "inferior solution" to NVIDIA. Their entire ecosystem is TPU-first.
    
    lostmsu 7 days ago
    
    Was Gemini 2.5 trained on TPUs thought? I seem to be struggling to find that information. Wouldn't they want to mention it in every press release?
    
    mmx1 7 days ago
    
    Pretty sure the answer is yes. I have no direct knowledge of the matter for Gemini 2.5, but in general TPUs were widely used for training at Google. Even Apple used them to train their Apple Intelligence models. It’s not some esoteric thing to train on TPU; I would consider using GPU for that inside Google esoteric.
    P.S. I found an on-the-record statement re Gemini 1.0 on TPU:
    "We trained Gemini 1.0 at scale on our AI-optimized infrastructure using Google’s in-house designed Tensor Processing Units (TPUs) v4 and v5e. And we designed it to be our most reliable and scalable model to train, and our most efficient to serve."
  - joshuamorton 8 days ago
    
    Google's been doing custom ML accelerators for 10 years now, and (depending on how much you're willing to stretch the definition) has been doing them in consumer hardware for soon to be five years (the Google Tensor chips in pixel phones).
foota 8 days ago

Personally, I have a (non-functional) TPU sitting on my desk at home :-)
prometheon1 8 days ago

You don't find news about quantum computers exciting at all? I personally disagree
justanotheratom 8 days ago

exactly. I wish Groq would start selling their cards that they use internally.
- xadhominemx 8 days ago
  
  They would lose money on every sale
p_j_w 8 days ago

I think this article is for Wall Street, not Silicon Valley.
- mycall 8 days ago
  
  Bad timing as I think Wall Street is preoccupied at the moment.
  - SquareWheel 8 days ago
    
    Must be part of that Preoccupy Wall Street movement.
  - asdfman123 8 days ago
    
    Oh, believe me, they are very much paying attention to tech stocks right now.
- killerstorm 8 days ago
  
  It might also be for people who consider working for Google...
- noitpmeder 8 days ago
  
  What's their use case?
  - fennokin 8 days ago
    
    As in for investor sentiment, not literally finance companies.
  - amelius 8 days ago
    
    Gambling^H^H^H^H Making markets more "efficient".
jeffbee 8 days ago

[flagged]
- CursedSilicon 8 days ago
  
  Please don't make low-effort bait comments. This isn't Reddit
- varelse 8 days ago
  
  [dead]

nehalem 8 days ago

Not knowing much about special-purpose chips, I would like to understand whether chips like this would give Google a significant cost advantage over the likes of Anthropic or OpenAI when offering LLM services. Is similar technology available to Google's competitors?

heymijo 8 days ago

GPUs, very good for pretraining. Inefficient for inference.
Why?
For each new word a transformer generates it has to move the entire set of model weights from memory to compute units. For a 70 billion parameter model with 16-bit weights that requires moving approximately 140 gigabytes of data to generate just a single word.
GPUs have off-chip memory. That means a GPU has to push data across a chip - memory bridge for every single word it creates. This architectural choice, is an advantage for graphics processing where large amounts of data needs to be stored but not necessarily accessed as rapidly for every single computation. It's a liability in inference where quick and frequent data access is critical.
Listening to Andrew Feldman of Cerebras [0] is what helped me grok the differences. Caveat, he is a founder/CEO of a company that sells hardware for AI inference, so the guy is talking his book.
[0] https://www.youtube.com/watch?v=MW9vwF7TUI8&list=PLnJFlI3aIN...
- latchkey 8 days ago
  
  Cerebras (and Groq) has the problem of using too much die for compute and not enough for memory. Their method of scaling is to fan out the compute across more physical space. This takes more dc space, power and cooling, which is a huge issue. Funny enough, when I talked to Cerebras at SC24, they told me their largest customers are for training, not inference. They just market it as an inference product, which is even more confusing to me.
  I wish I could say more about what AMD is doing in this space, but keep an eye on their MI4xx line.
  - usatie 8 days ago
    
    Thank you for sharing this perspective — really insightful. I’ve been reading up on Groq’s architecture and was under the impression that their chips dedicate a significant portion of die area to on-chip SRAM (around 220MiB per chip, if I recall correctly), which struck me as quite generous compared to typical accelerators.
    From die shots and materials I’ve seen, it even looks like ~40% of the die might be allocated to memory [1]. Given that, I’m curious about your point on “not enough die for memory” — is it a matter of absolute capacity still being insufficient for current model sizes, or more about the area-bandwidth tradeoff being unbalanced for inference workloads? Or perhaps something else entirely?
    I’d love to understand this design tension more deeply, especially from someone with a high-level view of real-world deployments. Thanks again.
    [1] Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads — Fig. 5. Die photo of 14nm ASIC implementation of the Groq TSP. https://groq.com/wp-content/uploads/2024/02/2020-Isca.pdf
    
    latchkey 8 days ago
    
    > is it a matter of absolute capacity still being insufficient for current model sizes
    This. Additionally, models aren't getting smaller, they are getting bigger and to be useful to a wider range of users, they also need more context to go off of, which is even more memory.
    Previously: https://news.ycombinator.com/item?id=42003823
    It could be partially the DC, but look at the rack density... to get to an equal amount of GPU compute and memory, you need 10x the rack space...
    https://www.linkedin.com/posts/andrewdfeldman_a-few-weeks-ag...
    Previously: https://news.ycombinator.com/item?id=39966620
    Now compare that to an NV72 and the direction Dell/CoreWeave/Switch are going in with the EVO containment... far better. One can imagine that AMD might do something similar.
    https://www.coreweave.com/blog/coreweave-pushes-boundaries-w...
    
    usatie 6 days ago
    
    Thanks for the links — I went through all of them (took me a while). The point about rack density differences between SRAM-based systems like Cerebras or Groq and GPU clusters is now clear to me.
    What I’m still trying to understand is the economics.
    From this benchmark: https://artificialanalysis.ai/models/llama-4-scout/providers...
    Groq seems to offer near lowest prices per million tokens and the near fastest end to end response times. That’s surprising because in my understanding, speed(latency) and the cost are trade-offs.
    So I’m wondering: Why can’t GPU-based providers can't offer cheaper but slower(high-latency) APIs? Or do you think Groq/Cerebras are pricing much below cost (loss-leader style)?
    
    latchkey 6 days ago
    
    Loss leader. It is uber/airbnb. Book revenue, regardless of economics, and then debt finance against that. Hope one day to lock in customers, or raise prices, or sell the company.
  - heymijo 8 days ago
    
    > they told me their largest customers are for training, not inference
    That is curious. Things are moving so quickly right now. I typed out a few speculative sentences then went ahead and asked an LLM.
    Looks like Cerebras is responding to the market and pivoting towards a perceived strength of their product combined with the growth in inference, especially with the advent of reasoning models.
    
    latchkey 8 days ago
    
    I wouldn't call it "pivoting" as much as "marketing".
- ein0p 8 days ago
  
  Several incorrect assumptions in this take. For one thing, 16 bit is not necessary. For another 140GB/token holds only if your batch size is 1 and your sequence length is 1 (no speculative decoding). Nobody runs LLMs like that on those GPUs - if you do it like that, compute utilization becomes ridiculously low. With batch of greater than 1 and speculative decoding arithmetic intensity of the kernels is much higher, and having weights "off chip" is not that much of a concern.
- hanska 8 days ago
  
  The Groq interview was good too. Seems that the thought process is that companies like Groq/Cerebras can run the inference, and companies like Nvidia can keep/focus on their highly lucrative pretraining business.
  https://www.youtube.com/watch?v=xBMRL_7msjY
pkaye 8 days ago

Anthropic is using Google TPUs. Also jointly working with Amazon on a data center using Amazon's custom AI chips. Also Google and Amazon are both investors in Anthropic.
https://www.datacenterknowledge.com/data-center-chips/ai-sta...
https://www.semafor.com/article/12/03/2024/amazon-announces-...
avrionov 8 days ago

NVIDIA operates at 70% profit right now. Not paying that premium and having alternative to NVIDIA is beneficial. We just don't know how much.
- kccqzy 8 days ago
  
  I might be misremembering here, but Google's own AI models (Gemini) don't use NVIDIA hardware in any way, training or inference. Google bought a large number of NVIDIA hardware only for Google Cloud customers, not themselves.
xnx 8 days ago

Google has a significant advantage over other hyperscalers because Google's AI data centers are much more compute cost efficient (capex and opex).
- claytonjy 8 days ago
  
  Because of the TPUs, or due to other factors?
  What even is an AI data center? are the GPU/TPU boxes in a different building than the others?
  - summerlight 8 days ago
    
    Lots of other factors. I suspect this is one of the reasons why Google cannot offer TPU hardware itself out of their cloud service. A significant chunk of TPU efficiency can be attributed external factors which customers cannot easily replicate.
  - xnx 8 days ago
    
    > Because of the TPUs, or due to other factors?
    Google does many pieces of the data center better. Google TPUs use 3D torus networking and are liquid cooled.
    > What even is an AI data center?
    Being newer, AI installations have more variations/innovation than traditional data centers. Google's competitors have not yet adopted all of Google's advances.
    > are the GPU/TPU boxes in a different building than the others?
    Not that I've read. They are definitely bringing on new data centers, but I don't know if they are initially designed for pure-AI workloads.
    
    nsteel 8 days ago
    
    Wouldn't a 3d torus network have horrible performance with 9,216 nodes? And really horrible latency? I'd have assumed traditional spine-leaf would do better. But I must be wrong as they're claiming their latency is great here. Of course, they provide zero actual evidence of that.
    And I'll echo, what even is an AI data center, because we're still none the wiser.
    
    dekhn 8 days ago
    
    A 3d torus is a tradeoff in terms of wiring complexity/cost and performance. When node counts get high you can't really have a pair of wires between all pairs of nodes, so if you don't use a torus you usually need a stack of switches/routers aggregating traffic. Those mid-level and top-level switch/routers get very expensive (high bandwidth cross-section) and the routing can get a bit painful. 3d torus has far fewer cables, and the routing can be really simple ("hop vertically until you reach your row, then hop horizontally to read your node"), and the wrap-around connections are nice.
    That said, the torus approach was a gamble that most workloads would be nearest-neighbor, and allreduce needs extra work to optimize.
    An AI data center tends to have enormous power consumption and cooling capabilities, with less disk, and slightly different networking setups. But really it just means "this part of the warehouse has more ML chips than disks"
    
    nsteel 8 days ago
    
    > most workloads would be nearest-neighbor
    Thank you very much, that is the piece of the puzzle I was missing. Naively, it still seems (to me) far more hops for a 3d torus than a regular multi-level switch when you've got many thousands of nodes, but I can appreciate it could be much simpler routing. Although, I would guess in practice it requires something beyond the simplest routing solution to avoid congestion.
    
    cavisne 8 days ago
    
    Was that gamble wrong? I thought all LLM training workloads do collectives that involve all nodes (all-gather, reduce-scatter).
    
    dekhn 7 days ago
    
    I think the choice they made, combined with some great software and hardware engineering, allows them to continue to innovate at the highest level of ML research regardless of their specific choice within a reasonable dollar and complexity budget.
    
    xadhominemx 8 days ago
    
    It’s data center with much higher power density. We’re talking about 100 going to 1,000 kw/rack vs 20 kw/rack for a traditional data center. Requiring much different cooling a power delivery.
    
    xnx 8 days ago
    
    > what even is an AI data center
    A data center that runs significant AI training or inference loads. Non AI data centers are fairly commodity. Google's non-AI efficiency is not much better than Amazon or anyone else. Google is much more efficient at running AI workloads than anyone else.
    
    dbmnt 8 days ago
    
    > Google's non-AI efficiency is not much better than Amazon or anyone else.
    I don't think this is true. Google has long been a leader in efficiency. Look at the power usage effectiveness (PUE). A decade ago Google announced average PUEs around 1.12 while the industry average was closer to 2.0. From what I can tell they reported a 1.1 average fleet wide last year. They've been more transparent about this than any of the other big players.
    AWS is opaque by comparison, but they report 1.2 on average. So they're close now, but that's after a decade of trying to catch up to Google.
    To suggest the rest of the industry is on the same level is not at all accurate.
    https://en.wikipedia.org/wiki/Power_usage_effectiveness
    (Amazon isn't even listed in the "Notably efficient companies" section on the Wikipedia page).
    
    literalAardvark 8 days ago
    
    A decade ago seems like a very long time.
    We've seen the rise of OSS Kubernetes and eBPF networking since, and a lot more that I don't have on-stack rn.
    I wouldn't be surprised if everyone else had significantly closed the hardware utilization gap.
cavisne 8 days ago

Nvidia has ~60% margins in their datacenter chips. So TPU's have quite a bit of headroom to save google money without being as good as Nvidia GPU's.
No one else has access to anything similar, Amazon is just starting to scale their Trainium chip.
- buildbot 8 days ago
  
  Microsoft has the MAIA 100 as well. No comment on their scale/plans though.
baby_souffle 8 days ago

There are other ai/llm ‘specific’ chips out there, yes. But the thing about asics is that you need one for each *specific* task. Eventually we’ll hit an equilibrium but for now, the stuff that Cerebras is best at is not what TPUs are best at is not what GPUs are best at…
- monocasa 8 days ago
  
  I don't even know if eventually we'll hit an equilibrium.
  The end of Moore's law pretty much dictates specialization, it's just more apparent in fields without as much ossification first.

fancyfredbot 9 days ago

It looks amazing but I wish we could stop playing silly games with benchmarks. Why compare fp8 performance in ironwood to architectures which don't support fp8 in hardware? Why leave out TPUv6 in the comparison?

Why compare fp64 flops in the El Capitan supercomputer to fp8 flops in the TPU pod when you know full well these are not comparable?

[Edit: it turns out that El Capitan is actually faster when compared like for like and the statement below underestimated how much slower fp64 is, my original comment in italics below is not accurate] (The TPU would still be faster even allowing for the fact fp64 is ~8x harder than fp8. Is it worthwhile to misleadingly claim it's 24x faster instead of honestly saying it's 3x faster? Really?)

It comes across as a bit cheap. Using misleading statements is a tactic for snake oil salesmen. This isn't snake oil so why lower yourself?

fancyfredbot 8 days ago

It's even worse than I thought. El Capitan has 43,808 MI300A APUs. According to AMD each MI300A can do 3922TF of sparse FP8 for a total of 171EF sparse FP8 performance, or 85TF non-sparse.
In other words El Capitan is between 2 and 4 times as fast as one of these pods, yet they claim the pod is 24x faster than El Capitan.
dekhn 8 days ago

Google shouldn't do that comparison. When I worked there I strongly emphasized to the TPU leadership to not compare their systems to supercomputers- not only were the comparisons misleading, Google absolutely does not want supercomputer users to switch to TPUs. SC users are demanding and require huge support.
- meta_ai_x 8 days ago
  
  Google needs to sell to Enterprise Customers. It's a Google Cloud Event. Of course they have incentives to hype because once long-term contracts are signed you lose that customer forever. So, hype is a necessity
shihab 9 days ago

I went through the article and it seems you're right about the comparison with El Capitan. These performance figures are so bafflingly misleading.
And so unnecessary too- nobody shopping for AI inference server cares at all about its relative performance vs a fp64 machine. This language seems designed solely to wow tech-illiterate C-Suites.
imtringued 8 days ago

Also, there is no such thing as a "El Capitan pod". The quoted number is for the entire supercomputer.
My impression from this is that they are too scared to say that their TPU pod is equivalent to 60 GB200 NVL72 racks in terms of fp8 flops.
I can only assume that they need way more than 60 racks and they want to hide this fact.
- jeffbee 8 days ago
  
  A max-spec v5p deployment, at least the biggest one they'll let you rent, occupies 140 racks, for reference.
  - aaronax 8 days ago
    
    8960 chips in those 140 racks. $4.20/hour/chip or $4,066/month/chip
    So $68k per hour or $27 million per month.
    Get 55% off with 3 year commitment.
adrian_b 8 days ago

FP64 is more like 64 times harder than FP8.
Actually the cost is even much higher, because the cost ratio is not much less than the square of the ratio between the sizes of the significands, which in this case is 52 bits / 4 bits = 13, and the square of 13 is 169.
- christkv 8 days ago
  
  Memory size and bandwidth goes up a lot right?
zipy124 8 days ago

Because it is a public company that aims to maximise shareholder value and thus the value of it's stock. Since value is largely evaluated by perception, if you can convince people your product is better than it is, your stock valuation, at least in the short term will be higher.
Hence Tesla saying FSD and robo-taxis are 1 year away, the fusion companies saying fusion is closer than it is etc....
Nvidia, AMD, apple and intel have all been publishing misleading graphs for decades and even under constant criticism they continue to.
- fancyfredbot 8 days ago
  
  I understand the value of perception.
  A big part of my issue here is that they've really messed up the misleading benchmarks.
  They've failed to compare to the most obvious alternative, which is Nvidia GPUs. They look like they've got something to hide, not like they're ahead.
  They've needlessly made their own current products look bad in comparison to this one understating the long-standing advantage TPUs have given Google.
  Then they've gone and produced a misleading comparison to the wrong product (who cares about El Capitan? I can't rent that!). This is a waste of credibility. If you are going to go with misleading benchmarks then at least compare to something people care about.
  - zipy124 7 days ago
    
    That's fair enough, I agree with all your points :)
segmondy 8 days ago

Why not? If we line up to race. You can't say why compare v8 to v6 turbo or electric engine. It's a race, the drive train doesn't matter. Who gets to the finish line first?
No one is shopping for GPU by fp8, fp16, fp32, fp64. It's all about cost/performance factor. 8 bits is as good as 32bits, great performance is even been pulled out of 4 bits...
- fancyfredbot 8 days ago
  
  This is like saying I'm faster because I ran (a mile) in 8 minutes whereas it took you 15 minutes (to run two miles).
  - scottlamb 7 days ago
    
    I think it's more like saying I ran a mile in 8 minutes whereas it took you 15 minutes to run the same distance, but you weigh twice what I do and also can squat 600 lbs. Like, that's impressive, but it's sure not helping your running time.
    Dropping the analogy: f64 multiplication is a lot harder than f8 multiplication, but for ML tasks it's just not needed. f8 multiplication hardware is the right tool for the job.
charcircuit 8 days ago

>Why compare fp8 performance in ironwood to architectures which don't support fp8 in hardware?
Because end users want to use fp8. Why should architectural differences matter when the speed is what matters at the end of the day?
- bobim 8 days ago
  
  GP bikes are faster than dirt bikes, but not on dirt. The context has some influence here.
cheptsov 9 days ago

I think it’s not misleading, but rather very clear that there are problems. v7 is compared to v5e. Also, notice that it’s not compared to competitors, and the price isn’t mentioned. Finally, I think the much bigger issue with TPU is the software and developer experience. Without improvements there, there’s close to zero chance that anyone besides a few companies will use TPU. It’s barely viable if the trend continues.
- mupuff1234 8 days ago
  
  > besides a few companies will use TPU. It’s barely viable if the trend continues
  That doesn't matter much of those few companies are the biggest companies. Even with Nvidia majority of the revenue is being generated by a handful of hyperscalers.
- sebzim4500 8 days ago
  
  >Without improvements there, there’s close to zero chance that anyone besides a few companies will use TPU. It’s barely viable if the trend continues.
  I wonder whether Google sees this as a problem. In a way it just means more AI compute capacity for Google.
- latchkey 8 days ago
  
  The reference to El Capitan, is a competitor.
  - cheptsov 8 days ago
    
    Are you suggesting NVIDIA is not a competitor?
    
    latchkey 8 days ago
    
    You said: "notice that it’s not compared to competitors"
    The article says: "When scaled to 9,216 chips per pod for a total of 42.5 Exaflops, Ironwood supports more than 24x the compute power of the world’s largest supercomputer – El Capitan – which offers just 1.7 Exaflops per pod."
    It is literally compared to a competitor.
    
    cheptsov 8 days ago
    
    I believe my original sentence was accurate. I was expecting the article to provide an objective comparison between TPUs and their main competitors. If you’re suggesting that El Capitan is the primary competitor, I’m not sure I agree, but I appreciate the perspective. Perhaps I was looking for other competitors, which is why I didn’t really pay attention to El Capitan.
    
    latchkey 8 days ago
    
    Andrey, this is what I'm referring to: https://news.ycombinator.com/item?id=43632709
    
    cheptsov 8 days ago
    
    Yea, makes sense

MegaAgenticAI a day ago

More details on Ironwood, Cloud TPUs and insights from Jeff Dean: https://www.youtube.com/watch?v=fNjH5izFeyw

gigel82 8 days ago

I was hoping they're launching a Coral kind of device that can run locally and cheaply, with updated specs.

It would be awesome for things like homelabs (to run Frigate NVR, Immich ML tasks or the Home Assistant LLM).

_hark 8 days ago

Can anyone comment on where efficiency gains come from these days at the arch level? I.e. not process-node improvements.

Are there a few big things, many small things...? I'm curious what fruit are left hanging for fast SIMD matrix multiplication.

vessenes 8 days ago

One big area the last two years has been algorithmic improvements feeding hardware improvements. Supercomputer folks use f64 for everything, or did. Most training was done at f32 four years ago. As algo teams have shown fp8 can be used for training and inference, hardware has updated to accommodate, yielding big gains.
NB: Hobbyist, take all with a grain of salt
- jmalicki 8 days ago
  
  Unlike a lot of supercomputer algorithms, where fp error accumulates as you go, gradient descent based algorithms don't need as much precision since any fp errors will still show up at the next loss function calculation to be corrected, which allows you to make do with much lower precision.
  - cubefox 8 days ago
    
    Much lower indeed. Even Boolean functions (e.g. AND) are differentiable (though not exactly in the Newton/Leibniz sense) which can be used for backpropagation. They allow for an optimizer similar to stochastic gradient descent. There is a paper on it: https://arxiv.org/abs/2405.16339
    It seems to me that floating point math (matrix multiplication) will over time mostly disappear from ML chips, as Boolean operations are much faster both in training an inference. But currently they are still optimized for FP rather than Boolean operations.
muxamilian 8 days ago

In-memory computing (analog or digital). Still doing SIMD matrix multiplication but using more efficient hardware: https://arxiv.org/html/2401.14428v1 https://www.nature.com/articles/s41565-020-0655-z
- gautamcgoel 8 days ago
  
  This is very interesting, but not what the Ironside TPU is doing. The blog post says that the TPU uses conventional HBM RAM.
  - nsteel 8 days ago
    
    There's been some talk/rumour of next-gen HBMs having some compute capability on the base die. But again, not what they're doing here, this is regular HBM3/HBM3e.
    https://semiengineering.com/speeding-down-memory-lane-with-c...
yeahwhatever10 8 days ago

Specialization. Ie specialized for inference.

tuna74 8 days ago

How is API story for these devices? Are the drivers mainlined in Linux? Is there a specific API you use to code for them? How does the instance you rent on Google Cloud look and what does that have for software?

cbarrick 8 days ago

XLA (Accelerated Linear Algebra) [1] is likely the library that you'll want to use to code for these machines.
TensorFlow, PyTorch, and Jax all support XLA on the backend.
[1]: https://openxla.org/
ndesaulniers 8 days ago

They have out of tree drivers. If they don't ship the hardware to end users, it's not clear upstream (Linux kernel) would want them.

DisjointedHunt 8 days ago

Cloud resources are trending towards consumer technology adoption numbers rather than being reserved mostly for Enterprise. This is the most exciting thing in decades!

There is going to be a GPU/Accelerator shortage for the foreseeable future to run the most advanced models, Gemini 2.5 Pro is such a good example. It is probably the first model that many developers i've considered skeptics of extended agent use have started to saturate free token thresholds on.

Grok is honestly the same, but the lack of an API is suggestive of the massive demand wall they face.

lawlessone 8 days ago

Can these be repurposed for other things? Encoding/decoding video? Graphics processing etc?

edit: >It’s a move from responsive AI models that provide real-time information for people to interpret, to models that provide the proactive generation of insights and interpretation. This is what we call the “age of inference” where AI agents will proactively retrieve and generate data to collaboratively deliver insights and answers, not just data.

maybe i will sound like a luddite but im not sure i want this.

I'd rather AI/ML only do what i ask it to.

vinkelhake 8 days ago

Google already has custom ASICs for video transcoding. YouTube has been running those for many years now.
https://streaminglearningcenter.com/encoding/asics-vs-softwa...
- lawlessone 8 days ago
  
  Thank you :)
cavisne 8 days ago

The JAX docs have a good explanation for how a TPU works
https://docs.jax.dev/en/latest/pallas/tpu/details.html#what-...
Its not really useful for other workloads (unless your workload looks like a bunch of matrix multiplications).

sait007 8 days ago

Any idea on the system topology for their 9216 TPU chip pods ? Curious to know how things are arranged at the board, rack and pod level, CPU Vs TPU ratio, TPU interconnect topology - switch Vs mesh etc.

GrumpyNl 8 days ago

Why doesnt google offer the most advanced voice technology when they offer a playback version, it still sounds like the most basic text to voice.

lostmsu 7 days ago

Maybe it's good enough. In the end what matters is how strong the text of response is.

qoez 8 days ago

Post just to tease us since they barely sell TPUs

AlexCoventry 8 days ago

Who manufactures their TPUs? Wondering whether they're US-made, and therefore immune to the tariff craziness.

adgjlsfhk1 8 days ago

it's tsmc
- AlexCoventry 8 days ago
  
  Thanks.
mupuff1234 8 days ago

Pretty sure it's broadcom.
- nickysielicki 8 days ago
  
  Broadcom is fabless…
  - mupuff1234 8 days ago
    
    Oops.
    
    nsteel 8 days ago
    
    You were right in this context. They have previously partnered with Broadcom to make their TPUs, rather than work directly with TSMC. But I read it's actually Mediatek instead for this generation.

behnamoh 8 days ago

The naming of these chips (GPUs, CPUs) is kinda badass: Ironwood, Blackwell, ThreadRipper, Epyc, etc.

mikrl 8 days ago

Scroll through wikichip sometime and try to figure out the Intel μarch names.
I always confuse Blackwell with Bakewell (tart) and my CPU is on Coffee Lake and great… now I want coffee and cake

aranw 8 days ago

I wonder if these chips might contribute towards advancements for the Coral TPU chips?

mianos 8 days ago

There were pretty much abandoned years ago. The software stack support from Google only really lasted a few months and even then the stack they ran on was already years old versions of operating systems and python versions.
The only support is via a few enthusiastic third party developers.

DeathArrow 8 days ago

Cool. But does it support CUDA?

nickysielicki 8 days ago

CUDAs are the secret ingredient in the AI scaling sauce.
saagarjha 8 days ago

No. Google has a JAX stack.

attentive 8 days ago

anyone knows how this compares to AWS Inferentia chips?

wg0 8 days ago

Can anyone buy them?

ein0p 8 days ago

God damn it, Google. Make a desktop version of these things.

vessenes 8 days ago

7.2 Terabit/s HBM Bandwidth raised my eyebrows. But then I googled, and it looks like GB200 is 16Tb/s. In plebe land, 2Tb is pretty awesome.

These continue to be mostly for bragging rights and strategic safety I think. I bet they are not on premium processor nodes; If I worked at GOOG I’d probably think about these as competitive insurance vis-a-vis NVIDIA — total costs of chip team, software, tape outs, and increased data center energy use probably wipe out any savings from not buying NV, but you are 100% not beholden to Jensen.

QuadrupleA 8 days ago

"Age of inference", "age of Gemini", bleh. Google's doing some amazing work but I hate how they communicate - vague, self-serving, low signal-to-noise slop.

It's like they take some interesting wood carving of communication, then sand it down to a featureless nub.

budmichstelk 8 days ago

Had coffee with some of these people. Great to know they've come such a long way.

Calwestjobs 8 days ago

[flagged]

fluidcruft 8 days ago

This isn't anything anyone can purchase, is it? Who's the audience for this announcement?

avrionov 8 days ago

The audience is Google cloud customers + investors
llm_nerd 8 days ago

The overwhelming majority of AI compute is by either the few bigs in their own products, or by third parties that rent out access to compute resources from those same bigs. Extremely few AI companies are buying their own GPU/TPU buildouts.
Google says Ironwood will be available in the Google Cloud late this year, so it's relevant to just about anyone that rents AI compute, which is just about everyone in tech. Even if you have zero interest in this product, it will likely lead to downward pressure on pricing, mostly courtesy of the large memory allocations.
- fluidcruft 8 days ago
  
  It just seems like John Deere putting out a press-release about about a new sparkplug that is only useful to John Deere and can maybe be used on rented John Deere harvesters when sharecropping on John Deere-owned fields using John Deere GMO crops. I just don't see what's appealing about any of it. Not only is it a walled garden, you can't even own anything and are completely dependent on the whims of John Deere to not bulldoze the entire field.
  It just seems like if you build on Tensor then sure, you can go home, but Google will keep your ball.
  - aseipp 8 days ago
    
    The reality is that for large scale AI deployment there's only one criterion that matters: what is the total cost of ownership? If TPUs are 1/30th the total perf but 1/50th the total price, then they will be bought by customers. Basically that simple.
    Most places using AI hardware don't actually want to expend massive amounts of capital to procure it and then shove it into racks somewhere and then manage it over its total lifetime. Hyperscalers like Google are also far, far ahead in things like DC energy efficiency, and at really large scale those energy costs are huge and have to be factored into the TCO. The long dominant cost of this stuff is all operational expenditures. Anyone running a physical AI cluster is going to have to consider this.
    The walled garden stuff doesn't matter, because places demanding large-scale AI deployments (and actually willing to spend money on it) do not really have the same priorities as HN homelabbers who want to install inefficient 5090s so they can run Ollama.
    
    fluidcruft 8 days ago
    
    At large scales why shouldn't it matter whether you're beholden to Google's cloud only vs having options to use AWS or Oracle or Azure etc. There's maybe an argument to be made about price and efficiency of Google's data centers, but Google's cloud is far from notably cheaper than alternatives (to put it mildly) so that's a moot point if there's any efficiencies to be had Google's pocketing it themselves. I just don't see why anyone should care about this chip except Google themselves. It would be a different story if we were talking about a chip that had the option of being available in non-Google data centers.
badlucklottery 8 days ago

> Who's the audience for this announcement?
Probably whales who can afford to rent one from Google Cloud.
- MasterScrat 8 days ago
  
  An on-demand v5e-1 is $1.2/h, it's pretty accessible.
  The challenge is getting them to run efficiently, which typically involves learning JAX.
- jeffbee 8 days ago
  
  People with $3 are whales now? TPU prices are similar to other cloud resources.
  - dylan604 8 days ago
    
    Does anyone do anything useful with a $3 spend, or is it $3 X $manyManyHours?
    
    scarmig 8 days ago
    
    No one does anything useful with a $3 spend. That's not anything particular to TPUs, though.
    
    dylan604 8 days ago
    
    That's my point. The touting of $3 is beyond misleading.
    
    fancyfredbot 8 days ago
    
    You can do real work for a few hundred dollars which is hardly the exclusive domain of "whales"?
    The programmer who writes code to run on these likely costs at least 15x this amount an hour.
xhkkffbf 8 days ago

People who buy their stock.

g42gregory 8 days ago

And ... where could we get one? If they wouldn't sell it anyone, then is this a self-congratulation story? Why do we even need to know about this? If it propagates to the lower Gemini prices, fantastic. If not, then isn't it kind of irrelevant for the actual user experience?

lordofgibbons 8 days ago

You can rent it on GCP in a few months
- g42gregory 8 days ago
  
  Good point. At what prices per GB/TOPS? Better be lower than the existing TPUs ... That's what I care about.
jstummbillig 8 days ago

Well, with stocks and all, there is more that matters in the world than "actual user experience"