In my experience LLMs have a hard time working with text grids like this. It seems to find columns harder to “detect” then rows. Probably because it’s input shows it as a giant row if that makes sense.
It has the same problem with playing chess.
But I’m not sure if there is a datatype it could work with for this kinda game. Currently it seems more like LLMs can’t really work on spacial problems. But this should actually be something that can be fixed (pretty sure I saw an article about it on HN recently)
Transformers can easily be trained / designed to handle grids, it's just that off the shelf standard LLMs haven't been particularly, (although they would have seen some)
Vision transformers effectively encode a grid of pixel patches. It’s ultimately a matter of ensuring the position encoding incorporates both X and Y and position.
For LLMs we only have one axis of position and - more importantly - the vast majority of training data only is oriented in this way.
Good point. The architectural solution that would come to mind is 2D text embeddings, i.e. we add 2 sines and cosines to each token embedding instead of 1. Apparently people have done it before: https://arxiv.org/abs/2409.19700v2
I think I remember one of the original ViT papers saying something about 2D embeddings on image patches not actually increasing performance on image recognition or segmentation, so it’s kind of interesting that it helps with text!
> We use standard learnable 1D position embeddings, since we have not observed significant performance gains from using more advanced 2D-aware position embeddings (Appendix D.4).
Although it looks like that was just ImageNet so maybe this isn't that surprising.
They seem to have used a fixed input resolution for each model, so the learnable 1D position embeddings are equivalent to learnable 2D position embeddings where every grid position gets its own embedding. It's when different images may have a different number of tokens per row that the correspondence between 1D index and 2D position gets broken and a 2D-aware position embedding can be expected to produce different results.
Yes, that's why ChatGPT can look at an image and change the style, or edit things in the image. The image itself is converted to tokens and passed to the LLM.
LLMs can be used as an agent to do all sorts of clever things, but it doesn’t mean the LLM is actually handling the original data format.
I’ve created MCP servers that can scrape websites but that doesn’t mean the LLM itself can make HTTP calls.
The reason I make this distinction is because someone claimed that LLMs can read images. But they don’t. They act as an agent for another model that reads images and creates metadata from it. LLMs then turn that meta data into natural language.
The LLM itself doesn’t see any pixels. It sees textual information that another model has provided.
Edit: reading more about this online, it seems LLMs can work with pixel level data. I had no idea that was possible.
No problem. Again, if it happened the way you described (which it did, until GPT-4o recently), the LLM wouldn't have been able to edit images. You can't get a textual description of an image and reconstruct it perfectly just from that, with one part edited.
> This is why the video of Claude solving level 1 at the top was actually (dramatic musical cue) staged, and only possible via a move-for-move tutorial that Claude nicely rationalized post hoc.
One of the things this arc of history has taught me is that post-hoc rationalization is depressingly easy. Especially if it doesn't have to make sense, but even passing basic logical checks isn't too difficult. Ripping the rationalization apart often requires identifying novel, non-obvious logical checks.
I thought I had learned that time and time again from human politics, but AI somehow made it even clearer than I thought possible. Perhaps simply because of knowing that a machine is doing it.
Edit: after watching the video more carefully:
> "This forms WALL IS WIN horizontally. But I need "FLAG IS WIN" instead. Let me check if walls now have the WIN property. If they do, I just need to touch a wall to win. Let me try moving to a wall:
There's something extremely uncanny-valley about this. A human player absolutely would accidentally win like this, and have similar reasoning (not expressed so formally) about how the win was achieved after the fact. (Winning depends on the walls having WIN and also not having STOP; many players get stuck on later levels, even after having supposedly learned the lesson of this one, by trying to make something WIN and walk onto it while it is still STOP.)
But the WIN block was not originally in line with the WALL IS text, so a human player would never accidentally form the rule, but would only do it with the expectation of being able to win that way. Especially since there was already an obvious, clear path to FLAG — a level like this has no Sokoban puzzle element to it; it's purely about learning that the walls only block the player because they are STOP.
Nor would (from my experience watching streamers at least) a human spontaneously notice that the rule "WALL IS WIN" had been formed and treat that as a cue to reconsider the entire strategy. The natural human response to unintentionally forming a useful rule is to keep pushing in the same direction.
On the other hand, an actually dedicated AI system (in the way that AlphaGo was dedicated to Go) could, I'm sure, figure out a game like Baba Is You pretty easily. It would lack the human instinct to treat the walls as if they were implicitly always STOP; so it would never struggle with overriding it.
The question isn't "can we write a computer program that can beat X game", it is "do things like claude represent a truly general purpose intelligence as demonstrated by its ability to both write a limerick and play baba is you"
This is interesting. If you approach this game as individual moves, the search tree is really deep. However, most levels can be expressed as a few intermediate goals.
In some ways, this reminds me of the history of AI Go (board game). But the resolution there was MCTS, which wasn't at all what we wanted (insofar as MCTS is not generalizable to most things).
> However, most levels can be expressed as a few intermediate goals
I think generally the whole thing with puzzle games is that you have to determine the “right” intermediate goals. In fact, the naive intermediate goals are often entirely wrong!
A canonical sokoban-like inversion might be where you have to push two blocks into goal areas. You might think “ok, push one block into its goal area and then push another into it.”
But many of these games will have mechanisms meaning you would first want to push one block into its goal, then undo that for some reason (it might activate some extra functionality) push the other block, and then finally go back and do the thing.
There’s always weird tricks that mean that you’re going to walk backwards before walking forwards. I don’t think it’s impossible for these things to stumble into it, though. Just might spin a lot of cycles to get there (humans do too I guess)
Yeah, often working backwards and forwards at the same time is how to solve some advanced puzzle games. Then you keep it from exploding in options. When thinking backwards from the goal, you figure out constraints or "invariants" the forward path must uphold, thus can discard lots of dead ends earlier in your forward path.
To me, those discoveries are the fun part of most puzzle games. When you unlock the "trick" for each level and the dopamine flies, heh.
I usually get a good mileage out of jumping straight in the middle :). Like, "hmm let's look at this block; oh cool, there's enough space around it that I could push it away from goal, for whatever reason". Turns out, if it's possible there usually is a good reason. So whenever I get stuck, I skim every object in the puzzle and consider in isolation, what can I do with it, and this usually gives me anchor points to drive my forward or backward thinking through.
MCTS wasn't _really_ the solution to go. MCTS-based AIs existed for years and they weren't _that_ good. They weren't superhuman for sure, and the moves/games they played were kind of boring.
The key to doing go well was doing something that vaguely looks like MCTS but the real guts are a network that can answer: "who's winning?" and "what are good moves to try here?" and using that to guide search. Additionally essential was realizing that computation (run search for a while) with a bad model could be effectively+efficiently used to generate better training data to train a better model.
> Additionally essential was realizing that computation (run search for a while) with a bad model could be effectively+efficiently used to generate better training data to train a better model.
In a sense, classic chess engines do that, too: alpha-beta-search uses a very weak model (eg just checking for checkmate, otherwise counting material, or what have you) and search to generate a much stronger player. You can use that to generate data for training a better model.
“Reasoning models like o3 might be better equipped to come up with a plan, so a natural step would be to try switching to those, away from Claude Desktop…”
But…Claude Desktop does have a reasoning mode for both Sonnet and Opus.
In the limit case, to an actual general intelligence, representation is superfluous, because it can figure out how to convert freely.
To the extent that the current generation of AI isn't general, yeah, papering over some of its weaknesses may allow you to expose other parts of it, both strengths and other weaknesses.
A human can easily struggle at solving a poorly communicated puzzle, especially if paper/pencil or something isn't available to convert to a better format. LLMs can look back at what they wrote, but it seems kind of like a poor format for working out a better representation to me.
These models can “code,” but they can’t code yet. We’ll know that they can actually code once their performance on these tasks becomes invariant to input representation, because they can just whip up a script to convert representations.
I have noticed a trend of the word "Desiderata" appearing in a lot more writing. Is this an LLM word or is it just in fashion? Most people would use the words "Deisres" or "Goals," so I assume this might be the new "delve."
It‘s academic jargon. Desiderata are often at the end of a paper, in the section „someone should investigate X, but I‘m moving on to the next funded project“.
At least in this instance, it came from my fleshy human brain. Although I perhaps used it to come off as smarter than I really am - just like an LLM might.
I once made a “RC plays Baba Is You” that controlled the game over a single shared browser that was streaming video and controls back to the game. Was quite fun!
But I am fairly sure all of Baba Is You solutions are present in the training data for modern LLMs so it won’t make for a good eval.
Baba is You is a great game part of a collection of 2D grid puzzle games.
(Shameless plug: I am one of the developers of Thinky.gg (https://thinky.gg), which is a thinky puzzle game site for a 'shortest path style' [Pathology] and a Sokoban variant [Sokoath] )
These games are typically NP Hard so the typical techniques that solvers have employed for Sokoban (or Pathology) have been brute forced with varying heuristics (like BFS, dead-lock detection, and Zobrist hashing). However, once levels get beyond a certain size with enough movable blocks you end up exhausting memory pretty quickly.
These types of games are still "AI Proof" so far in that LLMs are absolutely awful at solving these while humans are very good (so seems reasonable to consider for for ARC-AGI benchmarks). Whenever a new reasoning model gets released I typically try it on some basic Pathology levels (like 'One at a Time' https://pathology.thinky.gg/level/ybbun/one-at-a-time) and they fail miserably.
Simple level code for the above level (1 is a wall, 2 is a movable block, 4 is starting block, 3 is the exit):
000
020
023
041
Similar to OP, I've found Claude couldn’t manage rule dynamics, blocked paths, or game objectives well and spits out random results.
NP hard isn't much of a problem, because the levels are fairly small, and instances are not chosen to be worst case hard but to be entertaining for humans to solve.
SMT/SAT solvers or integer linear programming can get you pretty far. Many classic puzzle games like Minesweeper are NP hard, and you can solve any instance that a human would be able to solve in their lifetime fairly quickly on a computer.
There are numerous guides for all levels of Baba Is You available. I think it's likely that any modern LLM has them as part of its training dataset. That severely degrades this as a test for complex solution capabilities.
Still, its interesting to see the challenges with dynamic rules (like "Key is Stop") that change where are you able to move etc.
The dynamic rule changes are precisely what make this a valuable benchmark despite available guides. Each rule modification creates a novel state-space that requires reasoning about the consequences of those changes, not just memorizing solution paths.
this is definitely a case for fine tuning a LLM on this game's data. There is currently no LLM out there that is able to play very well many games of different kinds.
I would be way more interested in it playing niche community levels, because I suspect a huge reason it's able to solve these levels is because it was trained on a million Baba is You walkthroughs. Same with people using Pokemon as a way to test LLMs, it really just depends on how well it knows the game.
I suspect real AGI evals aren't going to be "IQ test"-like which is how I'd categorize these benchmarks.
LLMs will probably continue to scale on such benchmarks, as they have been, without needing real ingenuity or intelligence.
Obviously I don't know the answer but I think it's the same root problem as why neural networks will never lead to intelligence. We're building and testing idiot savants.
In my experience LLMs have a hard time working with text grids like this. It seems to find columns harder to “detect” then rows. Probably because it’s input shows it as a giant row if that makes sense.
It has the same problem with playing chess. But I’m not sure if there is a datatype it could work with for this kinda game. Currently it seems more like LLMs can’t really work on spacial problems. But this should actually be something that can be fixed (pretty sure I saw an article about it on HN recently)
Transformers can easily be trained / designed to handle grids, it's just that off the shelf standard LLMs haven't been particularly, (although they would have seen some)
Are there some well-known examples of success in it?
Vision transformers effectively encode a grid of pixel patches. It’s ultimately a matter of ensuring the position encoding incorporates both X and Y and position.
For LLMs we only have one axis of position and - more importantly - the vast majority of training data only is oriented in this way.
Good point. The architectural solution that would come to mind is 2D text embeddings, i.e. we add 2 sines and cosines to each token embedding instead of 1. Apparently people have done it before: https://arxiv.org/abs/2409.19700v2
I think I remember one of the original ViT papers saying something about 2D embeddings on image patches not actually increasing performance on image recognition or segmentation, so it’s kind of interesting that it helps with text!
E: I found the paper: https://arxiv.org/pdf/2010.11929
> We use standard learnable 1D position embeddings, since we have not observed significant performance gains from using more advanced 2D-aware position embeddings (Appendix D.4).
Although it looks like that was just ImageNet so maybe this isn't that surprising.
They seem to have used a fixed input resolution for each model, so the learnable 1D position embeddings are equivalent to learnable 2D position embeddings where every grid position gets its own embedding. It's when different images may have a different number of tokens per row that the correspondence between 1D index and 2D position gets broken and a 2D-aware position embedding can be expected to produce different results.
If this were a limitation in the architecture, they wouldn't be able to work with images, no?
LLMs don’t work with images.
They do, though.
Do they? I thought it was completely different models that did image generation.
LLMs might be used to translate requests into keywords, but I didn’t think LLMs themselves did any of the image generation.
Am I wrong here?
Yes, that's why ChatGPT can look at an image and change the style, or edit things in the image. The image itself is converted to tokens and passed to the LLM.
LLMs can be used as an agent to do all sorts of clever things, but it doesn’t mean the LLM is actually handling the original data format.
I’ve created MCP servers that can scrape websites but that doesn’t mean the LLM itself can make HTTP calls.
The reason I make this distinction is because someone claimed that LLMs can read images. But they don’t. They act as an agent for another model that reads images and creates metadata from it. LLMs then turn that meta data into natural language.
The LLM itself doesn’t see any pixels. It sees textual information that another model has provided.
Edit: reading more about this online, it seems LLMs can work with pixel level data. I had no idea that was possible.
My apologies.
No problem. Again, if it happened the way you described (which it did, until GPT-4o recently), the LLM wouldn't have been able to edit images. You can't get a textual description of an image and reconstruct it perfectly just from that, with one part edited.
> This is why the video of Claude solving level 1 at the top was actually (dramatic musical cue) staged, and only possible via a move-for-move tutorial that Claude nicely rationalized post hoc.
One of the things this arc of history has taught me is that post-hoc rationalization is depressingly easy. Especially if it doesn't have to make sense, but even passing basic logical checks isn't too difficult. Ripping the rationalization apart often requires identifying novel, non-obvious logical checks.
I thought I had learned that time and time again from human politics, but AI somehow made it even clearer than I thought possible. Perhaps simply because of knowing that a machine is doing it.
Edit: after watching the video more carefully:
> "This forms WALL IS WIN horizontally. But I need "FLAG IS WIN" instead. Let me check if walls now have the WIN property. If they do, I just need to touch a wall to win. Let me try moving to a wall:
There's something extremely uncanny-valley about this. A human player absolutely would accidentally win like this, and have similar reasoning (not expressed so formally) about how the win was achieved after the fact. (Winning depends on the walls having WIN and also not having STOP; many players get stuck on later levels, even after having supposedly learned the lesson of this one, by trying to make something WIN and walk onto it while it is still STOP.)
But the WIN block was not originally in line with the WALL IS text, so a human player would never accidentally form the rule, but would only do it with the expectation of being able to win that way. Especially since there was already an obvious, clear path to FLAG — a level like this has no Sokoban puzzle element to it; it's purely about learning that the walls only block the player because they are STOP.
Nor would (from my experience watching streamers at least) a human spontaneously notice that the rule "WALL IS WIN" had been formed and treat that as a cue to reconsider the entire strategy. The natural human response to unintentionally forming a useful rule is to keep pushing in the same direction.
On the other hand, an actually dedicated AI system (in the way that AlphaGo was dedicated to Go) could, I'm sure, figure out a game like Baba Is You pretty easily. It would lack the human instinct to treat the walls as if they were implicitly always STOP; so it would never struggle with overriding it.
A simple feed-forward neural network with sufficient training can solve levels way better than Claude. Why is Claude being used at all.
The question isn't "can we write a computer program that can beat X game", it is "do things like claude represent a truly general purpose intelligence as demonstrated by its ability to both write a limerick and play baba is you"
This is interesting. If you approach this game as individual moves, the search tree is really deep. However, most levels can be expressed as a few intermediate goals.
In some ways, this reminds me of the history of AI Go (board game). But the resolution there was MCTS, which wasn't at all what we wanted (insofar as MCTS is not generalizable to most things).
> However, most levels can be expressed as a few intermediate goals
I think generally the whole thing with puzzle games is that you have to determine the “right” intermediate goals. In fact, the naive intermediate goals are often entirely wrong!
A canonical sokoban-like inversion might be where you have to push two blocks into goal areas. You might think “ok, push one block into its goal area and then push another into it.”
But many of these games will have mechanisms meaning you would first want to push one block into its goal, then undo that for some reason (it might activate some extra functionality) push the other block, and then finally go back and do the thing.
There’s always weird tricks that mean that you’re going to walk backwards before walking forwards. I don’t think it’s impossible for these things to stumble into it, though. Just might spin a lot of cycles to get there (humans do too I guess)
Yeah, often working backwards and forwards at the same time is how to solve some advanced puzzle games. Then you keep it from exploding in options. When thinking backwards from the goal, you figure out constraints or "invariants" the forward path must uphold, thus can discard lots of dead ends earlier in your forward path.
To me, those discoveries are the fun part of most puzzle games. When you unlock the "trick" for each level and the dopamine flies, heh.
I usually get a good mileage out of jumping straight in the middle :). Like, "hmm let's look at this block; oh cool, there's enough space around it that I could push it away from goal, for whatever reason". Turns out, if it's possible there usually is a good reason. So whenever I get stuck, I skim every object in the puzzle and consider in isolation, what can I do with it, and this usually gives me anchor points to drive my forward or backward thinking through.
> But the resolution there was MCTS
MCTS wasn't _really_ the solution to go. MCTS-based AIs existed for years and they weren't _that_ good. They weren't superhuman for sure, and the moves/games they played were kind of boring.
The key to doing go well was doing something that vaguely looks like MCTS but the real guts are a network that can answer: "who's winning?" and "what are good moves to try here?" and using that to guide search. Additionally essential was realizing that computation (run search for a while) with a bad model could be effectively+efficiently used to generate better training data to train a better model.
> Additionally essential was realizing that computation (run search for a while) with a bad model could be effectively+efficiently used to generate better training data to train a better model.
That has been known since at least the 1990s with TD-Gammon beating the world champions in Backgammon. See eg http://incompleteideas.net/book/ebook/node108.html or https://en.wikipedia.org/wiki/TD-Gammon
In a sense, classic chess engines do that, too: alpha-beta-search uses a very weak model (eg just checking for checkmate, otherwise counting material, or what have you) and search to generate a much stronger player. You can use that to generate data for training a better model.
“Reasoning models like o3 might be better equipped to come up with a plan, so a natural step would be to try switching to those, away from Claude Desktop…”
But…Claude Desktop does have a reasoning mode for both Sonnet and Opus.
Do you think the performance can be improved if the representation of the level is different?
I've seen AI struggle with ASCII, but when presented as other data structures, it performs better.
edit:
e.g. JSON with structured coordinates, graph based JSON, or a semantic representation with the coordinates
In the limit case, to an actual general intelligence, representation is superfluous, because it can figure out how to convert freely.
To the extent that the current generation of AI isn't general, yeah, papering over some of its weaknesses may allow you to expose other parts of it, both strengths and other weaknesses.
A human can easily struggle at solving a poorly communicated puzzle, especially if paper/pencil or something isn't available to convert to a better format. LLMs can look back at what they wrote, but it seems kind of like a poor format for working out a better representation to me.
If it struggles with the representation, that makes it an even better test of the AI's thinking potential.
I'm not sure. Adding superficial difficulties to an IQ test for humans doesn't (necessarily) improve it as an IQ test.
These models can “code,” but they can’t code yet. We’ll know that they can actually code once their performance on these tasks becomes invariant to input representation, because they can just whip up a script to convert representations.
I think it’s a great idea for a benchmark.
One key difference to ARC in its current iteration is that there is a defined and learnable game physics.
Arc requires generalization based on few examples for problems that are not well defined per se.
Hence ARC currently requires the models that work on it to possess biases that are comparable to the ones that humans possess.
I have noticed a trend of the word "Desiderata" appearing in a lot more writing. Is this an LLM word or is it just in fashion? Most people would use the words "Deisres" or "Goals," so I assume this might be the new "delve."
It‘s academic jargon. Desiderata are often at the end of a paper, in the section „someone should investigate X, but I‘m moving on to the next funded project“.
So „Future Work“?
Literally it means “things that we wish for”, from the latin verb “desiderare” (to wish).
At least in this instance, it came from my fleshy human brain. Although I perhaps used it to come off as smarter than I really am - just like an LLM might.
I once made a “RC plays Baba Is You” that controlled the game over a single shared browser that was streaming video and controls back to the game. Was quite fun!
But I am fairly sure all of Baba Is You solutions are present in the training data for modern LLMs so it won’t make for a good eval.
> But I am fairly sure all of Baba Is You solutions are present in the training data for modern LLMs so it won’t make for a good eval.
Claude 4 cannot solve any Baba Is You level (except level 0 that is solved by 8 right inputs), so for now it's at least a nice low bar to shoot for...
Baba is You is a great game part of a collection of 2D grid puzzle games.
(Shameless plug: I am one of the developers of Thinky.gg (https://thinky.gg), which is a thinky puzzle game site for a 'shortest path style' [Pathology] and a Sokoban variant [Sokoath] )
These games are typically NP Hard so the typical techniques that solvers have employed for Sokoban (or Pathology) have been brute forced with varying heuristics (like BFS, dead-lock detection, and Zobrist hashing). However, once levels get beyond a certain size with enough movable blocks you end up exhausting memory pretty quickly.
These types of games are still "AI Proof" so far in that LLMs are absolutely awful at solving these while humans are very good (so seems reasonable to consider for for ARC-AGI benchmarks). Whenever a new reasoning model gets released I typically try it on some basic Pathology levels (like 'One at a Time' https://pathology.thinky.gg/level/ybbun/one-at-a-time) and they fail miserably.
Simple level code for the above level (1 is a wall, 2 is a movable block, 4 is starting block, 3 is the exit):
000
020
023
041
Similar to OP, I've found Claude couldn’t manage rule dynamics, blocked paths, or game objectives well and spits out random results.
NP hard isn't much of a problem, because the levels are fairly small, and instances are not chosen to be worst case hard but to be entertaining for humans to solve.
SMT/SAT solvers or integer linear programming can get you pretty far. Many classic puzzle games like Minesweeper are NP hard, and you can solve any instance that a human would be able to solve in their lifetime fairly quickly on a computer.
In Factorio's paper [1] page 3, the agent receives a semantic representation with coordinates. Have you tried this data format?
[1]: https://arxiv.org/pdf/2503.09617
It reminds me of https://en.m.wikipedia.org/wiki/The_Ricks_Must_Be_Crazy. Hope we are not ourselves in some sort of simulation ;)
There are numerous guides for all levels of Baba Is You available. I think it's likely that any modern LLM has them as part of its training dataset. That severely degrades this as a test for complex solution capabilities.
Still, its interesting to see the challenges with dynamic rules (like "Key is Stop") that change where are you able to move etc.
The dynamic rule changes are precisely what make this a valuable benchmark despite available guides. Each rule modification creates a novel state-space that requires reasoning about the consequences of those changes, not just memorizing solution paths.
Read the article first maybe
this is definitely a case for fine tuning a LLM on this game's data. There is currently no LLM out there that is able to play very well many games of different kinds.
[dead]
[dead]
I would be way more interested in it playing niche community levels, because I suspect a huge reason it's able to solve these levels is because it was trained on a million Baba is You walkthroughs. Same with people using Pokemon as a way to test LLMs, it really just depends on how well it knows the game.
Two corrections, as written in the post: At least Claude not able to solve the standard levels at all, and community levels are definitely in scope.
I suspect real AGI evals aren't going to be "IQ test"-like which is how I'd categorize these benchmarks.
LLMs will probably continue to scale on such benchmarks, as they have been, without needing real ingenuity or intelligence.
Obviously I don't know the answer but I think it's the same root problem as why neural networks will never lead to intelligence. We're building and testing idiot savants.