2084: Text to Video Games, how do LLMs think and the endless possibilities of visual encoding.
Opus AI, Jarvis, Othello Transformer, and CLIP thoughts.
I was reading a post on reddit about a this platform called opus.ai. Now this platform seems to be a composition of a bunch of different models, that have the effect of where you can with a few paragraphs of text description, create a whole videogame level, fully textured with unique NPCs. This is mindblowing, firstly since like HuggingGPT or Jarvis, which is another model that came out recently that uses a LLM as a reasoning model to call and compose other models to perform the tasks you want to perform, it is a concert model a model that serves as the reasoning engine for a whole system that has various capabilities.
I think that, besides the supercharging of video games, that this idea of a concert model is going to be revolutionary. After all, how many points of interface do most people have with most things they interact with? A car only has 7 controls, a keyboard has round about 50, but most technological interfaces have a low number of interface controls which are manipulated by a sophisticated reasoning machine i.e. a human. So imagine, look about you, and try to think of all the jobs with a simple interface that essentially relies on a reasoning mind behind it. Security guard, truck driver, and desk worker are all of this type, and all are in danger of being automated. It’ll be the height of ironies if the real self-driving car is just a camera tied to a really smart multimodal LLM with about 7 outputs - screw waymo and tesla’s 20 years of development.
Of course tho, visual encoding is hard. I tried to look up how GPT-4 did it but nothing much is available. I was wondering though whether you could “finetune” a GPT model on CLIP outputs, to have something that vaguely understands images. A big enough dataset and you could probably have the dream of a visually understanding LLM, afterwhich the use of it for control systems would be instantaneous and obvious and very very exciting.
LLM themselves are fascinating tho. I read an interesting paper on them and how they create a world model inside of themselves to represent the abstract encodings in their data. This paper tells the story, and I’ll quote their parable in full:
A thought experiment
Consider the following thought experiment. Imagine you have a friend who enjoys the board game Othello, and often comes to your house to play. The two of you take the competition seriously and are silent during the game except to call out each move as you make it, using standard Othello notation. Now imagine that there is a crow perching outside of an open window, out of view of the Othello board. After many visits from your friend, the crow starts calling out moves of its own—and to your surprise, those moves are almost always legal given the current board.
You naturally wonder how the crow does this. Is it producing legal moves by "haphazardly stitching together” [3] superficial statistics, such as which openings are common or the fact that the names of corner squares will be called out later in the game? Or is it somehow tracking and using the state of play, even though it has never seen the board? It seems like there's no way to tell.
But one day, while cleaning the windowsill where the crow sits, you notice a grid-like arrangement of two kinds of birdseed--and it looks remarkably like the configuration of the last Othello game you played. The next time your friend comes over, the two of you look at the windowsill during a game. Sure enough, the seeds show your current position, and the crow is nudging one more seed with its beak to reflect the move you just made. Then it starts looking over the seeds, paying special attention to parts of the grid that might determine the legality of the next move. Your friend, a prankster, decides to try a trick: distracting the crow and rearranging some of the seeds to a new position. When the crow looks back at the board, it cocks its head and announces a move, one that is only legal in the new, rearranged position.
At this point, it seems fair to conclude the crow is relying on more than surface statistics. It evidently has formed a model of the game it has been hearing about, one that humans can understand and even use to steer the crow's behavior. Of course, there's a lot the crow may be missing: what makes a good move, what it means to play a game, that winning makes you happy, that you once made bad moves on purpose to cheer up your friend, and so on. We make no comment on whether the crow “understands” what it hears or is in any sense “intelligent”. We can say, however, that it has developed an interpretable (compared to in the crow’s head) and controllable (can be changed with purpose) representation of the game state.
Of course, the crow in this parable is a transformer model. This indicates that these are not simply prediction engines, they are abstraction engines which of course indicates a higher type of logic. All of this is super exciting in what it says about the future, and the world model paper especially indicates that there is a lot more power in these models than we thought. It’ll be interesting and the future is going to be weird. I might just write a blog post on that soon. Can’t wait for 2084.