Discussed a fascinating idea for a foundation model tool at lunch today: interactive navigation in embedding space.
Right now, you prompt most generative models with human language. That works, but it’s imprecise and coarse. If you’re generating an image of an outside scene, and you want the sunlight ever so slightly brighter, you could add the phrase “and ever so slightly brighter” to the prompt, and it might work, but it’s clunky, and not great, and clearly doesn’t scale. Maybe good enough for recreational use cases; clearly not professional grade.
What you really want to do is move your target in embedding space directly, without the lossy indirection of going through human language. Ideally, you’d have a dial that mapped directly to sunlight luminance, and you could bump it up just a bit. Similar to temperature for LLMs, but for fine control over direction and distance in high-dimensionality embedding space, as opposed to overall stochasticity.
Imagine a big mixing board at a professional music studio. You generate an initial image as a starting point, and the model analyzes it and gives you the top 100 principal components as vectors in embedding space, each grounded to the closest embedding and human word that describes them. Smiles, spikiness, wood, buttons, height above the ground, crowd density, all sorts of concepts, each with a knob you can dial up or down. They won’t be entirely independent, so cranking up smiles may also move the warmth, happiness, and sociability knobs, which is ok.
It’s a complicated UI! Definitely not as approachable as “just type into the text box.” And typing into a text box has gotten us pretty far! But if we’re stuck with human language as our main interface to generative models, that’s extremely limiting. Professionals won’t tolerate that for long; they need more powerful, fine grained interfaces that give them a high degree of interactive control and ability to iterate. Language prompts may have gotten us here, as they say, but they may not get us there.
AI researchers will note that this has lots of prior art in grounding and interpretability, among other areas. I’m no expert, I’d love to hear any thoughts!

@snarfed.org Not an expert on AI by any stretch, but I have some experience with using creative tools professionally that might help with your goal.
Most creative LLM tools jump to the final output.
That’s not how creatives generally work.
Take a look at how any graphic designer actually uses Photoshop. I’ll guarantee they use many layers. Deep etched objects. Masks. Transparencies. When combined, it produces the final output.
Likewise, take a look at any record producer’s DAW. Track upon track. Drums, bass, keys, samples, synths, guitars, vocals. When mixed down, it produces the final output.
Sane deal with video production. Clip upon clip, when rendered, makes a video clip.
Instead of training an LLM to jump to the final output, could you train it to generate a series of layers which, when stacked on top of yeah other, produce the final image?
Likewise, there’s many examples of isolated vocal/drum tracks on YouTube.
Could you train an AI to generate a series of audio tracks which, when mixed, generate the final output?
So, for example, the tokens that generate the piano track of the audio at any given point in time are determined by what’s happening on, say, the drum and the bass tracks?
This post is getting long, so I’ll break it up into two parts.
@snarfed.org This sounds very promising. I wonder if you could just “ask” for the dials you want (or have a few by default) and then adjust them. A mix of both as the number of dials are likely infinite.
There is also a discussion about objects in the scene as well. I’ve noticed that when I ask for an object to be moved in an image, it almost never works. If there could be an intermediate layer of objects in the image, you could then select each one and use dials on each.
Since vectors are just a magnitude and an orientation, and you can “move” a vector in the direction and magnitude of another vector by adding them, I almost want just a magnitude slider and some other very esoteric ui control for specifying the magnitude/orientation of the transforming vector
Something like a gimbal with a scroll wheel on it
I guess you’d lose a lot of the control compared to something like “one slider per dimension” but to me this feels closer to the right way to interact with something like that
Like maybe on of them is (brightness, hue, contrast etc) another is broad topic like subject and location and action, one of them is emotional context, etc
Also, this one I’m a lot less sure about, but you might be able to get information about groupings of dimensions that are highly correlated, and have buttons for selecting different, like, “orientation planes” that you could gimbal and scroll your way through
@snarfed.org A bunch of people played with this idea back in the StyleGAN era: start (for example) by finding a point in latent space for “young” and another for “old” and then offer a slider that interpolates between them. Or “male” and “female”. Lots of neat demos. Would love to see a good set of sliders for writing concepts…
@snarfed.org A bunch of people played with this idea back in the StyleGAN era: start (for example) by finding a point in latent space for “young” and another for “old”, or “male” and “female” — now you’ve got your directional vector. After that, offer a slider that moves any other point along that same direction. Lots of neat demos. Would love to see a good set of sliders for writing concepts…
Super cool! Feels very similar in spirit to the autoencoder steering anthropic showcased
Someone actually built this!