How AI Systems Actually Decide Which Brands to Recommend
Most conversations about AI and branding stay at the surface: “make sure AI knows about you,” “create content for AI,” “optimize for LLMs.” These are fine as slogans. They are useless as strategy. To build real strategy, you need to understand the mechanism. How does a language model actually decide, at the moment of generation, to produce your brand name instead of a competitor's?
The answer is math. Specifically, probability distributions over tokens. Every decision a language model makes is a weighted coin flip across its entire vocabulary. Understanding those coin flips is the key to understanding AI visibility.
The probability distribution
A language model generates text one token at a time. At each step, it produces a score (called a logit) for every token in its vocabulary. GPT-4's vocabulary has roughly 100,000 tokens. So at each step, the model produces 100,000 scores.
These scores are then converted into probabilities via the softmax function. Softmax does something that is intuitive in concept but brutal in practice: it exponentiates each score, then normalizes. This means small differences in logits translate into large differences in probability.
Say your brand has a logit of 5.0 and a competitor has a logit of 6.0. That is a 20% difference in the raw score. After softmax, the competitor's probability is roughly 2.7 times yours. A small logit gap becomes a large probability gap. This is the softmax non-linearity, and it is the single most important concept in AI visibility.
Decoding strategies: the silent brand filters
The probability distribution is step one. Step two is what the system does with that distribution. This is called the decoding strategy, and most people in marketing have never heard of it. It is, quietly, one of the most powerful filters determining brand visibility.
Greedy decoding picks the highest-probability token every time. Only one brand wins. If you are number 2 in the distribution, you never appear. This is the most common strategy for factual queries and tool-use scenarios.
Top-k sampling takes the top k tokens (say, k=40), redistributes the probability among them, and samples. If your brand is in the top 40, you have a chance. If not, you are filtered out before sampling even begins.
Nucleus sampling (top-p) is more dynamic. It takes the smallest set of tokens whose cumulative probability exceeds a threshold (say, p=0.9). If the top 3 brands already account for 90% of the probability mass, everything else is excluded. This is why the distribution is so steep: a small number of brands capture almost all the probability, and the long tail is effectively zeroed out.
The practical implication is blunt: you need to be in the top 3 to 4 tokens for your category. Not the top 30. Not the top 10. The top 3 to 4. Everything else gets filtered by the decoding strategy before the model even generates its answer.
Temperature: the visibility knob
Temperature is a parameter that scales the logits before softmax. Low temperature (0.0 to 0.3) sharpens the distribution: the winner takes almost all the probability. High temperature (0.8 to 1.2) flattens it: more tokens get a meaningful share.
This matters because different AI products use different temperatures. ChatGPT uses relatively low temperature for factual queries and higher temperature for creative tasks. Perplexity runs even lower for its search-grounded answers. Claude tends toward moderate temperature across most tasks.
The result: your brand might appear in one AI product and not another, purely because of temperature settings. A brand that sits at position 5 in the probability distribution might appear in a high-temperature generation (where the distribution is flat enough to include it) but vanish in a low-temperature one (where only the top 2 to 3 survive).
This is not something you can control directly. But it is something you can measure. And once you measure it, you know exactly how far you need to climb in the distribution to become reliably visible across all temperature settings.
The two engines: weights vs. retrieval
A brand can appear in an AI answer through two distinct pathways: pretrained weights and retrieval-augmented generation (RAG).
Pretrained weightsare the slow, durable engine. During pretraining, the model reads billions of tokens. If your brand appears frequently enough in the right contexts, it gets encoded into the model's parameters. This is permanent (until the model is retrained) and robust (it survives across different prompts and phrasings). But it takes 6 to 18 months for new information to propagate through a training cycle, and there is no guarantee your content will be in the next training batch.
Retrieval is the fast, rented engine. When a model with RAG capabilities (Perplexity, ChatGPT with browsing, Gemini with grounding) processes a query, it first searches the web, retrieves relevant chunks, and injects them into its context window. If your page gets retrieved, your brand appears in the context, and the model can reference it in its answer.
The difference matters for strategy. Weight-based visibility is “owned.” Once it is there, it persists. Retrieval-based visibility is “rented.” You have to keep your content fresh, crawlable, and high-quality for retrieval systems to keep pulling it.
What this means for your brand
The mechanics lead to a few non-obvious conclusions:
1. Position in the distribution is everything.Being “mentioned by AI” is not binary. It is positional. A brand at position 2 in the logit ranking might have 5x the probability of position 5. The goal is not just presence; it is top-of-distribution presence.
2. Different models see you differently. Each model was trained on different data with different weights. Your brand might be position 2 in GPT-4 and position 15 in Claude. Multi-model measurement is not optional; it is the only way to see the full picture.
3. The rich get richer. If your brand is in the top 3, it gets generated. When it gets generated, users see it, search for it, write about it. That creates more training data, which pushes you higher in the next model. This feedback loop is the most powerful force in AIO, and it favors incumbents.
4. SEO metrics are a lagging indicator. Domain authority, backlinks, and keyword rankings do not directly affect token probability. They are correlated (popular sites generate more training data) but the correlation is weakening as models diversify their data sources.
The model does not have opinions. It has probabilities. Understanding those probabilities, measuring them, and deliberately shifting them is the work of AI Optimization. It is technical, it is measurable, and it is the new competitive surface for every brand that wants to exist in the AI-mediated future.
We measure token probability across 5 models with bootstrap confidence intervals. See where your brand sits in the distribution.
Run your audit →