How ChatGPT Decides What to Cite
Ever wonder why AI mentions some companies but not others? Here's what we know about how LLMs choose their sources.
When you ask ChatGPT a question, it doesn't search the web in real-time (usually). It draws from its training data plus, in some modes, web search results. So how does it decide what to cite?
Training Data Influence
LLMs are trained on massive text datasets. Content that appears frequently, in authoritative contexts, gets weighted more heavily. This includes:
- Wikipedia and reference sites
- News publications
- Academic papers
- Popular documentation
If your company is mentioned in these contexts, you're more likely to appear in answers.
Real-Time Search Integration
When AI assistants do search the web, they look for:
- Relevance: Does the content directly answer the query?
- Recency: Is the information current?
- Authority signals: Is there clear authorship and sourcing?
- Extractability: Can they pull a clean quote?
The Citation Threshold
AI doesn't cite everything it knows. It cites when:
- The claim is specific and verifiable
- The source is clearly identifiable
- The information adds credibility to the answer
This is why structured data and citation signals matter so much. You're not just helping AI find your content—you're making it easy to cite.
What You Can Control
You can't control what's in an LLM's training data. But you can control:
- Structured data on your pages
- Clear author attribution
- Publication dates
- Canonical URLs
- Extractable content (FAQs, definitions)
- Not blocking AI crawlers
These are the levers of GEO.