What kind of content gets cited by LLMs like ChatGPT and Perplexity?

In our own tracking across nine months, six formats earned citations far more reliably than anything else: original data and benchmarks the model cannot find elsewhere; direct-answer definitions that state a crisp, quotable answer in the first two to four sentences; comparison content structured as clean tables or 'X vs Y' sections; honest buying guides framed as 'best X for a specific use case'; step-by-step processes written as numbered, self-contained lists; and genuine FAQ blocks answering the exact questions buyers ask. What these share is not a topic but a structure: each contains short, factual, self-contained passages a model can lift without needing the surrounding page. Formats that failed - keyword-stuffed listicles, undifferentiated 'ultimate guides', opinion with no evidence - all buried or lacked that liftable passage.

Why does a page rank on Google but never get cited by AI assistants?

Because ranking and being cited reward different things. Google can rank a page on relevance, links and technical health even if its answer is buried three scrolls down. An assistant assembles an answer by retrieving and lifting short passages it can quote confidently, so it favours pages where a clean, factual, self-contained answer sits near the top and is easy to extract. A page can rank number one and still never be cited because the model has to read too far to find a quotable line, or because the answer is hedged, padded or wrapped in narrative. The fix is rarely more content; it is restructuring the answer you already have into an extractable passage.

What is the single highest-leverage content format for getting cited by AI?

Original data you own and no competitor can copy - your own benchmarks, test results, survey numbers or aggregated client outcomes. It was the most-cited format in our tracking by a wide margin because it gives an assistant a specific, attributable fact that exists in exactly one place: your page. Definitions and comparisons can be paraphrased from many sources, so the model has no reason to name you specifically. A number only you have published forces attribution. Even a modest original dataset - one honest benchmark from your own work - outperforms a long, well-written explainer that restates what a hundred other pages already say.

How long should a passage be to get lifted by an LLM?

Short and self-contained. In practice the passages that got lifted in our tracking were almost always two to four sentences that answered one question completely without depending on the sentence before or after them. A liftable passage states the claim, gives the key qualifier or number, and stops. Long paragraphs that build an argument across six sentences rarely get quoted because the model cannot extract a clean unit from them. The practical rule we now brief: after every important heading, write one tight paragraph that would still make sense if it were the only thing a reader saw. Then elaborate below it for the humans who keep reading.

Do FAQ sections actually help with AI citations?

Yes, when the questions are real and the answers are self-contained. FAQ blocks were one of our most-cited formats because each question-and-answer pair is already the structure assistants want: a specific question followed by a short, complete answer. The failure mode is fake FAQs - questions invented to stuff keywords, answered in one vague sentence. Those get ignored. The version that works uses the actual questions buyers ask an assistant, drawn from real search and sales conversations, each answered in two to five sentences that stand on their own. Pairing that with FAQPage schema helps machines parse the structure, but the structure and honesty of the answers matter more than the markup.

Does adding schema markup get your content cited by AI?

Schema helps machines understand structure, but it is not what earns the citation. In our testing, the pages that got cited were the ones with a clean, factual, extractable answer near the top; schema made that structure easier to parse but never rescued a page whose underlying answer was buried or thin. Think of schema as amplifying content that is already citable, not as a substitute for it. The highest-value markup for this work is FAQPage, Article and clear heading structure - but if you have to choose, spend the effort on restructuring the answer into a liftable passage first, then add schema on top of content that already deserves to be cited.

What content formats never get cited by AI, and why?

Five formats consistently failed in our tracking. Keyword-stuffed listicles ('21 tips for X') that repeat a phrase but never state a crisp answer. Undifferentiated 'ultimate guides' that restate what many pages already say, giving the model no reason to pick yours. Opinion pieces with no evidence, because an assistant will not attribute a claim it cannot verify. Thin or gated content where the substance is hidden behind a form. And narrative-heavy posts that bury the answer under a long personal build-up. The common thread is the absence of a short, factual, attributable, self-contained passage - the exact unit an assistant needs to lift and cite.

How do you measure whether your content is getting cited by AI engines?

You have to build the measurement yourself, because AI citations do not appear in Search Console. The method we use is to fix a set of 20 to 30 questions your buyers actually ask, run them across ChatGPT, Perplexity and Google's AI Overviews on a regular cadence, and record whether your brand or page is named, and which specific passage was quoted. Over time that gives you a citation rate and a share of voice against competitors, plus a direct signal of which passages get lifted - which tells you what to write more of. Without this baseline, teams routinely conclude 'AI search does nothing' while looking at a dashboard that structurally cannot show AI performance.

The Content Formats LLMs Cite (and the Ones They Ignore)

The short answer

Over nine months we tracked which of our own published pages ChatGPT, Perplexity and Google's AI Overviews actually cited - and which they ignored, even when the ignored pages ranked well on Google. Six formats earned citations reliably: original data and benchmarks, direct-answer definitions, comparison tables, honest buying guides, numbered step-by-step processes, and genuine FAQ blocks. Five formats never earned a citation no matter how much traffic they pulled: keyword-stuffed listicles, undifferentiated "ultimate guides", opinion with no evidence, thin or gated content, and narrative posts that bury the answer. The dividing line was never the topic. It was structure: cited pages contained a short, factual, self-contained passage a model could lift and attribute without reading the rest of the page. Ignored pages didn't. That single distinction now shapes how we brief every piece.

How we actually ran this

This is not a theory post. For a set of client and owned pages, we did the boring thing: we fixed a list of the questions our buyers ask an assistant, ran those prompts across ChatGPT, Perplexity and Google's AI Overviews on a repeating cadence, and logged three things each time - whether our page was cited, which passage was quoted, and whether the same page also ranked in classic Google results. We wrote up the tracking method in full in how to track AI brand mentions across ChatGPT and Perplexity; this piece is what we learned from the log once it had enough entries to show a pattern.

The most uncomfortable finding came first. Ranking and citation were only loosely correlated. Some of our best-ranking pages - genuinely useful, well-linked, technically clean - were never cited once. Some pages that ranked modestly got quoted constantly. When we lined up the cited pages next to the ignored ones and asked what the cited group had in common, the answer wasn't topic, length, or keyword targeting. It was that every cited page had at least one passage you could copy out, drop into a conversation, and it would still be true and complete on its own. That is the whole game, and the rest of this post is what that looks like in practice.

The six formats that got cited

1. Original data and benchmarks

This was the runaway winner, and it wasn't close. Any page that published a number we generated ourselves - a benchmark, a test result, an aggregated outcome across client accounts - got cited far more often than any explainer, however good. The reason is mechanical, not mysterious. A definition can be paraphrased from a hundred sources, so an assistant has no reason to name you. A number that exists on exactly one page in the world forces attribution: if the model wants to use it, it has to point at you.

You do not need a research department. One honest benchmark from work you already did - "across the accounts we manage, X happened in Y percent of cases" - is more citable than a 3,000-word guide that restates what everyone already published. The bar is originality and honesty, not scale. This is the highest-leverage thing most brands are not doing, and it is why we push every client to publish at least one piece of proprietary data.

2. Direct-answer definitions

"What is answer engine optimization?" style pages got cited constantly - but only the versions that answered the question in the first two to four sentences, before any history, context or throat-clearing. The pattern that worked was almost formulaic: state the definition crisply, give the one qualifier that makes it accurate, then stop and elaborate below for humans. The versions that failed opened with three paragraphs of "in today's fast-moving landscape" and buried the actual definition halfway down.

Assistants lift the top. If your definition is a clean, standalone unit near the start of the section, it gets quoted. If a reader - or a model - has to scroll to assemble the answer from scattered sentences, it doesn't. We rebuilt several pages around this single rule and watched previously-ignored definitions start getting cited within weeks. It is the cheapest structural fix available, and it is the core of answer engine optimization.

3. Comparison content, structured as tables

"X vs Y" content punched above its weight, especially when the comparison was laid out as an actual table rather than prose. Buyers ask assistants comparison questions constantly - "is A or B better for my situation" - and a clean comparison table is the ideal liftable structure: rows of attributes, two columns of values, no narrative to untangle. We saw comparison tables quoted almost verbatim in AI Overviews.

The failure version was the fake comparison - a page that claims to compare two things but spends the whole time arguing for one, with no honest columns where the other option wins. Assistants seem to distrust these, and buyers do too. The comparisons that got cited were the ones honest enough to say where each option loses. If you want the deeper architecture behind this, we broke it down in comparison-page SEO and BOFU architecture that ranks.

4. Honest buying guides framed by use case

"Best X for a specific use case" pages got cited when they were genuinely a guide and not a thinly-disguised pitch. The format assistants reward is a guide that names criteria, applies them honestly, and is willing to say "for this situation, not us." Counterintuitive for a brand, but the citation data was unambiguous: the more honest and criteria-led the guide, the more often it got quoted, because the model treats it as a reference rather than an advertisement.

We learned to write these the way we write our city and category buyer's guides - lead with the evaluation criteria, apply them without flinching, and let the brand appear as one option assessed on the same yardstick as the rest. That restraint is exactly what makes the page citable.

5. Step-by-step processes as numbered lists

Anything framed as "how to do X" and laid out as a genuine numbered sequence - each step self-contained, each stating what to do and why - got lifted regularly, especially into AI Overviews, which love ordered lists. The structural requirement is that each step stands on its own. A step that reads "next, do the thing we discussed above" cannot be lifted; a step that reads "3. Audit faceted navigation for near-duplicate URLs, because filter combinations quietly inflate crawlable pages" can.

The failure mode was process content written as flowing narrative - the steps were in there, but tangled into paragraphs the model couldn't cleanly extract. Same information, wrong structure, no citations. Rewriting the same content as a real numbered list, with each step complete in itself, was often all it took.

6. Genuine FAQ blocks

FAQ sections were one of our most-cited formats, which surprised no one once we saw why: a question followed by a short, complete answer is already the exact unit an assistant wants. The catch is the word genuine. Fake FAQs - questions invented to stuff keywords, answered in one vague sentence - got ignored completely. The FAQs that got cited used the real questions buyers ask, drawn from search data and sales calls, each answered in two to five self-contained sentences.

Pairing that with FAQPage schema helps machines parse the structure, but the honesty and self-containment of the answers did the heavy lifting. Every page we publish now ends with a real FAQ block for exactly this reason.

The split was structural, not topical: every format above the line shares one trait - a short, self-contained passage a model can lift and attribute.

The five formats that never worked

The failures were as instructive as the wins, because they killed some content we were proud of.

Keyword-stuffed listicles. "21 tips for X" pages that pulled decent traffic but never earned a single citation. They repeat the target phrase and gesture at breadth, but no single item is a crisp, complete answer - so there is nothing to lift. Traffic without citation is the signature of this format.

Undifferentiated "ultimate guides." Long, competent posts that restate what a hundred other pages already say. They are not wrong; they are just not distinctive. An assistant synthesising an answer has no reason to name a source that adds nothing the others don't. Length is not the moat we spent years assuming it was.

Opinion with no evidence. Thought-leadership pieces making confident claims with nothing behind them. Assistants are cautious about attributing claims they cannot verify, so a strong opinion unsupported by data, examples or a named source mostly gets skipped. The fix is not softer opinions - it is attaching evidence to the strong ones.

Thin or gated content. If the substance sits behind a form or is only three thin paragraphs, there is nothing for a model to retrieve. Gating remains a legitimate lead-gen tactic, but understand the trade: gated content forfeits AI citation entirely, because the assistant never sees the good part.

Narrative that buries the answer. This one stung, because we like writing this way. Posts that open with a long personal build-up before reaching the point often ranked fine but rarely got cited - the answer was in there, three scrolls down, tangled in story. The lesson wasn't to stop telling stories. It was to state the answer first, then tell the story underneath for the humans who stay.

The rule underneath all of it

Once we stopped looking at format labels and looked at what the cited pages physically contained, the whole thing collapsed into one rule: an assistant cites a passage it can lift, verify and attribute without needing the rest of the page. Every winning format is just a different way of producing that passage. Every losing format fails to produce it, or buries it.

A liftable passage has four properties. It is short - two to four sentences. It is self-contained - it makes sense with nothing above or below it. It is factual or specific - a claim with a number, a definition, a clear comparison, not a mood. And it is attributable - ideally something only your page says, so the model must name you. Original data hits all four at once, which is why it wins. A well-structured definition hits three. A keyword-stuffed listicle hits none.

Same topic, same facts - the only difference is whether the answer sits in a clean unit at the top or is buried under a build-up.

How to retrofit content you already have

You almost certainly do not need to start over. Most of the value in our own tracking came from restructuring pages that already ranked, not from writing new ones. The retrofit is fast:

Pull your top 20 pages by traffic and ask, for each, whether an assistant could lift a clean answer from the first screen. If the answer is buried, that page is leaking citations.
Add a "short answer" block near the top of each - two to four sentences that state the page's core answer completely, before any context. This alone moved previously-ignored pages into the cited set for us.
Convert any comparison into an actual table. If a page argues "A vs B" in prose, restructure it into rows and columns with honest values on both sides.
Turn buried processes into numbered lists where each step is self-contained and states its own why.
Add a genuine FAQ block using the real questions buyers ask - not invented ones - answered in two to five standalone sentences, with FAQPage schema on top.
Publish one piece of original data you already have sitting in a spreadsheet. This is the single highest-leverage new asset most brands can ship this quarter.

None of this requires more words. Most of it requires fewer, arranged better. That is the whole reframe: content marketing in the AI era is an editing discipline as much as a writing one, and the technical structure underneath - clean headings, valid schema, crawlable text - is what lets machines find the good passage once you have written it.

What we'd tell our past selves

Three mistakes cost us the most time before the pattern was obvious.

We optimised for length when we should have optimised for extractability. The 3,000-word guide felt like the safe bet; it usually wasn't. A tight page with one original number beat it every time.

We treated schema as the citation lever. It isn't. Schema helps a machine parse a good answer, but it never rescued a page whose answer was buried or thin. Structure the passage first, then add markup to content that already deserves citing.

And we assumed ranking would carry us into AI answers. It doesn't reliably. We now treat "does this rank" and "can this be cited" as two separate questions with two separate checklists - which is the core distinction we unpack in SEO vs AEO vs GEO, and the reason we run answer-engine and AI SEO work as a discipline alongside classic SEO rather than assuming one delivers the other.

Tools, KPIs and how to keep score

You cannot manage what you do not measure, and AI citations do not appear in Search Console. Build the scoreboard yourself:

Citation rate. Of a fixed set of 20 to 30 buyer questions, in what share does an assistant name your brand or page? Track it monthly across ChatGPT, Perplexity and AI Overviews.
Share of voice. For those same questions, how often are you cited versus named competitors? This is the number that tells you whether you are winning or just present.
Passage-level signal. Log which specific passage gets quoted. Over time this tells you exactly what to write more of - it is the fastest feedback loop we have found.
Rank-vs-cite gap. Flag pages that rank but are never cited. Each one is a fast retrofit waiting to happen.

The mechanics of building this are in our AI citation tracking method, and the wider strategic picture - why brands rank yet stay invisible in assistants - is in the AI search gap. If you want the cluster-level view of how these citable pages fit together into topical authority, we covered that in content silos that rank and get cited by AI, and the tactical, engine-by-engine detail lives in how to rank on ChatGPT and how to rank on Perplexity.

The bottom line

After nine months of watching which of our pages got cited and which got ignored, the lesson is smaller and more useful than we expected. LLMs do not cite topics, authors or word counts. They cite passages - short, self-contained, specific, attributable units they can lift and stand behind. Six formats produce those passages naturally: original data, direct-answer definitions, comparison tables, honest buying guides, numbered processes and genuine FAQs. Five formats reliably don't, no matter how much traffic they pull. The work is not to write more. It is to make sure every page you already have contains at least one passage worth quoting - and, ideally, one number only you can provide.

If you want a clear read on which of your pages are getting cited, which rank but stay invisible, and which formats to build next, that diagnosis is exactly what our team runs. Talk to us about an AI-era content and citation audit. It is the same process behind our AI SEO services and the wider SEO programmes we run for brands that want to be the answer, not just a result.