technology March 13, 2026

GEO: How AI Search Changes Everything About SEO

Generative Engine Optimization is replacing traditional SEO. Here's what the research says, what we've tested, and what actually works.

Google’s AI Overviews, ChatGPT search, Perplexity, Claude — these aren’t search engines. They’re answer engines. They don’t rank pages. They read them, synthesize them, and cite them. If your content isn’t structured for this, you’re invisible.

This is our working research document. We update it as we learn.

The Academic Foundation

The term Generative Engine Optimization (GEO) was formalized in a 2024 paper by researchers at IIT Delhi, Princeton, and Georgia Tech (Aggarwal et al., 2024, published at ACM SIGKDD). Their key findings:

  • Traditional SEO techniques (keyword stuffing, backlink farming) have zero positive effect on AI search visibility
  • Citing sources increased visibility by up to 40%
  • Adding statistics improved visibility by 30-40%
  • Including quotations from relevant authorities boosted visibility by 15-25%
  • The effect varies by domain: factual topics benefit most from statistics, while subjective topics benefit more from authoritative quotes

This isn’t theoretical. It’s measured against real generative engines.

What AI Crawlers Actually Do

When ChatGPT or Perplexity answers a question about “best recruitment agencies in Hong Kong”, here’s what happens:

  1. Query expansion. The model doesn’t search for your exact query. It generates multiple semantic variations: “top headhunters Hong Kong”, “executive search firms HK”, “recruitment agency Central district”. Google has a patent on this — they call it “query fan-out.”

  2. Retrieval. The system fetches candidate pages from its index. Pages with broader semantic coverage get retrieved for more query variations.

  3. Re-ranking. A specialized model (like Google’s ret-rr-skysight-v3 for AI Overviews) scores each page for relevance, factual density, and authority.

  4. Synthesis. The LLM reads the top-ranked passages and generates an answer, citing specific pages as sources.

The implication: you need to be retrieved, survive re-ranking, and be citation-worthy. Three separate filters, each with different requirements.

Four Strategies That Work

Based on the research and our own testing:

1. Semantic Footprint Expansion

Don’t target one keyword per page. Cover entire topic clusters. If you write about “recruitment agencies in Hong Kong”, you also need pages about:

  • Individual agency profiles (entity pages)
  • District-specific guides (geographic disambiguation)
  • Specialization pages (executive search vs staffing vs HR consulting)
  • Industry context (market size, fee structures, regulations)

AI search uses query fan-out. The broader your semantic coverage, the more variations you’re retrieved for.

2. Fact Density

Every page needs to contain information that isn’t available elsewhere. Not opinions. Not “top 10 lists” rewritten from other lists. New information.

Examples of high fact-density content:

  • “Michael Page Hong Kong operates from Des Voeux Road, Central. 4.9 stars across 350 Google reviews. Specializes in executive search for financial services and technology.”
  • “Average contingency recruitment fee in Hong Kong: 13-22% of first-year salary (2026 data).”

Low fact-density (what most sites do):

  • “Michael Page is one of the leading recruitment agencies in Hong Kong.”

The first version is citable. The second is noise.

3. Entity-Level Structure

AI search works with entities, not pages. An entity is a distinct thing: a company, a district, a specialization, a person.

Every entity should have:

  • Its own URL (addressable)
  • Unique descriptive content (not template-generated filler)
  • Explicit relationships to other entities (links, not just mentions)
  • Structured data (Schema.org, JSON-LD)

When an AI encounters a page about “Wan Chai District” that links to 72 individual company pages, each with unique descriptions, it recognizes this as genuine local expertise. When it encounters a page with 72 company names and addresses in a table, it recognizes this as a scraped directory.

4. LLMs.txt

A new convention (llmstxt.org) that gives AI systems a machine-readable overview of your site’s content. Think of it as robots.txt for language models.

Instead of forcing an AI to crawl your entire site, llms.txt provides:

  • What the site is about
  • What entities it covers
  • Key facts and statistics
  • Navigation structure

We generate ours automatically from our knowledge vault. Every company, district, and article is listed with a one-line description.

Knowledge Vault Architecture

This is our most significant finding. Most directories store data in a database and render it through templates. We use a knowledge vault — a collection of atomic Markdown notes with explicit relationships.

knowledge/headhunter/
  companies/     → 689 atomic notes
  districts/     → 18 notes
  stations/      → 97 notes
  industries/    → 22 notes
  facts/         → market data with sources
  articles/      → compiled from above

Each note connects to others via wikilinks. A company note links to its district, its specializations, and its MTR station. A district note links to all companies within it. An article about “Executive Search in Hong Kong” assembles facts from company notes, market data notes, and district notes.

Why this works for AI search:

  • Cross-linked pages create a traversable knowledge graph
  • Atomic facts are individually citable
  • Compiled articles have natural information density (they’re built from real data, not generated from a prompt)
  • The knowledge structure itself becomes a competitive moat — it’s hard to replicate without the underlying data

LLM-First Website Strategy

We developed a universal framework for AI search optimization that applies to any website. The core principle: build for AI citation first, Google ranking second.

Seven pillars of an LLM-first site:

  1. llms.txt protocol/llms.txt (summary) + /llms-full.txt (comprehensive). Machine-readable site map for AI crawlers. Lists entity types, counts, data freshness. Updated on every deploy.

  2. JSON-LD schema on every page. Collection pages get CollectionPage + ItemList. Business pages get LocalBusiness. Articles get Article + FAQPage. AI systems parse structured data more reliably than prose.

  3. Snippet-ready extracts. Every page gets a 40-60 word factual extract as its first paragraph. Entity-dense, no fluff, optimized for citation. Stored in the database as llm_snippet and rendered as the lead paragraph.

  4. Factual density over marketing copy. LLMs cite pages with verifiable facts: numbers, names, addresses, dates, certifications. Vague marketing language (“leading”, “trusted”, “comprehensive”) is actively harmful — it signals low information value.

  5. FAQ sections with schema. FAQPage markup on relevant pages. Questions written in natural language matching how people actually query AI systems.

  6. Content freshness signals. Last-updated dates visible on pages. Regular updates signal active maintenance. Stale content is less likely to be cited.

  7. Topical authority through depth. Deep coverage of your domain beats broad shallow coverage. Cross-link related pages to create a co-citation network.

Enrichment Quality: The Hard Part

AI-generated content is only as good as its validation. We developed strict rules for LLM enrichment:

  • Facts only, sourced from actual data. No puffery.
  • Forbidden phrases: “specializes in”, “trusted”, “reputable”, “comprehensive”, “one-stop”, “leading”. These are signals of thin content.
  • Entity-dense: names, numbers, addresses, dates, certifications.
  • Short when data is thin: one good sentence beats three padded ones.
  • LLM snippets: 40-60 words, dense factual extract. This is what AI search would quote.

Every enrichment output goes through deterministic validation: JSON schema check, word count floors, forbidden phrase detection. The LLM generates, but rules validate.

Keyword Research: DataForSEO

We use DataForSEO’s API for keyword intelligence. Key insight: use advanced mode, not regular. Advanced returns organic results plus People Also Ask questions and related searches in a single call — essential for content strategy.

Pilot run for headhunter vertical (cost: $0.085):

KeywordMonthly VolumeCompetition
recruitment agency hong kong1,600Medium
headhunter hong kong260High
executive search hong kong260Medium
top 10 recruitment agencies210Medium

People Also Ask questions extracted directly map to FAQ entries and article outlines. “Who are the big five headhunters?” became an article. “How can I check if a recruitment agency is legit?” became a FAQ entry on every company page.

Competitor Analysis: The Market Is Empty

We analyzed existing Hong Kong directory sites. The finding was striking: all competitors have effectively zero traffic.

DomainStatus
gym.hk”500+ gyms” landing page, 0 gyms inside
beauty.hk~6 salon listings. SimilarWeb: no data (<20K visits)
wedding.hk”0+ vendors” — literally empty
lawyer.hkParked, template site
dentist.hkSame platform as lawyer.hk

For a city of 7.5 million people searching for local services, these domains get negligible traffic. The quality threshold to rank is low because nobody has built anything real. Being indexed with genuine content equals instant category leadership.

Pillar Article Generation

Long-form SEO articles generated from keyword research + enriched company data + AI. The content engine for directory sites:

  1. Keyword research → target keywords per vertical, per district, per topic
  2. Context assembly → pull relevant companies, FAQs, stats from database
  3. AI generation → write article with structured template
  4. Upgrade pass → improve existing articles as new data becomes available
  5. Publish → article becomes a page in the knowledge vault

Quality rules: no marketing fluff. Entity-dense (real company names, addresses, specializations). Factual claims only, sourced from verified data. 800-2,000 words per article.

Result: 316 SEO articles across 6 verticals, each grounded in real company data.

Measuring AI Search Visibility

This is the hardest part. Traditional SEO has clear metrics: rankings, impressions, clicks. AI search visibility is opaque.

Current measurement approaches:

  • Log file analysis: Monitor requests from AI crawlers (GPTBot, PerplexityBot, ClaudeBot, Bingbot for Copilot)
  • Manual testing: Query AI systems directly and check if your content is cited
  • Self-hosted analytics: Umami tracks referral traffic from AI platforms without cookies
  • Share of Voice: For a set of target queries, how often are you cited vs competitors?

We don’t have definitive numbers yet. The experiment continues.

What We’ve Built

Headhunter.com.hk is our primary test site. Key metrics:

  • 1,076 companies enriched with AI-generated descriptions
  • 461 pages generated from structured knowledge vault data
  • 8 service-based specialization categories derived from actual service data
  • 18 district pages with company distributions
  • 316 SEO articles across all verticals
  • llms.txt + JSON-LD schema on every page
  • Self-hosted analytics for clean traffic measurement

The same engine powers 5 additional Hong Kong verticals (auditor.hk, renovation.hk, warehouse.hk, water.com.hk, education.headhunter.com.hk) — same codebase, different data, <2 second build per site, zero hosting cost.

References

  1. Aggarwal, P., et al. (2024). “GEO: Generative Engine Optimization.” ACM SIGKDD 2024. arxiv.org/abs/2311.09735
  2. Go Fish Digital (2026). “Generative Engine Optimization Strategies for 2026.” gofishdigital.com
  3. llmstxt.org — The /llms.txt file specification. llmstxt.org
  4. Google Patent US11769017B1 — Query fan-out for AI Overviews.
  5. Semrush (2025). “What Is LLMs.txt & Should You Use It?” semrush.com
  6. ResearchGate (2025). “GEO: The Mechanics, Strategy, and Economic Impact of the Post-Search Era.”

Last updated: March 2026. This is a living document.