SEO & Crawling Budget

How content negotiation avoids duplicate content penalties and preserves your crawling budget.

The problem with separate endpoints

The common approach to serving markdown for LLMs is to create separate API endpoints — something like /api/products/42.md or /api/md/products/42. This creates problems:

  • Duplicate content: Two URLs serve the same information. Search engines may see this as duplicate content and penalize your site
  • Wasted crawl budget: Search engine crawlers may discover and index your markdown endpoints, wasting precious crawl budget on content that isn't meant for them
  • Sitemap bloat: You need to either exclude markdown URLs from your sitemap or accept an inflated sitemap
  • Maintenance overhead: Two sets of routes to maintain, two sets of URLs to manage

How content negotiation solves this

next-md-negotiate uses HTTP content negotiation, which means no new URLs are created. The same URL /products/42 serves both HTML and markdown depending on the Accept header:

AspectSeparate EndpointsContent Negotiation
URLs created2x (HTML + markdown)1x (same URL)
Duplicate content riskHighNone
Crawl budget impactDoubledZero
Sitemap changesRequiredNone
robots.txt changesRecommendedNone

Zero impact on crawling budget

Crawling budget is the number of pages a search engine will crawl on your site within a given time period. Every URL it crawls uses part of this budget.

With next-md-negotiate:

  • No new URLs exist — there is nothing extra to crawl
  • Crawlers send text/html — standard search engine crawlers (Googlebot, Bingbot, etc.) always request HTML, so they get the normal page
  • No accidental indexing — markdown is only returned when explicitly requested via the Accept header, which crawlers don't do
  • robots.txt unchanged — no new paths to block or allow

How different crawlers behave

Standard search engines

Google, Bing, Yandex, and other search engines send Accept: text/html. They will always receive your normal HTML page. Content negotiation is invisible to them — your SEO is completely unaffected.

LLM crawlers

AI crawlers like those from OpenAI, Anthropic, and Perplexity can opt-in to receive markdown by sending Accept: text/markdown. This is beneficial for both sides:

  • For you: LLM agents consume significantly less bandwidth (markdown is 10-50x smaller than HTML)
  • For LLMs: Clean, structured markdown is easier to parse and produces better results than extracting text from HTML

Standard web crawlers

Tools like wget, curl (without explicit headers), and most web scrapers send generic Accept headers (*/*). They receive HTML by default. Only explicit text/markdown requests get markdown.

Canonical URLs

Since both representations share the same URL, there is no canonical URL conflict. Your existing <link rel="canonical"> tags continue to work exactly as before. No changes to your canonical URL strategy are needed.

Best practices

  1. Keep content in sync — The markdown version should contain the same information as the HTML version. Don't hide important content in one format only
  2. Markdown should be a subset — It's fine for the markdown version to be simpler (no navigation, no sidebars, no decorative content). Focus on the core content
  3. Don't block LLM crawlers — If you want to serve markdown to AI agents, make sure your robots.txt doesn't block their user agents
  4. Use LlmHint — The LlmHint component helps AI agents discover that markdown is available, so they can re-request with the right header
Bottom line: Content negotiation is the SEO-safe way to serve markdown to LLMs. No new URLs, no duplicate content, no wasted crawl budget. Your search rankings are not affected.