SEO & Crawling Budget
How content negotiation avoids duplicate content penalties and preserves your crawling budget.
The problem with separate endpoints
The common approach to serving markdown for LLMs is to create separate API endpoints — something like /api/products/42.md or /api/md/products/42. This creates problems:
- Duplicate content: Two URLs serve the same information. Search engines may see this as duplicate content and penalize your site
- Wasted crawl budget: Search engine crawlers may discover and index your markdown endpoints, wasting precious crawl budget on content that isn't meant for them
- Sitemap bloat: You need to either exclude markdown URLs from your sitemap or accept an inflated sitemap
- Maintenance overhead: Two sets of routes to maintain, two sets of URLs to manage
How content negotiation solves this
next-md-negotiate uses HTTP content negotiation, which means no new URLs are created. The same URL /products/42 serves both HTML and markdown depending on the Accept header:
| Aspect | Separate Endpoints | Content Negotiation |
|---|
| URLs created | 2x (HTML + markdown) | 1x (same URL) |
| Duplicate content risk | High | None |
| Crawl budget impact | Doubled | Zero |
| Sitemap changes | Required | None |
| robots.txt changes | Recommended | None |
Zero impact on crawling budget
Crawling budget is the number of pages a search engine will crawl on your site within a given time period. Every URL it crawls uses part of this budget.
With next-md-negotiate:
- No new URLs exist — there is nothing extra to crawl
- Crawlers send text/html — standard search engine crawlers (Googlebot, Bingbot, etc.) always request HTML, so they get the normal page
- No accidental indexing — markdown is only returned when explicitly requested via the Accept header, which crawlers don't do
- robots.txt unchanged — no new paths to block or allow
How different crawlers behave
Standard search engines
Google, Bing, Yandex, and other search engines send Accept: text/html. They will always receive your normal HTML page. Content negotiation is invisible to them — your SEO is completely unaffected.
LLM crawlers
AI crawlers like those from OpenAI, Anthropic, and Perplexity can opt-in to receive markdown by sending Accept: text/markdown. This is beneficial for both sides:
- For you: LLM agents consume significantly less bandwidth (markdown is 10-50x smaller than HTML)
- For LLMs: Clean, structured markdown is easier to parse and produces better results than extracting text from HTML
Standard web crawlers
Tools like wget, curl (without explicit headers), and most web scrapers send generic Accept headers (*/*). They receive HTML by default. Only explicit text/markdown requests get markdown.
Canonical URLs
Since both representations share the same URL, there is no canonical URL conflict. Your existing <link rel="canonical"> tags continue to work exactly as before. No changes to your canonical URL strategy are needed.
Best practices
- Keep content in sync — The markdown version should contain the same information as the HTML version. Don't hide important content in one format only
- Markdown should be a subset — It's fine for the markdown version to be simpler (no navigation, no sidebars, no decorative content). Focus on the core content
- Don't block LLM crawlers — If you want to serve markdown to AI agents, make sure your robots.txt doesn't block their user agents
- Use LlmHint — The LlmHint component helps AI agents discover that markdown is available, so they can re-request with the right header
Bottom line: Content negotiation is the SEO-safe way to serve markdown to LLMs. No new URLs, no duplicate content, no wasted crawl budget. Your search rankings are not affected.