Markdown for Agents and Statistics

Kuvaus

Markdown for Agents and Statistics converts your WordPress content to Markdown and serves it
to AI agents and language model tools that request it via HTTP content negotiation
(Accept: text/markdown).

The Chancery Lane Project is a charity that helps organisations reduce emissions using the power of legal documents and processes. We’ve published this plugin as we believe that making content more legible for AI Agents makes a meaningful difference to their energy usage – not only by reducing the amount of tokens required (by up to 90% over HTML) to consume the content, but also minimising the server resources required to render, process and display pages at source.

How it works:

  1. Posts and taxonomy archive pages are converted to Markdown and saved as static
    files on disk inside wp-content/uploads/.
  2. When a visitor (or AI agent) requests a page with Accept: text/markdown in
    the HTTP headers, WordPress serves the pre-generated .md file directly —
    no page render required.
  3. A <link rel="alternate" type="text/markdown"> tag is added to each page’s
    so agents can discover Markdown versions automatically.

Features:

  • Content negotiation (Accept: text/markdown, ?output_format=md, or known AI User-Agents)
  • Taxonomy archive support — category, tag, and custom taxonomy term pages served as Markdown post listings
  • Automatic Markdown generation on post save; taxonomy archives auto-update when any post in the term changes
  • AJAX bulk generation with live progress counter — no page timeouts on large sites
  • Per-post-type field configuration — choose which meta/ACF fields go in frontmatter or body
  • ACF support with dot notation for nested group fields (e.g. group.subfield)
  • Content fields option — use ACF fields as the body content instead of post_content
  • Manifest generation with content hashes and change tracking per post type
  • Incremental export — only re-export changed documents (--incremental)
  • Delta file (changes.json) for RAG system sync
  • Access statistics — logs AI agent requests with a dedicated stats admin page
  • Access grouping by class of agent
  • Optional frontmatter fields — hierarchy (parent/ancestors/children IDs), author display name, root-relative featured image paths
  • Topics section — appends a ## Topics section with linked taxonomy terms to the Markdown body
  • Export preview — preview generated Markdown inline in the post editor without writing to disk
  • WP-CLI commands: generate, generate-taxonomies, prune-stats, status, delete
  • Fully unit-tested

Asennus

  1. Upload the plugin to /wp-content/plugins/markdown-for-agents/, or install via the WordPress Plugins screen.
  2. Activate the plugin through the Plugins screen in WordPress.
  3. Visit Settings Markdown for Agents and choose which post types and taxonomies to generate.
  4. Enable Auto-generate on save so files stay in sync as you publish or edit content (optional).
  5. Click Generate All to create Markdown for your existing content. On large sites you can also run wp markdown-agents generate and wp markdown-agents generate-taxonomies from WP-CLI.
  6. Verify by appending ?output_format=md to any post URL (or using an AI User-Agent) to confirm Markdown is served.

UKK

Where are the Markdown files stored?

Inside wp-content/uploads/{export_dir}/ (configurable in Settings). Post files
live under {export_dir}/{post-type}/{slug}.md. Taxonomy archive files live under
{export_dir}/taxonomy/{taxonomy}/{term-slug}.md. The directory is served by
WordPress when content negotiation is triggered.

Will this slow down my site?

No. Markdown files are generated ahead of time (on post save or via manual/CLI
bulk generation). Serving them is a simple file read, much faster than rendering
a full WordPress page.

AI agents are getting HTML instead of Markdown. Why?

Almost always this is a CDN, firewall, or page cache sitting in front of
WordPress — not the plugin. On many hosts (for example Cloudflare in front of WP
Engine) the edge answers a request before it ever reaches the plugin: a full-page
cache can return the cached HTML, or a bot/WAF rule can block a known AI crawler
with a 403/429.

The reliable route is the query parameter: append ?output_format=md to any post
or archive URL. Because that is a distinct URL, caches store it separately and
firewalls treat it as an ordinary request, so it reaches the plugin even on a
hardened stack. The plugin advertises this URL automatically via a
tag in each page’s <head>, so
agents that read the page can discover and follow it.

The Accept: text/markdown header and User-Agent routes also work, but only if
your CDN/cache is configured to let them through (see the next question).

How do I let my CDN or cache serve Markdown to agents?

This is host/CDN configuration, not a plugin setting. Two changes help:

  • Page cache (WP Engine, LiteSpeed, Varnish, nginx): exclude agent-shaped
    requests from the full-page cache — any request whose Accept header contains
    text/markdown, whose query string contains output_format=md, or whose
    User-Agent is a known AI bot. Do not add User-Agent to the cache key; that
    fragments the cache for every visitor. Exclude from caching, do not key on it.
  • Firewall / bot rules (Cloudflare): add a skip/allow rule for the AI
    User-Agents you want to serve (for example GPTBot, ClaudeBot, PerplexityBot,
    Google-Extended). Otherwise they receive a 403/429 and get nothing.

If you skip this, nothing breaks — agents simply use the ?output_format=md URL
via discovery instead. The plugin already protects against the reverse problem:
Markdown responses are sent with Cache-Control: private, no-store and
Vary: Accept, User-Agent, so a shared cache cannot replay the Markdown to a
human browser on the same URL.

How can I check what an agent actually receives?

Request a page the way an agent would and inspect the response headers:

`

Query-param route (the reliable one)

curl -sI ’https://example.com/your-post/?output_format=md’

Accept-header route

curl -sI -H ’Accept: text/markdown’ ’https://example.com/your-post/’
`

A genuine Markdown response from the plugin has Content-Type: text/markdown and
an X-Markdown-Source: markdown-for-agents header. If you instead see
Content-Type: text/html, the request was answered by a cache or firewall before
reaching the plugin (see the previous questions). Note that running these from
your own server may bypass your CDN; testing from an external network shows what
real agents experience.

Should I publish an llms.txt file?

llms.txt is a proposed convention for a single Markdown index of your site at
https://example.com/llms.txt, aimed at AI tools that look for a site-level
manifest. It is an emerging community convention, not an official standard, and
there is limited evidence that the major AI crawlers consume it yet — so treat it
as low-cost, optional, and complementary to the per-page discovery this plugin
already provides.

This plugin does not generate llms.txt. If you want one, publish a static file at your web root listing your
key pages with their ?output_format=md URLs, and keep it in sync with published
and retired content or it will point agents at missing pages.

What are taxonomy archive files?

For every public taxonomy term (categories, tags, custom taxonomies) the plugin
generates a Markdown file listing all published posts in that term with links and
excerpts. These are served automatically when an AI agent requests a taxonomy
archive URL. This lets agents navigate your site structure by exploring term listings,
not just individual posts.

What is the manifest.json file?

When you generate with --with-manifest or --incremental, a manifest.json is
created inside each post-type export folder (e.g. wp-mfa-exports/post/manifest.json).
It contains a registry of all exported documents with content hashes and change
tracking (new/modified/unchanged/deleted), enabling RAG systems to identify what
changed since the last export without reprocessing all documents.

How does incremental export work?

Use wp markdown-agents generate --incremental to only re-export documents that
have changed since the last export. The plugin compares content hashes against the
previous manifest.json and skips unchanged posts. This also generates a
changes.json delta file listing new, modified, and deleted documents — your RAG
system can read this to know exactly what to re-embed.

How do I configure fields per post type?

In Settings Markdown for Agents, each enabled post type has its own
”Field Configuration” section with two textareas:

  • Frontmatter fields — meta or ACF fields added to the YAML frontmatter.
  • Content fields — meta or ACF fields used as the body content. When set,
    post_content is automatically excluded.

Use dot notation for ACF group fields (e.g. clause_fields.clause_summary).
Plain meta keys work too (e.g. _yoast_wpseo_title). ACF relationship fields
are automatically converted to a list of post titles.

Can I customise the Markdown output?

Yes. Several filters are available:

  • markdown_for_agents_pre_convert — filter HTML before conversion
  • markdown_for_agents_post_convert — filter Markdown after conversion
  • markdown_for_agents_frontmatter — modify frontmatter fields for a post
  • markdown_for_agents_taxonomy_frontmatter — modify frontmatter fields for a taxonomy archive
  • markdown_for_agents_serve_enabled — enable/disable serving for a specific post
  • markdown_for_agents_serve_taxonomies — enable/disable serving for taxonomy archive pages
  • markdown_for_agents_cache_headers — override the cache-related headers sent with the Markdown response
  • markdown_for_agents_file_generated — action fired after a file is written
  • markdown_for_agents_file_deleted — action fired after a file is deleted

Can I let CDNs/full-page caches cache the Markdown responses?

By default the Markdown response is sent with Cache-Control: private, no-store, max-age=0 (plus X-LiteSpeed-Cache-Control, X-Accel-Expires and Vary: Accept, User-Agent). This is deliberate: the Markdown is negotiated on the same URL as the HTML page, so a shared cache that ignores or normalises Vary could otherwise store the Markdown variant and replay it to ordinary browsers expecting HTML.

If your CDN/cache layer honours Vary correctly (or you serve Markdown from distinct URLs), you can relax this with the markdown_for_agents_cache_headers filter. Map any header to an empty string to omit it entirely:

`

add_filter( ’markdown_for_agents_cache_headers’, function ( array $headers, string $filepath ) {
$headers[’Cache-Control’] = ’public, max-age=300’;
$headers[’X-LiteSpeed-Cache-Control’] = ”;
$headers[’X-Accel-Expires’] = ”;
return $headers;
}, 10, 2 );
`

This filter governs only the cache-related headers listed above. The Content-Signal and X-Markdown-Source headers are sent separately and are unaffected (Content-Signal has its own markdown_for_agents_content_signal filter).

Override with caution — incorrectly cached Markdown will be served to browsers.

How do I generate taxonomy archives via WP-CLI?

wp markdown-agents generate-taxonomies
wp markdown-agents generate-taxonomies --taxonomy=category
wp markdown-agents generate-taxonomies --dry-run

Arvostelut

There are no reviews for this plugin.

Avustajat & Kehittäjät

“Markdown for Agents and Statistics” perustuu avoimeen lähdekoodiin. Seuraavat henkilöt ovat osallistuneet tämän lisäosan kehittämiseen.

Avustajat

Käännä “Markdown for Agents and Statistics” omalle kielellesi.

Oletko kiinnostunut kehitystyöstä?

Browse the code, check out the SVN repository, or subscribe to the development log by RSS.

Muutosloki

1.5.1

  • Add markdown_for_agents_cache_headers filter so the cache-related headers on Markdown responses can be customised (e.g. to allow CDN caching where Vary is honoured). Defaults are unchanged and remain cache-bypassing.

1.5.0

  • Add new ’skipped’ grouping on generating MD files to show those that have been skipped for good reason (password or draft etc) rather than failed.
  • Add new ’Agent Class’ graph display on Agent Stats page which mimics Known Agents classifications to help understand traffic patterns
  • Better documentation for caching and generation logic

1.4.5

  • Fix: Issues where memcache could cause problems on CLI invoked rebuilds on large sites. Also resolves minor issues with and outputs generated by post filters appearing in MD output, while allowing for same in blocks where needed.

1.4.4

  • Fix: full-page caches (LiteSpeed, Varnish, nginx fastcgi_cache) could store the Markdown response under a page URL when an AI agent or ?output_format=md request hit it first, then replay the .md body to subsequent HTML browser requests. Markdown responses now send Cache-Control: private, no-store, X-LiteSpeed-Cache-Control: no-cache, X-Accel-Expires: 0, and Vary: Accept, User-Agent unconditionally.

1.4.3

  • Update to fix deleting posts on status change outside of auto-update flow

1.4.2

  • Fixed issue with private/draft posts being created as MD files and added checkbox to post edit pages to exclude posts from MD generation. Also fixes small issue in unusual taxonomy slugs prodducing incorrect URLs in Topics secion of MD body. Adds Strauss namespacing to html-to-markdown/Composer includes to avoid collisions.

1.4.1

  • Removed llms.txt index generation. The LlmsTxtGenerator class, its --with-llmstxt WP-CLI flag on wp markdown-agents generate, and the corresponding unit tests have been dropped.

1.4.0

  • Add notices and copy around generating and regenerating content on install and updates to Settings
  • Add transient to store and note when content needs regenerating

1.3.0

  • Optional hierarchy frontmatter fields (parent, ancestors, children IDs) for hierarchical post types (pages, etc.).
  • Optional author display name in frontmatter.
  • Optional root-relative paths for featured images (survives domain migrations).
  • Optional ## Topics section appended to the Markdown body with linked taxonomy terms.
  • Export preview — ”Preview Markdown” button in the post meta box renders generated Markdown inline without writing to disk.
  • New WP-CLI command: wp markdown-agents prune-stats [--days=<n>] [--yes] — removes access stats older than N days.
  • Manifest hash now covers taxonomy term slugs — incremental export correctly detects posts whose terms changed.

1.2.0

  • Taxonomy archive support — generates Markdown index files for all public taxonomy terms (categories, tags, custom taxonomies), served via content negotiation.
  • Taxonomy archives auto-regenerate when any post in the term is saved or deleted.
  • AJAX bulk generation for taxonomy archives on the Settings page with live progress counter.
  • New WP-CLI command: wp markdown-agents generate-taxonomies [--taxonomy=<slug>] [--dry-run].
  • <link rel="alternate" type="text/markdown"> tag now emitted on taxonomy archive pages.
  • New filter: markdown_for_agents_serve_taxonomies to enable/disable taxonomy archive serving globally.
  • New filter: markdown_for_agents_taxonomy_frontmatter to modify taxonomy archive frontmatter before serialisation.
  • Bulk generation buttons converted to AJAX with live counter — no more page timeouts on large sites.

1.1.0

  • Per-post-type field configuration for frontmatter and content fields.
  • ACF support with dot notation for nested group fields.
  • Content fields option — use ACF/meta fields as body content instead of post_content.
  • ACF relationship fields automatically normalised to post titles.
  • Added manifest.json generation with content hashes and change tracking.
  • New --with-manifest flag for wp markdown-agents generate.
  • Manifest is generated per post-type folder for independent change tracking.
  • Incremental export via --incremental — skips unchanged documents.
  • Delta file (changes.json) generated for RAG system integration.
  • Access statistics — logs AI agent requests; dedicated stats admin page.
  • UA detection — configurable User-Agent strings force Markdown serving.

1.0.0

  • Initial release.