How We Built a 5-Step Cross-Platform Publishing Pipeline with AI Agents

Most developer blogs start with a markdown file and end with a "Publish" button. Ours starts with an AI agent scraping GitHub Trending and ends with the same post live on four platforms — each with the correct canonical URL, proper SEO ownership, and a Supabase record tying it all together.

Here's how we built it, what broke, and what we learned.

The Problem

We publish a GitHub Trending digest three times a week. Each post needs to live on:

hongphuc5497.com — our personal site (Next.js, Vercel)
Substack — newsletter subscribers
Dev.to — developer community
Supabase — content store for the site's dynamic pages

Doing this manually meant: copy-paste to Substack, format for Dev.to, write a YAML-frontmatter note for the site, then forget to update the content database. Canonical URLs were wrong. Dates drifted. The Supabase store was perpetually out of sync.

We wanted a single command that produces all four — in the right order, with the right URLs.

The Architecture

The pipeline has five steps, executed in strict order:

Step 1: Create   → LLM agent writes the post (Hermes cron)
Step 2: Site     → hongphuc5497.com (canonical source of truth)
Step 3: Substack → canonical URL → personal site
Step 4: Dev.to   → canonical URL → personal site
Step 5: Supabase → records all 3 platform URLs

The personal site is always first. Substack and Dev.to both point their canonical_url to it. This means Google indexes the personal site — not the platforms. SEO ownership stays with us.

What We Automated

Three AI agents handle different parts:

Hermes runs the cron scheduler — fetches trending data, orchestrates the pipeline, delivers to Telegram
Codex builds the repo infrastructure — staging scripts, archive logic, Agent Ops protocol
The Hermes agent itself composes the digest analysis — reads the raw JSON, writes 2-3 sentence summaries per repo

The fetcher script scrapes github.com/trending directly (no API key needed), extracts repo names, star counts, languages, and descriptions, then outputs structured JSON with local timezone fields. The agent uses those fields — not UTC, not the system clock — for the post title and filename.

What Broke

1. The Date Drift Problem

The most subtle bug: the trending data was fetched at 10:36 PM on June 4, but the cron job ran at 9:00 AM on June 5. The agent titled the digest "June 05" even though the fetcher data said "June 04."

The fix: Two-layer enforcement. The staging script now exits with code 1 when the title date doesn't match the fetcher's local_date_long field. The cron prompt explicitly warns: "STAGING WILL FAIL with exit code 1 if the title date does not match the fetcher date."

This turned a silent warning into a hard pipeline failure. The agent can't ignore it anymore.

2. The Supabase Namespace Shadow

The markdown-content-store repo has a supabase/migrations/ directory (standard Supabase CLI layout). Python's import system treated this as a namespace package, which shadowed the real supabase pip library.

When any publishing script ran from supabase import Client, Python found the empty migrations directory instead of the installed library. Every content store save silently failed.

The fix: In src/db.py, we prepend .venv/lib/python*/site-packages to sys.path before the import. The real library is found first. One file, 14 lines, zero external changes.

3. The Multi-Line List Numbering

The Hermes agent writes numbered lists like:

1. **chopratejas/headroom** · Python · 12.6K⭐ +3.1K
   Compress tool outputs before they hit the LLM...

But when Substack's from_markdown() parser sees this, it merges the continuation line into the heading, breaking the numbering entirely. The same issue hit Dev.to.

The fix: We built a custom ProseMirror JSON builder (pm_builder.py) that correctly handles headings, inline formatting, and multi-line content. For Dev.to, we added _reformat_ordered_lists() to merge continuation lines into single-line items before hitting the API.

The Canonical URL Chain

hongphuc5497.com/notes/{slug}     ← SOURCE OF TRUTH
        │
        ├── Substack              ← canonical → personal site
        └── Dev.to                ← canonical → personal site

Substack and Dev.to both declare the personal site as their canonical source. Search engines index the personal site. The platforms get the content but defer SEO authority.

The Supabase content store records all three URLs via a published_urls table — one post, multiple platform entries. The site reads from Supabase to populate its notes page dynamically.

The State Machine

trending.json → digest.md → stage → archive → deliver
                    │
                    ├── run.json      (date_consistent, warnings)
                    └── publish.json  (status: draft → delivered → published)

The staging script is the gatekeeper. It checks date consistency, copies artifacts into the canonical state directory, records metadata in run.json, and — if --auto-archive is set — archives the digest to docs/ with an updated index.json.

If anything fails, the pipeline stops. No partial publishes. No stale state.

What We Learned

Python namespace packages are silent killers. A supabase/ directory with no __init__.py will shadow your pip install with zero error messages. Always check import supabase; print(supabase.__path__) when imports break mysteriously.
Prompt enforcement is not enough. We told the agent "use ONLY the fetcher's date" and it still inferred from the system clock. Hard exits are the only reliable enforcement.
Canonical URLs must be set at publish time, not after. If Substack publishes without a canonical URL, you can't retroactively add one without deleting and republishing.
The personal site must be first. If you publish to Substack first, its URL becomes the de facto canonical — and you lose SEO ownership to your own platform.
Content stores are boring infrastructure that saves you hours. The Supabase upsert_by_slug + add_platform_url pattern means re-runs don't create duplicates, and the site always knows where every post lives.

The Result

Three times a week, an AI agent scrapes GitHub Trending, writes a 10-repo digest, and publishes it to four platforms in under 2 minutes. The personal site owns the canonical URL. The content store tracks everything. And if the dates don't match, the pipeline refuses to proceed.

No manual copy-paste. No stale Supabase records. No wrong canonical URLs.

Just an agent, a fetcher, and a very opinionated staging script.

The publishing pipeline is open source in github-digest. The content store lives in markdown-content-store. Both repos use Hermes Agent for orchestration.