If your ingestion shows all pages with identical "Coming Soon" content, your site likely has access control enabled.
During ingestion only:
- Comment out or temporarily disable your access control middleware
- Run the ingestion
- Re-enable access control after completion
How to disable (depends on your implementation):
- If you have a middleware file, comment out the access check
- If you have a layout wrapper with access gates, add a bypass for localhost
- Set
NEXT_PUBLIC_ENABLE_ACCESS_CONTROL=falsetemporarily
Instead of crawling via HTTP, read directly from the file system:
// Read pages directly from src/app
// Parse MDX/TSX files
// Extract metadata and content
// No HTTP crawling neededAdd to .env.local:
CRAWLER_BYPASS_TOKEN=your-secret-tokenThen update your access control middleware to check for:
if (request.headers.get('X-Crawler-Bypass') === process.env.CRAWLER_BYPASS_TOKEN) {
return NextResponse.next(); // Allow crawler through
}For production sites with access control:
-
Create a separate ingestion script that:
- Reads MDX/markdown files directly from your content directory
- Parses front matter and content
- Chunks and embeds the content
- No HTTP crawling needed
-
Run this script as part of your build process or manually when content changes
-
This avoids HTTP round-trips and access control issues entirely
After ingestion, check /admin/ingest/inspect:
✅ Good signs:
- Diverse content previews (not all the same)
- Different sources (
/about,/store, etc.) - Content length varies per page
❌ Bad signs:
- All chunks have identical content
- Content all shows "Coming Soon" or navigation text
- Very short content chunks (< 100 chars)
To verify what content the crawler sees:
curl http://localhost:3000/aboutIf this shows "Coming Soon" instead of your actual "About" page content, that's what the crawler is seeing too.