This guide covers multiple ways to automatically keep the chatbot's knowledge base up-to-date.
The chatbot ingestion process:
- Generates sitemap.xml automatically from Next.js routes
- Crawls pages using sitemap for reliable discovery
- Extracts content and chunks it for embedding
- Creates embeddings using OpenAI
- Stores in Redis vector database
- Atomically swaps to new index (zero-downtime)
Automatically triggers ingestion after every successful production deployment.
-
Create Deploy Webhook Endpoint ✅ Already created at:
/api/webhooks/vercel-deploy -
Add Environment Variables in Vercel:
VERCEL_DEPLOY_WEBHOOK_SECRET=your-random-secret-here INGESTION_API_TOKEN=your-ingestion-token
-
Configure Vercel Webhook:
- Go to: Vercel Project → Settings → Git → Deploy Hooks
- Click "Create Hook"
- Name:
Post-Deploy Ingestion - Webhook URL:
https://react.foundation/api/webhooks/vercel-deploy - Secret: Same as
VERCEL_DEPLOY_WEBHOOK_SECRET - Events: Select "Deployment Succeeded"
- Click "Create Hook"
-
How it works:
GitHub Push ↓ Vercel Deploy ↓ Deployment Succeeds ↓ Webhook Fires → /api/webhooks/vercel-deploy ↓ Triggers Ingestion → /api/admin/ingest ↓ Chatbot Updated ✅
Pros:
- ✅ Automatic - no manual work
- ✅ Always synced with latest deployment
- ✅ Zero-downtime updates
- ✅ Works for all content changes
Cons:
⚠️ Only triggers on deployment (not standalone content updates)
Runs ingestion on a schedule (e.g., daily at 2 AM).
-
GitHub Workflow ✅ Already created at:
.github/workflows/trigger-ingestion.yml -
Add GitHub Secret:
- Go to: GitHub Repo → Settings → Secrets and variables → Actions
- Add secret:
INGESTION_API_TOKEN
-
Schedule (edit workflow file):
schedule: - cron: '0 2 * * *' # Daily at 2 AM UTC
-
Manual trigger also available:
- Go to: Actions → Trigger Chatbot Ingestion → Run workflow
- Choose environment, max pages, use sitemap
Pros:
- ✅ Reliable scheduled updates
- ✅ Manual trigger available
- ✅ Can customize per environment
- ✅ Logs viewable in GitHub Actions
Cons:
⚠️ Not immediate after content changes⚠️ Uses GitHub Actions minutes
Scheduled ingestion using Vercel's built-in cron.
-
Create cron config in
vercel.json:{ "crons": [ { "path": "/api/cron/ingest", "schedule": "0 2 * * *" } ] } -
Create cron endpoint at
src/app/api/cron/ingest/route.ts:import { NextResponse } from 'next/server'; export async function GET(request: Request) { // Verify Vercel cron secret const authHeader = request.headers.get('authorization'); if (authHeader !== `Bearer ${process.env.CRON_SECRET}`) { return NextResponse.json({ error: 'Unauthorized' }, { status: 401 }); } // Trigger ingestion const response = await fetch( `${process.env.NEXT_PUBLIC_SITE_URL}/api/admin/ingest`, { method: 'POST', headers: { 'Authorization': `Bearer ${process.env.INGESTION_API_TOKEN}`, 'Content-Type': 'application/json', }, body: JSON.stringify({ maxPages: 100, useSitemap: true, }), } ); return NextResponse.json({ success: true }); }
-
Add environment variable:
CRON_SECRET=your-random-secret
Pros:
- ✅ Native Vercel integration
- ✅ No external dependencies
- ✅ Free on Vercel Pro
Cons:
⚠️ Requires Vercel Pro plan⚠️ Less flexible than GitHub Actions
Run ingestion manually when needed.
1. Visit: https://react.foundation/admin/data
2. Click "Run Ingestion"
3. Wait for completion
curl -X POST https://react.foundation/api/admin/ingest \
-H "Authorization: Bearer $INGESTION_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"maxPages": 100,
"useSitemap": true,
"clearExisting": false
}'# Create scripts/trigger-ingestion.sh
./scripts/trigger-ingestion.sh productionPros:
- ✅ Full control
- ✅ No setup required
- ✅ Can customize per run
Cons:
- ❌ Manual work required
- ❌ Easy to forget
The ingestion system automatically generates and uses sitemap.xml for reliable page discovery.
Location: https://react.foundation/sitemap.xml
Generator: src/app/sitemap.ts (Next.js App Router)
Includes:
- All static routes (home, about, communities, etc.)
- Dynamic collection pages from Shopify
- Priority and change frequency metadata
- Reliable - No risk of missing pages
- Fast - Direct URL list, no link following
- Prioritized - Can skip low-priority pages
- Metadata - Uses lastModified, changefreq, priority
Ingestion options:
{
useSitemap: true, // Use sitemap.xml (default: true)
minPriority: 0.5, // Only pages with priority >= 0.5
maxPages: 100, // Limit pages crawled
allowedPaths: ['/docs'], // Only these paths
excludePaths: ['/admin'] // Skip these paths
}All chatbot responses include source links showing where information came from.
-
Content chunks store source URL:
{ id: "chunk-123", source: "/communities/start", // ← URL of source page content: "To start a community...", embedding: [0.1, 0.2, ...] }
-
Search results include source:
const results = await searchSimilar(redis, embedding, { k: 6 }); // Results have: id, source, score, content
-
Chatbot response includes citations:
{ "message": "To start a React community, you'll need...", "citations": [ { "id": "chunk-123", "source": "/communities/start", "score": 0.92 } ] } -
Frontend displays source links below message
- ✅ Shows: Public pages (
/communities,/docs, etc.) - ❌ Hides: Admin paths (
/admin/*,/api/*) - ✅ Shows: Public context files (
public-context/*.md)
Admin UI:
https://react.foundation/admin/data
API:
curl https://react.foundation/api/admin/ingest/statusDiagnostic Script:
npx tsx scripts/diagnose-chatbot-content.tsOutput:
✅ Vector index found: idx:chatbot:chunks
✅ Found 156 content chunks
✅ Community guide content found (5 results)
Server logs (Vercel):
https://vercel.com/your-project/logs
Search for:
Ingestion startedCrawled X pagesStored Y chunksSwapped to new index
- ✅ New docs added → Run ingestion
- ✅ Page content updated → Run ingestion
- ✅ Sitemap routes changed → Run ingestion
- ✅ More reliable than link-following
- ✅ Faster and predictable
- ✅ Skip low-priority pages with
minPriority
- ✅ Set up Vercel Deploy Hooks (recommended)
- ✅ Or use GitHub Actions scheduled
- ✅ Manual as backup
npm run dev
# Visit: http://localhost:3000/admin/data
# Click "Run Ingestion"
# Verify results with diagnostic script- Check citation sources in chatbot responses
- Use diagnostic script to verify content
- Test common queries to ensure good results
Symptoms: 500 error, no chunks created
Solutions:
- Check Redis connection
- Verify OPENAI_API_KEY is set
- Check ingestion logs in Vercel
- Ensure sitemap.xml is accessible
Symptoms: "I don't have information on..."
Solutions:
- Run diagnostic:
npx tsx scripts/diagnose-chatbot-content.ts - Check if vector store is empty
- Verify content exists in public-context/ or sitemap
- Check source attribution is correct
Symptoms: Chatbot returns old information
Solutions:
- Run manual ingestion
- Verify automation is working (check logs)
- Check that sitemap includes new pages
- Verify new index was swapped successfully
Symptoms: No ingestion after deployment
Solutions:
- Check Vercel webhook is configured correctly
- Verify VERCEL_DEPLOY_WEBHOOK_SECRET matches
- Check webhook endpoint logs
- Test webhook manually with curl
Recommended Setup:
- ✅ Primary: Vercel Deploy Hooks (automatic after every deploy)
- ✅ Backup: GitHub Actions scheduled (daily at 2 AM)
- ✅ Emergency: Manual via Admin UI
This ensures your chatbot always has the latest content with zero manual work!