The Technical SEO Checklist for Scaling Sites Past 100k Pages
Past a certain scale, SEO isn’t content — it’s infrastructure. When your site crosses ~100,000 URLs, rankings hinge on how efficiently you allocate crawl budget, deduplicate templates, render JavaScript, and keep your index clean.
1) Know your scale problems before they snowball
What breaks first at 100k+ URLs
- Crawl waste on facets, filters, sessions, and infinite scroll variants.
- Duplicate/near-duplicate templates (color/size variants, UTM’d URLs).
- Indexing gaps from weak canonicals, stale sitemaps, soft-404s, and slow servers.
- JS rendering lags that hide content and links on initial fetch.
- Pagination myths (Google hasn’t used
rel=prev/next
for years).
Current baseline to hit
- Keep server errors low and response times fast; Google will crawl more when servers are healthy. Track this in Search Console’s Crawl Stats.
- Design for Core Web Vitals with a focus on INP (Interaction to Next Paint) replacing FID.
2) The must-have controls (with shippable examples)
A) robots.txt that gates, not guesses
Use Allow
/Disallow
and the Sitemap:
directive. Don’t rely on unsupported directives like crawl-delay
.
# /robots.txt
User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /search
Disallow: /*?sessionid=
Disallow: /*?sort=
Allow: /static/
Sitemap: https://www.example.com/sitemaps/sitemap-index.xml
- Use wildcards carefully (
*
and$
) and prefer least-restrictive rule in conflicts. Test before deployment.
B) XML sitemaps that scale
Split large sites into logical sitemap files (by type, dir, or date) and link them with a sitemap index. Respect limits (50k URLs per sitemap or 50MB uncompressed). Keep lastmod
accurate; don’t “thrash” dates.
<!-- /sitemaps/sitemap-index.xml -->
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap><loc>https://www.example.com/sitemaps/products-1.xml</loc></sitemap>
<sitemap><loc>https://www.example.com/sitemaps/products-2.xml</loc></sitemap>
<sitemap><loc>https://www.example.com/sitemaps/categories.xml</loc></sitemap>
<sitemap><loc>https://www.example.com/sitemaps/blog.xml</loc></sitemap>
</sitemapindex>
C) Canonicalization and internal links
- Canonicalize variants (parameters, case, trailing slashes) to a single, clean URL and reinforce with internal links to the canonical form.
- Avoid blocking canonicalized URLs in robots.txt (Google must crawl to see the
<link rel="canonical">
). - When dealing with facets, block clearly low-value patterns in robots.txt, canonicalize near-duplicates, and expose browseable high-value combinations.
<link rel="canonical" href="https://www.example.com/shoes/mens/running">
D) Structured data: ship templates, not one-offs
- Ecomm: Product + Merchant listing (where applicable).
- SaaS: SoftwareApplication + Organization + Breadcrumb.
Implement as JSON-LD and validate.
<script type="application/ld+json">
{
"@context":"https://schema.org",
"@type":"Product",
"name":"Trail Runner 2.0",
"sku":"TR-200",
"brand":{"@type":"Brand","name":"ExampleCo"},
"offers":{"@type":"Offer","price":"129.00","priceCurrency":"USD","availability":"InStock","url":"https://www.example.com/p/tr-200"}
}
</script>
E) JavaScript and rendering
Prefer SSR/SSG or hydrated HTML for critical content/links. If you must use dynamic rendering, treat it as a workaround and monitor parity.
F) Pagination without rel=prev/next
Use strong internal links to page 1 and key pages; keep page titles/snippets unique; consider “view-all” where UX allows.
G) Meta robots vs robots.txt, and the nofollow reality
- Meta robots
noindex
requires the page to be crawled; robots.txt Disallow prevents fetch. nofollow
is a hint, not a directive (since 2019). Don’t use it as your primary crawl management tool.
3) Monitoring that actually scales
Weekly
- Search Console → Crawl Stats: total requests, avg response time, bytes downloaded; watch spikes in 5xx/timeout.
- Index Coverage: sudden drops in “Indexed” or rises in “Crawled – currently not indexed”.
- Diff the sitemap counts vs. indexed counts by section.
Monthly
- Log-file analysis:
- % of Googlebot hits to blocked or parameterized URLs.
- Crawl distribution by template (product, PLP, blog).
- Top 404/410/301 chains.
- Render checks at scale (crawler + HTML snapshot comparisons).
When to use the Indexing API
- Only for JobPosting and BroadcastEvent (livestream) pages. It won’t accelerate general pages; use sitemaps + internal links instead.
4) Faceted navigation: a safe default pattern
-
Map parameter types
- UX params (sort, view, session): block in robots.txt.
- Duplicate creators (color/size filters): canonical to base and avoid linking at scale.
- Valuable combinations (category + brand): keep crawlable, link them, and include in sitemaps.
-
Ship a robots rule set that blocks obviously infinite patterns (e.g.,
?page=*&sort=*
) but allows representative “browseable” sets. -
Reinforce with canonicals + internal links to the clean target URLs.
5) Production-ready snippets
robots.txt (facet-safe example)
User-agent: *
Disallow: /search
Disallow: /*?q=
Disallow: /*?view=
Disallow: /*?sessionid=
Disallow: /*?sort=
Disallow: /*&page=* # if your param order varies, pattern accordingly
Allow: /static/
Sitemap: https://www.example.com/sitemaps/sitemap-index.xml
Sitemap index (by directory)
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap><loc>https://www.example.com/sitemaps/plp.xml</loc></sitemap>
<sitemap><loc>https://www.example.com/sitemaps/pdp-1.xml</loc></sitemap>
<sitemap><loc>https://www.example.com/sitemaps/pdp-2.xml</loc></sitemap>
<sitemap><loc>https://www.example.com/sitemaps/blog.xml</loc></sitemap>
</sitemapindex>
Canonical pattern
<link rel="canonical" href="https://www.example.com/{canonicalPath}">
SaaS structured data (SoftwareApplication)
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "SoftwareApplication",
"name": "Acme Analytics",
"applicationCategory": "BusinessApplication",
"operatingSystem": "Web",
"offers": { "@type": "Offer", "price": "99.00", "priceCurrency": "USD" }
}
</script>
Ecomm structured data (Product)
<script type="application/ld+json">
{
"@context":"https://schema.org",
"@type":"Product",
"name":"Trail Runner 2.0",
"sku":"TR-200",
"brand":{"@type":"Brand","name":"ExampleCo"},
"offers":{"@type":"Offer","price":"129.00","priceCurrency":"USD","availability":"InStock"}
}
</script>
6) Hypothetical scale-up case study (SaaS + Ecomm hybrid)
Context: 250k URLs across product listings, product detail pages, help docs, and blog. Growth stalled; new pages discovered slowly; coverage was noisy.
Fix plan
- Crawl budget: blocked session/sort/search parameters; allowed representative brand + category; moved everything else to canonical targets.
- Sitemaps: split by dir and freshness (daily PDP delta files, weekly PLP, monthly blog); added accurate
lastmod
. - Rendering: SSR for PLP/PDP shells to expose links + core content on initial HTML; monitored INP and response times.
- Index hygiene: removed reliance on
rel=prev/next
; strengthened internal links to page 1s and hubs. - Monitoring: weekly Crawl Stats review; monthly log sampling to verify crawl distribution by template and cut 404s.
Outcomes (12 weeks)
- +28% faster discovery of new PDPs (time-to-first-crawl).
- −37% Googlebot hits to Disallow/parameter garbage.
- +16% organic sessions to PLPs; +9% PDP clicks attributable to better crawl allocation.
- INP stabilized < 200 ms on high-traffic templates.
7) Your operational checklist
Ship now (2–4 weeks)
- [ ] Validate robots.txt (block search/session/sort; keep sitemap directive).
- [ ] Publish sitemap index split by section; verify in Search Console.
- [ ] Standardize canonicals and internal links to one URL per asset.
- [ ] Add JSON-LD templates for Product/SoftwareApplication where relevant.
- [ ] Replace FID thinking with INP targets; ship perf budgets.
- [ ] Set weekly Crawl Stats review; alert on 5xx/timeouts spikes.
Scale safely (ongoing)
- [ ] Treat nofollow as a hint; don’t depend on it for crawl control.
- [ ] Don’t use the Indexing API for general pages (jobs/livestreams only).
- [ ] Keep paginated sets usable without
rel=prev/next
; invest in internal linking. - [ ] Log-audits monthly: crawl allocation by template, top waste URLs, soft-404s.
Bottom line
At 100k+ URLs, you win by engineering: tighter robots/sitemaps, canonical discipline, render parity, and relentless log-driven monitoring. Do that, and the content you already have becomes dramatically easier for Google to crawl, process, and rank.
Work with us
Pods bring technical engineering + SEO muscle together. We plug in with your devs, instrument the templates, and ship the infra that scales crawling, rendering, and indexing — so you can scale growth.