The Technical SEO Checklist for Scaling Sites Past 100k Pages

6 min read

Past a certain scale, SEO isn’t content — it’s infrastructure. When your site crosses ~100,000 URLs, rankings hinge on how efficiently you allocate crawl budget, deduplicate templates, render JavaScript, and keep your index clean.

1) Know your scale problems before they snowball

What breaks first at 100k+ URLs

  • Crawl waste on facets, filters, sessions, and infinite scroll variants.
  • Duplicate/near-duplicate templates (color/size variants, UTM’d URLs).
  • Indexing gaps from weak canonicals, stale sitemaps, soft-404s, and slow servers.
  • JS rendering lags that hide content and links on initial fetch.
  • Pagination myths (Google hasn’t used rel=prev/next for years).

Current baseline to hit

  • Keep server errors low and response times fast; Google will crawl more when servers are healthy. Track this in Search Console’s Crawl Stats.
  • Design for Core Web Vitals with a focus on INP (Interaction to Next Paint) replacing FID.

2) The must-have controls (with shippable examples)

A) robots.txt that gates, not guesses

Use Allow/Disallow and the Sitemap: directive. Don’t rely on unsupported directives like crawl-delay.

# /robots.txt
User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /search
Disallow: /*?sessionid=
Disallow: /*?sort=
Allow: /static/
Sitemap: https://www.example.com/sitemaps/sitemap-index.xml
  • Use wildcards carefully (* and $) and prefer least-restrictive rule in conflicts. Test before deployment.

B) XML sitemaps that scale

Split large sites into logical sitemap files (by type, dir, or date) and link them with a sitemap index. Respect limits (50k URLs per sitemap or 50MB uncompressed). Keep lastmod accurate; don’t “thrash” dates.

<!-- /sitemaps/sitemap-index.xml -->
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap><loc>https://www.example.com/sitemaps/products-1.xml</loc></sitemap>
  <sitemap><loc>https://www.example.com/sitemaps/products-2.xml</loc></sitemap>
  <sitemap><loc>https://www.example.com/sitemaps/categories.xml</loc></sitemap>
  <sitemap><loc>https://www.example.com/sitemaps/blog.xml</loc></sitemap>
</sitemapindex>

C) Canonicalization and internal links

  • Canonicalize variants (parameters, case, trailing slashes) to a single, clean URL and reinforce with internal links to the canonical form.
  • Avoid blocking canonicalized URLs in robots.txt (Google must crawl to see the <link rel="canonical">).
  • When dealing with facets, block clearly low-value patterns in robots.txt, canonicalize near-duplicates, and expose browseable high-value combinations.
<link rel="canonical" href="https://www.example.com/shoes/mens/running">

D) Structured data: ship templates, not one-offs

  • Ecomm: Product + Merchant listing (where applicable).
  • SaaS: SoftwareApplication + Organization + Breadcrumb.
    Implement as JSON-LD and validate.
<script type="application/ld+json">
{
  "@context":"https://schema.org",
  "@type":"Product",
  "name":"Trail Runner 2.0",
  "sku":"TR-200",
  "brand":{"@type":"Brand","name":"ExampleCo"},
  "offers":{"@type":"Offer","price":"129.00","priceCurrency":"USD","availability":"InStock","url":"https://www.example.com/p/tr-200"}
}
</script>

E) JavaScript and rendering

Prefer SSR/SSG or hydrated HTML for critical content/links. If you must use dynamic rendering, treat it as a workaround and monitor parity.

F) Pagination without rel=prev/next

Use strong internal links to page 1 and key pages; keep page titles/snippets unique; consider “view-all” where UX allows.

G) Meta robots vs robots.txt, and the nofollow reality

  • Meta robots noindex requires the page to be crawled; robots.txt Disallow prevents fetch.
  • nofollow is a hint, not a directive (since 2019). Don’t use it as your primary crawl management tool.

3) Monitoring that actually scales

Weekly

  • Search Console → Crawl Stats: total requests, avg response time, bytes downloaded; watch spikes in 5xx/timeout.
  • Index Coverage: sudden drops in “Indexed” or rises in “Crawled – currently not indexed”.
  • Diff the sitemap counts vs. indexed counts by section.

Monthly

  • Log-file analysis:
    • % of Googlebot hits to blocked or parameterized URLs.
    • Crawl distribution by template (product, PLP, blog).
    • Top 404/410/301 chains.
  • Render checks at scale (crawler + HTML snapshot comparisons).

When to use the Indexing API

  • Only for JobPosting and BroadcastEvent (livestream) pages. It won’t accelerate general pages; use sitemaps + internal links instead.

4) Faceted navigation: a safe default pattern

  1. Map parameter types

    • UX params (sort, view, session): block in robots.txt.
    • Duplicate creators (color/size filters): canonical to base and avoid linking at scale.
    • Valuable combinations (category + brand): keep crawlable, link them, and include in sitemaps.
  2. Ship a robots rule set that blocks obviously infinite patterns (e.g., ?page=*&sort=*) but allows representative “browseable” sets.

  3. Reinforce with canonicals + internal links to the clean target URLs.

5) Production-ready snippets

robots.txt (facet-safe example)

User-agent: *
Disallow: /search
Disallow: /*?q=
Disallow: /*?view=
Disallow: /*?sessionid=
Disallow: /*?sort=
Disallow: /*&page=*    # if your param order varies, pattern accordingly
Allow: /static/
Sitemap: https://www.example.com/sitemaps/sitemap-index.xml

Sitemap index (by directory)

<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap><loc>https://www.example.com/sitemaps/plp.xml</loc></sitemap>
  <sitemap><loc>https://www.example.com/sitemaps/pdp-1.xml</loc></sitemap>
  <sitemap><loc>https://www.example.com/sitemaps/pdp-2.xml</loc></sitemap>
  <sitemap><loc>https://www.example.com/sitemaps/blog.xml</loc></sitemap>
</sitemapindex>

Canonical pattern

<link rel="canonical" href="https://www.example.com/{canonicalPath}">

SaaS structured data (SoftwareApplication)

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "SoftwareApplication",
  "name": "Acme Analytics",
  "applicationCategory": "BusinessApplication",
  "operatingSystem": "Web",
  "offers": { "@type": "Offer", "price": "99.00", "priceCurrency": "USD" }
}
</script>

Ecomm structured data (Product)

<script type="application/ld+json">
{
  "@context":"https://schema.org",
  "@type":"Product",
  "name":"Trail Runner 2.0",
  "sku":"TR-200",
  "brand":{"@type":"Brand","name":"ExampleCo"},
  "offers":{"@type":"Offer","price":"129.00","priceCurrency":"USD","availability":"InStock"}
}
</script>

6) Hypothetical scale-up case study (SaaS + Ecomm hybrid)

Context: 250k URLs across product listings, product detail pages, help docs, and blog. Growth stalled; new pages discovered slowly; coverage was noisy.

Fix plan

  • Crawl budget: blocked session/sort/search parameters; allowed representative brand + category; moved everything else to canonical targets.
  • Sitemaps: split by dir and freshness (daily PDP delta files, weekly PLP, monthly blog); added accurate lastmod.
  • Rendering: SSR for PLP/PDP shells to expose links + core content on initial HTML; monitored INP and response times.
  • Index hygiene: removed reliance on rel=prev/next; strengthened internal links to page 1s and hubs.
  • Monitoring: weekly Crawl Stats review; monthly log sampling to verify crawl distribution by template and cut 404s.

Outcomes (12 weeks)

  • +28% faster discovery of new PDPs (time-to-first-crawl).
  • −37% Googlebot hits to Disallow/parameter garbage.
  • +16% organic sessions to PLPs; +9% PDP clicks attributable to better crawl allocation.
  • INP stabilized < 200 ms on high-traffic templates.

7) Your operational checklist

Ship now (2–4 weeks)

  • [ ] Validate robots.txt (block search/session/sort; keep sitemap directive).
  • [ ] Publish sitemap index split by section; verify in Search Console.
  • [ ] Standardize canonicals and internal links to one URL per asset.
  • [ ] Add JSON-LD templates for Product/SoftwareApplication where relevant.
  • [ ] Replace FID thinking with INP targets; ship perf budgets.
  • [ ] Set weekly Crawl Stats review; alert on 5xx/timeouts spikes.

Scale safely (ongoing)

  • [ ] Treat nofollow as a hint; don’t depend on it for crawl control.
  • [ ] Don’t use the Indexing API for general pages (jobs/livestreams only).
  • [ ] Keep paginated sets usable without rel=prev/next; invest in internal linking.
  • [ ] Log-audits monthly: crawl allocation by template, top waste URLs, soft-404s.

Bottom line

At 100k+ URLs, you win by engineering: tighter robots/sitemaps, canonical discipline, render parity, and relentless log-driven monitoring. Do that, and the content you already have becomes dramatically easier for Google to crawl, process, and rank.

Work with us

Pods bring technical engineering + SEO muscle together. We plug in with your devs, instrument the templates, and ship the infra that scales crawling, rendering, and indexing — so you can scale growth.

Ready to Scale Your Marketing Engineering?

Get dedicated engineering pods for your marketing team. No hiring headaches, no bottlenecks.

View Our Plans