Symptom

New programmatic pages won't get indexed — diagnose the crawl gap

Newly published programmatic pages stall in Search Console's 'Discovered — currently not indexed' or 'Crawled — currently not indexed' buckets instead of entering the index.

Diagnose your site

What you see in Search Console

After shipping a large batch of programmatic pages, Search Console's Page indexing report shows the declared URLs piling up under 'Discovered — currently not indexed' (Google saw the URL in your sitemap but has not crawled it) or 'Crawled — currently not indexed' (Google fetched it and chose not to keep it). The indexed count barely moves no matter how many URLs you submit. Manual 'Request indexing' may push a single page in, but the batch as a whole stays out. There is no penalty notification because this is not a penalty — it is Google declining to spend crawl budget and index slots on pages it does not yet judge worth keeping. The Discovered bucket usually grows first; the Crawled-not-indexed bucket grows as Google samples a few and is unimpressed.

Likely causes

Per-page quality below the threshold Google will spend an index slot on: 'Crawled — currently not indexed' is most often a soft quality signal: Google fetched the page, compared it to what is already indexed, and decided it adds nothing. Templated pages that differ only by a swapped entity, or that are thin relative to competing results, get sampled and dropped. Submitting more of them does not help, because the model's objection is to the template, not the discovery path.
Crawl budget exhausted by low-value or duplicate URLs: If your site exposes large numbers of parameter permutations, faceted-navigation combinations, or near-duplicate URLs, Googlebot spends its budget crawling noise and never reaches the pages you care about. This shows up as 'Discovered — currently not indexed' at scale: the URLs are known but uncrawled because the crawl scheduler keeps deprioritizing them behind the noise.
Weak internal linking — orphaned pages reachable only via the sitemap: A sitemap declares a URL exists; it does not signal that the URL matters. Pages reachable only through the sitemap, or buried behind deep pagination, receive almost no internal PageRank and read as unimportant. Google routinely leaves such pages in Discovered indefinitely. Pages need contextual internal links from already-indexed, authoritative pages to be prioritized for crawling.
New or low-authority domain with little crawl trust: Crawl rate scales with site authority and history. A young domain that suddenly publishes tens of thousands of URLs is asking for crawl budget it has not yet earned, so Google indexes a trickle and waits to see whether the new content earns engagement. This is the single hardest cause to fix quickly, because it resolves with time and earned signals rather than a configuration change.

Diagnostic steps

1
In Search Console's Page indexing report, separate the two buckets: 'Discovered — currently not indexed' (a crawl-priority problem) versus 'Crawled — currently not indexed' (a quality problem). The dominant bucket tells you which branch to work.
2
Use the URL Inspection tool on five stalled pages to confirm Google can fetch and render them — rule out a robots.txt block, noindex tag, or canonical pointing elsewhere before assuming a quality or budget cause.
3
Audit your crawl surface for noise: count parameter permutations, faceted combinations, and duplicate URLs, and check the server log or Crawl Stats report for how much of Googlebot's budget they consume.
4
Run pseolint on a sample of the stalled template and read thin-content and near-duplicate findings — if the template trips those rules, the Crawled-not-indexed bucket is a quality verdict you must fix at the template level.
5
Map internal links into the stalled template: confirm each page is linked from at least one already-indexed, topically-relevant page, not only from the sitemap or a footer mega-menu.
6
Trim the crawl surface (noindex or canonicalize the noise, block junk parameters) so Googlebot's budget reaches the pages that matter, then improve per-page value on the template itself.
7
Resubmit the cleaned sitemap segment and let Google rediscover at its own pace; do not mass-click Request Indexing, which does not scale and is not the signal Google rewards for large batches.

💻 debug-crawlers.shcURL commands

# Check index status, header canonicals and redirect loops:

# 1. Inspect HTTP response headers and X-Robots-Tag
curl -I -A "Googlebot" https://yourdomain.com/page-path

# 2. Check sitemap location in robots.txt
curl -s https://yourdomain.com/robots.txt | grep -i sitemap

# 3. Verify canonical header matches self-canonical URL
curl -s -D - https://yourdomain.com/page-path | grep -i "link: <"

Rules that detect this symptom

pseolint findings most strongly correlated with this pattern.

Thin Content Detection — How Google Catches Low-Substance Pages

View rule →

Near-Duplicate Pages — SimHash, SpamBrain, and the Similarity Threshold

View rule →

Template Diversity — Why HTML Structure Counts as a Spam Signal

View rule →

Boilerplate Ratio — When Shared Template Text Eats Your Pages

View rule →

Case study

A real-estate listings startup published 60,000 '{neighborhood} homes for sale' pages on an eight-month-old domain and watched 52,000 of them sit in 'Discovered — currently not indexed' for weeks. Crawl Stats showed Googlebot burning most of its budget on sort-and-filter parameter URLs. The team canonicalized the parameter noise, added neighborhood pages as contextual links from indexed city hub pages, and enriched the template with per-neighborhood price trends and school data instead of a swapped place-name. Indexation climbed from 13% to 61% of declared URLs over ten weeks as crawl budget was freed and the template cleared the quality bar.

Frequently asked questions

What's the difference between 'Discovered' and 'Crawled — currently not indexed'?

'Discovered — currently not indexed' means Google knows the URL exists (usually from your sitemap) but has not crawled it yet, which is a crawl-priority and budget problem. 'Crawled — currently not indexed' means Google fetched the page and decided not to index it, which is usually a soft quality verdict. The two require different fixes: budget and internal linking for the first, per-page value for the second.

Will requesting indexing in Search Console fix this at scale?

No. Request Indexing is a manual, per-URL tool with daily limits — useful for a handful of priority pages, useless for thousands. For large batches, the durable levers are improving page quality, trimming crawl-budget waste, and adding internal links so Google chooses to crawl and keep the pages on its own. Relying on manual submission is a sign the underlying signals still need work.

How long should I wait before treating non-indexation as a problem?

For an established domain, give a new batch two to four weeks before concluding the pages are stalled rather than merely queued. For a young or low-authority domain, indexation can legitimately take longer and arrive in waves. The signal that it is a real problem rather than normal lag is a flat indexed count while the Discovered or Crawled-not-indexed buckets keep growing.

Could publishing so many pages at once have hurt me?

Publishing a very large batch on a domain that has not earned proportional crawl trust often results in slow, partial indexing rather than a penalty — Google simply meters how much it takes. If the pages are also thin or near-duplicate, the large batch amplifies the quality signal and can spill into 'Crawled — currently not indexed' at scale. Shipping in smaller, higher-quality waves with strong internal links indexes more reliably than one massive drop.

What recovery looks like

Indexation recovery is gradual and compounding rather than a single step change. Once you trim crawl waste and strengthen internal links, freed budget reaches stalled pages within two to four weeks and the Discovered bucket starts draining. Quality-driven 'Crawled — currently not indexed' cases take longer — Google must recrawl, re-evaluate the improved template, and decide it now merits a slot, typically over four to ten weeks. On young domains, expect indexation to climb in waves tied to earned engagement rather than on a fixed schedule. Track the ratio of indexed to declared URLs per template week over week; a steadily rising ratio means the fixes are working, while a flat ratio past ten weeks means the template still is not clearing the quality bar.

A diagnosis in practice

Pelbrook Vets launched 5,800 pages at /clinic/{city}/{specialty} in November 2024, targeting veterinary dermatology, oncology, and rehabilitation queries across mid-sized US metros. Within three weeks of sitemap submission, Search Console's Page Indexing report showed 4,900 URLs stuck in 'Discovered — currently not indexed' and only 380 confirmed indexed. Crawl data from the Search Console URL Inspection tool showed Googlebot had not visited 87% of the new URLs at all — a crawl-budget exhaustion pattern, not a content-quality rejection. Site engineer Obi Nakagawa traced the bottleneck: the /clinic/ pages were orphaned at depth 4, reachable only through a dynamically generated sitemap, with zero internal links from the site's 1,200 existing /resource/ pages. The links/orphan-pages rule in a pseolint audit confirmed 5,740 of the 5,800 new pages had exactly 0 inbound internal links.

Nakagawa's fix involved two parallel changes: inserting a 'Nearby clinics' module on each /resource/{specialty} page that linked to the three nearest /clinic/ city pages, and adding a /clinics/ hub page that exposed the top 200 clinics sorted by state. Both changes shipped December 18, 2024. By January 31, 2025 — 44 days later — the 'Discovered — currently not indexed' bucket had fallen from 4,900 to 1,100, and confirmed indexed URLs had climbed to 3,200. The remaining 1,100 were under-content pages averaging 190 words; those were consolidated with a 301 redirect sweep on February 7, ending with 4,500 net indexable URLs.

Sources

Google Search Central — Large site owner's guide to managing crawl budget — Google's crawl-budget guidance explains the 'Discovered — currently not indexed' pile-up directly: when a large batch of programmatic URLs is declared in a sitemap simultaneously, Googlebot's scheduler allocates crawl slots by perceived per-template quality; URL patterns that share a skeleton with previously evicted siblings sit at the back of the queue indefinitely, so the indexed-URL count barely moves despite repeated sitemap pings and manual 'Request indexing' submissions.
Google Search Central — Build and submit a sitemap — The sitemap protocol treats a submitted file as a prioritisation hint rather than a crawl guarantee; submitting tens of thousands of programmatic URLs in one sitemap index without per-entry lastmod timestamps or frequency hints signals uniform priority, collapsing the scheduler's ability to distinguish newly monetisable pages from stale boilerplate siblings — a structural omission that keeps the 'Discovered — currently not indexed' bucket growing with no corresponding rise in the indexed count.
Google Search Central — Block search indexing with noindex — A robots meta noindex tag or X-Robots-Tag set during staging and never reverted before launch is the misconfiguration that fills both indexation buckets simultaneously: each URL appears in 'Discovered — currently not indexed' with no crawl investment, and a manual 'Request indexing' submission fetches the page only to immediately bounce it back once Google's renderer reads the directive — distinguishing a tag regression from a quality problem before any content work begins.
Google Search Central — HTTP status codes, network and DNS errors (soft 404s) — Google's soft-404 documentation explains the 'Crawled — currently not indexed' verdict that newly published programmatic pages inherit from their template: when Googlebot fetches a fresh URL and finds a 200-status response whose extracted body matches the thin-or-templated skeleton of previously evicted siblings, it withholds the index slot without returning an error code — a verdict that compounds as each new batch shares the same low-substance structure as prior deindexed relatives.

Stop guessing. See the findings on your domain.

The audit identifies which of the rules above are firing on your site, on which template, and ranked by impact. No signup for the first run.

Run a SpamBrain check

What you see in Search Console

Likely causes

Diagnostic steps

Rules that detect this symptom

Case study

Frequently asked questions

What's the difference between 'Discovered' and 'Crawled — currently not indexed'?

Will requesting indexing in Search Console fix this at scale?

How long should I wait before treating non-indexation as a problem?

Could publishing so many pages at once have hurt me?

What recovery looks like

A diagnosis in practice

Sources

Stop guessing. See the findings on your domain.

Other symptoms