How to Handle Pagination in Web Scraping

A practical guide to pagination web scraping across numbered pages, load more buttons, infinite scroll, and cursor-based APIs.

Pagination is one of the first things that makes a simple scraper fail in production. A page may look like a neat list of links today, then switch to a “Load more” button, infinite scroll, or a background API tomorrow. This guide gives you a practical way to handle pagination web scraping across static and dynamic sites, so you can scrape multiple pages reliably, detect the real data source, and build scrapers that are easier to maintain when site behavior changes.

Overview

If you only remember one idea from this article, make it this: pagination is not a visual feature, it is a data-delivery pattern. The visible interface might be numbered links, a next button, endless scrolling, or a filtered search grid. Underneath, the site is usually doing one of a few repeatable things: requesting another HTML document, calling an API with a page or cursor parameter, or rendering more items into the DOM after a user action.

That distinction matters because the best scraping method depends on where the next batch of records actually comes from. If the site delivers fully rendered HTML pages, a basic HTTP client may be enough. If it uses JavaScript to fetch JSON in the background, browser automation may only be needed long enough to discover the network calls. If the site uses a cursor-based API, trying to click through every visual page in a browser can be slower and more fragile than calling the underlying endpoint directly.

For developers working on a web scraping tutorial, data pipeline, or reusable scraper framework, pagination is where ad hoc scripts often become maintainability problems. A robust approach should answer five questions early:

What triggers the next batch of results?
Where does the next batch come from: HTML, JSON, or GraphQL-like API responses?
How do you know when to stop?
How do you avoid duplicates and missing items across pages?
What is the least complex method that still works reliably?

This article focuses on those decisions rather than any single library. The patterns apply whether you use Python requests and Beautiful Soup, Playwright, Puppeteer, or another browser automation stack. If you are comparing browser automation choices, see Playwright vs Puppeteer for Web Scraping: Which Should You Use?. If you want a broader starting point for Python workflows, see Python Web Scraping Tutorial: Requests, Beautiful Soup, and Playwright.

Core framework

Use this framework whenever you need to handle next page scraping or infinite scroll scraping on an unfamiliar site.

1. Identify the pagination type before writing extraction code

Open the page in a browser and inspect the behavior, not just the markup. Common pagination types include:

Numbered page URLs: examples often look like ?page=2 or /page/3/.
Next/previous links: the next page URL is embedded in an anchor tag.
Load more button: clicking appends additional items without a full page reload.
Infinite scroll: scrolling triggers background requests for more content.
Cursor-based APIs: responses include a token such as next_cursor rather than a page number.
Hybrid pagination: the page URL changes, but data is still fetched asynchronously.

Your first task is not “parse the product card.” Your first task is “discover how the next batch is requested.” That usually saves time.

2. Prefer the lowest-friction data source

In practice, the most stable scraper is often the one closest to the source data and furthest from presentation details.

If each page is a standard HTML document, request the URLs directly.
If the browser calls a JSON endpoint, consider scraping the endpoint instead of the rendered page.
If a browser is required for authentication or token generation, use it to establish session context, then capture the underlying requests.

Many developers jump straight to full browser automation because the page is dynamic. That can work, but it is often more expensive and less predictable than necessary. A browser is best treated as a discovery and fallback tool, not automatically the primary extraction method.

3. Define the stopping condition clearly

Pagination bugs often come from weak stop logic. Good stop conditions include:

No next URL exists.
The API returns an empty list.
The cursor is missing or null.
The page repeats the same last item as the prior batch.
You hit a known page limit, date boundary, or item count threshold.

Avoid open-ended loops driven only by “keep clicking until something breaks.” That tends to fail silently when a site changes its layout or returns a partial error page.

4. Track identity, not position

When you scrape multiple pages, records can move between pages as the site updates. A product that was on page 2 in the morning may appear on page 1 later. The safest approach is to deduplicate by a stable identifier such as product ID, article URL, canonical slug, or another durable key. Do not rely on page number plus position index as the record identity.

This is especially important for search results, job boards, marketplace listings, and news pages where content order changes frequently.

5. Separate pagination logic from parsing logic

In a maintainable scraper, “how do I get the next batch?” should be independent from “how do I extract fields from a record?” That separation makes it easier to update a scraper when a site changes from numbered pages to infinite scroll but keeps the same item structure.

A simple mental model is:

Fetcher: gets one batch of content.
Paginator: determines the next request.
Parser: extracts records from the batch.
Deduper: prevents repeated records.
Stop rule: decides when the job is done.

6. Build observability into the scraper

Even lightweight scripts benefit from logging a few pagination details:

Current page number or cursor
Items extracted per batch
Next URL or token
Total unique records collected
Reason for stopping

Those fields make debugging much easier when a site changes behavior.

Practical examples

Here are the main pagination patterns you are likely to encounter, with practical guidance for each.

Pattern 1: Static numbered pages

This is the simplest case. The site exposes a predictable URL pattern or an explicit next link in the HTML.

What to do:

Fetch the first page and inspect the pagination links.
Prefer extracting the actual next link over manually incrementing a page number if possible.
Parse items from each page and follow the next link until it disappears.

Why extracting the next link is often better: some sites skip pages, add filters to the next URL, or use non-obvious patterns. Following the site’s own next link is generally more resilient than guessing the next URL.

Good fit for: requests-based scrapers, server-rendered sites, archives, category pages, documentation indexes.

Pattern 2: Query parameter pagination

Many sites use a parameter such as ?page=2, ?offset=50, or ?start=100.

What to do:

Confirm whether the parameter is page-based or offset-based.
Check if page size can also be controlled with a parameter like limit or per_page.
Stop when the response contains no items or fewer than the expected page size.

Key caution: offset-based pagination can become inconsistent on fast-changing datasets because insertions and deletions shift later results. If data freshness matters, use stable IDs and consider date or cursor strategies if available.

Pattern 3: Load more button

A “Load more” interface often calls an API and appends new items to the existing page.

What to do:

Open browser developer tools and watch the network tab while clicking the button.
Look for XHR or fetch requests returning JSON or HTML fragments.
Replicate the request directly if practical.
If direct replication is difficult, automate button clicks with a browser and wait for new items to appear.

Stop conditions:

The button disappears or is disabled.
The API returns no additional items.
The last item ID repeats.

Common trap: clicking too quickly without waiting for new content can trigger duplicate requests or stale DOM reads.

Pattern 4: Infinite scroll

Infinite scroll scraping is usually just hidden pagination with a scroll event as the trigger. The important part is still the network behavior.

What to do:

Scroll once or twice manually and inspect the background requests.
Check whether the site uses page numbers, offsets, or cursors under the hood.
If an API is visible, scrape that instead of simulating hundreds of scroll actions.
If you must scroll, wait for the item count or a sentinel element to change before continuing.

Reliable loop idea: measure the number of extracted cards, scroll, wait, then compare counts. If the count no longer increases after a few attempts, stop.

Common trap: assuming all items currently in the DOM represent the full result set. Some frontends virtualize long lists and remove older elements from the DOM as you scroll. In that case, extracting only visible elements may miss data unless you capture the network responses or collect records incrementally as they appear.

Pattern 5: Cursor-based API pagination

This is common in modern web apps because it performs better than large page numbers on changing datasets. The response may include something like next_cursor, endCursor, or a continuation token.

What to do:

Send the initial request.
Extract records and the cursor token from the response.
Pass that token into the next request.
Continue until no cursor is returned.

Why it matters: cursor-based pagination is often the most stable way to scrape dynamic feeds because it tracks position in the dataset more safely than page numbers or offsets.

Pattern 6: POST-based search pagination

Some sites paginate search results through form submissions or JSON POST requests rather than simple GET URLs.

What to do:

Capture the full request payload and headers.
Identify which field changes between requests: page number, offset, cursor, or filter state.
Preserve session cookies and anti-CSRF values if needed.

Common trap: reproducing the URL but not the payload, which leads to getting the same first page repeatedly.

A practical decision tree

When you encounter a new target site, this sequence works well:

Look for a visible next link or page parameter in the HTML.
If not found, inspect network requests while paging.
If data comes from an API, test direct requests.
If direct requests are blocked or require browser state, automate the browser minimally.
Add deduplication and strong stop rules before scaling.

This decision tree keeps your scraper grounded in the simplest working method.

If you are evaluating tooling for these workflows, Best Web Scraping Tools in 2026: Features, Pricing, and Use Cases is a useful comparison point.

Common mistakes

Most pagination failures come from a small set of design mistakes. Avoiding them will make your scrapers more dependable.

Scraping the interface instead of the data flow

If you only inspect the DOM, you may miss that the page is driven by a clean JSON endpoint. That leads to brittle selectors, unnecessary rendering costs, and slower runs.

Using page numbers when the site uses cursors

Developers sometimes force a page-based model onto a cursor-based system because it feels familiar. If the underlying data source uses continuation tokens, follow that pattern instead.

No deduplication across batches

Any scraper that handles dynamic pagination should assume overlap can happen. Without a unique key check, your data will eventually contain duplicates.

No protection against loops

Some sites return the same “next” URL repeatedly when something goes wrong. Others keep the button visible even when no more items exist. Keep a set of seen page URLs or cursor tokens and stop if one repeats unexpectedly.

Waiting for the wrong thing in browser automation

Waiting for a generic timeout is less reliable than waiting for a meaningful change such as a larger item count, a new network response, or a specific DOM mutation.

Ignoring sort order and content drift

If the target data changes while your scraper is paging through it, records may move. This is common in “latest first” feeds. For recurring jobs, consider scraping within a bounded time range or stopping once you reach a known older item.

Mixing pagination, parsing, and storage in one loop

One long loop that fetches, clicks, parses, transforms, and writes data can be difficult to debug. Smaller, clearly named steps are easier to fix when the site changes.

Not testing the end of the list

Many scripts work for page 1 through 3 and fail at the end because the final page is shorter, the next button disappears, or the API returns a different shape. Always test the last-page behavior explicitly.

Overlooking legal and operational boundaries

Even when the technical approach is sound, you still need to act responsibly. Review a site’s terms, respect rate limits where appropriate, and avoid unnecessary load. Good pagination handling is not just about completeness; it is also about being deliberate and efficient.

When to revisit

Pagination logic should be treated as a part of scraper maintenance, not a one-time setup. Revisit your approach when the site’s delivery pattern changes or when your operating assumptions no longer hold.

Review the scraper if you notice any of the following:

The site changes from numbered pages to infinite scroll or “Load more.”
Your scraper starts collecting fewer items than expected.
The same records appear across multiple batches.
The browser run becomes much slower than before.
Network responses now include a cleaner endpoint than the one you originally used.
The site introduces new filtering, sorting, or authentication behavior.

A practical refresh checklist:

Open the page manually and paginate through it once.
Inspect network requests during each pagination action.
Confirm the current pagination model: page, offset, cursor, or hybrid.
Verify the stop condition still works on the final batch.
Check deduplication against a stable identifier.
Review whether a simpler method is now available.
Update logs so you can see page and cursor behavior clearly.

As a rule, revisit the scraper whenever the primary method changes or when new tools and standards make a simpler implementation possible. For example, a job that once required full browser automation may later be easier to run through a stable API call discovered in the network panel.

The most durable mindset is to think of pagination as a pattern library rather than a fixed implementation. Static pages, next links, dynamic pagination, infinite scroll, and background APIs are all variations of the same problem: retrieving the next batch safely and knowing when you are done. Once you build around that model, your scraper is much more likely to survive interface changes without a full rewrite.

For your next project, start small: identify the pagination type, capture one complete batch transition, define a stop rule, and add deduplication before you scale. That process is usually enough to turn a fragile proof of concept into a scraper you can trust.

How to Handle Pagination in Web Scraping: Patterns for Static and Dynamic Sites

Overview

Core framework

2. Prefer the lowest-friction data source

3. Define the stopping condition clearly

4. Track identity, not position

6. Build observability into the scraper

Practical examples

Pattern 1: Static numbered pages

Pattern 3: Load more button

Pattern 4: Infinite scroll

A practical decision tree

Common mistakes

Scraping the interface instead of the data flow

Using page numbers when the site uses cursors

No deduplication across batches

No protection against loops

Waiting for the wrong thing in browser automation

Ignoring sort order and content drift

Not testing the end of the list

Overlooking legal and operational boundaries

When to revisit

Related Topics

Web Dev Toolbox Editorial

Up Next

Best JSON Formatter, Validator, and Viewer Tools for Developers

How to Use Proxy Rotation in Python for Web Scraping

How to Scrape Product Pages for Price Monitoring and Stock Tracking