Scale your scraper analytics with ClickHouse: ETL patterns and performance tips
ClickHousedata-pipelinesanalytics

Scale your scraper analytics with ClickHouse: ETL patterns and performance tips

wwebscraper
2026-01-23 12:00:00
11 min read
Advertisement

Design real-time scraper ingestion and ETL patterns for ClickHouse: schema, batching, streaming, and query recipes to handle high-throughput scraping in 2026.

Stop losing insight to ingestion bottlenecks — architect scraper analytics for ClickHouse at scale

If your scrapers can fetch millions of pages per day but your analytics lag by hours (or collapse under load), you’re fighting the wrong problem. High-throughput scraping workloads aren’t just about faster crawlers — they need an ingestion and ETL architecture that preserves data fidelity, deduplicates and normalizes events in flight, and exposes fast ad-hoc analysis for investigators and ML pipelines. In 2026, ClickHouse is the OLAP engine many engineering teams choose for this exact problem because of its throughput, low-latency queries, and streaming integrations. This guide gives you practical ETL patterns, schema design, batching rules and query recipes to run real-time scraper analytics on ClickHouse.

The big picture — real-time scraping pipelines in 2026

Trends through late 2025 and early 2026 accelerated adoption of cloud-first OLAP: ClickHouse Cloud maturity, broader Kafka + Change Data Capture (CDC) ecosystems, and optimized vectorized query engines mean teams can push near-real-time scraped events into analytics tables. But the gap remains operational: how to keep ClickHouse healthy and queries interactive while ingesting tens of thousands of rows per second from hundreds of scrapers.

Core pattern: make the ingestion path fault-tolerant and batched, apply lightweight cleaning and canonicalization at ingress (so heavy transformations don’t block ingestion), and maintain materialized aggregates and projections for common ad-hoc analysis.

Architecture patterns that work

1) Ingest -> Buffer -> Normalize -> Store

  1. Scrapers push raw events (JSONEachRow) to a message queue (Kafka/Redpanda) or an HTTP gateway. For troubleshooting scraper networking and local/test setups see Security & Reliability: Troubleshooting Localhost and CI Networking for Scraper Devs.
  2. ClickHouse reads from Kafka using the Kafka table engine into a staging MergeTree via a Materialized View, or writes to a Buffer table in front of the MergeTree to absorb spikes.
  3. Materialized Views perform deterministic transformations (URL canonicalization, timestamp normalization, JSON parsing) and write to the final MergeTree table optimized for analytical queries.

2) Distributed Writes for scale

For clusters, use a Distributed table to route writes to shard-local MergeTree tables. This keeps local insert rates high and avoids single-node hot spots. Configure client libraries to write directly to one or more replicas (or to the Distributed table depending on your latency/consistency tradeoffs). If you’re operating with compact gateways or distributed control planes, the field guidance for compact gateways and distributed control planes is useful for topology decisions.

3) Two-path processing: fast lane vs. enrichment lane

  • Fast lane: accept lightweight rows (url, url_hash, site_id, scraped_at, raw_payload) with tiny transforms so you never block ingestion.
  • Enrichment lane: background jobs read the raw rows, enrich with third-party data, heavy NLP/DOM parsing, then write back to analytic tables or summary tables. Treat enrichment like a set of governed micro-services — governance patterns for micro-apps at scale apply here.

Schema design: build for appends, dedupe, and multi-tenancy

Good schema design balances write performance, storage, and query latency. For scraper analytics, the following patterns work well.

CREATE TABLE scraper_events (
    site_id UInt32,
    url String,
    url_hash FixedString(16), -- md5 or xxhash
    scraped_at DateTime64(6),
    status UInt16,
    http_code UInt16,
    content_length UInt64,
    raw_payload String,
    content_hash FixedString(16),
    domain String,
    path String,
    crawler_id String,
    metadata Nested(key String, value String)
  )
  ENGINE = ReplacingMergeTree(content_hash)
  PARTITION BY toYYYYMM(scraped_at)
  ORDER BY (site_id, url_hash, scraped_at)
  SETTINGS index_granularity = 8192;
  

Key rationale:

  • ReplacingMergeTree with a content_hash lets ClickHouse deduplicate or keep the latest content for the same URL automatically during merges.
  • Partitioning by month (toYYYYMM) keeps partition count manageable for high-throughput sites; use daily partitions only if you delete/expire frequently.
  • ORDER BY drives read locality: grouping by site_id and url_hash gives fast point lookups and range scans. Put scraped_at as a tie-breaker for time ordering.
  • FixedString for hashes is storage-efficient and fast to compare.

When to use CollapsingMergeTree or Versioned deduplication

If your pipeline needs true upserts (add/delete semantics), CollapsingMergeTree with a sign column or ReplacingMergeTree with a version column are common strategies. Use CollapsingMergeTree to mark deletes and keep merges deterministic for logical removal. Prefer ReplacingMergeTree for idempotent rewrites where you can compute a monotonically increasing version.

Low-cardinality and dictionary usage

For fields with limited distinct values (status, domain hostnames for a small fleet, crawler_id), use LowCardinality(String) — this reduces memory pressure on sorts and group-bys. For enrichments (country/ip owner), leverage ClickHouse external dictionaries to keep large-but-static lookup tables out of the main column-store and speed JOIN-like lookups at query time.

Ingestion and batching best practices

Small, frequent inserts kill ClickHouse performance by creating many tiny parts and causing merge churn. The goal: batch to create medium-sized parts and keep merge pressure steady.

Batch size targets

  • Aim for 1–10 MB per insert or 10k–100k rows depending on row width.
  • Use the Buffer engine to accumulate many small writes and flush in larger blocks when strict consistency is not required.
  • For Kafka consumers, process in message-batch sizes of 1k–50k messages before insert, tuned to your payload size.

Example Python insert pattern (clickhouse-driver)

from clickhouse_driver import Client

client = Client('clickhouse-host')

batch = []
for event in scraper_events_source:
    batch.append((event.site_id, event.url, event.url_hash, event.scraped_at, ...))
    if len(batch) >= 10000:  # tune based on row size
        client.execute('INSERT INTO scraper_events VALUES', batch)
        batch = []

Use asynchronous producers or thread pools to keep scraping unaffected by insert latency. If you must do synchronous writes from scrapers, write to a centralized HTTP gateway that batches and forwards to ClickHouse.

Using Buffer and Kafka engines

  • Buffer table: good for sudden bursts from thousands of scrapers — it batches in-memory and flushes periodically to the real MergeTree, reducing part fragmentation. For insights on buffering patterns and cache layers see layered caching guidance.
  • Kafka engine + materialized views: allows ClickHouse to act as the stream consumer and perform transformations in the same step as ingestion.

Transformations: what to do at ingest vs post-ingest

Designate lightweight, deterministic transforms for ingest (URL normalization, timestamp parsing, JSON field extraction) and push heavy work (DOM parsing, language detection, ML inference) to asynchronous workers.

Materialized View ETL example (Kafka -> staging -> final)

CREATE TABLE kafka_scrapes (key String, value String) ENGINE = Kafka(...);

CREATE TABLE staging_scraper_events ENGINE = MergeTree() PARTITION BY toYYYYMM(scraped_at) ORDER BY (site_id, url_hash, scraped_at) AS SELECT
  toUInt32(JSONExtractInt(value, 'site_id')) AS site_id,
  JSONExtractString(value, 'url') AS url,
  sipHash64(url) AS url_hash,
  parseDateTimeBestEffort(JSONExtractString(value, 'scraped_at')) AS scraped_at,
  JSONExtractInt(value, 'http_code') AS http_code,
  value AS raw_payload
FROM kafka_scrapes;

CREATE MATERIALIZED VIEW mv_kafka_to_final TO scraper_events AS
SELECT *, sipHash64(url) AS url_hash FROM staging_scraper_events;

Materialized views let you apply deterministic ETL at ingest time, keeping the final table read-optimized.

Query patterns for ad-hoc analysis and troubleshooting

Analysts and SREs need fast ad-hoc queries to investigate anomalies (spike in 5xx, site outages, content drift). Use projections, pre-aggregations, and approximate aggregates to keep latency low.

1) Frequent question: "What sites had the most 5xx in the last 10 minutes?"

SELECT site_id, count() AS errors
FROM scraper_events
WHERE scraped_at >= now() - INTERVAL 10 MINUTE
  AND http_code >= 500
GROUP BY site_id
ORDER BY errors DESC
LIMIT 50;

2) Sampling for rapid exploration

Use the SAMPLE clause if you have defined sampling keys and need sub-second exploratory queries on huge tables:

SELECT url, count() FROM scraper_events
SAMPLE 0.01 -- 1% sample
WHERE scraped_at >= today()
GROUP BY url
ORDER BY count() DESC
LIMIT 20;

3) Approximate distinct for cardinality-heavy fields

SELECT uniqCombined(url_hash) as unique_pages, uniqCombined(crawler_id) as crawlers
FROM scraper_events
WHERE scraped_at >= ago(1 DAY);

Use uniqExact for small time windows when exactness is required, but prefer uniqCombined for speed and memory efficiency for analytics.

4) Use projections and materialized aggregates

Create projections or materialized aggregate tables for recurring heavy group-bys (e.g., site_id by hour metrics). Projections (built-in index-like structures) are especially helpful in 2026 ClickHouse versions and can dramatically reduce query time for common cubes.

Operational tuning and monitoring

Watch the marriage between writes and merges. Key signals to monitor:

  • system.parts — parts count per table. Too many tiny parts indicate small inserts.
  • system.merges — queued or running merges show merge pressure.
  • Disk usage and IO wait — ClickHouse is IO-bound for large scans; use NVMe and fast networking for lowest latency. For cost/observability tooling and runbooks, check cloud cost observability reviews like top cloud cost observability tools.
  • Insert latency and rows/sec — track per-node ingestion capacity.

Tuneable knobs (general guidance):

  • Increase insert batch sizes instead of increasing server memory limits.
  • Adjust merge settings only after observing system metrics and testing — premature tuning can worsen fragmentation.
  • Prefer column-level compression and LowCardinality types to reduce read amplification.

Managing storage: TTLs, retention and compaction

Scraper data grows fast. Use TTLs to drop or move old raw payloads to cheaper storage. Recommended pattern:

ALTER TABLE scraper_events
  MODIFY COLUMN raw_payload String TTL scraped_at + INTERVAL 30 DAY;

ALTER TABLE scraper_events
  MODIFY COLUMN raw_payload TO DISK 'cold' TTL scraped_at + INTERVAL 90 DAY;

Keep detailed raw payloads for a bounded window (30–90 days), then keep compacted metadata and aggregates longer for historical analysis.

Data quality: dedupe, canonicalization, and provenance

Scraper analytics is only useful if the data is clean. Put canonicalization at the earliest safe point:

  • Compute url_hash with a canonicalization function (strip UTM params, sort query params).
  • Store content_hash (sha1 or xxhash64) to detect identical payloads and prevent duplicate analysis work.
  • Keep provenance metadata (crawler_id, job_id, source_ip) to debug bot-blocking and rate-limit issues — provenance helps during incident response and outage playbooks such as Outage-Ready: A Small Business Playbook.
Pro tip: Keep the raw_payload but also store a compact parsed representation (JSON or Nested) for fast analytics. Raw data helps debugging; parsed fields speed queries.

Scaling beyond a single ClickHouse cluster

When a single cluster is insufficient:

  • Shard by site_id ranges and use a globally routed Distributed table.
  • Use cross-cluster replication for disaster recovery and analytics locality. For topology and gateway patterns consult the compact gateways field review at controlcenter.cloud.
  • Consider ClickHouse Cloud managed clusters to offload operational burden — the ecosystem matured rapidly through 2025 and Cloud offerings in 2026 provide autoscaling and multi-region replication.

Scraping raises legal and compliance questions. Architect your pipeline with provenance, retention policies, and access controls:

  • Enable column-level access controls and encryption at rest and in transit. For deep security patterns and zero-trust guidance see Security Deep Dive: Zero Trust, Homomorphic Encryption, and Access Governance.
  • Keep audit trails for deletion and retention decisions (use system.mutations to monitor DELETEs/TTL operations).
  • Apply GDPR/CCPA retention and data-minimization rules via TTL and anonymization in ClickHouse.

Two important developments to plan for in 2026:

  • ClickHouse Cloud and managed OLAP: with larger funding rounds and ecosystem investment in 2025, managed ClickHouse has more features for autoscaling ingestion and cross-region replication — great for multi-tenant scraping platforms.
  • Vectorized and CPU-optimized query engines: modern hardware plus ClickHouse engine improvements are making aggregations and approximate analytics faster, which means more complex analytics can run in near real-time even on larger datasets.

Common pitfalls and how to avoid them

  • Tiny inserts: fix with Buffer tables or batch in the scraper or gateway. Teams operating at the edge should also review edge-first, cost-aware strategies to balance latency and cost.
  • No deduplication: use content_hash and ReplacingMergeTree or a deduplication pipeline to avoid wasted storage.
  • Blocking transforms: keep heavy ML and parsing off the ingest path.
  • Unbounded raw retention: set TTLs and move cold data to cheaper tiers.

Actionable checklist — deploy a high-throughput pipeline in 30 days

  1. Prototype an ingest path: scrapers -> Kafka -> ClickHouse Kafka engine & Materialized View -> MergeTree.
  2. Design the base table with ReplacingMergeTree(content_hash), partition by month, ORDER BY (site_id, url_hash, scraped_at).
  3. Implement URL canonicalization and url_hash at the gateway or materialized view.
  4. Tune batch size to ~1–10MB inserts and measure system.parts and system.merges.
  5. Create hourly aggregate materialized views and projections for common dashboards (error rates, throughput per crawler, unique pages/day).
  6. Set TTL for raw_payload after 30–90 days and move old data to cold storage.
  7. Monitor ingestion metrics, parts count, and merge backlog; iterate on Buffer or Kafka batch sizes if merges fall behind.

Quick reference: SQL snippets

-- Buffer table in front of MergeTree
CREATE TABLE buffer_scraper_events AS scraper_events ENGINE = Buffer(default, scraper_events, 16, 10, 60, 10000, 1000000, 10485760);

-- Materialized view for hourly aggregates
CREATE MATERIALIZED VIEW agg_hourly TO scraper_hourly
AS SELECT
  site_id,
  toStartOfHour(scraped_at) as hour,
  count() AS requests,
  countIf(http_code >= 500) AS errors
FROM scraper_events
GROUP BY site_id, hour;

Closing: measurable outcomes and next steps

Adopting these patterns yields measurable improvements: lower ingestion latency, fewer tiny parts and merge stalls, faster ad-hoc queries, and a maintainable platform for downstream ML and BI. In 2026, ClickHouse combines OLAP speed with streaming integrations that let scraper teams move from brittle ETL scripts to durable real-time ingestion pipelines.

Takeaways

  • Batch aggressively and use Buffer/Kafka to avoid part fragmentation.
  • Use ReplacingMergeTree + content_hash for idempotent deduplication.
  • Keep heavy transforms out of the ingestion path; use materialized views for deterministic lightweight ETL.
  • Precompute aggregates and projections for common ad-hoc queries to keep latency low.

Ready to scale your scraper analytics? Start with a small cluster + Kafka proof-of-concept and measure parts/merges within a week. If you want a ready-to-deploy reference implementation, check out our open-source scraper->ClickHouse starter kit (includes Terraform for ClickHouse Cloud, Kafka connectors and ETL materialized views) or schedule a hands-on architecture review.

Call to action: Try ClickHouse with a sample ingestion pipeline this week — deploy the reference repo, run your scrapers into it, and compare pre/post latency and storage metrics. If you want help tuning batch sizes and schema for your workload, contact our engineering team for a free 1-hour remediation session.

Advertisement

Related Topics

#ClickHouse#data-pipelines#analytics
w

webscraper

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-01-24T07:26:00.032Z