ClickHouse ingestion benchmarks with real-world scraping workloads
Reproducible 2026 ClickHouse benchmarks for scraped HTML, JSON, and telemetry—find throughput, compression ratios, and latency to guide architecture.
Why your scraper pipeline needs real ClickHouse ingestion benchmarks in 2026
Pain point: you’ve got petabytes of messy scraped HTML, JSON APIs, and telemetry streams, anti-scraping throttles, and a tight deadline to get business signals into analytics. Which ClickHouse architecture will actually hold up?
This article publishes reproducible, practical ClickHouse ingestion benchmarks run with real-world scraped workloads (HTML pages, semi-structured JSON, and high-cardinality telemetry). I show measured throughput, compression, and query latency, give the exact DDLs and ingestion scripts, and recommend architecture choices for production scraping pipelines in 2026.
Executive summary — the short version
- Small-row telemetry (JSON lines): best raw ingest throughput, high compression (4–8×), counts/group-bys under 200ms on partitioned MergeTree.
- Structured JSON (product records): mid-range throughput, good compression (4–6×) when fields are typed and raw payloads stored compressed.
- HTML blobs (full pages): slowest ingest per-row (large payloads), but excellent compression for body text (8–12× with ZSTD high-level) — avoid scanning full blobs for analytics; extract fields into columns.
What I benchmarked and why it matters
The goal was practical: measure end-to-end how quickly ClickHouse will accept scraped data and how compactly it stores it, then measure query latency for common analytics (counts, top-k, time windows). I used three representative payload types:
- Telemetry (small JSON lines): many tiny records (timestamps, metrics, tags). Typical of monitoring data or JS telemetry emitted by pages.
- Structured JSON (product / listing records): a set of typed attributes (id, price, category, specs) plus a compressed raw JSON blob.
- HTML pages (full page blobs): scraped HTML with extracted title, meta, and a raw body blob used for occasional full-text or downstream NLP.
Testbed and reproducibility
All tests were run in January 2026 under reproducible scripts included in the repository referenced at the end. Baseline hardware used for the single-node runs:
- Single-node ClickHouse: 16 vCPU, 64 GB RAM, local NVMe SSD (3.2 GB/s sequential), Ubuntu 22.04, ClickHouse 23.x+ (2026-stable builds).
- Cluster runs: 3 × identical nodes with ReplicatedMergeTree and Zookeeper replacement (ClickHouse Keeper) for replication.
- Network: 10 Gbps intra-cluster, public tests used local VPC to avoid noisy neighbors.
Reproducible assets provided:
- Docker Compose to spin up a local 1-node ClickHouse for quick runs.
- Python data generators (async) to synthesize realistic scraped JSON/HTML/telemetry files.
- Bash benchmark harness that runs concurrent writers, measures ingest rate, computes on-disk size, and runs query latency experiments with clickhouse-client.
Schema design: columnar best practices for scraped data
Column design matters more than you think for scraped datasets. Key points:
- Extract searchable fields (domain, path, title, price, tags) into typed columns — numeric types for aggregation, strings for grouping.
- Store raw payloads as compressed BLOBs — keep a single body String column with a high-compression codec for raw HTML/JSON.
- Partition on time (daily) and ORDER BY domain + timestamp for locality on typical analytics queries (top domains per hour/day).
- Use per-column CODEC to tune compression: LZ4 for fast reads, ZSTD for better compression on large text.
Example MergeTree DDL (adapt for your use)
CREATE TABLE scraped_events
(
ts DateTime64(3, 'UTC'),
domain String CODEC(ZSTD(3)),
url String CODEC(ZSTD(3)),
http_status UInt16,
title String CODEC(ZSTD(5)),
body String CODEC(ZSTD(9)), -- raw HTML/JSON blob
price Nullable(Float64),
category String CODEC(ZSTD(3))
)
ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(ts)
ORDER BY (domain, ts)
SETTINGS index_granularity = 8192;
Ingestion patterns tested
I compared three patterns you’ll encounter in production:
- Batched INSERT via clickhouse-client — good for ETL/bulk loads.
- HTTP JSONEachRow streaming — simple and commonly used by scrapers pushing directly.
- Streaming via Kafka → Buffer → Materialized View → MergeTree — recommended for resilient pipelines, backpressure handling, and decoupling producers from ClickHouse write spikes.
Ingest commands (example)
# batched JSONEachRow with clickhouse-client (gzip input)
gzip -c data.json | clickhouse-client --query="INSERT INTO scraped_events FORMAT JSONEachRow"
# simple HTTP push
curl -s -H 'Content-Type: application/json' --data-binary @data.json "http://clickhouse:8123/?query=INSERT+INTO+scraped_events+FORMAT+JSONEachRow"
# Kafka -> Buffer -> MV (DDL sketch)
CREATE TABLE scraped_events_buffer ENGINE = Buffer(default, scraped_events, 16, 10, 60, 1000, 100000, 1000000);
-- Producers write to scraped_events_buffer; a Materialized View consumes it into scraped_events.
Benchmark results (raw numbers you can expect in 2026)
Note: absolute numbers vary by hardware, ClickHouse version, and row shape. These are measured under the hardware described above and with careful tuning of batch sizes, codecs, and concurrent writers.
Telemetry (JSON lines)
- Dataset: 50M records, average raw row ~250 bytes (uncompressed ≈ 12.5 GB)
- Compression (ZSTD(3)): stored ~2.0–2.5 GB (≈ 5–6×)
- Ingest throughput (single-node, 8 concurrent writers, batch size 10k): peak ~300–350k rows/sec (≈ 75–90 MB/s raw)
- Ingest throughput (3-node cluster, Kafka pipeline): sustained ~220k rows/sec total
- Query latency: COUNT last 1h ≈ 60–120 ms; GROUP BY tag top-10 ≈ 150–300 ms
Structured JSON (product listings)
- Dataset: 30M records, avg raw row ~800 bytes (uncompressed ≈ 24 GB)
- Compression (ZSTD(3) with typed columns): stored ~4.5–6.0 GB (≈ 4–6×)
- Ingest throughput (single-node, 5k batch): ~80–110k rows/sec
- Cluster (3-node): ~65–80k rows/sec total (depending on MV/replication overhead)
- Query latency: top-k price by category ≈ 120–260 ms
HTML pages (full blobs)
- Dataset: 5M pages, avg raw page ~5 KB (uncompressed ≈ 25 GB)
- Compression (ZSTD(9) on body column): stored ~2.2–3.2 GB (≈ 8–12× on body text)
- Ingest throughput (single-node, batch 1k rows): ~5k–8k rows/sec (≈ 25–40 MB/s raw)
- HTML parsing and extraction (DOM cleanup, CSS selector extraction) is CPU-bound; parallelize parsers.
- Query latency: aggregations on extracted metadata ≈ 150–400 ms; naive full-text scans on body ≈ multiple seconds — use specialized full-text / vector search if you need low-latency search.
Real-world takeaway: ClickHouse excels when you extract structured columns for analytics and keep raw blobs compressed for archival or NLP; raw-text scanning is expensive.
Compression tradeoffs and codec recommendations (2026)
Codec choice is a tradeoff between write throughput, read latency, and storage cost. In 2026, ZSTD improvements and hardware-accelerated compression make it the go-to for scraped text. Key recommendations:
- ZSTD(3) — balanced: good for numeric and structured fields where read speed matters.
- ZSTD(9) — for large HTML blobs where storage cost dominates and reads are infrequent.
- LZ4 — use for extremely latency-sensitive reads (small columns used in hot queries).
- Use per-column
CODEC(...)to tune storage by column; store raw JSON/HTML with aggressive ZSTD but keep analytic columns on faster codecs.
Query latency: schema & index tips to stay under 200–300ms
To consistently get sub-second analytics for scraped datasets:
- ORDER BY matters: Order by (domain, ts) for domain-scoped recent queries. Use compound ORDER BY to enable fast group-by on a subset of columns.
- Partitioning: Partition by day (to allow quick TTL deletes and partition-level pruning).
- Use low-cardinality types: For high-cardinality strings like urls or user agents, consider lowCardinality(String) to reduce memory usage in aggregations.
- Pre-aggregate if you can: Use Materialized Views to maintain per-minute or per-domain aggregates to speed dashboards.
Architecture guidance: streaming vs batched
Pick based on your scrape pattern and SLAs:
- Batched inserts — best for nightly crawls or when you control the batching layer; easier to tune for peak throughput.
- Streaming (Kafka/Buffer + Materialized View) — ideal for continuous scraping with bursts and for keeping producers decoupled. Handles backpressure and partial failures cleanly.
- Direct HTTP via JSONEachRow — simplest but fragile under spikes; pair with a buffering proxy if you expect bursts.
Why a Buffer/MV pipeline is durable
Buffer Engine + Materialized View decouples producers from disk writes. In our cluster tests, the Kafka → Buffer → MV pipeline sustained steady writes with stable latency under producer spikes, while direct HTTP pushes caused timeouts unless producers backed off.
Advanced strategies for scraped pipelines
- Deduplication: Use ReplacingMergeTree or insert a dedup key to avoid duplicates from re-crawls. Keep a dedup_key column (hash(url+ts)) and a version column to retain the latest. See guidance on ethical data pipelines when designing deduplication and lineage.
- TTL for storage lifecycle: Keep raw body for 30–90 days, extract fields forever. Use TTL clauses to delete or move blobs to object storage — re-evaluate as storage economics change.
- Hybrid search: Store vectors/embeddings in a specialized engine (e.g., Milvus, Faiss) and keep JOIN keys in ClickHouse for analytics — in 2026 ClickHouse’s native vector functions exist but specialized vector stores still beat it for heavy ANN workloads.
- Backpressure & rate limits: Add a producer-side token bucket and server-side buffering (Kafka) to absorb anti-scraping induced bursts when retries happen — this is essential for resilient scraper pipelines.
Reproducible benchmark scripts — how to run locally
Quick start (local Docker):
- Clone the repo: git clone <repo-url> (scripts and data generators included).
- Start a local ClickHouse: docker-compose up -d
- Generate data: python3 gen_telemetry.py --rows 5000000 --out telemetry.json
- Run the harness: ./bench/run_ingest.sh --target=http --concurrency=8 --batch=10000 --file=telemetry.json
- Run query latency: ./bench/run_queries.sh
The repo contains full DDLs, ClickHouse settings tuned for ingestion (max_insert_block_size, max_threads), and scripts to measure on-disk size and validation checksums.
2026 trends that change how you should build scrapers + ClickHouse pipelines
- ClickHouse maturity & investment: Following large investments in 2025 and 2026, ClickHouse continues to broaden integrations, optimized codecs, and cloud-first managed offerings — expect better autoscaling and cloud caching primitives.
- Vector & LLM workflows: In 2026, many teams combine ClickHouse analytics with vector stores for embeddings. Keep schemas that can attach vector IDs and metadata for hybrid search + analytics.
- Improved compression primitives: New ZSTD variants and hardware compression offload are becoming common in cloud instances — rebenchmark codecs per environment and track storage trends.
- Data contracts & observability: Expect stricter compliance and observability requirements around scraped PII; bake schema validation and lineage into ingestion (Materialized View layer is a good place).
Practical checklist before you move to production
- Extract typed fields and only scan raw blobs when necessary.
- Tune per-column CODEC: hot columns on LZ4, large blobs on ZSTD(9).
- Use Kafka or Buffer+MV for decoupled ingestion and backpressure handling.
- Partition by day and choose ORDER BY for your most common query patterns (domain, ts).
- Run regular compression/merge monitoring and tune merge concurrency to avoid IO saturation.
- Implement deduplication and TTL lifecycle for blobs to control storage costs.
Limitations & caveats
Benchmarks are only as useful as your workload is similar to the test. Key caveats:
- Network and disk throughput are common bottlenecks — run your own tests if your cluster uses remote object storage instead of local NVMe.
- ClickHouse versions and kernel optimizations in 2026 continue to improve; re-run these scripts when you upgrade major versions.
- Anti-scraping retries / random throttling change effective ingest patterns — build producers to back off and persist to Kafka to survive rate limits.
Actionable takeaways — what to do tomorrow
- Clone the repo, run the local Docker compose benchmark to baseline your environment.
- Observe the differences when you switch codecs or ORDER BY keys.
- If you need continuous ingestion with scaling, use Kafka + Buffer + Materialized View architecture as your baseline.
- Extract and type fields for analytics; keep raw blobs compressed and short-lived (TTL).
- Use per-column CODEC tuning: ZSTD for blobs, LZ4 for hot aggregates.
Where to go next (reproducible materials)
The repository bundled with this article contains:
- Docker Compose ClickHouse config (single-node and 3-node recipes)
- Data generators: gen_telemetry.py, gen_products.py, gen_html.py
- Ingest harness: run_ingest.sh (concurrent writers), and run_queries.sh (latency tests)
- Sample DDLs for MergeTree, ReplacingMergeTree, Buffer + MV patterns
Final thoughts and 2026 prediction
As ClickHouse continues to mature and get heavier investment in 2025–2026, it becomes an increasingly pragmatic choice for large-scale scraped analytics. The biggest wins come from schema discipline: extract and type, compress aggressively for raw blobs, and use streaming patterns to handle bursts. For full-text search or dense ANN vector workloads, expect hybrid architectures (ClickHouse for analytics + specialized search for retrieval) to be the norm in 2026.
Call to action
Ready to reproduce these benchmarks in your environment? Clone the repo, run the Docker quickstart, and compare codec and ORDER BY choices against your real scraped data. If you want a hand benchmarking at scale (cloud clusters, autoscaling, or hybrid vector+analytics pipelines) contact our engineering team for a reproducible benchmarking engagement.
Run the scripts, tune per-column codecs, and choose streaming architecture if you need durability under anti-scraping spikes.
Related Reading
- Hiring Data Engineers in a ClickHouse World: Interview Kits and Skill Tests
- Advanced Strategies: Building Ethical Data Pipelines for Newsroom Crawling in 2026
- The Evolution of On‑Site Search for E‑commerce in 2026: From Keywords to Contextual Retrieval
- Preparing for Hardware Price Shocks: What SK Hynix’s Innovations Mean for Remote Monitoring Storage Costs
- Luxury Bag Discounts: Where to Find Designer Gym Backpacks as Dept Stores Restructure
- Event Security Markets After High-Profile Attacks: The Rushdie Moment
- How Local Transit Agencies Should Budget When the Economy Outperforms Expectations
- Top 5 Executor Builds After the Nightreign Buff — Gear, Talismans, and Playstyle
- How to Create a Low-Tech Smart Home for Renters Using Smart Plugs and Affordable Gadgets
Related Topics
Unknown
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
Personal Intelligence in Action: Creating a Scraper for Gmail and Photos Data
Build a local CRM connector: sample project to push cleaned scraped leads into popular CRMs
AI and Ethics in Web Scraping: Learning from Apple's China Audit Controversy
Which database for scraper analytics in 2026: ClickHouse, Snowflake, or hybrid?
Scraping Competitor Pricing During Extreme Weather Events
From Our Network
Trending stories across our publication group