What is ClickHouse? A Developer's Guide to Column-Oriented Analytics

ClickHouse is an open-source, column-oriented database management system designed for online analytical processing (OLAP). It was originally built at Yandex to power Yandex.Metrica, one of the largest web analytics platforms in the world, processing over 20 billion events per day. It was open-sourced in 2016 and has since become one of the most widely adopted analytical databases, used by Cloudflare, Uber, eBay, Deutsche Bank, and hundreds of other organizations that need to run analytical queries over massive datasets in real time.

The short version: if you need to aggregate, filter, or analyze billions of rows and get results in milliseconds to low seconds, ClickHouse is purpose-built for that.

How Columnar Storage Works

To understand why ClickHouse is fast, you first need to understand how it stores data differently from traditional databases.

Row-Oriented vs. Column-Oriented Storage

Most databases you have worked with -- PostgreSQL, MySQL, SQLite -- are row-oriented. They store each row of a table contiguously on disk. When you insert a row, all columns for that row are written together in a single block.

A column-oriented database like ClickHouse does the opposite: it stores each column separately. All values for timestamp are stored together, all values for user_id are stored together, and so on.

Here is a concrete example. Suppose you have an events table with four columns:

| timestamp           | user_id | event_type  | duration_ms |
|---------------------|---------|-------------|-------------|
| 2026-02-01 10:00:00 | 1001    | page_view   | 320         |
| 2026-02-01 10:00:01 | 1002    | click       | 45          |
| 2026-02-01 10:00:02 | 1001    | page_view   | 580         |
| 2026-02-01 10:00:03 | 1003    | purchase    | 1200        |

Row-oriented storage (PostgreSQL) writes this to disk roughly as:

[2026-02-01 10:00:00, 1001, page_view, 320]
[2026-02-01 10:00:01, 1002, click, 45]
[2026-02-01 10:00:02, 1001, page_view, 580]
[2026-02-01 10:00:03, 1003, purchase, 1200]

Column-oriented storage (ClickHouse) writes this as:

timestamp:    [2026-02-01 10:00:00, 2026-02-01 10:00:01, 2026-02-01 10:00:02, 2026-02-01 10:00:03]
user_id:      [1001, 1002, 1001, 1003]
event_type:   [page_view, click, page_view, purchase]
duration_ms:  [320, 45, 580, 1200]

Why This Matters for Analytical Queries

Now consider a typical analytical query:

SELECT event_type, AVG(duration_ms)
FROM events
WHERE timestamp >= '2026-02-01'
GROUP BY event_type;

In a row-oriented database, the engine must read every complete row from disk -- all four columns -- even though it only needs timestamp, event_type, and duration_ms. The user_id column is loaded into memory and immediately discarded. With a table that has 50 columns and hundreds of millions of rows, you are reading 50x more data than necessary.

In a column-oriented database, the engine reads only the three columns it needs. It skips user_id entirely. It never touches that data on disk. For wide tables with many columns, this reduces I/O by an order of magnitude.

This is the fundamental reason columnar databases dominate analytical workloads. Analytical queries typically touch a few columns across many rows. Columnar storage means you only pay for the columns you actually use.

Why ClickHouse is Fast

Columnar storage is the foundation, but ClickHouse adds several layers of optimization on top of it.

Vectorized Query Execution

Traditional databases process data row by row. ClickHouse processes data in vectors -- batches of column values that fit neatly into CPU cache lines. Instead of evaluating a WHERE clause once per row, it evaluates it against thousands of values in a single CPU operation. This approach leverages SIMD (Single Instruction, Multiple Data) instructions on modern processors, which can perform the same operation on multiple data points simultaneously.

In practice, vectorized execution means ClickHouse can process 1-2 billion rows per second per core for simple aggregations. A 12-core machine can scan 10+ billion rows per second.

Aggressive Compression

Because each column is stored separately, all values in a column block have the same data type. An array of integers compresses far better than a mixed row of integers, strings, timestamps, and floats. ClickHouse uses LZ4 compression by default (fast decompression) and supports ZSTD (better compression ratios). Typical compression ratios range from 5:1 to 20:1 depending on the data.

Better compression means less data to read from disk, which directly translates to faster queries. A table that takes 100 GB uncompressed might occupy 10-15 GB on disk. The query engine reads 10 GB instead of 100 GB and decompresses in-memory at speeds close to memory bandwidth.

Data Skipping with Primary Keys

ClickHouse tables are sorted by their primary key (defined via ORDER BY in the table definition). The engine maintains a sparse index that records the minimum and maximum values for each granule (a block of 8,192 rows by default). When a query includes a filter on the primary key columns, ClickHouse checks the sparse index and skips entire granules that cannot contain matching rows.

For a time-series table ordered by (timestamp, user_id), a query filtering on a specific hour only reads the granules that fall within that hour. The rest of the table is never touched. This is not a traditional B-tree index -- it is much cheaper to maintain and works well with the sequential access patterns of analytical workloads.

Parallel and Distributed Execution

ClickHouse parallelizes query execution across all available CPU cores on a single node. In a cluster, queries can also be distributed across multiple shards, with each shard processing its portion of the data in parallel. The results are merged and returned to the client. This scales linearly with the number of nodes for most query types.

Real-World Use Cases

ClickHouse is not a general-purpose database. It excels at specific categories of workloads.

Product Analytics and Event Tracking

Tracking user events -- page views, clicks, feature usage, conversions -- is the most common ClickHouse use case. Event data is naturally append-only (events are immutable once recorded), arrives in high volume, and is queried with aggregations over time ranges. ClickHouse handles this pattern exceptionally well. Companies like PostHog, Plausible Analytics, and OpenPanel use ClickHouse as their primary analytics store.

Log Storage and Observability

ClickHouse is increasingly replacing Elasticsearch for log storage. Structured logs with fields like timestamp, level, service, message, and trace_id map well to columnar storage. Query performance for time-range filtering and aggregation is typically much better than Elasticsearch, and storage costs are substantially lower due to compression. ClickHouse does not have full-text search with relevance scoring, but for structured log filtering -- which is what most teams actually do -- it is faster and cheaper.

Time-Series Data

Metrics, sensor data, financial tick data, IoT telemetry -- any data indexed primarily by time with numeric measurements fits naturally in ClickHouse. The ORDER BY primary key is typically timestamp-first, which gives excellent query performance for time-range scans. Materialized views can pre-aggregate data into rollup tables (per-minute, per-hour) to accelerate dashboard queries.

Real-Time Dashboards and Reporting

Business intelligence dashboards that need sub-second response times over large datasets are a natural fit. ClickHouse can power customer-facing analytics dashboards where each user queries their own slice of a shared dataset. The combination of columnar storage, compression, and data skipping makes these queries fast enough for interactive use.

Ad Tech and Clickstream Analysis

Ad impressions, bid requests, and click tracking generate massive data volumes with straightforward analytical query patterns. ClickHouse's ability to ingest millions of rows per second and query them immediately makes it well-suited for real-time bidding analytics and attribution modeling.

When NOT to Use ClickHouse

ClickHouse has clear limitations, and using it for the wrong workload will lead to frustration.

Transactional Workloads (OLTP)

ClickHouse is not a replacement for PostgreSQL or MySQL for transactional data. It does not support full ACID transactions across multiple rows. It does not have row-level locking. If your application needs to read a single user record, update it, and write it back within a transaction, use a row-oriented database.

Frequent Updates and Deletes

ClickHouse is designed for append-only or append-mostly workloads. It does support ALTER TABLE ... UPDATE and ALTER TABLE ... DELETE, but these are heavy operations internally -- they rewrite entire data parts in the background (called mutations). Running frequent updates or deletes on ClickHouse will degrade performance significantly. If your data model requires frequent row-level modifications, ClickHouse is the wrong tool.

Point Lookups by Primary Key

Looking up a single row by its ID is not what ClickHouse is optimized for. While it can do it, the latency will be higher than a row-oriented database with a B-tree index. If your primary access pattern is "fetch one record by ID," use PostgreSQL or a key-value store.

Small Datasets

If your dataset is under a few million rows, ClickHouse is overkill. PostgreSQL handles analytical queries over small-to-medium datasets with adequate performance and gives you transactions, joins, and a richer ecosystem of tooling. ClickHouse's advantages become apparent at hundreds of millions to billions of rows.

Complex Joins Across Many Tables

ClickHouse supports joins, but it is not optimized for complex multi-table joins the way PostgreSQL is. Large join operations can consume significant memory. The recommended pattern in ClickHouse is to denormalize your data at insert time -- flatten your data model so that analytical queries can run against a single wide table rather than joining multiple normalized tables.

The ClickHouse Ecosystem

ClickHouse is more than a single binary. A production deployment typically involves several components.

ClickHouse Keeper

ClickHouse Keeper is a distributed coordination service that manages replica metadata and replication queues. It is ClickHouse's built-in replacement for Apache ZooKeeper. Keeper uses the Raft consensus protocol and ships as part of the ClickHouse distribution. A Keeper ensemble requires an odd number of nodes (typically 3) to maintain quorum. It is lightweight and can run alongside ClickHouse on the same nodes.

CHProxy

CHProxy is an HTTP proxy and load balancer for ClickHouse. It sits in front of your ClickHouse nodes and handles request routing, connection pooling, per-user rate limiting, and health-check-based failover. Your application connects to a single CHProxy endpoint instead of managing connections to individual ClickHouse nodes. It is the standard way to expose a ClickHouse cluster to applications.

Materialized Views

Materialized views in ClickHouse are not what you might expect from PostgreSQL. In ClickHouse, a materialized view is a trigger that transforms data as it is inserted and writes the result to a separate target table. They run incrementally -- each batch of inserted rows is processed through the view and appended to the target. This makes them ideal for maintaining pre-aggregated rollup tables. For example, you can have a raw events table and a materialized view that aggregates events into per-hour counts, keeping your dashboard queries fast without scanning raw data.

clickhouse-backup

Altinity's clickhouse-backup tool handles backup and restore operations, supporting local storage and remote destinations like S3, GCS, and Azure Blob Storage. It integrates with ClickHouse's native snapshot mechanism to create consistent backups without locking tables.

Getting Started

The fastest way to experiment with ClickHouse is to run it locally:

# Run ClickHouse in Docker
docker run -d --name clickhouse \
  -p 8123:8123 -p 9000:9000 \
  clickhouse/clickhouse-server

# Connect with the client
docker exec -it clickhouse clickhouse-client

Once connected, create a table and insert some data:

CREATE TABLE events (
    timestamp DateTime,
    user_id UInt64,
    event_type LowCardinality(String),
    duration_ms UInt32
) ENGINE = MergeTree()
ORDER BY (timestamp, user_id);

INSERT INTO events VALUES
    ('2026-02-01 10:00:00', 1001, 'page_view', 320),
    ('2026-02-01 10:00:01', 1002, 'click', 45),
    ('2026-02-01 10:00:02', 1001, 'page_view', 580),
    ('2026-02-01 10:00:03', 1003, 'purchase', 1200);

SELECT event_type, count(), avg(duration_ms)
FROM events
GROUP BY event_type;

Notice LowCardinality(String) for event_type -- this is a ClickHouse-specific optimization that dictionary-encodes string columns with a small number of distinct values, significantly improving both compression and query speed.

For a single-node development setup, MergeTree() is sufficient. For production with replication, you would use ReplicatedMergeTree with a ClickHouse Keeper ensemble.

Deploying a Production Cluster with sshploy

Going from a single Docker container to a production cluster involves configuring ClickHouse Keeper for coordination, setting up replication across nodes, deploying CHProxy for load balancing, configuring backups, and setting up monitoring -- roughly 500 lines of configuration spread across multiple nodes. sshploy automates this entire process. You connect your servers via SSH, choose your cluster topology (shard count, replica count, Keeper placement), and sshploy runs the Ansible playbooks that provision everything with production-safe defaults. The full cluster, including CHProxy and monitoring, deploys in under 10 minutes.

FAQ

How does ClickHouse compare to PostgreSQL for analytics?

PostgreSQL is a general-purpose database that can handle moderate analytical queries. For datasets under 50-100 million rows with simple aggregations, PostgreSQL with proper indexes may be fast enough. Beyond that scale, ClickHouse is typically 10-100x faster for analytical queries due to columnar storage, vectorized execution, and compression. The two databases serve different roles: PostgreSQL for transactional data and complex relational queries, ClickHouse for high-volume analytical workloads. Many teams run both -- PostgreSQL as the source of truth for transactional data, with events and analytics flowing into ClickHouse.

How much data can a single ClickHouse node handle?

A single ClickHouse node on commodity hardware (12 cores, 64 GB RAM, NVMe storage) can comfortably handle 1-10 TB of compressed data and query billions of rows per second for typical aggregations. The practical limit depends on your query patterns and data shape more than raw data volume. When a single node is no longer sufficient, you add more nodes and shard your data across them.

Is ClickHouse difficult to operate in production?

ClickHouse itself is relatively stable and low-maintenance compared to some distributed systems. The main operational concerns are monitoring replication lag, ensuring Keeper quorum is healthy, managing disk space (especially during merge operations that temporarily increase disk usage), and handling schema migrations carefully in a replicated environment. If you are coming from PostgreSQL, the biggest adjustment is the different data modeling approach -- denormalization over joins, append-only patterns over updates.

Can ClickHouse replace my data warehouse?

For many teams, yes. ClickHouse can serve as a real-time data warehouse for event data, logs, and metrics. It supports SQL (with some extensions and limitations), integrates with common BI tools like Grafana, Metabase, and Superset, and can ingest data from Kafka, S3, and other sources. Where it falls short compared to traditional data warehouses like BigQuery or Snowflake is in complex multi-table join performance and managed ecosystem tooling. If your analytical workload is primarily aggregations over event-style data, ClickHouse is a strong and cost-effective choice.

What is the difference between ClickHouse and Apache Druid?

Both are column-oriented analytical databases, but they differ in architecture and trade-offs. ClickHouse uses a shared-nothing architecture where each node stores its data locally, while Druid separates compute and storage with deep storage in S3 or HDFS. ClickHouse generally offers faster query performance on a single node and simpler operations. Druid has better native support for real-time ingestion from streaming sources like Kafka and more granular data retention policies. For most teams that do not already have a Druid deployment, ClickHouse is the simpler and more performant choice.