All guides

How to Deploy a Database in Production: The Complete Guide

Everything you need to know about deploying databases in production. Covers architecture decisions, high availability, backups, monitoring, security, and choosing between managed and self-hosted.

February 17, 2026

How to Deploy a Database in Production: The Complete Guide

Deploying a database to production is one of those tasks where the gap between "it works" and "it works reliably" is enormous. A database running on a single VPS with no backups and no monitoring is technically "in production" — until it isn't. This guide covers what production-ready actually means and walks through every layer you need to think about before trusting a database with real data.

This is a broad guide. It covers the decisions and patterns that apply regardless of which database engine you choose. If you want database-specific setup instructions, see the related guides linked at the bottom.


What "Production-Ready" Actually Means

The term gets thrown around loosely, so it helps to define it concretely. A production-ready database deployment has the following properties:

Durability: Data written to the database survives process crashes, machine reboots, and disk failures. This means durable write modes are enabled (e.g., fsync is on in PostgreSQL), and data is replicated to at least one other physical location.

Availability: The database can serve reads and writes even when individual components fail. This requires redundancy — at minimum a primary/replica pair, ideally an automatic failover mechanism.

Recoverability: If the worst happens — corrupted data, accidental deletion, a botched migration — you can restore to a known good state. This requires working, tested backups.

Observability: You know what the database is doing. Slow queries are visible. Replication lag is measured. Disk usage is tracked. You get alerted before problems become outages.

Security: The database is not accessible to the open internet. Connections are encrypted. Credentials are not hardcoded. Superuser access is restricted.

If any one of these is missing, the deployment is not production-ready, regardless of how well everything else is configured.


Single Node vs. Cluster

The simplest production setup is a single database instance on a dedicated server. This is a reasonable starting point for small applications, internal tools, or workloads where brief downtime is acceptable.

The honest limitations of a single-node setup:

  • Any hardware failure takes the database offline
  • Planned maintenance (OS updates, database upgrades) requires downtime
  • Backups cannot be taken without impacting performance unless you snapshot at the storage layer
  • Reads cannot be distributed

A cluster adds redundancy and, depending on the architecture, eliminates single points of failure. The trade-off is complexity: more nodes to configure, a consensus mechanism or replication protocol to understand, and more things that can fail or drift out of sync.

The threshold where a cluster becomes necessary varies by application, but a reasonable rule of thumb: if more than a few minutes of database unavailability per month would cause meaningful business impact, a single node is not sufficient.


High Availability Patterns

High availability (HA) means the system continues operating when individual components fail. For databases, this typically involves:

Primary/Replica Replication

The most common pattern. One node accepts writes (the primary), and one or more replicas receive a continuous stream of changes. If the primary fails, a replica can be promoted.

The challenge is automation. Detecting that a primary has failed (versus is just slow), selecting a replica to promote, and updating connection routing all need to happen reliably and quickly. Manual failover is error-prone under pressure. Automated failover requires a quorum mechanism to avoid split-brain scenarios.

Patroni is the standard solution for PostgreSQL HA. It uses etcd, Consul, or ZooKeeper as a distributed consensus store to coordinate leader election, handles automatic failover, and integrates with HAProxy or pgBouncer for connection routing.

Multi-Master / Active-Active

Multiple nodes accept writes simultaneously. Conflicts are either prevented (by sharding write access) or resolved (by the database engine). This is significantly more complex and only warranted for specific workloads. For most OLTP applications, it introduces more problems than it solves.

Read Replicas

Not a true HA mechanism, but a scaling pattern worth including here. Read replicas offload SELECT queries from the primary, reducing contention and extending the life of a single-primary setup. They can also serve as warm standbys for failover.

Quorum and Fencing

Any automated failover system needs to handle the scenario where the primary is not dead but merely unreachable from the nodes doing the monitoring — a network partition. Without fencing (also called STONITH: Shoot The Other Node In The Head), you can end up with two nodes both believing they are the primary and accepting writes. The result is a split-brain state that requires manual reconciliation.

Patroni handles this with distributed locks in the DCS (Distributed Configuration Store). Other systems use STONITH agents that can physically power off a node via IPMI or a cloud API.


Backup Strategies

Backups are not optional. They are also not enough on their own — a backup you have never tested restoring from is not a backup, it is a file.

Full Backups

A complete copy of the database at a point in time. Simple to reason about, straightforward to restore from. The downside is size and time — for large databases, full backups are expensive to take and store.

pg_dump and pg_dumpall are the standard PostgreSQL tools. For MySQL/MariaDB, mysqldump or Percona XtraBackup are common. For consistent filesystem-level backups, stop writes or use snapshot-capable storage (e.g., LVM snapshots, ZFS, cloud block storage snapshots).

Incremental Backups

Only the data changed since the last backup is stored. Much smaller and faster than full backups after the first run. Restoring requires applying the incremental changes on top of a base backup, which takes longer and adds complexity.

pgBackRest is the standard tool for PostgreSQL incremental backups. It supports parallel backup/restore, compression, and multiple storage backends (S3, GCS, Azure Blob Storage, SFTP).

Point-in-Time Recovery (PITR)

PITR allows you to restore a database to any specific moment in time, not just to the point of the last backup. This is the mechanism that saves you when someone runs DELETE FROM orders without a WHERE clause at 3pm and you need to roll back to 2:59pm.

PITR works by combining a base backup with a continuous archive of Write-Ahead Log (WAL) segments. PostgreSQL writes every change to the WAL before applying it to data files. By archiving WAL continuously and keeping a base backup, you can replay history to any point within the archived range.

pgBackRest, Barman, and WAL-G all support PITR for PostgreSQL.

Backup Retention and Storage

Keep backups off the database server itself. A server failure that takes the database offline will likely also take local backups offline. Store backups in object storage (S3, Backblaze B2, etc.) or on a separate backup server.

Retention policy depends on regulatory requirements and RPO (Recovery Point Objective). A common baseline: daily backups retained for 30 days, with WAL archives enabling PITR within that window.

Test your restores. Schedule a monthly or quarterly restore drill to a separate environment. Verify row counts, spot-check data, and measure how long the restore takes.


Monitoring and Alerting

You cannot manage what you cannot measure. At minimum, track:

Database-Level Metrics

  • Connections: Active connections, idle connections, connections waiting. Near-limit connection counts are a leading indicator of application trouble.
  • Query performance: Slow query logs, average query latency, p95/p99 latency. PostgreSQL's pg_stat_statements extension is essential.
  • Replication lag: How far behind are replicas? A growing lag means the replica may not be usable for failover or may be serving stale reads.
  • Lock contention: Long-running locks block writes. pg_locks and pg_stat_activity expose this.
  • Cache hit rate: Buffer pool / shared buffer hit rate. A low hit rate suggests the working set does not fit in memory.
  • Vacuum and autovacuum activity (PostgreSQL): Dead tuple accumulation and table bloat can degrade performance significantly if autovacuum is not keeping up.

Host-Level Metrics

  • Disk usage and I/O wait
  • CPU and memory utilization
  • Network throughput

Alerting Thresholds

Define alerts with some care. Alert fatigue — too many low-priority alerts — leads to important alerts being ignored. A practical baseline:

MetricWarningCritical
Disk usage75%90%
Replication lag30s5min
Connection count80% of max95% of max
Long-running queries30s5min
Failed backupsAny

Prometheus with node_exporter and a database-specific exporter (postgres_exporter, redis_exporter, clickhouse_exporter) feeding into Grafana is the standard open-source stack for this. It is well-documented, integrates with PagerDuty and Alertmanager, and has pre-built dashboards for most major databases.


Security

Network Access

The database should not be reachable from the public internet. Place it on a private network. If you are using cloud infrastructure, use a VPC and security groups to restrict inbound connections to application servers only.

For on-premise or bare-metal deployments, iptables or nftables rules should limit database port access to known application server IPs. UFW is a friendlier interface to iptables for simpler setups.

When using Docker, be explicit about port binding. The default Docker networking model can expose ports in ways that bypass UFW rules — bind to 127.0.0.1 or a specific private interface, not 0.0.0.0.

Encryption in Transit

Enable TLS for all database connections. PostgreSQL supports TLS natively; configure ssl = on in postgresql.conf and require SSL in pg_hba.conf (hostssl rules). Generate certificates using Let's Encrypt or your internal CA.

For connections within a trusted private network this is sometimes skipped, but it is worth enabling — certificate infrastructure is reusable, and it protects against traffic interception on shared or misconfigured switches.

Authentication

Use strong, unique passwords. Store them in a secrets manager (HashiCorp Vault, AWS Secrets Manager, Doppler) rather than in application config files or environment variable files checked into source control.

PostgreSQL supports multiple authentication methods in pg_hba.conf. Prefer scram-sha-256 over md5 for password hashing. Consider certificate-based authentication for application service accounts.

Create database users with least-privilege access. Application service accounts should have SELECT, INSERT, UPDATE, DELETE on specific schemas — not superuser access. Migration users can be separate accounts with CREATE and ALTER privileges.

Secrets Rotation

Rotate database credentials periodically. With a secrets manager and connection pooling, this can be done without application downtime. The application fetches credentials from the secrets manager on each new connection; the old credential is revoked after connections using it drain.


Network Topology

How the database fits into your overall network architecture matters for both performance and security.

A typical topology for a small-to-medium production deployment:

Internet
    |
Load Balancer (public IP)
    |
Application Servers (private subnet)
    |
Connection Pooler (PgBouncer / HAProxy)
    |
Database Primary (private subnet, no public IP)
    |
Database Replicas (same private subnet)

The connection pooler sits between application servers and the database to manage connection overhead. PostgreSQL's connection model — one OS process per connection — means that hundreds of open connections are expensive. PgBouncer in transaction pooling mode allows thousands of application connections to multiplex over a small number of actual database connections.

HAProxy is commonly used in front of Patroni clusters to route write traffic to the current primary and optionally route read traffic to replicas. Patroni exposes health check endpoints that HAProxy can poll to determine which node is currently primary.


Choosing the Right Database

Different workloads have fundamentally different requirements. The three categories that cover most production use cases:

PostgreSQL for OLTP

OLTP (Online Transaction Processing) workloads: user-facing application data, relational data with complex queries, workloads where correctness and consistency are critical.

PostgreSQL is the default choice for most web applications. It has excellent support for ACID transactions, a rich extension ecosystem (PostGIS for geospatial, TimescaleDB for time series, pgvector for embeddings), strong tooling for HA and backups, and a large community.

Use it for: application databases, user data, financial transactions, anything where you need joins, foreign keys, and reliable writes.

ClickHouse for Analytics

ClickHouse is a column-oriented database designed for analytical queries over large datasets. It excels at aggregations over billions of rows that would be impractical in PostgreSQL.

Use it for: event data, logs, metrics, analytics pipelines, reporting workloads where you are aggregating over large time ranges. Not suitable as a primary application database — it lacks full ACID transactions and has limited support for single-row updates.

Redis for Caching and Ephemeral Data

Redis is an in-memory data structure store. It is extremely fast for point lookups and is the standard choice for caching, session storage, rate limiting, queues, and pub/sub.

Use it for: caching query results, session data, rate limit counters, feature flags, distributed locks, background job queues (via Redis Streams or libraries like Sidekiq/BullMQ).

Redis Sentinel provides HA for Redis with automatic failover. Redis Cluster adds horizontal sharding.


Managed vs. Self-Hosted

This is the central trade-off in production database operations.

Managed Databases (RDS, Cloud SQL, PlanetScale, Neon, etc.)

Advantages:

  • Automated backups, point-in-time restore included
  • High availability options configurable through a UI
  • Patches and minor version upgrades handled by the provider
  • Storage auto-scaling available on most platforms
  • No server infrastructure to maintain

Disadvantages:

  • Cost: managed databases carry a significant markup over equivalent raw compute, typically 2-5x
  • Less control over configuration tuning
  • Vendor lock-in for proprietary services
  • Egress costs when moving data out
  • Limited options for exotic database engines or extensions

Self-Hosted

Advantages:

  • Lower cost at scale — bare metal or dedicated VMs are substantially cheaper than managed equivalents
  • Full control over configuration, extensions, and tuning
  • No egress costs for data locality with your application
  • Works for any database engine
  • Data sovereignty: clear physical control over where data lives

Disadvantages:

  • You own the operational burden: patching, backups, HA configuration, monitoring
  • Requires expertise to set up correctly
  • Incident response falls to your team

The Practical Decision

For teams without dedicated database engineering expertise, managed databases are often the right call initially — the operational overhead is real and the cost of getting HA or backups wrong is high.

For teams that have the expertise, are running at scale, or need to control costs, self-hosting with proper tooling is well within reach. The automation exists to make self-hosted deployments reliable without requiring deep Ansible or infrastructure expertise.


Production Database Readiness Checklist

Use this before considering a database deployment production-ready.

Durability

  • Durable write modes enabled (fsync / AOF persistence)
  • Data replicated to at least one additional node
  • Backups configured and verified with a test restore
  • PITR enabled if RPO requirements demand it

Availability

  • HA mechanism configured (Patroni, Sentinel, or equivalent)
  • Automatic failover tested
  • Connection pooling in place
  • Health check endpoints monitored

Observability

  • Database metrics exported to monitoring system
  • Host metrics (disk, CPU, memory) collected
  • Alerting configured with appropriate thresholds
  • Slow query logging enabled
  • Replication lag monitored

Security

  • Database not exposed to public internet
  • Firewall rules restrict access to application servers only
  • TLS enabled for connections
  • Least-privilege database users configured
  • Credentials stored in secrets manager, not in code
  • Superuser / root access restricted

Operations

  • Runbook for common failure scenarios documented
  • On-call rotation aware of database alert channels
  • Upgrade and patching process defined
  • Backup retention policy set and verified

How sshploy Helps

sshploy automates the deployment of production database clusters to your own servers via SSH and Ansible. Rather than manually configuring each layer described in this guide, sshploy runs proven playbooks that handle the full stack — installation, HA configuration, connection pooling, monitoring, and firewall setup.

Supported deployment recipes:

PostgreSQL with Patroni — deploys a multi-node PostgreSQL cluster with Patroni for automatic failover, etcd for consensus, HAProxy for connection routing, PgBouncer for connection pooling, and pgBackRest for backups. Includes postgres_exporter configured for Prometheus.

ClickHouse Cluster — deploys a sharded, replicated ClickHouse cluster with ClickHouse Keeper for coordination and chproxy for query routing. Includes Prometheus metrics endpoints.

Redis Sentinel — deploys a Redis primary/replica topology with Sentinel for automatic failover and HAProxy for transparent connection routing.

All recipes enforce firewall rules via iptables, configure TLS where applicable, and output connection strings ready for application configuration.


Frequently Asked Questions

How many servers do I need for a production database?

At minimum, three nodes for any HA setup. With Patroni, the recommended minimum is three PostgreSQL nodes (one primary, two replicas) plus three etcd nodes — though the etcd nodes can run on the same machines as PostgreSQL for smaller deployments. Three nodes is the minimum for a majority quorum: if one node fails, the remaining two can agree on a new primary. Two nodes cannot safely elect a primary without risking split-brain.

What is the difference between replication and backups?

Replication protects against hardware failure — if one server dies, another takes over. It does not protect against data corruption or accidental deletion: a DROP TABLE executed on the primary replicates to all replicas within milliseconds. Backups are the mechanism for recovering from logical errors. You need both.

How do I handle database migrations in production without downtime?

The approach depends on the type of change. Additive changes (adding a new nullable column, creating an index concurrently) can typically be applied without locking. Destructive changes (dropping columns, altering column types) require more care: use a expand/contract pattern where the application is first deployed to handle both old and new schema, the schema is migrated, and the old code paths are removed in a subsequent deploy. Tools like pg_repack handle table rewrites online.

How much does it cost to self-host a production database cluster?

A minimal three-node PostgreSQL cluster on reasonably sized VMs (8 vCPU, 32GB RAM, 500GB NVMe SSD each) typically costs $150-400/month on major cloud providers, or $80-200/month on budget providers like Hetzner or OVH. A comparable managed database service with equivalent compute and HA would typically run $600-1200/month or more. The cost gap widens at larger instance sizes. The self-hosted cost does not include the time to operate it, which is real but can be substantially reduced with automation.

When should I add read replicas?

Add read replicas when read load is meaningfully impacting primary performance, when you need a geographically distributed read endpoint, or when you want a warm standby that can serve reads while also being available for failover. For most applications, a properly indexed single primary handles more read load than expected. Profile before scaling horizontally — adding a missing index often solves what looks like a read scaling problem.

Ready to deploy?

Skip the manual setup. sshploy handles the entire deployment for you.

Deploy PostgreSQL with Patroni