123456789_123456789_123456789_123456789_123456789_

Deployment engineering for Puma

Puma expects to be run in a deployed environment eventually. You can use it as your development server, but most people use it in their production deployments.

To that end, this document serves as a foundation of wisdom regarding deploying Puma to production while increasing happiness and decreasing downtime.

Specifying Puma

Most people will specify Puma by including gem "puma" in a Gemfile, so we'll assume this is how you're using Puma.

Single vs. Cluster mode

Initially, Puma was conceived as a thread-only web server, but support for processes was added in version 2.

In general, use single mode only if:

Otherwise, you'll want to use cluster mode to utilize all available CPU resources.

To run puma in single mode (i.e., as a development environment), set the number of workers to 0; anything higher will run in cluster mode.

Cluster Mode Tips

For the purposes of Puma provisioning, "CPU cores" means:

  1. On ARM, the number of physical cores.
  2. On x86, the number of logical cores, hyperthreads, or vCPUs (these words all mean the same thing).

Set your config with the following process:

Worker utilization

How do you know if you've got enough (or too many workers)?

A good question. Due to MRI's GIL, only one thread can be executing Ruby code at a time. But since so many apps are waiting on IO from DBs, etc., they can utilize threads to use the process more efficiently.

Generally, you never want processes that are pegged all the time. That can mean there is more work to do than the process can get through, and requests will end up with additional latency. On the other hand, if you have processes that sit around doing nothing, then you're wasting resources and money.

In general, you are making a tradeoff between:

  1. CPU and memory utilization.
  2. Time spent queueing for a Puma worker to accept requests and additional latency caused by CPU contention.

If latency is important to you, you will have to accept lower utilization, and vice versa.

Container/VPS sizing

You will have to make a decision about how "big" to make each pod/VPS/server/dyno.

TL:DR;: 80% of Puma apps will end up deploying "pods" of 4 workers, 5 threads each, 4 vCPU and 8GB of RAM.

For the rest of this discussion, we'll adopt the Kubernetes term of "pods".

Should you run 2 pods with 50 workers each? 25 pods, each with 4 workers? 100 pods, with each Puma running in single mode? Each scenario represents the same total amount of capacity (100 Puma processes that can respond to requests), but there are tradeoffs to make:

Measuring utilization and queue time

Using a timestamp header from an upstream proxy server (e.g., nginx or haproxy) makes it possible to indicate how long requests have been waiting for a Puma thread to become available.

Should I daemonize?

The Puma 5.0 release removed daemonization. For older versions and alternatives, continue reading.

I prefer not to daemonize my servers and use something like runit or systemd to monitor them as child processes. This gives them fast response to crashes and makes it easy to figure out what is going on. Additionally, unlike unicorn, Puma does not require daemonization to do zero-downtime restarts.

I see people using daemonization because they start puma directly via Capistrano task and thus want it to live on past the cap deploy. To these people, I say: You need to be using a process monitor. Nothing is making sure Puma stays up in this scenario! You're just waiting for something weird to happen, Puma to die, and to get paged at 3 AM. Do yourself a favor, at least the process monitoring your OS comes with, be it sysvinit or systemd. Or branch out and use runit or hell, even monit.

Restarting

You probably will want to deploy some new code at some point, and you'd like Puma to start running that new code. There are a few options for restarting Puma, described separately in our restart documentation.

Migrating from Unicorn

Ubuntu / Systemd (Systemctl) Installation

See systemd.md