Tech TIL

Adventures in tuning unicorn for Kubernetes

There isn’t much detailed info in the wild about folks running Ruby on Rails on Kubernetes (k8s); the Venn diagram for the respective communities doesn’t have a ton of overlap. There are some guides that cover the basics, enough to get a lab up and running, but not much at all about running a production environment. The situation is getting better slowly.

The #kubernetes-ruby Slack channel has all of ~220 people in it versus the thousands found in channels for other languages. If you scroll through the history, you’ll find that most of the questions and responses cover Day 0/1 issues – “How do we make it run?”-type problems.

So, I thought it would be worthwhile to share a bit of my own experience trying to optimize a RoR deployment in Kubernetes, if only to save another Ops-person from stumbling through the same mistakes and dead-ends that I encountered.

Some background

The initial move of our RoR app from Heroku to Kubernetes was relatively straightforward. Much of the effort went into stress-testing the auto-scaling and resource config to find behavior that felt OK-enough to start.

Part of that tweaking required making our worker/container config less dense to be cluster-friendly and provide smooth scaling versus the somewhat bursty scale-up/down we had seen with running a high process count on Heroku dynos.

Generally, small, numerous containers running across many nodes is what you want in a clustered setting. This optimizes for resiliency and efficient container packing. It can also make scaling up and down very smooth.

We settled on an initial config of 2 unicorn (Rails app server) processes per container with 768MB of RAM and 1000 CPU millicores – maxing out at a few hundred k8s pods. This felt bad and abstracted traditional unicorn optimization practices up one level, where more workers per unicorn socket = better routing efficiency, but it seemed to perform OK and denser configs (confusingly) appeared to have worse performance within the cluster. It also jived with the limited documentation we could find for other folks making similar migrations.

The initial goal was satisfied – get it moved with limited downtime and acceptable performance, where “acceptable” was basically “what it was doing on Heroku”. In fact, it seemed to be running a bit better than it had on Heroku.

Slower than it should be

Fast forward a year and tens of thousands more requests-per-minute. Our team decided we wanted to introduce service level indicators/objectives to inform product and infrastructure work. We chose a target for latency and started tracking it. We were doing OK in regard to the target, but not where we felt like we could be (I, personally, wanted a bit more buffer room.), so we started digging in to the causes of slowness within the stack.

It immediately became apparent that we were blind to network performance across some layers of the stack. We were tracking app/database level latency and could derive some of the latency values for other tiers via logs, but the process was cumbersome and too slow for real-time iteration and config tweaking.

A co-worker noticed we were missing an X-Request-Start header in our APM telemetry. We added the config in our reverse-proxy (nginx) and discovered a higher-than-expected amount of request queuing between nginx and unicorn.

That kicked off a round of experiments with nginx, unicorn, and container configs. Some of these provided minor benefit. Boosting the number of app pods reduced the request queuing but also wasted a lot of resources and eventually plateaued. Increasing worker density was minimally successful. We went up to 3 workers w/ 1GB of RAM and saw better performance, but going past that yielded diminishing returns, even when increasing the pod request/limits in parallel.

Network captures weren’t very helpful. Neither were Prometheus metric scrapes (at least not to the degree that I was able to make sense of the data.). As soon as requests entered the k8s proxy network, we were blind to intra-cluster comms until it hit the pod on the other side of the connection. Monitoring the unicorn socket didn’t show any obvious problems, but all the symptoms signaled that there was a bottleneck between nginx and unicorn, what you would see if connections were stacking up on the unicorn socket. We couldn’t verify that was what the actual issue was though.

After investing quite a bit of time going down sysctl and other rabbit holes, we decided to set the problem aside again to focus on other work. We had yielded some performance improvements and everything was performing “OK” versus our SLO.

Give it the juice

One of the goals of the SLI/SLO paradigm is to take measurements as close to the customer as is practical. In the spirit of that, we moved our latency measurement from the nginx layer to the CDN. We had avoided the CDN measurement previously, because the time_taken value in AWS Cloudfront logs is problematic and sensitive to slow client connections. However, AWS recently added a origin_latency value to their logs, making this tracking more practical and consistent.

Once we made the switch, we were sad. Per the updated measurement point, we were performing much worse than expected the closer we got to the client. This kicked off a new round of investigation.

Much of the unexpected latency was due to geography and TLS-handshakes. We detailed out some of the other causes and potential mitigations, listing unicorn config improvements as one of them. I set the expectation for those improvements low given how much time we had already invested there, and how mixed the results were.

But we gave it another go.

This time around, we introduced linkerd, a k8s service mesh, into the equation. This gave us better visibility into intra-cluster network metrics. We were able to test/iterate in real-time in our sandbox environment

We performed some experiments swapping out unicorn with puma (We hadn’t done this previously due to concerns about thread-safety, but it was safe enough to test in the sandbox context.). Puma showed an immediate improvement versus unicorn at low traffic and quickly removed any doubt that there was a big bottleneck tied directly to unicorn.

We carved out some time to spin up our stress test infra and dug in with experiments at higher traffic levels. Puma performed well but also ran into diminishing returns pretty quickly when adding workers/threads/limits. While troubleshooting we noticed that if we set a CPU limit of 2000 millicores, the pods would never use more than 1000. Something was preventing multi-core usage.

That something turned out to be a missing argument that I’ve so far only found in one place in the official k8s docs and was missing from every example of a deployment config that I’ve come across to date.

apiVersion: v1
kind: Pod
  name: cpu-demo
  namespace: cpu-example
  - name: cpu-demo-ctr
    image: vish/stress
        cpu: "1"
        cpu: "0.5"
    - -cpus
    - "2"

“The args section of the configuration file provides arguments for the container when it starts. The -cpus "2" argument tells the Container to attempt to use 2 CPUs.”

Turns out, it doesn’t matter how many millicores you assign to a request/limit via k8s config if you don’t enable a corresponding number of cores via container argument. The reason this argument is so lightly documented in the context of k8s is that it has little to nothing to do with k8s. -cpus is related to Linux kernel cpuset and allows Docker (and other containerization tools) to configure the cgroup a container is running in with limit overrides or restrictions. I’ve never had to use it before, so I knew nothing about it.

(╯°□°)╯︵ ┻━┻ (╯°□°)╯︵ ┻━┻ (╯°□°)╯︵ ┻━┻

(╯°□°)╯︵ ┻━┻ (╯°□°)╯︵ ┻━┻ (╯°□°)╯︵ ┻━┻

(╯°□°)╯︵ ┻━┻ (╯°□°)╯︵ ┻━┻ (╯°□°)╯︵ ┻━┻

So many tables flipped… not enough tables.

With higher worker count, CPU/RAM limits, AND a container CPU assignment override, unicorn actually performed better than puma (This can be the case in situations where your app is not I/O constrained). Almost all of our request queuing went away.

We eventually settled on 8 workers per pod with 2000/4000 CPU request/limit and 2048/3584 MB RAM request/limit as a nice compromise in density vs. resiliency and saw an average of 50ms improvement in our p95 response time. (It’s possible we’ll tweak this further as we monitor performance over time.

The issue had been a unicorn socket routing bottleneck the entire time, just as had been suspected earlier on. The missing piece was the CPU override argument.

Note: the default value for -cpus is ‘unlimited’. In our case something, like a host param, was overriding that.

What have we learned

There are a few things worth taking away here in my mind.

  1. Yet again, k8s is not a panacea. It will not solve all your problems. It will not make everything magically faster. In fact, in some cases, it may make performance worse.
  2. Before you run out and start installing it willy-nilly, Linkerd (or service meshes in general) will not solve all your problems either. It was helpful in this context to enable some troubleshooting, but actually caused problems during stress testing when we saturated the linkerd proxy sidecars, which in turn caused requests to fail entirely. I ended up pulling the sidecar injection during testing rather than fiddling with additional resource allocation to make it work correctly.
  3. For all the abstraction of the underlying infrastructure that k8s provides, at the end of the day, it’s still an app running on an OS. Knowledge and configuration of the underlying stack remains critical to your success. You will continually encounter surprises where your assumptions about what is and is not handled by k8s automagic are wrong.
  4. Layering of interdependent configurations can be a nightmare to troubleshoot and can make identifying (and building confidence in) root causes feel almost impossible. Every layer you add increases complexity and difficulty exponentially. Expertise in individual technologies (nginx, unicorn, Linux, k8s, etc) helps, but isn’t enough. Understanding how different configurations interact with one another across different layers in different contexts presents significant challenges.

Photo by Joen Patrick Caagbay on Unsplash

Tech TIL

TIL: How to live-rotate PostgreSQL credentials

OK, I didn’t actually learn this today, but it wasn’t that long ago.

Postgres creds rotation is straightforward with the exception of the PG maintainers deciding in recent years that words don’t mean anything while designing their identity model. “Users” and “Groups” used to exist in PG, but were replaced in version 8.1 with the “Role” construct.

Here’s a map to translate PG identifies to a model that will make sense for anyone who is familiar with literally any other identity system.

PostgresLiterally anything else

Now that we’ve established this nonsense, here’s a way of handling live creds rotation.

CREATE ROLE user_group; -- create a role, give it appropriate grants.


CREATE ROLE user_green WITH ENCRYPTED PASSWORD 'REPLACE ME AS WELL' IN ROLE user_group nologin; -- This one isn't being used yet, so disable the login.

That gets you prepped. When you’re ready to flip things.

ALTER USER user_green WITH PASSWORD 'new_password' login;

Update the creds wherever else they need updating, restart processes, confirm everything is using the new credentials, etc. Then

ALTER USER user_blue WITH PASSWORD 'new_password_2' nologin;

Easy, peasy.


TIL: How to exclude specific Boto errors

Some background:

I previously wrote a python lambda function that copies AWS RDS snapshots to a different region. This has been working for months but recently started throwing this obfuscated error:

An error occurred (DBSnapshotAlreadyExists) when calling the CopyDBSnapshot operation: Cannot create the snapshot because a snapshot with the identifier copy-snap-of-TIMESTAMP-DB_NAME already exists.

Thinking this might be due to some timestamp shenanigans, I looked at the Cloudwatch Events trigger for the lambda and saw that there were two triggers instead of the original one that I setup. Both were scheduled for the same time. I deleted the new one and waited until the next day to see if the error re-occurred, which it did.

Looking through the Cloudwatch errors, even though the second trigger was gone, the lambda was still trying to execute twice. I’ve filed a support ticket with AWS, but in the meantime, needed to silence the false positives to keep people from getting paged.

The error handling had been done as:

except botocore.exceptions.ClientError as e:    
    raise Exception("Could not issue copy command: %s" % e)

Initially, I tried:

except botocore.exceptions.ClientError as e:
   if 'DBSnapshotAlreadyExists' in e:
       raise Exception("Could not issue copy command: %s" % e)

Instead, I had to do:

except botocore.exceptions.ClientError as e:
   if e.response['Error']['Code'] == 'DBSnapshotAlreadyExists':
       raise Exception("Could not issue copy command: %s" % e)

Which works.

Tech TIL

TIL: How to use NumPy

I’ve been trying to flesh out my Python knowledge and learn more about machine learning(ML) in the process. Most of my day-to-day Python use is focused on text manipulation, API calls, and JSON parsing, so leveling up on ML is more math (specifically stat-related) than I’m used to.

Today I played around with the NumPy Python package a bit and figured out some simple things.

For example, if I wanted to multiply the numbers in two lists with vanilla Python, like this:

a = [1, 2, 3, 4] 
b = [4, 3, 2, 1]

print(a * b)

I’d get TypeError: can’t multiply sequence by non-int of type ‘list’ . I’d have to write something to iterate through each list.

NumPy, on the other hand, can handle this like a champ. And this is probably the simplest thing you could use it for.

import numpy

a = [1, 2, 3, 4]
b = [4, 3, 2, 1]

new_a = numpy.array(a)
new_b = numpy.array(b)

print(new_a * new_b)

> [4 6 6 4]

NumPy really shines when you start dealing with multidimensional lists and stat work.


> array([[1, 2, 3, 4],
       [4, 3, 2, 1]])

And then it’s just turtles all the way down. You can slice intersections, calculate standard deviations, and so on. It’s a handy Python package that I knew literally nothing about prior today and a nice tool to add to the toolbox.

Tech TIL

TIL: How to use list comprehensions in python

In the past, if I wanted to make a new list by pulling values out of an existing list based on a condition I would have done something like:

def listItems():
    a = [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
    new = []
    for num in a:
        if num % 2 != 0:
    print new

But, I figured out via that list comprehenisions can dramatically compress these functions while still maintaining readability. Here’s an example of a list comprehension that would render the same output as the expanded function above:

def listComp():
    a = [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
    new = [ num for num in a if num % 2 != 0] 
    print new


The syntax is a little weird because python has so little structure to it “num for num in a…”, but makes more sense if you’re referencing a tuple, where it would be “( 1, 2 ) for num in a…”


Tech TIL

TIL: How to pass MySQL queries as variables in Bash

Note: I also learned how to handle query blocks with Bash here documents. Most of what I’ve done in the past with MySQL and Bash has been limited to single-line selects, so Yay!, learning.



user_id=$(mysql -u$db_user -p$db_pass $db_server <<GET_USER_ID
USE main;
SELECT user_id FROM users WHERE username="dave";

echo "Dave's User ID is $user_id"

It’s the small things in life…

Tech TIL

TIL: How to disable core dumps on AWS Linux

I ran across a situation where a previous sysadmin had enabled core dumps on a server that were filling up the root volume. The dumps weren’t needed, so I decided to disable them, problem is, I’ve never really dealt with core dumps because I’ve never had to, so I had to do some Googling.

Here’s the result:
# Only tested with AWS Linux but should work on RHEL, CentOS, Fedora, etc.

echo '*     hard    core    0' >> /etc/security/limits.conf
echo 'fs.suid_dumpable = 0' >> /etc/sysctl.conf
sysctl -p

Line 4 disables core dump creation for users. Line 5 disables setuid itself from generating core files, and Line 6 applies the changes to the running kernel.

Tech TIL

TIL: How to get RedShift to access S3 buckets in a different region

While trying to get a Spark EMR job running, I encountered this error on a Spark step that copied data from RedShift to S3.

error: S3ServiceException:The bucket you are attempting to access must be addressed using the specified endpoint.

I’ve seen issues in the past with S3 buckets outside of US-East-1 needing to be targeted with region-specific URLs for REST ( vs. but had not seen anything similar for s3:// targeted buckets.

This got me looking at Hadoop file system references, none of which are helpful because EMR rolls their own, proprietary file system for Hadoop S3 access. So Hadoop’s recommended s3a:// (which is fast and resilient – and supports self-discovering cross-region S3 access!) does not work on EMR. Your only option is s3://, which appears to be region-dumb.

The fix turns out to be simple, you just have to pass the bucket region to the Spark step as a separate argument.  i.e. us-west-2

… simple, but annoying, because the steps worked in a pre-prod environment (in a different region), so it wasn’t immediately apparent what was causing the failure, which was buried in the logs.

Tech TIL

TIL: How to get JBoss AS to gracefully handle database server failovers

Today I learned how to get JBoss AS (Wildfly) to not lose its mind during a database server failover event (as occurs in the cloud quite often).

The config is simple, but finding comprehensive documentation is a bit challenging since most current JBoss docs require a RedHat subscription. So you’re stuck piecing things together from multiple sites that contain tidbits of what’s needed.

Note: This is for a DBaaS scenario (think AWS RDS or Azure SQL Database), where the DB load balancing is done downstream of your app server. If you’re specifying multiple servers per connection, you’ll have to do some Googling of your own.

Otherwise, you’ve probably got a datasource (DS) connection defined in standalone.xml (or elsewhere depending on your deployment) that looks sort of like this:

<datasource jndi-name="java:jboss/datasources/defaultDS" enabled="true" use-java-context="true" pool-name="defaultDS" use-ccm="true">

Adding these options make JBoss handle failovers a bit better.

<datasource jndi-name="java:jboss/datasources/defaultDS" enabled="true" use-java-context="true" pool-name="defaultDS" use-ccm="true">
        <check-valid-connection-sql>SELECT 1</check-valid-connection-sql>

Checkout lines 2 and 8-14.

On line 2 we’ve added the autoReconnect=true option to the connection-url. This does exactly what it says. If a database connection attempt fails, JBoss will attempt to re-establish the connection instead of sulking in a corner like it does by default. But it needs a way to know that it should reconnect…

On lines 8-11, we’ve added a connection validation. I believe some DS drivers handle this on their own, but the MySQL JDBC drivers I’ve tested appear not to. This seems to be the standard workaround from what I could find but does have the downside of issuing wasteful queries on the DB. The “background-validation” setting helps a little by issuing the validation checks in a separate thread.

This section should force JBoss to drop dead DB connections instead of letting them gum up the pipes.

Lines 12-14 help with the same problem. By default, JBoss is supposed to flush stale connections (that’s what the docs say at least), but this doesn’t seem to always happen in practice. Using IdleConnections should take care of any failed connections that aren’t getting flushed or EntirePool can be used if you want to be really aggressive.