Need 4 Speed II – QuantumLeap

Ok, in last episode we looked at the Orion performance. A platform with Orion is bit incomplete, effectively you are suffering from long term memory loss. To fix that you need a component which can record the event through time. Enter QuantumLeap!

The Road so far

QuantumLeap has it’s own problems with amnesia (not the database), but we’ve heard it’s now on the mend. Because we got such fan-tastic results with Orion, we decided once more hit the lab and see how much performance we can get out of QuantumLeap.

The Setup

Setup is more or less the same like last time, but with additions: Loadtest -> Orion -> Quantumleap -> Crate DB. Messages travel from Orion to Quantumleap (QL) with notification mechanism.

As a starting point we have QL 0.7.6 coupled to 3.x Crate DB. First run: 191 msg/sec. 10000 sent, 6151 arrived to crate. Ups. PIDs go up on QL to ~5000 and there is minutes of post processing. Clearly too slow.


Optimization time! We change the notification mode on Orion to permanent. Orion PIDs go up and processing goes on for minutes. It takes one minute to process 2000 PIDS. We do not run long duration, but if we’d keep this up, something would crash on Orion side. Looking at the Crate DB.. All messages on board! That is good, but clearly the buffering is happening on Orion side and this is not sustainable.

We revert the notification mode so we get realistic readings. Time to hit the books hard. As a tweak, we reduce logging to ERROR. Now last 2000 PIDS were disposed in 20 secs! But roughly same amount of data was lost. Clearly not a winning formula.

Into another Dimension

Another attempt. Combing through issues we ran into a new Docker image. A Redis-cache one. Apparently, this can cache stuff! Trying and result!

274 messages / second and all in Crate! No post processing. At this point the VM is running out of resources. So what does that mean? If these features make it to the release, we have 10x improvement from 0.7.5. Can we do better? Let’s find out!

Outside access

Testing from separate machine outside the cluster will free some resources, but it will also take some. Now the chain is: Load machine -> Internets -> Umbrella -> Orion -> QL -> Crate.

First it’s all rainbows and sunshines, 6 cores can do 200+ messages / sec. then the perf drops horribly and response times start to be 1000 ms+. What? It’s not rate limiting. What is this?

Problem Child

Trying few things. Making sure that the load test machines are the problem. Dropping Elastic search so that logging is not the problem. Scaling Umbrella to two so that the capacity is not a problem. Separating Mongo so that the Mongo access is not a problem. But we still have a problem. Tryina isolate the problem; sending with token, no token, https, no https. Suspicion mount towards Umbrella. As a final blow, we send data past Umbrella to Orion and then via Umbrella at the same time and this creates a hiccup when traffic goes via Umbrella, but traffic that goes directly to Orion is fine. PROOF!

This is odd. Testing further with Nginx/Nginx:latest alone. Same problem. We have trouble believing that one of the most popular components in the Internets have this performance issue. We google, tweak settings ( worker_process = auto – no help, worker_connections = 16 no help, keep alive timeour = 3 – no help)!

If someone can point us to right direction with Nginx, please let us know in the comments!

Gen T

We decided to run reference tests with Traefik. After getting to know our new friend bit better, we get promising results. No long response times, messages end up in historical DB nicely. We run one hour test with 300+ msg / sec to Traefik -> Orion -> QL -> Postgres. No problem.

Dense! after one hour run with ~350 msg/sec. 1 325 868/1 325 868 in DB.

Dr. Kubernetes, I presume

Then we decided to run the same tests against one of our other deployment, which is running on Kubernetes and the linux flavor is Debian (normally we are using CentOs). This one has Umbrella. Bam! NO long response times.

So what does all this mean? Not sure, jury is still out. With Traefik we are not doing TLS termination, not checking Bearer tokens or anything like that. Kernel parameters/difference in OS? Search is still on. If you have a solution in mind, let us know in the comments!

P.S: Merry Christmas all!