Tech

By Jordan Brown, October 2, 2014

Burning Money is Never Good: Putting Out Fires with Librato’s API

I started working as a full-stack engineer on CrowdFlower’s platform team in April of 2013. Since joining, it’s been an awesome ride with challenges ranging from building a real-time job monitoring dashboard to revamping our entire billing system.

Welcome aboard. Now get to it!

Right out of the gate, I was asked to tackle alerting for runaway job costs — a crucial component of a broader systemic bug that was causing us to burn money unnecessarily.

Throwing money away isn ever a good idea.

Basically, we’d overcollect judgments on a given unit of data without notifying the requester. To fix this, we needed to automate alerting when something went awry and pause jobs that breached critical thresholds for a variety of different metrics:

Focused on customer satisfaction, we used to absorb these costs ourselves when misconfigured built customer jobs resulted in too many collected judgments. In some ways, this part of our platform was asleep on the job.

 

Automating job monitoring would ensure that customers could passively keep tabs on their jobs, enrich data more efficiently and take an important chunk out of CrowdFlower’s runaway job costs issue.

So how did we fix it?

We turned to Librato for their expertise in metrics analysis and alerting. Their product is highly scalable and easy to integrate with. Exactly what we needed to solve this problem quickly and for the long term.

We hooked into their API to build an automated alerting framework. Upon job launch, an alert object for each job metric we care about is sent to Librato, and a corresponding copy is saved to our database. These metrics are configurable through the Alert Settings page in CrowdFlower’s UI, but are created with reasonable defaults so most users will not have to modify them.

Every five minutes, a scheduled script runs that iterates through all of our running jobs and sends each job’s set of metrics to Librato for analysis. Their system determines if any of the job’s metrics have exceeded their critical or warning threshold, and, if so, fires a webhook back to CrowdFlower with the job’s unique identifier and the metric that triggered the alert. Once we receive this webhook, we parse the information and send a warning email and/or pause the job entirely, depending on the configuration of the particular alert.

Another benefit of using Librato is that they maintain a history of each job’s metrics over the lifetime of the job, complete with beautiful graphs that can be easily embedded into our platform.

Librato has Made a Big Difference for Us

Ultimately, integrating with their automated monitoring for CrowdFlower job metrics has allowed our platform to deliver even better results for data enrichment jobs, while ensuring the stability of both cost and quality for our customers.