Sunday, 28 August 2016

Diagnosing poorly performing Node.js applications

Someone call the developer

As a DevOps team, we feel the pain of outages or poorly performing applications in the form of late night or weekend pages.  Because of that, we are highly motivated to keep everything up and running smoothly.  We use New Relic to monitor every component of IBM Marketplace and recently noticed our external API component was starting to report low Apdex scores.  My squad was on the brink of getting paged repeatedly if we did not make improvements quickly.

IBM Techology, IBM Certifications, IBM Materials
Poor Apdex scores = incoming pages!

After taking a look at the Web transactions view in New Relic, we saw that most of the time was being spend in node.

IBM Techology, IBM Certifications, IBM Materials
Slow web transactions

Using Node.js profiling data to improve performance

Since we needed to profile node to see where the problem areas were, we decided to use node’s built-in V8 profiler.  We also used tools to decipher the log that V8 generates, such as Node Inspector and Webstorm’s V8 log viewer.  After starting the application with profiling enabled, we ran our integration test suite which has many tests for the API endpoints which were the slowest responding.  This gave us a good baseline of how our application was performing.
To use Webstorm’s V8 log viewer:
  • Start node with profiling enabled: node –prof app.js
  • Run the integration test suite, or simulate some load on your application, using a tool like Apache Benchmark.
  • Stop node. A profiling log will be generated in the application directory.
  • In Webstorm, go to Tools > V8 Profiling > Analyze V8 Profiling Log and point it to the log file that was generated.
  • Once loaded, Webstorm displays a view that shows where the application spent most of its time and allows you to drill into different parts of the code that were called to see how much time each step took.
After drilling down into the Node Inspector view, we could see where the application was spending most of its time.  Using that data, we hypothesized about where we could make changes to gain the greatest improvements.  Our process involved tweaking parts of the code we saw were performing poorly from the profiling reports, and running our integration test suite with profiling enabled again, so that we could compare the new results to the baseline.  This validated our changes with data, and gave us confidence in which fixes would work the best to improve performance.

Example improvement made using V8 profiling data

One method that was taking a high percentage of time compared to other calls was a _clone call inside of a caching component we use called node-cache.  We realized that every time we retrieved an object from our cache, node-cache was doing a clone of the object before returning it to the application.  Cloning is the safest approach to guarantee that cache data is not overwritten, but we verified that were only retrieving the cache once it was set, and not modifying it.  Then we were able to turn useClones off when node-cache initialized, saving us the hit on cloning every time an object was retrieved from the cache, which gave us a boost in performance.

IBM Techology, IBM Certifications, IBM Materials
Webstorm’s V8 profile log view


By using this approach iteratively and going after the low hanging fruit first, we were able to improve our application’s performance by 34% in the matter of a few days.  We still have some longer term improvements to make, but we were able to stabilize our monitoring metrics and get our application back to an acceptable performance level.

IBM Techology, IBM Certifications, IBM Materials
Back where we want to be!