This is a guest blog post from Xactly Software Engineer, Anvesh Checka. Stay tuned for future blogs from the Xactly Engineering Team!
A plan document is a legal document between a company and its sales representatives, used for tracking progress. Plan documents within Xactly eDocs & Approvals both simplify processes and increase flexibility as documents smoothly transition through various channels in your organization.
Whenever a sales representative requests a plan document, the document is formatted by the PDF Generator Service using the HTML markup it receives from the Headless Browser Service.
Xactly recently switched from PhantomJS to Headless Chrome for generating content for plan documents in Xactly DocuSign workflows. The switch helped us continue building a reliable system with significant performance and stability gains.
This article will cover:
- What inspired this transition
- Challenges faced during the process
- How the Xactly engineering team achieved this goal
A Bit of Background
At Xactly, we run automations on remote machines with production data workloads. In the process, we have consciously made a series of modifications that have led to maintenance challenges, such as higher memory consumption, lower processing speed, intermittent crashes, and cache invalidations.
With more computing power at hand, there was an explosion of technical capabilities with what today’s computers can do.
PhantomJS had become a staple in our workflow, however, PhantomJS and other headless browsers have not developed at the same pace as front-end ecosystems over the years. Using ES6/7/8, web and service workers, native and browser-specific APIs, shadow DOM, and other features have added their own set of challenges. Stability too is a major concern.
When the PhantomJS end-of-life was announced coinciding with the arrival of Headless Chrome, it was a blessing in disguise for the Xactly engineering team. We began running multiple PoCs (Proof of Concepts) to evaluate its features, monitoring its stability and performance, and checking its compatibility with our product feature set.
We observed that Headless Chrome is tremendously fast (given enough hardware), stable, and equally importantly, developer friendly. Those reasons were enough for us to flip the switch.
How Did We Pull This Off?
As part of our reliability engineering initiative, we rolled out the new implementation in about two months (including thorough testing).
In the process, we developed a checklist of the most important things to consider related to the infrastructure. These considerations include:
- Software compatibility with the operating system
- Machine setup
- Build and deployment workflows and scripts
- Service discovery
- Health checks
- Security monitoring
- Server monitoring
- Sandboxing (wherever necessary)
Our PhantomJS monolith was converted to a standalone service running Headless Chrome, with several instances of the service horizontally scaled on multiple multi-core machines with HAProxy.
On each machine, a NodeJS app is vertically scaled by clustering node instances and load-balancing using PM2. We also used Puppeteer as our de-facto API client to talk to Chromium. Currently, we collect all logs to a local file on the machine but plan to migrate to ELK (Elastic, Logstash, Kibana) in the future.
With every request, the node service hits an internal endpoint and generates the required HTML for the requested parameters. Later, this is passed to a PDF generator service that generates customized PDF documents.
Cold Cache and Pristine Sessions
It’s important to never sacrifice security for performance. Every request is run with no shared cache and in a new Chromium instance. These instances are not reused to create tabs. This eliminates the chance of sharing data between two Chromium instances when processing the HTML, thereby avoiding inadvertent access to cross-business data.
The overhead of creating and destroying Chromium processes for every incoming request imposes a performance penalty, but we accepted this tradeoff in the name of security. This is easily addressed using more capable hardware.
We’ve been able to collect some basic statistics and other information about our implementation, which include the following:
- The time to generate content for a single document ranges between 4 and 10 seconds, depending on the size of the document.
- Despite spinning up a new browser instance for every request, an 8-core machine is typically able to generate more than 8000 documents in less than an hour. This was never possible using PhantomJS.
- During our load tests, we observed the following:
- Running approximately 35 active browser instances is ideal. Anything significantly above this number degrades performance due to process overhead. To keep things further under control, we implemented queuing mechanisms.
- Without queuing, we ran into issues related to memory and maximum threads. In these cases, we noticed a rise in the number of ghost Chrome processes, which had to be removed using a cronjob at regular intervals.
- In case the incoming request takes too long to complete or fails due to some internal error, we attempt several retries based on a pre-configured setting.
- We moved settings such as API timeouts, retries, threshold, and more to external configuration files.
- All metrics analysis data is pushed to an internal, in-house system. In case you don’t have an in-house implementation, you can use the Keymetrics application.
Our system continues to evolve with planned enhancements and other improvements, including the following:
- Structured log analysis through ELK
- Improved monitoring
- Enhanced reliability and performance
About the Author
My name is Anvesh Checka and I work as a Software Engineer with the Product team at Xactly. I love building user interfaces and web apps with various systems. If you have any questions or comments, feel free to tweet or send me a message.