Adventures in Measuring JavaScript
This is a post about my attempts to measure the CPU cost of JavaScript in a large codebase using thread instruction counters within Chrome's tracing tools. If you came here for a solution, I'll have to disappoint: While I managed to get something working, I have not implemented it in my organization and cannot recommend that you do either. This post explains the progress I've made, what options I've encountered on my journey, and offers some actual code you can use, too.
Performance is hard
Let's begin with the problem statement: Helping an entire organization to write performant code is tricky. Too frequently, you die a death of a thousand cuts - and while each Pull Request seems innocent, you end up with a slow and heavy application in aggregate. Ideally, you want to create a team goal, report progress against it, and allow developers to quickly see whether or not a single code change is making things better or worse.
That's fairly straightforward for dimensions like bundle size. For every PR, you have bots run webpack
, analyze the bundle, and report back to the developer whether or not they made it bigger or smaller. As an organization, you can gamify the numbers and race towards the bottom together.
We had dreams about doing the same with the CPU cost of JavaScript. Specifically, we wanted to measure and improve the total work required for our application to load and for content to be visible. Our app is mostly CPU-bound, so optimizing individual algorithms can have an outsized impact on how snappy the application feels and actually operates. On the flip side, little changes to an Array.map()
or a state layer can make the entire app slower – often without the developer ever noticing.
The problem: Measuring the CPU cost of JavaScript is something that we, as an industry, have seemingly not figured out.
You can't just time it
You might want to reach for time: If you have two functions, the one that's faster is the one that's better. That method works well for benchmarking a single function and tools like jsperf support developers curious enough to try out different implementations. Sadly, the time-based approach breaks down completely once you're changing a complex, large application: Due to countless browser and system optimizations, two runs of JavaScript are rarely the same. Here's what we tried:
- Use
performance.mark()
calls to measure the time it takes from navigating to our-app.com until content is visible. - Do that
n
times to create some statistical significance.
In our testing, even on local hardware, the coefficient of variation between runs for a total of ~15 MB of uncompressed, unminified JavaScript was around 30%. In other words, any changes below 30% in either direction could simply hide in the noise of highly optimized execution. That variance increases in the cloud, where "noisy neighbor" problems exist. Once you do anything involving the network, all timing results become quickly worthless.
You can't just tell developers to "write better code"
In an engineering interview, we might sit down and stare at every single line in detail, discussing O(n) notations, carefully thinking through the performance of the code we're discussing. In a complex web app in 2023, the implications of code changes are often opaque. As a simple example, back in 2017 Visual Studio Code battled with a performance issue that had the editor use up to 20% of available CPU resources – all because the blinking text editor cursor was being animated with CSS, which was an unoptimized path deep inside Chromium. I am confident in saying that most developers would have not caught this problem in a code review. I also believe that many of us would have told a junior developer to not animate the cursor in JavaScript and to use CSS instead. I would have been surprised to discover that manually flipping visibility with JavaScript was, in fact, the more performant choice.
Surprises like these are everywhere: document.querySelectorAll('.test-class')
was, for a long time, 99% slower (!!!) than the modern document.getElementsByClassName('test-class')
.
Now, throw in a thick layer of React, state management, and server-driven reactivity and I am confident in saying that developers are not equipped to detect small performance regressions or improvements on a per-PR-level.
Let's count the instructions
So, if you can't simply measure the time and you cannot trust developers to intuitively know which code creates more work, what do you? You measure the number of instructions. Instead of measuring the time it takes to reach our performance.mark()
, we count the number of instructions sent to the CPU it took to reach it. In other words, we ask the CPU how hard it had to work to execute the JavaScript we're testing. Below: An explanation of how you measure instruction counts for JavaScript with Chrome's own tracing.
Requirements
Computer: A Linux or Windows on Linux environment that supports CPU instruction counters. Due to the nature of virtual CPUs, you will need an actual physical machine and a CPU that includes a Performance Monitor Unit (PMU).
A virtual machine, including one offered by the cloud providers, will not work. I used WSL2 with Ubuntu on Windows 11 on a Lenovo X1 Thinkpad with an Intel i7-1165G7 processor.
WSL version: 1.2.5.0
Kernel version: 5.15.90.1
WSLg version: 1.0.51
MSRDC version: 1.2.3770
Direct3D version: 1.608.2-61064218
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.22621.2283
Once you have your Linux environment running, you can confirm whether or not you have support enabled by running dmesg
and checking for a running PMU driver:
root@Notion-26823:~# dmesg | grep PMU
[ 0.097297] Performance Events: AnyThread deprecated, Icelake events, 32-deep LBR, full-width counters, Intel PMU driver.
See how it says full-width counters
and Intel PMU driver
? That means we're in business!
Linux Perf Tools: To actually measure the instruction counters, we'll need the Linux Perf Tools.
apt install linux-perf
Chrome: In order to measure the cost of JavaScript for a single execution, you'll need the browser's help to embed instruction counters into performance profiles and traces recorded with performance.mark()
.
Let's start with the bad news: This feature no longer exists in Chrome. It was initially added to Chromium by Facebook performance engineer Andrew Comminos in 2019. In 2022, it was removed due to its niche nature and a lack of a maintainer.
In practice, that means that we'll be using Chrome 101, a version I confirmed to still have working instruction counters.
If you're using WSL, first make sure that you can run Chrome Stable:
sudo dpkg -i google-chrome-stable_current_amd64.deb
sudo apt install --fix-broken -y
sudo dpkg -i google-chrome-stable_current_amd64.deb
# and confirm that stable Chrome launches fine
google-chrome --no-sandbox
Then, install Chrome 101:
wget https://www.googleapis.com/download/storage/v1/b/chromium-browser-snapshots/o/Linux_x64%2F982481%2Fchrome-linux.zip?alt=media
Unzip and save this browser to a location of your choice.
Let's measure some instruction counters
For my testing script, I'm using puppeteer
. Let's highlight a few things of note here:
- You'll need to pass in your own version of Chrome 101, otherwise
puppeteer
will use a recent version without instruction counters. - You'll need to run Chrome with the
--no-sandbox
and--enable-thread-instruction-count
flags. - You do not need to specify specific tracing categories, but I've done so to keep my traces small. You can find the full list of available categories directly in Chromium's source code.
export async function run(options: CommonOptions): Promise<string> {
const browser = await puppeteer.launch({
executablePath: options.chrome,
headless: "new",
userDataDir: options.userDataDir,
args: ["--no-sandbox", "--enable-thread-instruction-count"],
});
const page = await browser.newPage();
const tracePath = getOutPath("profile.json");
const categories = [
"blink.console",
"blink.user_timing",
];
await page.tracing.start({ path: tracePath, categories });
await page.goto(options.url);
await page.tracing.stop();
await browser.close();
return tracePath;
}
Lets' create a profile for a simple example file, like this website diligently finding prime numbers:
<!DOCTYPE html>
<html lang="en">
<body>
<script>
function findPrimesInRange(start, end) {
const primes = [];
for (let num = start; num <= end; num++) {
if (isPrime(num)) {
primes.push(num);
}
}
return primes;
}
function isPrime(n) {
if (n <= 1) return false;
if (n <= 3) return true;
if (n % 2 === 0 || n % 3 === 0) return false;
for (let i = 5; i * i <= n; i += 6) {
if (n % i === 0 || n % (i + 2) === 0) return false;
}
return true;
}
performance.mark("start_prime");
const start = 1;
const end = 10000000;
const primeNumbers = findPrimesInRange(start, end);
performance.mark("end_prime");
</script>
</body>
</html>
Once you've created a profile, you'll end up with a giant JSON array filled with various events. Here's what some of those might look like:
{"args":{"data":{"navigationId":"D3ED92BAC05F7C827C0F2CCE164715FE"}},"cat":"blink.user_timing","name":"start_prime","ph":"R","pid":3799,"s":"t","ticount":18319665,"tid":3799,"ts":2394712721,"tts":36115},
{"args":{"data":{"navigationId":"D3ED92BAC05F7C827C0F2CCE164715FE"}},"cat":"blink.user_timing","name":"end_prime","ph":"R","pid":3799,"s":"t","ticount":12527374052,"tid":3799,"ts":2395524528,"tts":847868},
Let's de-code that a little bit. You can find a detailed explanation for the Trace Event Format directly from the Chromium project, but for our purposes, the key we care about is ticount
. At the beginning of our code, that count was at 18319665 - at the end, it's at 12527374052. After running this code five times, I come up with the following numbers:
Thread Instructions Avg 12508261752, Min 12507860881, Max 12508649256
Standard Deviation: 276496.79, ± 0.0022%)
Milliseconds Avg 888, Min 829.074, Max 968.957
Standard Deviation: 45.28, ± 5.0980%)
See that coefficient of variation right there? For time, it was a whopping 5% - but for thread instructions, it was 0.0022%. That's amazing - we could tell a developer very quickly and reliably whether or not their code got more or less efficient with regards to the CPU!
Where it breaks down
So, why didn't I end up implementing it? Here's a list of problems:
- Async uncounted: As I quickly discovered when running my own tool against our entire application, any asynchronous work is not counted.
fetch()
orsetTimeout()
operate without being counted. Naturally, any modern web app contains heaps of scheduled asynchronous work. - Growing coefficient of variation: We're measuring right at the CPU here, which doesn't know why it's being told to do work. If Garbage Collection happens during your measurement, the counters will be off. In practice: The more code you measure, the larger your coefficient of variation. I wanted to measure our entire application's boot, which lead me down a >15% CV path. At that point, all the same troubles encountered with simply measuring time are back in the picture.
- Unmaintainable: We will not always be compatible with Chrome 101. Once that happens, we cannot use the thread instruction counters.
- Undeployable: Since this approach only works on a physical machine, you'll need to hook a physical machine into your CI workflow. Many modern companies are entirely in the cloud and have completely lost that ability.
Fun Experiment Though
In conclusion: We still don't know how to tell developers when their JavaScript got a little better or worse.
Further reading: