Fastersite: 2011

Saturday, December 17, 2011

Beyond web developer tools: strace

[cross-posted from 2011 Performance Calendar]

Rich developer tools are available for all modern web browsers. They are typically easy to use and can provide all of the information necessary to optimize web pages. It is rare to need to go beyond the unified networking/scripting/rendering view of the Web Inspector's Timeline panel.

But they aren't always perfect: a tool may be missing information, may disagree with another tool, or may just be incorrect. For instance, a recent bug occasionally caused two Navigation Timing metrics to be incorrect in Chrome (and the Inspector).

When these rare situations arise, great engineers are able to go beyond a browser's developer tools to find out exactly what the browser is telling the operating system to do. On Linux, this source of ultimate truth can be found using strace. This tool can trace each system call made by a browser. Since every network and file access entails a system call, and this is where browsers spend a lot of their time, it is perfect for debugging many types of browser performance issues.

What about other platforms?

In this post, I introduce strace because the syntax is clean and no setup is required. But most systems have an equivalent tool for tracing system calls. Mobile developers will be happy to hear that strace is fully supported by Android. OS X users will find dtrace offers more powerful functionality at the expense of less intuitive syntax (unfortunately not ported to iOS). Finally, Event Tracing for Windows (ETW), while harder to setup, supports a friendly GUI.

Getting started

To use it: open a terminal and invoke strace at the command prompt. This invocation prints all system calls while starting Google Chrome to google.com:


$ strace -f -ttt -T google-chrome http://www.google.com/

I've added -f to follow forks, -ttt to print the timestamp of each call and -T to print the duration of each call.

Zeroing in

If you run the command above, you'll probably be overwhelmed by the amount of stuff going on in a modern web browser. To filter down to something interesting, try using the -e argument. For examining only file or network access, try -e trace=file or -e trace=network. The man page has many more examples.

An example: local storage

As a concrete example, let's trace local storage performance in Chrome. First I opened a local storage quota test page. Then I retrieved the Chrome browser processes' ID from Chrome's task manager (Wrench > Tools > Task Manager) and attached strace to that process using the -p switch.


$ strace -f -T -p <process id> -e trace=open,read,write

The output shows the timestamps, arguments and return value of every open, read, and write system call. The man page for each call explains the arguments and return values. The first call of interest to us is this open:


open("/home/tonyg/.config/google-chrome/Default/Local Storage/http_arty.name_0.localstorage-journal", O_RDWR|O_CREAT, 0640) = 114 <0.000391>

This shows us that Chrome has opened this file for reading and writing (and possibly created it). The name of the file is a big clue that this is where local storage is saved for arty's web page. The return value, 114, is the file descriptor which will identify it in later reads and writes. Now we can look for read and write calls which operate on fd 114, for example:


write(114, "\0\0\00020\0001\0002\0003\0004\0005\0006\0007\0008\0009\0000\0001\0002\0003\0"..., 1024 <unfinished ...>

<... write resumed> ) = 1024 <0.425476>

These two lines show a 1,024 byte write of the data beginning with the string above to the local storage file (114). This write happened to take 425ms. Note that the call is split into two lines with possibly others in between because another thread preempted it. This is common for slower calls like this.

We've only scratched the surface

There are options for dumping the full data read/written from the network or filesystem. Running with -c displays aggregate statistic about the time spent in the most common calls. I've also found that some practical python scripting can quickly parse these traces into a variety of useful formats.

This brief introduction hardly does this tool justice. I merely hope it provides the courage to explore deeper into the stack the next time you run into a tricky performance problem.

Sunday, August 7, 2011

Finding memory leaks

Over lunch last week Mikhail Naganov (creator of the DevTools Heap Profiler) and I were discussing how invaluable it has been to have the same insight into JavaScript memory usage that we have into applications written in languages like C++ or Java. But the heap profiler doesn't seem to get as much attention from developers as I think it deserves. There could be two explanations: either leaking memory isn't a big problem for web sites or there is a problem but developers aren't aware of it.

Are memory leaks a problem?

For traditional pages where the user is encouraged to navigate from page to page, memory leaks should almost never be a problem. However, for any page that encourages interaction, memory management must be considered. Most realize that ultimately if too much memory is consumed the page will be killed, forcing the user to reload it. However, even before all memory is exhausted performance problems arise:

A large JavaScript heap means garbage collections may take longer.
Greater system memory pressure means fewer things can be cached (both in the browser and the OS).
The OS may start paging or thrashing which can make the whole system feel sluggish.

These problems are of course exacerbated on mobile devices which have less RAM.

A real world walkthrough

So, in order to demonstrate this is a real world problem and how easily the heap profiler can diagnose it, I set out to find a memory leak in the wild. A peak at the task manager (Wrench > Tools > Task Manager) for my open tabs showed a good candidate for investigation: Facebook is consuming 410MB!!

Pinpoint the leaky action

The first step in finding a memory leak is to isolate the action that leaks. So I loaded facebook.com in a new tab. The fresh instance used only 49MB -- another indicator the 410MB might have been due to a leak. To observe memory use over time, I opened the Inspector's Timeline panel, selected the Memory tab and pressed the record button. At rest, the page displays a typical pattern of allocation and garbage collection. This is not a leak.

While keeping an eye on the graph, I began navigating around the site. I eventually noticed that each time I clicked the Events link on the left side, memory usage would rise but never be collected. This is how the usage grows as I repeatedly click the link. A quintessential leak.

As an aside, this leak isn't a browser bug. The OS task manager shows similar memory growth when performing the same action in Firefox.

Find the leaked memory

Now that we know we have a leak, the obvious next question is what is leaking. The heap profiler's ability to compare heap snapshots is the perfect tool to answer it. To use it, I reloaded a new instance and took a snapshot by clicking the heap snapshot button at the bottom of the Profiles panel. Next, I performed the leaky action a prime number of times in hopes that it might be easy to spot. So I clicked Events 13 times and immediately took a second snapshot. To compare before and after, I highlighted the second snapshot and selected Comparison view.

The comparison view displays the difference between any two snapshots. I sorted by delta to look for any objects that grew by the same number of times I clicked: 13. Sure enough, there were 13 more UIPagelets on the heap after my clicks than before.

Expanding the UIPagelet shows us each instance. Let's look at the first.

Each instance has an _element property that points to a DOM node. Expanding that node, we can see that it is part of a detached DOM tree of 136 nodes. This means that 136 nodes are no longer visible in the page, but are being held alive by a JavaScript reference. There are legitimate reasons to do this, but it is also easy and common to do it by accident.

Note that all memory statistics reported by the tool reflect only the memory allocated in the JavaScript heap. This does not include native memory used by the DOM objects. So we cannot readily determine how much memory those 136 nodes are using. It all depends on their content -- for example leaking images can burn through memory very quickly.

Determine what prevents collection

After finding the leaked memory the last question is what is preventing it from being collected. To answer this we simply highlight any node and the retaining path will be shown (I typically change it to show paths to window objects instead of paths to GC roots). Here we see a very simple path. The UIPagelets are stored in a __UIControllerRegistry object.

At first I wondered if this might intentionally keep DOM nodes alive as a cache. However, that doesn't seem to be the case. A search of the source JS shows several places where items are added to the __UIControllerRegistry, but I couldn't find anywhere where they are cleaned up. So this appears to be a case where retaining the DOM nodes is purely accidental. The fix is to remove references to these nodes so they may be collected.

Takeaway

The point of the post is not that facebook has a leak. Facebook is an extremely well engineered site and large apps all deal with memory leaks from time to time. The point is to demonstrate how readily leaks can be diagnosed even with no knowledge of the source.

For anyone with an interactive web site, I highly recommend using your site for a few minutes with the memory timeline enabled to watch for any suspicious growth. If you have to solve any issues, the manual has excellent tutorials.

Sunday, May 29, 2011

How a web page loads

The major web browsers load web pages in basically the same way. This process is known as parsing and is described by the HTML5 specification. A high-level understanding of this process is critical to writing web pages that load efficiently.

Parsing overview

As chunks of the HTML source become available from the network (or cache, filesystem, etc), they are streamed to the HTML parser. Next, in a process known as tokenization, the parser iterates through the source generating a token for (most notably) each start tag, end tag and character outside of a tag.

For example the input source <b>hello</b> yields 7 tokens:

start-tag { name: b }
character { data: h }
character { data: e }
character { data: l }
character { data: l }
character { data: o }
end-tag { name: b }

After each token is generated it is serially passed to the next major subsystem: the tree builder. The tree builder dynamically modifies the Document's DOM tree to reflect the new token.

The 7 input tokens above yield the following DOM tree:

<html>
  <head>
  <body>
    <b>
      "hello"

Fetching subresources

A frequent operation performed by the tree builder is creating a new HTML element and inserting it into the Document. It is at the point of insertion that HTML elements which load subresources begin fetching the subresource.

Running scripts

This parsing algorithm seems to translate HTML source into a DOM tree as efficiently as possible. That is, except for one wrinkle: scripts. When the tree builder encounters an end-tag token for a script, it must serially execute the script before parsing can continue (unless the associated script start-tag has the defer or async attribute).

There are two significant preconditions which must be fulfilled before a script can execute:

If the script is external its source must be fully downloaded.
For any script, all stylesheets in the document must be fully downloaded.

This means often the parser must idly wait while scripts and stylesheets are downloaded.

Why must parsing halt?

Well, a script may document.write something which affects further parsing or it may query something about the DOM which would yield incorrect results if parsing had continued (for instance the number of image elements in the DOM).

Why wait for stylesheets?

A script may expect to access the CSSOM directly or it may query an attribute of a DOM node which depends on the stylesheet (for example, how wide is a certain <table>).

Is it inefficient to block parsing?

Yes. Subresource download times often have a large constant factor limited by round trip time. This means it is faster to download two resources in parallel than to download the same two in serial. More obviously, the browser is also free to do CPU work while waiting on the network. For these reasons it is critical to efficient loading of a web page that subresource fetches are initiated as soon as possible. When parsing is blocked, the tree builder is not able to insert subsequent elements into the DOM, and thus subsequent subresource downloads are not initiated even if the HTML source which includes them is already available to the parser.

Mitigating blocking

As I've blogged previously, when the parser becomes blocked WebKit will run a lightweight parser known as the preload scanner. It mitigates the blocking problem by scanning ahead and fetching certain subresource that may be required. Other browsers employ similar techniques.

It is important to note that even with preload scanning, parsing is still blocked. Nodes cannot be added to the DOM tree. Although I haven't covered how a DOM tree becomes a render tree, layout or painting, it should be obvious that before a node is in the DOM tree it cannot be painted to the screen.

Finishing parsing

After the entire source has been parsed, first all deferred scripts will be executed (waiting for their source and all pending stylesheets to download). Their completion triggers the DOMContentLoaded event to be fired. Next, the parser will wait for any pending async scripts to finish loading and executing. Finally, once all subresources have finished downloading, the window's load event will be fired and parsing is complete.

Takeaway

With this understanding, it becomes clear how important it is to carefully consider where and how stylesheets and scripts are included in the document. Those decisions can have a significant impact on the efficiency of the page load.

Sunday, May 15, 2011

List of ways HTML can download a resource

Recently two different projects required compiling a list of ways to trigger a download through HTML: Resource Timing and Preload Scanner optimization.

There's no centralized list in the WebKit source nor did a web search turn one up. So in hopes it may be useful to others, here's what I was able to come up with. Please let me know what I forgot (note that ways to download through CSS, JS, SVG and plugins are intentionally omitted).

<applet archive>
<audio src>
<body background>
<embed src>
<frame src>
<html manifest>
<iframe src>
<img src>
<input type=image src>
<link href>
<object data>
<script src>
<source src>
<track src>
<video poster>
<video src>

It might be interesting to compare the performance characteristics of downloads by resource type across browsers. For instance download priority, memory cacheability, parsing blocking and preload scan detection will vary.

Sunday, March 27, 2011

How (not) to trigger a layout in WebKit

As most web developers are aware, a significant amount of a script's running time may be spent performing DOM operations triggered by the script rather than executing the JS byte code itself. One such potentially costly operation is layout (aka reflow) -- the process of constructing a render tree from a DOM tree. The larger and more complex the DOM, the more expensive this operation may be.

An important technique for keeping a page snappy is to batch methods that manipulate the DOM separately from those that query the state. For example:

// Suboptimal, causes layout twice.

var newWidth = aDiv.offsetWidth + 10;  // Read

aDiv.style.width = newWidth + 'px';  // Write

var newHeight = aDiv.offsetHeight + 10;  // Read

aDiv.style.height = newHeight + 'px';  // Write



// Better, only one layout.

var newWidth = aDiv.offsetWidth + 10;  // Read

var newHeight = aDiv.offsetHeight + 10;  // Read

aDiv.style.width = newWidth + 'px';  // Write

aDiv.style.height = newHeight + 'px';  // Write

Stoyan Stefanov's tome on repaint, relayout and restyle provides an excellent explanation of the topic.

This often leaves developers asking the question: What triggers layout? Last week Dimitri Glazkov answered this question with this codesearch link. Trying to understand it better myself, I went through and translated it into a list of properties and methods. Here they are:

Element

clientHeight, clientLeft, clientTop, clientWidth, focus(), getBoundingClientRect(), getClientRects(), innerText, offsetHeight, offsetLeft, offsetParent, offsetTop, offsetWidth, outerText, scrollByLines(), scrollByPages(), scrollHeight, scrollIntoView(), scrollIntoViewIfNeeded(), scrollLeft, scrollTop, scrollWidth

Frame, Image

height, width

Range

getBoundingClientRect(), getClientRects()

SVGLocatable

computeCTM(), getBBox()

SVGTextContent

getCharNumAtPosition(), getComputedTextLength(), getEndPositionOfChar(), getExtentOfChar(), getNumberOfChars(), getRotationOfChar(), getStartPositionOfChar(), getSubStringLength(), selectSubString()

SVGUse

instanceRoot

window

getComputedStyle(), scrollBy(), scrollTo(), scrollX, scrollY, webkitConvertPointFromNodeToPage(), webkitConvertPointFromPageToNode()

This list is almost certainly not complete, but it is a good start. The best way to check for over-layout is to watch for the purple layout bars in the Timeline panel of Chrome or Safari's Inspector.

Saturday, February 5, 2011

Chrome's 10 Caches

While defining the set of page load time metrics that we think are most important for benchmarking, Mike Belshe, James Simonsen and I went through a seemingly simple exercise: enumerate the ways in which Chrome caches data. The resulting list was interesting enough to me that I thought it worthwhile to share.

When most people think of "the browser's cache" they envision a single map of HTTP requests to HTTP responses on disk (and perhaps partially in memory). This cache may arguably have the most impact on page load times, but to get to a truly stable benchmark, we identified 10 caches that need to be considered. An understanding of the various caches is also useful to web page optimization experts who seek to maximize cache hits.

HTTP disk cache

Stores HTTP responses on disk as long as their headers permit caching. Lookups are usually significantly cheaper than fetching over the network, but they are not free as a single disk seek might take 10-15ms and that doesn't include the time to read the data from disk.
The maximum size of the cache is calculated as a percentage of available disk space. The contents can be viewed at chrome://net-internals/#httpCache. It can be cleared manually at chrome://settings/advanced or programmatically by calling chrome.benchmarking.clearCache() when Chrome is run with the --enable-benchmarking flag set. Note that for incognito windows this cache actually resides in memory. source
HTTP memory cache

Similar to the HTTP disk cache, but entirely unrelated in code. Lookups in this cache are fast enough that they may be thought of as "free."
This cache is limited to 32 megabytes, however, when the system is not under memory pressure the effective limit may be higher due to its use of purgeable memory. Conversely, when multiple tabs are open, the limit may be divided among the tabs. It is cleared in the same manner as the HTTP disk cache. source
DNS host cache

Caches up to 100 DNS resolutions for up to 1 minute each. It is somewhat unfortunate that this cache needs to exist in Chrome, but OS level caching cannot be trusted across platforms.
It can be viewed and manually cleared at chrome://net-internals/#dns. source
Preconnect domain cache

A unique and important optimization in Chrome is the ability to remember the set of domains used by all subresources referenced by a page. Upon the next visit to the page, Chrome can preemptively perform DNS resolution and even establish a TCP connection to these domains.
This cache can be viewed at about:dns. source
V8 compilation cache

Compilation can be an expensive step in executing JavaScript. V8 stores compiled JS keyed off of a hash of the source for up to 5 generations (garbage collections). This means that two identical pieces of source code will share a cache entry regardless of how they were included. source
SSL session cache

Caches SSL sessions to disk. This saves several round trips of negotiation when connecting to HTTPS pages by allowing the connection to skip directly to the encrypted stream. Implementation and limits vary by platform, as an example, when OpenSSL is used, the limit is 1,024 sessions. source
TCP connections

Establishing a TCP connection takes about one round trip time. Newer connections also have a smaller window so they have a lower effective bandwidth. For this reason Chrome keeps connections open for a period in hopes that they can be reused. This can be thought of as an in-memory cache.
Connections may be viewed at chrome://net-internals/#sockets and cleared programmatically by calling chrome.benchmarking.closeConnections() when Chrome is run with the --enable-benchmarking flag set. source
Cookies

While not usually thought of as a cache, this is web page state which is persisted to disk. The presence of cookies can have a large impact on performance. They can bloat requests and change how the client and server behave in limitless ways.
They can be cleared manually at chrome://settings/advanced. We are planning to add a method to chrome.benchmarking for the same. source
HTML5 caches

HTML5 introduces 3 major new ways for web pages to persist state to disk: Application Cache, Indexed Database and Web Storage. For a particular page, these stores may be viewed under the "Resources" panel of the Inspector. The entire Application Cache may also be viewed and manually cleared at chrome://appcache-internals/
SDCH dictionary cache

While currently only used by Google Search, the SDCH protocol requires a shared dictionary to be downloaded periodically. A performance hit is taken infrequently to download the dictionary which makes future requests much faster. source

I hope you found this as interested as I did. Please let me know if I left anything out.

Edit: Will Chan points out additional caches for proxies, authentication, glyphs and backing stores. Proxy and authentication caches were intentionally omitted because they aren't relevant to our benchmark, however glyphs and backing stores are two additional things we need to consider. Thanks!

Sunday, January 30, 2011

Edit recorded web pages

Web Page Replay now supports editing recorded web pages. Credit for the idea goes to Sergey Chernyshev. To use it: record a web page then use httparchive's edit command.

$ ./httparchive.py edit --host=example.com --path=/ ~/archive.wpr

Since all other aspects of the page are controlled, this allows you to measure the effect of tweaking just a single aspect of the page without having to worry about conflating factors like rotating ads, server response time or variable network conditions.

I'm finding this invaluable for answering quick questions like how much faster a page loads if this script were deferred or if the order of this script and stylesheet were switched.

Friday, January 28, 2011

The WebKit PreloadScanner

WebKit's HTMLDocumentParser frequently must block parsing while waiting for a script or stylesheet to download. During this time, the HTMLPreloadScanner looks ahead in the source for subresource downloads which can be started speculatively. I've always assumed that discovering subresources sooner is a key factor in loading web pages efficiently, but until now, never had a good way to quantify it.

How effective is it?

Today I used Web Page Replay to test a build of Chromium with preload scanning disabled vs a stock build. The results were definitive. A sample of 43 URLs from Alexa's top 75 websites loaded on average in 1,086ms without the scanner and 879ms with it. That is a ~20% savings!

That number conceals some subtleties. The preload scanner has zero effect on highly optimized sites such as google.com and bing.com. In stark contrast, the preload scanner causes cnn.com, a subresource heavy site, to load fully twice as fast.

Why does this matter?

There is a lot of room for improvement in the preload scanner. These results tell me that it is worth spending time giving it some serious love. Some ideas:

It doesn't detect iframes, @import stylesheet, fonts, HTML5 audio/video, and probably lots of other types of subresources.
When blocked in the <head>, it doesn't scan the <body>.
It doesn't work on xhtml pages (wikipedia is perhaps the most prominent example).
The tokens it generates are not reused by the parser, so in many cases parsing is done twice.
The scanner runs in the UI thread. So as data arrives from the network, it may not be scanned immediately if the UI thread is blocked by JavaScript execution.
External stylesheets are not scanned until they are entirely downloaded. They could be scanned as data arrives like is done for the root document.

Test setup

The test was performed with a simulated connection of 5Mbps download, 2Mbps upload, and a 40ms RTT time on OSX 10.6. The full data set is available.