Wednesday, November 16, 2011

HTTP Caching 101

This captures a basic learning curve around HTTP Caching, actually just a selected subset of a rather large topic. Another post driven by laziness - written in terse talking-point style, with most information either paraphrased or copied explicitly from the resources listed at end - with a touch of editorializing on my part. My thanks to those authors upfront.

Overview


Benefits: save bandwidth, reduce latency, lighten server load
Types of caches (server's responses can be cached by these)
  • gateway cache (shared): middle-man, facade in front of a server (for any client). This caches responses from one server for many clients.
  • browser cache (private): for BACK button, images, etc. This caches responses from many servers for one particular client.
  • proxy cache (shared): forward-facing towards web, client-side. This caches responses from many servers for many clients.
Here's some decent slideware with boxes and lines illustrating the above, and with some sequence diagrams illustrating the protocols discussed below.
Mechanisms
  • Expiration, Validation and Invalidation
  • note that caching is applied only to GET and HEAD
  • ...this is yet another reason not to tunnel operations thru GET, etc abuse - requests won't make it to server w/proxy cache in play
  • and, another reason to use all verbs as intended: POST, PUT and DELETE should not ever be cached
  • in fact, POST, PUT and DELETE should result in cached content being invalidated
“pragma: no-cache” is linked to requests, and not responses. The RFC did not specify the behavior of this directive for responses. Hence, this directive does NOT instruct the browser not to cache a page. We often see this tag being misused when a page is served to the browser. Developers mistakenly set this directive expecting that the page will not be cached on the browser.

The sections of the HTTP 1.1 spec (http://www.w3.org/Protocols/rfc2616/rfc2616.html) that discuss caches include 13, 14.9, 14.19, 14.21, and 14.24-29.

Cache Control


Here's a subset of the cache-control mechanism to be discussed here:
non-cacheable:
HTTP/1.1 200 OK
Cache-Control: no-cache,no-store
user-specific private:
HTTP/1.1 200 OK
Cache-Control: private,no-cache,no-store
cacheable:
HTTP/1.1 200 OK
Cache-Control: max-age=3600,must-revalidate
Date: Sun 16 Oct 2011 15:30:00 GMT
Here's a summary on the semantics of the header values listed above ( the first explanation section of each directive comes from http://www.mnot.net/cache_docs/#CACHE-CONTROL, with subsequent text coming from section 14.9 of the HTTP 1.1 spec):
no-cache — forces caches to submit the request to the origin server for validation before releasing a cached copy, every time. This is useful to assure that authentication is respected (in combination with public), or to maintain rigid freshness, without sacrificing all of the benefits of caching. From the spec: This allows an origin server to prevent caching even by caches that have been configured to return stale responses to client requests. (Note that it's possible to specify field-names for this header that can apply more fine-grained meaning to the intended behavior; see section 14.9.1 for details.). However, a point of interest from this article:
“no-cache” directive, according to the RFC, tells the browser that it should revalidate with the server before serving the page from the cache..."no-store" tells the browser not only not to cache the page, but also not to even store the page in its cache folder. Whenever you’re serving a sensitive page, this is the cache control directive to use. Notice that of late, “cache-control: no-cache” has also started behaving like the “no-store” directive. To be on the safer side, we recommend that you use both “no-cache” and “no-store” when serving sensitive pages.
no-store — instructs caches not to keep a copy of the representation under any conditions.
private — allows caches that are specific to one user (e.g., in a browser) to store the response; shared caches (e.g., in a proxy) may not.
max-age= --- specifies the maximum amount of time that a representation will be considered fresh. Similar to Expires, this directive is relative to the time of the request, rather than absolute.  is the number of seconds from the time of the request you wish the representation to be fresh for. From the spec: implies that the response is cacheable (i.e., "public") unless some other, more restrictive cache directive is also present. If a response includes both an Expires header and a max-age directive, the max-age directive overrides the Expires header, even if the Expires header is more restrictive. (Note that max-age can also be specified in a request from the client; the meaning here is, from section 14.9.3, that the client is willing to accept a response whose age is no greater than the specified time in seconds. There are also max-stale and max-fresh directives that further refine the user-agent's intentions; see section 14.9.3 for details.). Note that setting max-age to zero ensures that a page is never served from cache, but is always re-validated against the server.
must-revalidate — tells caches that they must obey any freshness information you give them about a representation. HTTP allows caches to serve stale representations under special conditions; by specifying this header, you’re telling the cache that you want it to strictly follow your rules.
What follows here is a general discussion of the Expiration and Validation models.

Expiration


With expiration, you tell the cache how long should a response be considered "fresh", i.e. on expiration it becomes "stale". This is more efficient than validation since cache need not go back to origin server until staleness.
Expires
This is useful for long-lived static images, and things that change on regular schedule. It's not without its problems: 
  • server/client clocks must be in sync
  • times are in one-second resolution
  • you might forget to update date after expiration
  • spec prohibits Expires date more than one year in future
  • it's not useful for dynamic content
Cache-Control (i.e. max-age, s-max-age (shared caches only))
  • to address limitations of Expires
  • relative vs absolute time frames

Validation


With the validation model, the cache asks the origin server on each request if its cached response is still valid. This is probably the best option for dynamic content.
ETag/If-None-Match
You can think of ETag as a hash, fingerprint, etc. unique ID for a given page, resource, etc. Use this when it's inconvenient to store modification dates, or when the 1-second resolution of HTTP expiration model is not good enough, etc. 
An ETag can be strong or weak - in a nutshell, strong means the response can be used to compose a larger response, i.e. it's valid as a partial piece of content; whereas weak means you should expect to use the response to reliably compose larger pages (etc.). If a request contains an If-None-Match header with a value equal to the ETag for the given resource, the response simply contains a 304-Not Modified response status with an empty body; this is useful for bandwidth savings. If any intermediary cache must go back to the origin server for validation, a useful implementation to consider on that origin server involves saving the ETag (in database, key-value store, in-memory hashmap, etc.) on a per-URL (etc.) basis, to additionally save on CPU (i.e. the ETag need not be recomputed on each request).
Note that there is a fair amount more to the ETag-based validation model than discussed here. Please consult the RFC for more details.
Last-Modified/If-Modified-Since
A date-based approach to validation can be done with the Last-Modified/If-Modified-Since protocol. Similar to ETag, the server might save the modification dates for a given resource (URL, etc.) and compare that with any request header value for If-Modified-Since, again returning 304-Not Modified if the resource has not been modified since that given date. Again, savings in bandwidth can be gained here.
Combining ETag with Last-Modified
There's a fair amount of ambiguity in blogs and web articles around what the behavior should be when requests send both types of validation headers and the server side supports both. To make things worse, the RFC 2616 is not the easiest specification to understand (in fact, there's an effort underway to improve that with a rewrite). I'm basing my interpretation on my own careful read of RFC 2616; you are invited to correct me if you believe I've got it wrong - note in the following that "server must execute the method" is similar to saying there was a "cache miss":
- etag matches, date is fresh: return status 304
- etag matches, but date is stale: origin server must execute the method and return that response
- etag does not match: origin server must execute the method and return that response, regardless of any date-based validation information in request
- etag matches, and there is no date-based header: return status 304
- no etag in request; date is fresh: return status 304
- no etag in request; date is stale: origin server must execute the method and return that response
For reference, here are the sections of the RFC that I used to determine the above recipe.

Invalidation


The third mechanism supporting caching (beyond Expiration and Validation) is invalidation. However, here you shouldn't need to do anything special with your web application -- it should be handled implicitly by any gateway or proxy cache that is in place. What you should do is return ETag and Last-Modified information in any response for a given resource (including GET, POST, PUT, DELETE) - such that any intermediary cache can not only cache as needed (for GETs), but also evict/update/otherwise-deal-with URLs that have been changed (via POST, PUT and DELETE).

Design Discussion


Checking for ETag alone is not as effective a caching solution as might be accomplished with use of If-Modified-Since. This is because the cache must validate the value of the ETag before returning a 304. In the case of a dynamic system where state changes frequently or unpredictably, however, ETag might be the best approach. In other cases where it is known e.g. that state won't change until a given time, use of If-Modified-Since is likely to be the most scalable solution, since the application can offload requests to reverse proxies.
Use of the Expires header can also be used in systems where it is known that state will change at precise times; however, this calls for client and server clocks to be sync, and requires explicit updates to the Expires header value when expiration occurs (which, if forgotten, leads to permanent staleness and zero benefit from a cache).
Note that date-based validation is limited to one-second precision. So it's feasible that a client could fetch a resource, saving the last-modified date in the response, then change the resource (or some other client could change it) and subsequently request it again immediately, but receive a 304 response since all of this took place within the same second. As such, I'd endorse use of the ETag as the preferred caching protocol if your service expects anything resembling high concurrency and requires strict correctness.
If proxy/reverse caches are expected to be part of the ecosystem and you want to control their behavior independently, consider using these headers (see RFC 2616, section 14.9.3 for details):
s-maxage= - similar to max-age, except that it only applies to shared caches.
proxy-revalidate - similar to must-revalidate, except that it only applies to proxy caches. From this article:
...useful when an authenticated page is being cached in the browser. You don’t want the proxy to cache and serve the page, whereas it’s fine for your browser to cache and serve the page.
From my understanding of the protocol so far, I'd also suggest adding the public header value to responses intended to be cacheable if HTTP authentication is involved:
public --- marks authenticated responses as cacheable; normally, if HTTP authentication is required, responses are automatically private. From HTTP 1.1 spec: Indicates that the response MAY be cached by any cache, even if it would normally be non-cacheable or cacheable only within a non- shared cache.
Here's additional information from HTTP 1.1. spec, section 14.8:
A user agent that wishes to authenticate itself with a server – usually, but not necessarily, after receiving a 401 response – does so by including an Authorization request-header field with the request. When a shared cache (see section 13.7) receives a request containing an Authorization field, it MUST NOT return the corresponding response as a reply to any other request, except that it MAY use the response in a subsequent request if s-maxage, must-revalidate or public headers are included in the response (see section 13.7 for details).

Other


Here are some things you can do to make your website more cache-friendly and cache-aware. Not all of these will necessarily apply to your context, especially when dealing with dynamic content.
Note that one assumption made in this writeup is that there is only one representation for any given URL. This may not be the case, however; representations may vary as per JSON vs XML encoding, use of compression or not, special markup for a given browser (or other user agent), and etc. This is where the Vary header comes into play; here's a nice explanation of this pattern.
There are many open source/commercial proxy and reverse-proxy caches available, notably Squid and Varnish (let alone ApacheNginx, ... etc.). 

Resources