One of the ways by which you can make your websites faster is by using web caching techniques. There are a lot of confusions related to web caching. Developers usually get confused on how web caching information is understood by proxies, browser and crawlers. In this post I will give all details about web caching.
What is Web Caching?
Caching is storing something temporarily for fast retrieval later on.
Web Caching is storing of HTTP responses temporarily for fast retrieval later on.
Advantage of web caching
Web caching reduces the number of requests made to the server. Due to which less bandwidth is consumed and web server load is reduced. It also helps users to visit a web page if web server is down.
Systems using web caching
Search Engines, Web Browsers, Content Delivery Networks and Web Proxies are some systems which widely cache web files. Systems have their different purpose of web caching.
Web caching in CDNs, proxies and web browsers
The caching mechanism of these systems can be controlled using caching meta tags or HTTP caching headers. These systems do caching to decrease the bandwidth usage and also decrease web server overload.
We can control the caching behavior using Last-Modified/If-Modified-Since, ETag/If-None-Match, Cache-Control and Expires headers(or meta tags). Cache-Control was introduced in HTTP 1.1 whereas Expires header has introduced in HTTP 1.0. So we must use both of them for better support of clients. Similarly Last-Modified/If-Modified-Since was introduced in HTTP 1.0 but ETag/If-None-Match was introduced in HTTP 1.1. So we can rely on using Last-Modified/If-Modified-Since.
Cache-Control header is used to instruct these systems how to cache the response. It is responsible for controlling the freshness. It can have eight possible values. These values are:
- public: The web page can be cached by any cache and can be served to any user.
- private: Caches on these systems can be shared among many users or a single user. When a cache is made for a single user then we call it as a non-shared cache. If the cache is made for all users than we call it as shared cache. private indicates that only cache the response if it is stored in non-shared cache.
- no-cache: systems will cache the response. But before serving these systems send a If-Modified-Since header(assigned to the date same as Last-Modified) GET request for confirmation to server. If server responds 304 than cached version is served otherwise if 200 response is sent by server than the currently received response is served and old response is removed from cache. If-Modified-Since header is only sent if Last-Modified header was sent by server during cache-control response. If Last-Modified header is send by the server then client will do revalidation of the cache using If-Modified-Since otherwise client will revalidate without Last-Modified-Since header due to which server will have no way to identify its a validation request and so it will always respond with 200 status which will cause refetching. we can use ETag/If-None-Match instead of Last-Modified/If-Modified-Since.
- no-store: Systems do not cache the response at all.
- max-age: Specifies in seconds the amount of time the response will be cached. After the response has expired its deleted from the cache.
- s-maxage: same as max-age but for proxy servers
- must-revalidate: Specifies that no matter what the condition is before serving cached content to user the system must send a If-Modified-Since request to the server for confirmation. If-Modified-Since header is only sent if Last-Modified header was sent by server during cache-control response. If Last-Modified header is send by the server then client will do revalidation of the cache using If-Modified-Since otherwise client will revalidate without Last-Modified-Since header due to which server will have no way to identify its a validation request and so it will always respond with 200 status which will cause refetching. we can use ETag/If-None-Match instead of Last-Modified/If-Modified-Since.
- proxy-revalidate: Same as must-revalidate but for proxy servers.
- pre-check and post-check: These values are supported by IE only. They provide better control over expiry time than max-age. I skipped these two headers because they are not supported in other browsers so its not important to learn about them. If you still want to learn about them then visit this link.
Expires header can also be used to instruct these systems how to cache response. If Expires header is assigned an future date and time then the response is cache till that time and requests are not made to the server. But if it is assigned to a past time or -1 then these systems do not cache the response. Expires header has no way to instruct client to revalidate cache. Even if we provide Expires header with Last-Modified the client will not revalidate the cache.
Let’s see some examples of using these headers:
<meta http-equiv="Cache-Control" content="no-store" />
Here these systems will not cache the response.
Cache-Control: no-cache, must-revalidate, expires=360000000
Here these systems will cache the response but before serving the response the client will try to revalidate but as we didn’t provide Last-Modified header, client will send revalidation request without If-Modified-Since and therefore server will response with 200 status code which is refetching the page again.
Last-Modified: Thu, 15 Aug 2011 09:00:00 GMT
Here browser will cache the document till 15 Aug 2015 09:00:00. Client will not revalidate the cache before serving.
If the cached document expires then its re-fetched or re-validated. Its re-validated if Last-Modified or E-tag header was provided by the server while the response was stored in cache. So if Last-Modified (assigned to the date when response was last modified) is present then client sends a If-Modified-Since header to confirm weather the cached copy is still valid or not. If server responds with 304 then client continuous using the cached copy after it has expired. If the server responds with 200 status code then the cached copy is removed and the new response is served to the user. Same way it works for E-Tag. E-Tag is a calculated hash of the response content. Server uses it to check if the document is modified or not. And responds accordingly. More on calculation of E-tag click here. If none of these Last-Modified and E-Tag is provided while caching the document then after cache expires the client re-fetches the while document.
Let’s see an example to make it clear how Last-Modified and E-Tag works. Suppose we have 2 files one.js and two.js.
one.js HTTP response
Expires: Thu, 15 Aug 2060 09:00:00 GMT
Cache-Control: public, must-revalidate, expires=360000000
two.js HTTP response
Cache-Control: public, expires=360000000
In two.js we did not provided the Last-Modified header while caching it. So once it expires in 2060 the client will download the file again instead of re-validating using If-Modified-Since. But for one.js the client will verify using the If-Modified-Since header during every request and also after cache expires. If server responds 304 then client continuous using the cached copy.
Sometimes web servers also cache requests. So that they don’t have to read the same files from the disk again and again. Sometimes browsers, CDNs or proxies don’t like it. They may need fresh server execution for the request, so they can use Pragma HTTP request header to ask for fresh response.
How is expiration time calculated
expirationTime = responseTime + freshnessLifetime – currentAge
The freshness lifetime is calculated based on several headers. If a “Cache-control: max-age=N” header is specified, then the freshness lifetime is equal to N. If this header is not present, which is very often the case, then we look for an”Expires” header. If an “Expires” header exists, then its value minus the value of the “Date” header determines the freshness lifetime. Finally, if neither header is present, then we look for a “Last-Modified” header. If this header is present, then the cache’s freshness lifetime is equal to the value of the “Date” header minus the value of the “Last-modified” header divided by 10. If none of this headers are there then the response is not cached.
responseTime is the time at which the response was received according to the client.
The current age is usually close to zero, but is influenced by the presence of an Age header, which proxy caches may add to indicate the length of time a document has been sitting in its cache. The precise algorithm, which attempts avoid error resulting from clock skew, is described in RFC 2616 section 13.2.3.
Web caching in search engines
Search engines do caching so that if the website is down than they can provide the cached version of the web page to the user. This caching is done by a search engine component called as indexer. In Google search result you can see a link to cached version of every web page.
We can avoid search engines from displaying cached version of a web page by the following ways:
HTTP Response header
Search engine crawlers (a component of search engine responsible for downloading pages) don’t do caching. They decide revisits to web pages by using some complicated algorithms. But still there is a way we can force the crawlers to cache pages they visit so that they don’t download the same content again and again.
We can control search engines from downloading the same content or page again and again by using Last-Modified/If-Modified-Since or ETag/If-None-Match headers.
Example of using Last-Modified/If-Modified-Since:
When a request is made by search engine the server returns HTTP response with Last-Modified header. This header indicates when the file was last modified.
Now when search engine revisits the same file it puts If-Modified-Since header in HTTP request.
Now the server sees the If-Modified-Since header and checks if it was modified since then or not. If its modified than it returns a normal 200 success response. And can incude Last-Modified header if needed again. But if its not modified than server returns 304 Page Not Modified response. On return of 304 response, search engines consider the previously indexed information to be still fresh and valid.
Example of using ETag/If-None-Match:
When a request is made by search engine the server returns HTTP response with ETag header. ETag is assigned to a hash value that can be hash of page content or any other thing.
Now when search engine revisits the same file it include If-None-Match header in HTTP request.
Now the server sees the If-None-Match header and checks if it has changed or not. If its changed than it returns a normal 200 success response. And can incude ETag header if needed again. But if its not modified than server returns 304 Page Not Modified response. On return of 304 response, search engines consider the previously indexed information to be still fresh and valid.
In this way we can control the caching behavior of both components crawler and indexer. Remember that these techniques has no effect on the crawler revisit policy and priority on a web page.
Cache-Control and Expires headers are ignored by the search engines.
If we provide only Last-Modified or ETag header to browser without cache-control and expires then browser behave the same way as a search engine crawler. So if you return Last-Modified header in HTTP response then browsers will cache the response and then use If-Modified-Since before serving to the client.
Which is better Last-Modified/If-Modified-Since or ETag/If-None-Match?
I prefer to use Last-Modified/If-Modified-Since because there are many clients that don’t support ETag/If-None-Match.
What is must required for clients to cache?
Client need to know how long to hold the response in cache. You should provide expiration time or re-validation permission so that clients will cache a document. If none is provided then client will not at all cache the response. In short we can say that if client can calculate the expiration time then the response will be cached.
Expiration time can be provided using Expires or max-age. Re-validation permission can be provided using no-cache, must-revalidate or Last-Modified.
The below response is never cached. because clients don’t know how long the keep the response the cache.
The below response is cached. Client can calculate the expiration time using the Last-Modified header. If we replace Last-Modified with ETag in this response then the response will not be cached because clientss cannot calculate expiration time using ETag.
Last-Modified: Thu, 11 Feb 2011 10:00:00 GMT
Due to largely misunderstanding of these headers and techniques many client and servers interpret things differently. So always be careful while building a cache proof website. If you like it please “Like and Share”.