- July 11th, 2008
- Write comment
I’ve recently begun to work on improving performance in MDB. Since MDB is very much a write-once-read-many setup (where the write-once is the scanning and indexing of the files into your database, and the read-many is the repeated fetching of this data while browsing through the collection), I felt it would benefit very much from a caching solution or two. I looked at a number of different solutions, and have decided to go with optional Memcached support. Memcached is a distributed object caching system – it runs as a server that you can connect to and store/fetch items to/from a memory cache. It can also be distributed among multiple Memached servers. It pretty much works as an enormous hash table – you associate a key with some data, and that key gets hashed to find the location of the data in the hash table. Any data can be stored – basic data such as strings are stored straight, and more complex data such as multi-dimensional php arrays are stored serialized. The server can also be set up to compress data that’s above a certain size, as long as there is a certain percent of space gain by compression.
I essentially implemented memcached caching on a two-layer approach. The first layer is caching of data from the database. Anytime a chunk of data is requested from the database and parsed into a data structure, the data is cached in memcached. This means that when you’re running with the data cached into memcached (cache hot), it is possible to completely eliminate all database activity. There are certain limitations to this though – for example, one query that cannot be cached is the check to see if the database is still updating, using the mutex stored in the database. Caching this check would kind of defeat the purpose of a live check. Also, caching any of the database status pages – the update page, the database consistency check page, and the database stats page, would also defeat the purpose of live checks. But aside from these exceptions, it is possible to cache all other database queries, cutting down a page’s query count from 13 or so to 1 or 2. Since most of the data fetched from the database is stored in a user-agnostic form, it means that this cached data can be used for multiple users. The cache data is expired when something in it is modified. For example, adding a new tag will expire the cached list of tags, requiring it to be fetched and cached again (cache miss) on the next run.
The other layer of caching that is done is the caching of Smarty template output. Smarty has its own caching system, but I chose to use memcached to manually cache the output of Smarty templates – more on this decision later. The output of the Smarty template – essentially, the finished HTML that gets sent to the user – is stored in memcached. This means that, in a similar manner as the database caching, if you’re running cache hot then it’s possible to entirely eliminate the need for any smarty parsing on a page load, which also eliminates the need to fetch any database data, hence the two layer approach. The equivalent speed in that case is almost as fast as serving a straight HTML page to the user.
In general, most of the data being fetched from the database is stored in a user-agnostic format, and the templates are what customize a page to a specific user. For example, when fetching a list of files in a title from the database, it’s just stored as a chunk of data. However, when rendering the page for the user, it’s tailored to that user’s view – making some of the links downloadable if a user has access, giving the option to add and remove tags, etc. Therefore, if both the file data and the template output of a title for a user with download rights is cached, and an anonymous user comes and tries to fetch the same title page, the template output will be a cache miss since the anonymous user is supposed to see a version without download links. However, the database data will still be a cache hit, since it’s the same list of files. So this cached database data will be used to render the anonymous user’s template without download links, and this new template output will then also be cached for subsequent accesses by anonymous users. Therefore, in this situation, only one layer of caching was a cache miss (the template), and another layer was still a cache hit (the database).
This has allowed me to see enormous speedups, especially for data that is either complicated to render (a title with many many files in it) or very frequently displayed (the list of titles on the right side that shows up on every single page). Plus, Memcache allows for a great deal of scalability as you have more and more users on the page – this wasn’t something that I could test, though, as I’m the only user using my home copy of MDB.
I have some very rough benchmarks (these were just 2-second tests, not nearly quantitative in any way. Just to get an idea of the difference):
Listing a title with 78 files in 4 folders:
Database cache miss, template cache miss (no cache at all): 0.15322113 sec
Database cache hit, template cache miss: 0.11778903 sec
Database cache hit, template cache hit: 0.00953102 sec
Speedup: about 16x faster
Rendering a tag cloud with 8 tags:
Database cache miss, template cache miss (no cache at all): 0.05347109 sec
Database cache hit, template cache hit: 0.00882792 sec
This page isn’t user specific so there’s no way the template cache would miss but the database cache would hit.
Speedup: about 6x faster
Keep in mind that my server is pretty fast (I was already getting tenth of a second times or less before caching), so a slower server, or a setup where mysql is running on a different server, would see even more speedup. Also keep in mind that on the very first startup the cache is entirely cold and therefore everything will need to be fetched from the database. Only as it’s used more will the cache warm up. Also please note that after a database update (where the entire list of files may be changed), the entire cache is marked expired so it will be entirely cache cold again. Updating the database is a generally infrequent operation, though, so you don’t have to worry about this too often.
Also, although I post benchmarks here, and it is faster, really the biggest goal of memcached is to lighten the load for scalability – lighten the database load, lighten the cpu load, serve more pages in a given time period, etc. Enable mysql’s query cache along with this and you’ll have a system that barely uses any cpu or database at all.
So why did I choose not to use smarty caching?
The first strike was that after enabling caching, things slowed down. I don’t know why, it may have been a misconfiguration on my part. It wasn’t because of a cold cache, I tried multiple times. But whatever, this actually isn’t the main reason.
Another reason is that Smarty caches compiled templates to disk, not into memory. There are apparently other cache handlers that allow you to use things such as memcached for smarty caching, but after some browsing through some solutions I didn’t feel like lumping a whole bunch of other people’s stuff with mdb.
The third, and probably biggest reason, is that smarty caches on a per-template basis. This means it caches the entire template, and only one template, at a time. Caching in smarty is actually a pain to implement if your system wasn’t written to take advantage of caching from the start. My templates are a big combination of dynamic and non-dynamic data for ease of development. To actually cache as much as possible many of these templates would have to be broken up into multiple sub-templates and/or rewritten, making things more difficult. Speaking of sub-templates, this is also another problem with caching on a per-template basis. My pages are already made up of a number of templates. There is the left nav template, the right titlelist template, and the middle. The middle is where all the complicated stuff happens – for example, when drawing the list of files in a title to be dynamically collapsible by folder, each file needs to have a class that depends on the class of the parent folder before it, meaning all files need to be looped through and given certain CSS classes/ids. I chose to make a template that shows one file in the table – therefore the css classes can be figured out, and displayed using the template, and then it’ll move on to the next file. The other way of doing it – looping through and adding extra data into the array that stores the files, before sending it to the template to be displayed (looped through again to be displayed as table rows), would have almost doubled the work it was doing. The problem is that smarty would try and cache that one row for that one file, meaning if your file database has 9631 files, you’d have 9631 cached copies of the single row template, each with a different cache id. Why bother? Even with cache, you’d still be iterating through files on each title page display. With memcache being used outside of smarty I can append all the bits and pieces of template HTML that make up a title’s file list into a single html chunk, then cache that whole title’s chunk of html once. On the next load, I’d just pull up the full cached HTML for that title and never have to iterate at all.
But anyway, memcached support is in MDB now. It’s available in gitphp right now. I’m still working on integrating the last couple parts of it, and it still needs to go through some testing to make sure that cached data is being expired properly where appropriate. But after that I will probably release a new version with all the memcache updates as well as the other minor fixes I’ve done.