Posts Tagged ‘mdb’

memcached and smarty in mdb

I’ve recently begun to work on improving performance in MDB. Since MDB is very much a write-once-read-many setup (where the write-once is the scanning and indexing of the files into your database, and the read-many is the repeated fetching of this data while browsing through the collection), I felt it would benefit very much from a caching solution or two. I looked at a number of different solutions, and have decided to go with optional Memcached support. Memcached is a distributed object caching system – it runs as a server that you can connect to and store/fetch items to/from a memory cache. It can also be distributed among multiple Memached servers. It pretty much works as an enormous hash table – you associate a key with some data, and that key gets hashed to find the location of the data in the hash table. Any data can be stored – basic data such as strings are stored straight, and more complex data such as multi-dimensional php arrays are stored serialized. The server can also be set up to compress data that’s above a certain size, as long as there is a certain percent of space gain by compression.

I essentially implemented memcached caching on a two-layer approach. The first layer is caching of data from the database. Anytime a chunk of data is requested from the database and parsed into a data structure, the data is cached in memcached. This means that when you’re running with the data cached into memcached (cache hot), it is possible to completely eliminate all database activity. There are certain limitations to this though – for example, one query that cannot be cached is the check to see if the database is still updating, using the mutex stored in the database. Caching this check would kind of defeat the purpose of a live check. Also, caching any of the database status pages – the update page, the database consistency check page, and the database stats page, would also defeat the purpose of live checks. But aside from these exceptions, it is possible to cache all other database queries, cutting down a page’s query count from 13 or so to 1 or 2. Since most of the data fetched from the database is stored in a user-agnostic form, it means that this cached data can be used for multiple users. The cache data is expired when something in it is modified. For example, adding a new tag will expire the cached list of tags, requiring it to be fetched and cached again (cache miss) on the next run.

The other layer of caching that is done is the caching of Smarty template output. Smarty has its own caching system, but I chose to use memcached to manually cache the output of Smarty templates – more on this decision later. The output of the Smarty template – essentially, the finished HTML that gets sent to the user – is stored in memcached. This means that, in a similar manner as the database caching, if you’re running cache hot then it’s possible to entirely eliminate the need for any smarty parsing on a page load, which also eliminates the need to fetch any database data, hence the two layer approach. The equivalent speed in that case is almost as fast as serving a straight HTML page to the user.

In general, most of the data being fetched from the database is stored in a user-agnostic format, and the templates are what customize a page to a specific user. For example, when fetching a list of files in a title from the database, it’s just stored as a chunk of data. However, when rendering the page for the user, it’s tailored to that user’s view – making some of the links downloadable if a user has access, giving the option to add and remove tags, etc. Therefore, if both the file data and the template output of a title for a user with download rights is cached, and an anonymous user comes and tries to fetch the same title page, the template output will be a cache miss since the anonymous user is supposed to see a version without download links. However, the database data will still be a cache hit, since it’s the same list of files. So this cached database data will be used to render the anonymous user’s template without download links, and this new template output will then also be cached for subsequent accesses by anonymous users. Therefore, in this situation, only one layer of caching was a cache miss (the template), and another layer was still a cache hit (the database).

This has allowed me to see enormous speedups, especially for data that is either complicated to render (a title with many many files in it) or very frequently displayed (the list of titles on the right side that shows up on every single page). Plus, Memcache allows for a great deal of scalability as you have more and more users on the page – this wasn’t something that I could test, though, as I’m the only user using my home copy of MDB.

I have some very rough benchmarks (these were just 2-second tests, not nearly quantitative in any way. Just to get an idea of the difference):

Listing a title with 78 files in 4 folders:
Database cache miss, template cache miss (no cache at all): 0.15322113 sec
Database cache hit, template cache miss: 0.11778903 sec
Database cache hit, template cache hit: 0.00953102 sec
Speedup: about 16x faster

Rendering a tag cloud with 8 tags:
Database cache miss, template cache miss (no cache at all): 0.05347109 sec
Database cache hit, template cache hit: 0.00882792 sec
This page isn’t user specific so there’s no way the template cache would miss but the database cache would hit.
Speedup: about 6x faster

Keep in mind that my server is pretty fast (I was already getting tenth of a second times or less before caching), so a slower server, or a setup where mysql is running on a different server, would see even more speedup. Also keep in mind that on the very first startup the cache is entirely cold and therefore everything will need to be fetched from the database. Only as it’s used more will the cache warm up. Also please note that after a database update (where the entire list of files may be changed), the entire cache is marked expired so it will be entirely cache cold again. Updating the database is a generally infrequent operation, though, so you don’t have to worry about this too often.

Also, although I post benchmarks here, and it is faster, really the biggest goal of memcached is to lighten the load for scalability – lighten the database load, lighten the cpu load, serve more pages in a given time period, etc. Enable mysql’s query cache along with this and you’ll have a system that barely uses any cpu or database at all.

So why did I choose not to use smarty caching?
The first strike was that after enabling caching, things slowed down. I don’t know why, it may have been a misconfiguration on my part. It wasn’t because of a cold cache, I tried multiple times. But whatever, this actually isn’t the main reason.
Another reason is that Smarty caches compiled templates to disk, not into memory. There are apparently other cache handlers that allow you to use things such as memcached for smarty caching, but after some browsing through some solutions I didn’t feel like lumping a whole bunch of other people’s stuff with mdb.
The third, and probably biggest reason, is that smarty caches on a per-template basis. This means it caches the entire template, and only one template, at a time. Caching in smarty is actually a pain to implement if your system wasn’t written to take advantage of caching from the start. My templates are a big combination of dynamic and non-dynamic data for ease of development. To actually cache as much as possible many of these templates would have to be broken up into multiple sub-templates and/or rewritten, making things more difficult. Speaking of sub-templates, this is also another problem with caching on a per-template basis. My pages are already made up of a number of templates. There is the left nav template, the right titlelist template, and the middle. The middle is where all the complicated stuff happens – for example, when drawing the list of files in a title to be dynamically collapsible by folder, each file needs to have a class that depends on the class of the parent folder before it, meaning all files need to be looped through and given certain CSS classes/ids. I chose to make a template that shows one file in the table – therefore the css classes can be figured out, and displayed using the template, and then it’ll move on to the next file. The other way of doing it – looping through and adding extra data into the array that stores the files, before sending it to the template to be displayed (looped through again to be displayed as table rows), would have almost doubled the work it was doing. The problem is that smarty would try and cache that one row for that one file, meaning if your file database has 9631 files, you’d have 9631 cached copies of the single row template, each with a different cache id. Why bother? Even with cache, you’d still be iterating through files on each title page display. With memcache being used outside of smarty I can append all the bits and pieces of template HTML that make up a title’s file list into a single html chunk, then cache that whole title’s chunk of html once. On the next load, I’d just pull up the full cached HTML for that title and never have to iterate at all.

But anyway, memcached support is in MDB now. It’s available in gitphp right now. I’m still working on integrating the last couple parts of it, and it still needs to go through some testing to make sure that cached data is being expired properly where appropriate. But after that I will probably release a new version with all the memcache updates as well as the other minor fixes I’ve done.

MDB 0.0.6

Just over a week after the last update, here is MDB 0.0.6. MDB had the most potential for features to be added, so that’s what I focused on. Here is the rather large changelog:

  • Database update times are logged in a table
  • The updatedb page shows the output of the status test if something goes wrong
  • PHP commandline binary used in db updates is configurable
  • Allow setting a cooloff time between database updates (for example, if last update was less than X seconds ago, skip update)
  • Fixed the shell hack mutex test in some places
  • Allow using the database to store update status (rather than just testing with ps and grep) – please read the notes in the config file before enabling this!
  • Add debug option to dump a bunch of info during execution
  • Fix OPTIMIZE TABLE calls that weren’t working
  • The optimize configure option now applies to updatedb (so optimizing after updating can be optional too)
  • Updatedb status test reports when cooloff time hasn’t been met
  • Mapping sanity check has been moved to a more generic db check page that will also check consistency of dbmutex
  • Optimizing is done on dbcheck page instead of during db stats
  • Use php’s builtin uname for dbstats page instead of running external uname
  • Clean up uptime output on dbstats page
  • Avoid caching password hash in session key
  • Add user management page so admins can add and delete users, and change user privileges (from/to admin)
  • Add preferences page for users to change passwords
  • Metadata links for titles have been moved to a generic links table, rather than an AnimeNFO-specific table
  • Title list no longer depends on images that I couldn’t include – colored by CSS now
  • Use newer mootools rather than old moo.fx for javascript effects (scripts included now)
  • CSS split into themes – formatting is in a core css file and colors/styles are in separate theme files. Comes with three themes – Dark Aqua, Dark Lime, and Light
  • Users can set their preferred css theme on their preferences page
  • Some PHP warnings are cleaned up
  • Chooses an optimal title listing method (SQL queries or iteration) by the number of titles in the system
  • Title listing using SQL uses prepared statements for speedup
  • A footer now shows db status (size, titles, files, etc), the number of queries executed, and the execution time of the page
  • Footer and main page show a warning when database is upgrading
  • ADOdb caching now works (it didn’t before) – please read the notes in the config file about adodb caching’s performance hit before enabling!
  • Remove limit statements to be more portable with other databases
  • Remove file_title association table and embed the title id in the file database instead – files can only have one title
  • Fix bug with tag substring collisions (e.g. “girls” and “girls with guns” being treated as the same tag)
  • Change delete links to submit buttons that use POST – safer to avoid accidental triggering by spiders (although I don’t know why you’d spider this in the first place)
  • Database operations (updatedb, dbstats, dbcheck) are now all triggered from one admin-only database page
  • Minor optimizations of SQL queries
  • Only count directories as titles
  • Filelist on title page is now shown hierarchically indented by directory
  • Directories on title page are now collapsible by javascript. All files are shown by default, so will degrade gracefully if javascript is off – directories just won’t collapse

This version has significant architectural differences from the previous versions. Everything is included now so you may want to get rid of your own copies of moo.fx and titlelist images. The database is also structurally different. I unfortunately have not written a database upgrade script. In general though, the following changes are required:

  1. Add the dbupdate table
  2. Add the preferences table
  3. Remove the file_title table and add the new field to the files table. If you know how to do this yourself by phpmyadmin or the like then the description of the field is in the SQL file. If not, then delete the table and recreate it using the one in the SQL file. Either way, run updatedb to restore consistency to the database.
  4. Delete the animenfo table and create the links table. Unfortunately there isn’t really an easy way to migrate the data from one table to another. If you really have a lot of data in the animenfo table you want migrated, I can help you write the SQL to do so.

All other tables have remained the same, so you can keep your same users and stuff. But since some tables have changed and some tables haven’t, it may be easiest to just drop all tables in the database, import the SQL fresh, and update your db. I apologize for any inconvenience.

If you have no users and have trouble figuring out how the users table works (I know I forgot to include a default user before), I have included an SQL file that creates an admin user with username/password root/root. From there you can add more users and delete the root user if you’d like. Of course, make sure you have at least 1 admin!

The tarball is on the MDB page.

hat trick

I’ve been going through and doing some cleanup on my old php projects. Some of them had some pretty crappy coding practices, since they were written back when I was still learning PHP and SQL. For example, they had all functions lumped together in one function library – made it easier to write, but it meant that every single operation loaded all the code for every single function into php’s memory. Why load 10 functions into memory when you’re only using one of them? Also, SQL functions were mainly used for fetching records, and most of the sorting/association was done by PHP. I did it at the time because I knew PHP a lot better than I did SQL, and it meant I could avoid some of the more complicated SQL stuff such as JOINs and subqueries. However in practice it’s better to have as much of that done by the SQL server as possible, since it’s much faster and more efficient than PHP at managing data. So I managed to do major cleanup of all three of my PHP projects. Unfortunately I didn’t really get around to adding new features, which is why the three projects were only bumped minor versions, and not major versions.

The changelogs are as follows:
Codex:

  • Move config to its own directory
  • Smarty and ADOdb prefixes are now configurable
  • Break up function library into a file for each function that’s loaded on demand
  • Remove usage of obsolete HTTP_GET_VARS and HTTP_POST_VARS
  • Rewrite matchup_data to use SQL and be more efficient
  • SQL file is now included in tarball
  • Use a more sane versioning scheme (e.g. 0.3.3 instead of v03c)

Codex actually still has some speed issues – part of it is due to some more shitty code that needs to get cleaned up, part of it is due to the sheer amount of data it fetches and cross-references at any given time.

MDB:

  • Prevent SQL injections in searches
  • Move config to its own directory
  • Break up function library into a file for each function that’s loaded on demand
  • Made ADOdb prefix configurable
  • Rewrite SQL/PHP in functions unmapped, taglist, prunetags, userhistory, titlelist
  • Fixed file download function which broke in some browsers
  • Use a transaction during database update to avoid users seeing corrupt data during update

GitPHP:

  • Move config to its own dir
  • Break up function library into a file for each function that’s loaded on demand
  • Use a more sane versioning scheme (e.g. 0.0.4 instead of v01d)
  • Fix broken snapshot URL on some pages
  • Fix missing header that broke formatting on tag page
  • Fix duplicated “…” that appeared on truncated shortlogs
  • Fix sorting functions on main project list page
  • Project lists searching a directory (rather than using a predefined list in the config) now search recursively
  • Add “TXT” plaintext project list link on front page
  • Fix ref icons next to commits with tags/heads
  • Some adjustments to links on pages to match more closely with current gitweb
  • Links to files on commitdiff pages now work
  • Diff binary path is now configurable
  • Now works on Windows with msysgit! Thanks to testing and bugfixes from Slash

Slash was a big help with much of the Windows testing/bugfixes on this release, as I don’t have any windows boxes. GitPHP can now work drop-in on Windows as long as you read the config and set the paths to your binaries correctly.

Tarballs of all new releases are available on their respective project pages. Since the directory structure of these releases have changed significantly, I recommend deleting files from the previous version before extracting this version.

I have also added a bugtracker on this site for people to report bugs with the software, at http://mantis.xiphux.com. Given how many people actually use anything I write (countable on one hand), I don’t forsee it being used much, but at least it’s there if I or anyone else ever needs it.

Return top