I’ve introduced a new caching feature in GitPHP: the object cache.
The existing cache in GitPHP uses Smarty caching on a per-template basis. This means that each page has its own cache. However, each time that page needs to be regenerated, it hits the git repository for everything it needs.
This meant there were lots of places where effort was being duplicated, even with the cache turned on. For example, if you visited the shortlog, you loaded up the data for 50 commits. Ok. Then you visited the log page, and you loaded up the data for the same 50 commits… that’s 50+ redundant git repository hits. Another example is changing the language – various labels are different but the data is exactly the same, but it still needs to be loaded twice because the cached template is different.
Therefore, I created the object cache. The object cache is a secondary layer of caching that only stores immutable data from git. Because of git’s architecture as a directed acyclic graph, all of its content is stored as hashes of data. This means that the contents of a hash – for example, a blob or a commit – will be exactly the same now as it will be weeks later. So if we store this immutable data parsed and loaded in the cache, we can use it as many times as we want later, without hitting the git repository for it.
So continuing with the previous example, visiting the shortlog will first load up 50 commits from the repository. (this is unavoidable) Then, going to the log will hit the cache for those 50 commits and never hit the repository.
Because this is intended to only cache immutable git data, there are a number of things that it doesn’t cache. For example, which commit is the HEAD isn’t cached, because that changes all the time. Likewise, the blame or history of a blob isn’t cached, because that is technically mutable based on the filename used to refer to the blob. (Because the same blob can represent more than one file). Here’s the list of known immutable stuff that is cached:
- Commit – its author/committer/times, parent commit(s), comment, tree, containing tag
- Tags – both light tags and tag objects, and the object they are tagging (commit or tag), and their message
- Blob – the file contents
- Tree – its contents (sub-trees and blobs)
This option (‘objectcache’ in the config) is separate from the existing template cache option. You can choose not to use it, while continuing to use template caching. Or you can choose to only use the object cache without using the template cache, in order to eliminate most of the git repository calls while absolutely guaranteeing that you’re seeing the most up to date data. Or you can use both together for the maximum benefit. Combine that with the recently added Memcache support, and you’ve got yourself a pretty sweet setup.
The object cache also has its own lifetime setting (‘objectcachelifetime’ in the config). Because the data cached in the object cache is immutable, it’s safe to set the object cache’s lifetime extremely high. The default is 86400 seconds (24 hrs).
I’ve included some technical details of how this works below. It’s a long read, but if you’re programming inclined, it’s worth checking out. Part of it is because you will need to know some of this if you intend to make changes to GitPHP code. Part of it is also because it was interesting to implement
Technical info
The Cache interface
The object cache uses a wrapper interface around Smarty’s cache system to store arbitrary data. This was done because I wanted to keep the cache interface the same for both the object cache and the template cache. This meant that if the user was using the memcache smarty interface, then both caching methods would go to the same place. Likewise, with disk caching, data is stored in the same place. It’s simpler for the user and easier for me to maintain.
The cache system uses a single smarty template for storing all data. The template does nothing but store and return a string of data, and each individual piece of data cached is distinguished by a unique cache id. Data is stored serialized into a string and inserted into the template, and is retrieved by fetching the contents of that template and deserializing the data – so the template is never actually output. This interface is in include/cache/Cache.class.php. It is a bit of a hack, but I’ve already used this trick before to cache blob headers when viewing plaintext blobs.
Loading/storing data
Objects are all retrieved using a factory method off of the GitPHP_Project class in include/git/Project.class.php – GetCommit, GetTag, GetBlob, GetTree, etc. The factories will return a deserialized cached object if it is found, otherwise a new instance of the object. The objects save/update themselves in the cache after some data is loaded from the repository (ReadData in include/git/Commit.class.php, ReadData in include/git/Blob.class.php, etc).
Serialization
PHP objects have a magic function called __sleep that is called right before serialization. This returns a list of properties to serialize. In a lot of objects (such as Commit) this just returns all properties. In the Blob object, because we don’t want to serialize the history/blame data (since that’s mutable data), we skip those properties.
However, the equivalent deserialization function __wakeup is not used. This is because serializing object references in PHP introduced an interesting problem.
Fucking references, how do they work?
PHP will not serialize references (pointers to objects). And this makes sense – there’s no guarantee that, upon deserialization, that object is in the same place in memory, or even exists at all. Therefore, all references in an object will be lost. (except for circular references within the object… which I don’t really use)
But the git objects in include/git depend heavily on object references. A commit points to its parent commit object. A tree points to its child tree/blob objects. Every git object points to the project it belongs to. So we need some way to preserve these references.
Fortunately, every object in GitPHP is uniquely identifiable in some way. A blob/tree/commit is identifiable by its hash. A tag is identifiable by its name. A project is identifiable by its path. This means that we can store this “unique id” and use it to restore the object reference later.
Therefore, every object needs to be able to convert all its objects into “references” – not PHP’s object references, but a unique id “reference” in order to locate the right object later. So all serialized objects that contain objects within them will “reference” their objects in the __sleep method, by replacing the object with a string ID. (Since PHP is untyped)
You would think that the “dereference” of the string ID back into an object would happen in __wakeup… however this is not the case. GitPHP objects use circular references at times. For example, a commit points to its tree, and a tree points to its parent commit. This becomes a problem if you try to dereference during __wakeup:
- Deserialize the commit object. __wakeup attempts to load its tree object…
- Load and deserialize the tree object. __wakeup attempts to load its parent commit…
- Deserialize the commit object. __wakeup attempts to load its tree object…
- Load and deserialize the tree object…
You get the idea – you’ve got an infinite deserialization loop.
Therefore, what happens is a just-in-time dereference. Since I wrapped every property with a getter/setter (because PHP’s method of adding behavior to properties really sucks), I can add behavior to be executed just before fetching a property. In this case, objects remain as string IDs after deserializing the object, up until you need them… at which point they get on-demand dereferenced back into objects right before returning that object to you. I’ve used this scheme before, since a lot of property getters load the data from the git repository right on-demand right before returning the value to you.
It’s a bit complex, but hopefully that explanation helped. You can always check out the code as well. Let me know if you have any questions.