Author Archive

GitPHP 0.2.2

I’ve released GitPHP 0.2.2. There are a number of neat enhancements in this release; you might have seen some running on the local copy on this site – javascript livesearch of the project list, ajax tree drilldown, choice of snapshot format, bugtracker linking, etc. Plus there are major enhancements on the backend, including the object cache described in a previous post, and built-in memcache support for all caching – just specify the memcache server(s) in the config and you’re good to go.

Full changelog:

  • Enhancements:
    • Atom feed support, thanks to Christian Weiske
    • Error pages now return proper HTTP error codes to avoid search engine indexing
    • Users can now choose the file format of snapshot they want, based on what’s supported by the system. The config value still controls the default for non-javascript users.
    • Directories in the config file no longer require you to specify the trailing slash
    • Overriding project settings (category, owner, description, clone url, etc) can now be done for all project listing methods – directory list, file list, and array list. Previously it could only be done with the array
    • The tree view can now be drilled down using AJAX
    • Memcache support
    • Support for linking bug numbers in commit messages to a bug tracker
    • Search box to search projects on the project list. This is a live search if you have javascript
    • Object cache, for caching immutable git data – more info on what this is and why you want it is here
    • Javascript is now minified to decrease its size
    • Clone/push urls on the project page are now links, thanks to Cory Thomas
    • Project owners are now read from the git config value gitweb.owner if set, thanks to Cory Thomas
  • Translations:
    • Russian, thanks to Aidsoid
    • German, thanks to Andy Tandler
  • Bugfixes:
    • Fix issue where commit tooltips didn’t escape HTML characters correctly
    • Project ages on the project list page now use a more accurate method to get the age
    • The default tmpdir, if not specified in the config, is read from the system/php config rather than hardcoding it. This also fixes an issue where the default tmpdir for windows was incorrect
    • Fix the default git binary for windows x64 installs
    • An error message now displays when the diff/git executable isn’t found or doesn’t work, rather than just failing silently

For those who are wondering about this post about a backwards incompatible change – the change to specify overrides for all project list formats does change the project config file (projects.conf.php) format slightly, but the code is backwards compatible and will continue to use your old project config file until you adjust it for the new changes.

As always, the release can be downloaded from the gitphp page and bugs can be reported on the bugtracker.

GitPHP phpDoc

I’ve generated GitPHP’s API documentation using phpDoc. Most of the headers in the files come out with pretty good documentation; a few of them still need to be cleaned up. I’ll try to regenerate the documentation every now and then, but it may not always be 100% in sync with the code in the repository.

It’s available at http://api.xiphux.com/gitphp.

The cache strikes back: The GitPHP object cache

I’ve introduced a new caching feature in GitPHP: the object cache.

The existing cache in GitPHP uses Smarty caching on a per-template basis. This means that each page has its own cache. However, each time that page needs to be regenerated, it hits the git repository for everything it needs.

This meant there were lots of places where effort was being duplicated, even with the cache turned on. For example, if you visited the shortlog, you loaded up the data for 50 commits. Ok. Then you visited the log page, and you loaded up the data for the same 50 commits… that’s 50+ redundant git repository hits. Another example is changing the language – various labels are different but the data is exactly the same, but it still needs to be loaded twice because the cached template is different.

Therefore, I created the object cache. The object cache is a secondary layer of caching that only stores immutable data from git. Because of git’s architecture as a directed acyclic graph, all of its content is stored as hashes of data. This means that the contents of a hash – for example, a blob or a commit – will be exactly the same now as it will be weeks later. So if we store this immutable data parsed and loaded in the cache, we can use it as many times as we want later, without hitting the git repository for it.

So continuing with the previous example, visiting the shortlog will first load up 50 commits from the repository. (this is unavoidable) Then, going to the log will hit the cache for those 50 commits and never hit the repository.

Because this is intended to only cache immutable git data, there are a number of things that it doesn’t cache. For example, which commit is the HEAD isn’t cached, because that changes all the time. Likewise, the blame or history of a blob isn’t cached, because that is technically mutable based on the filename used to refer to the blob. (Because the same blob can represent more than one file). Here’s the list of known immutable stuff that is cached:

  • Commit – its author/committer/times, parent commit(s), comment, tree, containing tag
  • Tags – both light tags and tag objects, and the object they are tagging (commit or tag), and their message
  • Blob – the file contents
  • Tree – its contents (sub-trees and blobs)

This option (‘objectcache’ in the config) is separate from the existing template cache option. You can choose not to use it, while continuing to use template caching. Or you can choose to only use the object cache without using the template cache, in order to eliminate most of the git repository calls while absolutely guaranteeing that you’re seeing the most up to date data. Or you can use both together for the maximum benefit. Combine that with the recently added Memcache support, and you’ve got yourself a pretty sweet setup.

The object cache also has its own lifetime setting (‘objectcachelifetime’ in the config). Because the data cached in the object cache is immutable, it’s safe to set the object cache’s lifetime extremely high. The default is 86400 seconds (24 hrs).

I’ve included some technical details of how this works below. It’s a long read, but if you’re programming inclined, it’s worth checking out. Part of it is because you will need to know some of this if you intend to make changes to GitPHP code. Part of it is also because it was interesting to implement :)

Technical info

The Cache interface

The object cache uses a wrapper interface around Smarty’s cache system to store arbitrary data. This was done because I wanted to keep the cache interface the same for both the object cache and the template cache. This meant that if the user was using the memcache smarty interface, then both caching methods would go to the same place. Likewise, with disk caching, data is stored in the same place. It’s simpler for the user and easier for me to maintain.

The cache system uses a single smarty template for storing all data. The template does nothing but store and return a string of data, and each individual piece of data cached is distinguished by a unique cache id. Data is stored serialized into a string and inserted into the template, and is retrieved by fetching the contents of that template and deserializing the data – so the template is never actually output. This interface is in include/cache/Cache.class.php. It is a bit of a hack, but I’ve already used this trick before to cache blob headers when viewing plaintext blobs.

Loading/storing data

Objects are all retrieved using a factory method off of the GitPHP_Project class in include/git/Project.class.php – GetCommit, GetTag, GetBlob, GetTree, etc. The factories will return a deserialized cached object if it is found, otherwise a new instance of the object. The objects save/update themselves in the cache after some data is loaded from the repository (ReadData in include/git/Commit.class.php, ReadData in include/git/Blob.class.php, etc).

Serialization

PHP objects have a magic function called __sleep that is called right before serialization. This returns a list of properties to serialize. In a lot of objects (such as Commit) this just returns all properties. In the Blob object, because we don’t want to serialize the history/blame data (since that’s mutable data), we skip those properties.

However, the equivalent deserialization function __wakeup is not used. This is because serializing object references in PHP introduced an interesting problem.

Fucking references, how do they work?

PHP will not serialize references (pointers to objects). And this makes sense – there’s no guarantee that, upon deserialization, that object is in the same place in memory, or even exists at all. Therefore, all references in an object will be lost. (except for circular references within the object… which I don’t really use)

But the git objects in include/git depend heavily on object references. A commit points to its parent commit object. A tree points to its child tree/blob objects. Every git object points to the project it belongs to. So we need some way to preserve these references.

Fortunately, every object in GitPHP is uniquely identifiable in some way. A blob/tree/commit is identifiable by its hash. A tag is identifiable by its name. A project is identifiable by its path. This means that we can store this “unique id” and use it to restore the object reference later.

Therefore, every object needs to be able to convert all its objects into “references” – not PHP’s object references, but a unique id “reference” in order to locate the right object later. So all serialized objects that contain objects within them will “reference” their objects in the __sleep method, by replacing the object with a string ID. (Since PHP is untyped)

You would think that the “dereference” of the string ID back into an object would happen in __wakeup… however this is not the case. GitPHP objects use circular references at times. For example, a commit points to its tree, and a tree points to its parent commit. This becomes a problem if you try to dereference during __wakeup:

  1. Deserialize the commit object. __wakeup attempts to load its tree object…
  2. Load and deserialize the tree object. __wakeup attempts to load its parent commit…
  3. Deserialize the commit object. __wakeup attempts to load its tree object…
  4. Load and deserialize the tree object…

You get the idea – you’ve got an infinite deserialization loop.

Therefore, what happens is a just-in-time dereference. Since I wrapped every property with a getter/setter (because PHP’s method of adding behavior to properties really sucks), I can add behavior to be executed just before fetching a property. In this case, objects remain as string IDs after deserializing the object, up until you need them… at which point they get on-demand dereferenced back into objects right before returning that object to you. I’ve used this scheme before, since a lot of property getters load the data from the git repository right on-demand right before returning the value to you.

It’s a bit complex, but hopefully that explanation helped. You can always check out the code as well. Let me know if you have any questions.

Return top