Decreasing use of the git exe in GitPHP
- July 11th, 2011
- Write comment
I’m actually not getting screwed by work this month, so I have a branch I’m working on where I’m changing the way GitPHP loads data from a project. I’m moving as much of the data loading as possible into php itself, by having PHP directly access the objects and packfiles in the git project, rather than relying on calls to the git executable. The packfile loading code is based off of Glip, but these changes are not actually using the Glip library itself.
The majority of the work is done. I’m going to be merging it into master relatively soon, with the intent of including the changes in the next release. Since this is such a fundamental change to the way data is loaded, I’m providing a configuration option to fall back to using the git executable (like GitPHP has always done) in case there are issues, and I’m considering providing this option on a per-project basis.
Update 7/17: This is merged into master. Any beta testers willing to test the new loading code on their repositories by running from git master would be appreciated… maybe we can knock out any major compatibility issues early.
Q: Why?
A: Several reasons:
- Compatibility. The less calls I have to do to the shell, the less I have to worry about the differences between Linux and Windows, and breaking one or the other (usually Windows, since I develop on Linux)
- Performance. Every shell execution is a fork() off of the webserver’s process. When you run lots of shell commands one after another, fork()-ing the web server process each time, the delay is noticeable.
- Security. While the good security practice would be to make your git repositories read-only to the webserver user, I bet very few people actually do that. Passing user input down to a shell command is always a dangerous thing to do, and potentially exposes your server to hackers. Unfortunately, I’ve always put functionality above security when calling the git executable, and I’d like to change that.
Q: Why didn’t you use Glip?
A. I experimented with using Glip. In doing so, I found a couple things:
- Glip is incomplete. It doesn’t load all of the objects that can exist in a git project, and is also missing a number of features I would need for it to replace the git executable.
- Glip provides its own API to access objects in a git project – Blob, Commit, Tree, etc. These objects were very similar to the git object representations I had created for GitPHP, but the API was different enough to make them incompatible. In loading data using Glip, I found myself creating Glip objects, manually picking out values off of the properties just to stuff them into my own properties, and discarding the Glip objects. Not only that, but the way it stored data in properties was different enough from my git objects that I ended up having to do redundant conversions. Glip would load data in git’s internal representation, convert them to php’s internal representation, and then I would extract those values and convert them back into git’s internal representation for use in my objects. It just felt like a really redundant amount of glue code. I have no doubt that Glip would be great if you were writing a project from scratch based off it, but ripping apart and rewriting all of GitPHP’s model code is not something I’m looking to do anytime soon.
Q: Does this mean the git executable is no longer required?
A: Right now the git exe is still required.
Q: Does this mean we may not need the git executable in the future?
A. I don’t want to say never, but this is unlikely. While loading data from plain git objects is significantly faster in raw PHP, this also means that the processing has to be done in PHP. There are a number of cases where large amounts of processing in PHP actually make it slower than just biting the bullet and calling the git executable:
- The git rev-list command (used for the shortlog/log) has a –skip command, which is used to skip a certain number of commits down in the log. This is used for the next/prev paging in the log in GitPHP. In raw PHP, we don’t have this option – we have to walk all the way down the commit log ourselves. So in PHP, if you want to get page 5 of the log (commits 401-500), you have to walk the log and load commits 1-500, and discard the first 400. So you can imagine that it gets really slow when you get down to page 21, and you have to load the first 2100 commits and discard the first 2000. And walking the log isn’t just following each parent link – it’s any commit reachable from the tip, which includes all merged branches and any of their reachable commits. The current implementation actually uses raw php for the early pages of the log, but when skipping a significant number of commits, falls back on the git executable in order to keep performance reasonable.
- Searching for a commit, committer, or author requires loading every single commit in the history.
- Grepping inside files requires loading up the contents of every single file in a tree and searching every line.
- Getting the history of a file requires reading every single commit in the entire history and reading each commit’s tree to see if it touched this file. That also doesn’t take into account things like detecting renames.
- Getting the blame of a file requires getting the entire history of a file and diffing every commit with its parent to figure out the changes to the file.
- Diffing a file would require me to write my own diff algorithm (non-trivial), or requiring an external php extension like xdiff, which I don’t want to do.
Q: Are there any additional requirements?
A: You’ll need Zlib support in your PHP, since git objects are gz-compressed.
Q: Are there any limitations?
A. The new loading code won’t read packfiles larger than 2GB. This is a limitation in Glip, too. It’s because php’s fseek() (or actually most operations in php) top out at 2GB (2*1024^3).
I’ll update this post with any other FAQs I can think of.