Hi,
I store a git repository on Amazon S3 and notice the "jgit fetch" can be very slow, fetching lots of pack-*.idx files even when the remote is ahead of local by only a single commit. It looks like WalkFetchConnection::downloadObject essentially iterates by brute force through all remote pack-*.idx files looking for an object. Since it's difficult to GC remote dumb repositories (I think best practice for Amazon S3 is doing a git gc using s3fs-fuse), over time pack files accumulate and "fetch" becomes slow.
So what if a local repository kept a persistent cache of remote pack-*.idx files? WalkFetchConnection could try that cache before the big iteration through all remote pack files. Further, maybe before consulting the cache, WalkFetchConnection could check the local .git/objects/pack directory for index files as well.
pack indexes are tied to their respective pack file, if a repack happened the corresponding
cached index file would become stale
It'd also be nice if jgit supported remote gc of dumb repositories but that's maybe a separate optimization.
well, the dumb protocol is - dumb
Thoughts? Am I understanding things correctly and does this seem like a workable idea?
The AmazonS3 implementation is very old and predates the introduction of the DfsRepository APIs
which is meant for storing git objects in a distributed file system which has a much higher latency
compared to a local filesystem. Maybe a Dfs-based S3 implementation would be the better approach.
-Matthias