воскресенье, 1 августа 2010 г.

GSoC 2010 Progress Report #2

Last week I spent improving and debugging the code of repository packs. For those unfamiliar with this feature: repository packs are two tarballs, basic.tar.gz and patches.tar.gz, containing the copy of repository contents. They are used for faster getting a Darcs repository over the networks, and will be created by 'darcs optimize --http' command, when it will be enabled.

The main changes are:

  • while getting a repository via packs, the files hardlink to a darcs global cache,
  • small tuning of packs format, to support the parallel get using cache,
  • further development and debugging of code for parallel get,
  • optimization of packing of inventories.

The last change resolves issue1889, allowing to get rid of unnecessary files in basic pack, thereby reducing its size. The fact is that with use of a repository, the inventories directory can accumulate quite a large number of unneeded files. In the Darcs unstable repository case, these files represent a significant part of basic pack: unoptimized pack takes 21MB vs 1.7MB with optimization.

Also, I've figured out what caused the issue1884 and was giving the wrong message about the success of incomplete darcs get. It turns out that Darcs.Command.Get has the interrupt handler that covers almost the entire code of darcs get, and unconditionally reports about the success on interrupts. The fix is easy, but it conflicts with the rest of my work that is waiting for review, so it will have to wait a bit too.

By the way, this fix will help to do one more optimization, because it clearly defines the time of getting the lazy repository. It turns out that in order to get a lazy repository, it's not necessary to download the entire basic tarball: inventory files at it's end may be obtained later lazily. With this optimization, darcs get of lazy packed repository will download the same files as the "classical" darcs get, only faster.

While getting the packs can be much faster than getting a repository file by file, it also may be much slower if the repository files have been saved in the cache. However, this case also does not necessarily win, e.g, the cache may be on the network share behind a slow connection. Conversely, the "remote" repository can be at arm's length. Or even on the local host. As you can see, things get a little complicated here, and certainly there will be cases when trying to be too clever and guess the way to get the repository (file-by-file using the cache, or using packs) will fail.

The way I solve this problem is actually simple: why to choose between two options, if you can use both? So I added to the beginning of basic pack list of files it contains, in reverse order (patches pack doesn't need it, as it can be inferred from inventory). Now when you get the repository the pack is loaded, and when a list of his files is received, they are obtained in parallel, in reverse order. Downloading files from different ends of the list, both download threads eventually discover that the file that they are going to upload already exists. At this point their work ends: pack download is completed.

The only remaining issue of packs I know about is the download realization. The fact is that the current code for downloading files in Darcs lets you use the file only after the download is complete, which is not suitable for my way of using the packs in parallel with the cache. Since I am going to write custom downloading code in my upcoming smart server work, I think it will be easier to provide a common interface for both smart and dumb (including lazy) downloads, instead of trying to alter the current code, which was not designed for lazy download.

Now, when I've finished with most of repository packs (well, almost; there will probably be a couple of rounds of review-amend ping-pong on the Darcs bug-tracking system), I'm starting to write the code for smart server. After the server's interface design (I'll post the specification on the Darcs wiki), I'll start with writing the client side (it will help to solve the problem with downloading tarballs and put an end to my work on the repo packs sooner). I'll make a next post about my progress on Saturday, August 7.

1 комментарий: