You do know that "diff" and "delta" are two different beasts, and that
the diff optimizations have no effect on deltas? :)
The problem with directory deltification lies in the length of the delta
chain and the frequency of directory lookup compared to file access. The
sad fact is that our directory storage (/and/ our API) are woefully
unsuited to their tasks. The way they're stored now (in both BDB and
FSFS back-ends) requires the whole directory to be read into memory and
hashed in order to find a single entry, and you have to do this for each
level of directories when resolving a path. It doesn't help that most of
the APIs are strictly path based, e.g., editor drives will do the lookup
any number of times.
The whole concept of directory storage needs to be changed. The easiest
way would be to store directories as B-trees, however, that doesn't play
nice with versioning. On the other hand, directory structure is well
known and there's no reason to use a generic delta algorithm to store
them. I could probably come up with a better storage schema for
directories in a couple weeks, but I don't have the time to implement
such a thing.
-- Brane
On 01.02.2011 05:29, Hyrum K Wright wrote:
> Philip and I had an interesting conversation with some users this
> evening, and I'm just archiving my brain dump here.
>
> These users have a large repository with a large number of branches in
> the /branches directory (~35k). We described the well-known
> phenomenon in which directories aren't deltified on commit, and thus
> cause the repository to have very large revisions, even when the
> actual content changes are fairly small. This is due to bubble up and
> having to re-write the entire directory list of the /branches
> directory.
>
> Philip recalled a time several years ago when he enabled directory
> deltification, but the performance was awful, and we've never released
> it. In our discussion, we mentioned that directory deltification may
> be better performing now, especially in light of the imminent merge of
> the diff-bytes-optimizations branch. In the case of a bubble-up
> directory modification, the prefix and suffix matching would simplify
> the problem space, leaving a very small diff.
>
> The only trouble with the above theory is if directory entry lists are
> stored in a hash, and are serialized in an unordered manner, thus
> negating any benefits prefix-scanning would provide (and potentially
> causing the horrific delta performance in the first place).
>
> Anyway, that was the kernel of our discussion. I haven't dug around
> in the code to determine how much of it is true or not, but if anybody
> wants something to do, this might be interesting.
>
> -Hyrum
Received on 2011-02-01 10:44:13 CET