On 22.11.2017 22:08, Evgeny Kotkov wrote:
> Stefan Sperling <stsp_at_apache.org> writes:
>> However, if rep-sharing is enabled, svn_fs_props_changed() does not work
>> as advertised because properties do not carry a SHA1 checksum with a
>> "uniquifier" which identifies the transaction they were created in.
>> The uniquifier is used by svn_fs_props_changed() to tell apart property
>> representations which share content but were created in different revisions.
>> To fix that problem, make FSFS write SHA1 checksums along with uniquifiers
>> for file properties, just as it is already done for file content.
>> A source code comment indicates that SHA1 checksums for property reps
>> were not written due to concerns over disk space. In hindsight, this was
>> a bad trade-off because it affected correctness of svn_fs_props_changed().
> Thinking about this change, it could be that writing the additional 40-byte
> SHA1 for the property representations is going to eliminate the benefit of
> sharing them in the first place.
Here some more background after digging through the commit history.
The original comment about space saving was ill-worded. It should have
said something like "storing sha1 etc. is pointless because nobody uses
it for props & dirs". Hence, it is a waste of space.
In 1.8, svn_fs_fs__dag_things_different got broken for props due to the
missing uniquifier but that went unnoticed. This has not been a deliberate
trade-off but a mere oversight of the potential implication of shared reps.
It needs to be fixed.
One reason for rep sharing is that rep changes tend to be rare and often
not "in sync" with text changes. That means, they must be pulled from
a different repo location than both, the node tree (dirs and noderevs)
and the file contents. Eliminating most of these accesses by reading
only a few shared proplists should give some speedup.
Space savings on disk are not the main focus. Those might happen for
commits with mass prop changes and a longer list of standard-props
at each node. Savings in the membuffer cache may be more significant,
though, because its entry directory has a limited size and storage may
remain unused if the average cached item size drops below 1k.
> If I recall correctly, rep sharing for properties is mostly there to gain
> from deduplicating small properties, such as svn:eol-style or svn:keywords.
> But with the additional SHA1 written in every representation string, this
> overhead is likely to take more space than the property itself.
> An alternative approach that might be worth considering here would be:
> (1) Extend the on-disk format and allow representation strings without
> SHA1, but with the uniquifier, something like this (where "-" stands
> for "no SHA1"):
> 15 0 563 7809 28ef320a82e7bd11eebdf3502d69e608 - 14-g/_5
> (The new format would be allowed starting from FSFS 8.)
Yes, that would be one option. If you are willing to provide a patch,
I would probably +1 it. Not bothering with the space savings would
be a valid choice as well, IMO.
> (2) Use the new format to allow rep sharing for properties that writes
> the uniquifier so that svn_fs_props_changed() would work correctly,
> and doesn't introduce the overhead of writing SHA1 in the representation
> string for every property.
> (3) Disable rep sharing for properties in FSFS formats < 8 that cannot
> read and write such representation strings without SHA1, but with an
I'd rather have the pre fsfs-8 repos use the full sha1+uniquifier.
That is not worse than storing the props themselves and will
preserve the benefits mentioned above.
> Barring objections and alternative suggestions, I could give a shot at
> implementing this.
Go for it. Maybe notify me once you are done b/c currently, I don't
monitor the commit activity closely.
Received on 2017-11-23 15:25:54 CET