On Jan 5, 2005, at 8:00 PM, Branko Čibej wrote:
> Brian W. Fitzpatrick wrote:
>
>> On Tue, 2005-01-04 at 15:17, Mark Benedetto King wrote:
>>
>>> On Tue, Jan 04, 2005 at 02:49:00PM -0600, Brian W. Fitzpatrick wrote:
>>>
>>>> If the node is a directory, the second through n lines will contain
>>>> a
>>>> serialized hash of entries, where the hash key is the name of the
>>>> child
>>>> component, and the corresponding value is the integer of the
>>>> filename
>>>> where the node's lock is stored.
>>>>
>>>>
>>> If we consider this data element to be a (somehow delimited) list of
>>> integers, then we can try to limit the size of the individual dirent
>>> files, which should avoid quadratic I/O behaviour (relevant when
>>> locking many files in the same directory).
>>>
>>
>> Can you explain this in more detail please? I don't understand what
>> you're saying.
>>
> foo -> 1001
> bar -> 1002
> baz -> 1003
>
> trivially maps to
>
> (foo, bar, baz) -> [1001..1003]
>
> In short, you can compact sequential lists of numbers. The trouble is,
> of course, that the delimiters are all different, so you can't
> compress those as in the case of a "normal" list of integers, where,
> e.g.,
>
> 1,2,3,5,6,7
>
> becomes
>
> (1..3),(5..7)
>
> Now if we assume that, on average, the length of a file name is about
> the same as the lenght of a decimally-encoded integer, and you locked
> all the files in a directory, you'd save around 50% space by
> compressing the list into a number range. That may seem like a lot,
> but it's still only a constant factor that vanishes in the noise when
> the quadratic component begins to make itself felt. Then there's the
> fact that opening a file is usually more expensive than reading data
> from it once it's open.
>
> In short, you avoid quadratic I/O behaviour here in exactly the same
> way as we did in the WC -- by caching the entries files. I'm not sure
> if this is a valid thing to do in the face of possibly multi-process
> access to the FS, though.
mbk explained this to me on irc this afternoon, but after further
discussion with Peter Lundblad, I think that we're going to go ahead
and use MD5 hashes instead of the incrementing integer--mainly because
it allows us to do lock lookup with a single stat (instead of opening
and reading each entries file for each path component) as well as store
only the MD5 for all of the children in a directory node entry file.
So instead of having this complex hash, we have a list of MD5 sums to
read through.
-Fitz
- application/pkcs7-signature attachment: smime.p7s
Received on Thu Jan 6 06:19:31 2005