kfogel@collab.net wrote:
>>At the moment the code uses memory roughly proportional to the total
>>lengths of all paths in the transactions.
>>
>>
>
>is both true and the cause of our problems. I'm pretty sure it's
>true, of course, it's the second half I'm not positive about :-).
>Are you sure that path lengths are relevant to total memory usage, or
>are they just lost in the noise?
>
>
A few rough estimators:
The original problem report was for problems importing the NetBSD source
tree, so I unpacked the files from the NetBSD 2.0 source iso.
Original report is for fewer files (~120,000) , but we're just doing
big O here.
Noise sources :
Original report is for a memory spike from 19Mb -> 44Mb, so
results on the order of megabytes are possibly significant.
Hashtable array size is always a power of two; hash node size is
~20 bytes.
First metric was to run find . >/tmp/find-netbsd.
Total size is 8,139,654 Bytes. (wc -c)
Number of entries: 193,716 (wc -l)
Average path length: ~42 bytes
Measurements were made relative to '.' ; paths in memory would be
relative to the root of the repository. Adding /trunk/ to start of each
path would use an extra 6 chars per entry (~1.1MB in this case)
Second metric is to strip out everthing but the last name component: (
sed -e 's;^.*/;;' )
Total size: 1,654,088
Number of entries: 193,716 (wc -l)
Average size: 8 bytes
Third metric: size of interned path-name components (sed ... | sort | uniq)
Total size: 577,432
Number of unique strings: 51,236
Bonus metric: estimate entropy using bzip2 -9 -v
Full pathnames : 0.606 bits/byte, 92.42% saved, 8139654 in, 616601 out
Basenames: 1.221 bits/byte, 84.73% saved, 1654088 in, 252521 out.
Interned: 2.807 bits/byte, 64.91% saved, 577432 in, 202629 out
Received on Mon Dec 13 23:12:07 2004