kfogel@collab.net wrote:
>>At the moment the code uses memory roughly proportional to the total
>>lengths of all paths in the transactions.
>>    
>>
>
>is both true and the cause of our problems.  I'm pretty sure it's
>true, of course, it's the second half I'm not positive about :-).
>Are you sure that path lengths are relevant to total memory usage, or
>are they just lost in the noise?
>  
>
   
A few rough estimators:
The original problem report was for problems importing the NetBSD source 
tree, so I unpacked the  files from the NetBSD 2.0 source iso.   
Original report is for fewer files  (~120,000) , but we're just doing 
big O here.
Noise sources :
    Original report is for  a memory spike from 19Mb ->  44Mb, so 
results on the order of megabytes are possibly significant. 
    Hashtable array size is  always a power of two; hash node size is 
~20 bytes.
   
First metric was to run  find . >/tmp/find-netbsd. 
    Total  size is 8,139,654 Bytes. (wc -c)
    Number of entries:  193,716  (wc -l)
    Average path length: ~42 bytes
    Measurements were made relative to '.' ;  paths in memory would be 
relative to the root of the repository.  Adding /trunk/ to start of each 
path would use an extra 6 chars per entry (~1.1MB  in this case)
Second metric is to strip out everthing but the last name component:  ( 
sed -e 's;^.*/;;' )
    Total size: 1,654,088
    Number of entries:  193,716  (wc -l)
    Average size: 8 bytes
Third metric: size of interned path-name components (sed ... | sort | uniq)
    Total size: 577,432
    Number of unique strings: 51,236
Bonus  metric: estimate entropy using  bzip2 -9 -v
    Full pathnames : 0.606 bits/byte, 92.42% saved, 8139654 in, 616601 out
    Basenames: 1.221 bits/byte, 84.73% saved, 1654088 in, 252521 out.
    Interned: 2.807 bits/byte, 64.91% saved, 577432 in, 202629 out
Received on Mon Dec 13 21:38:03 2004