On 12/13/05, Jim Blandy <firstname.lastname@example.org> wrote:
> On 12/12/05, Erik Huelsmann <email@example.com> wrote:
> > Some time ago, I created a translating stream. Just a few minutes
> > ago, I committed an MD5 calculating stream. This is all part of a
> > greater plan to reduce I/O in libsvn_wc by using more streams.
> Kind of off-topic, but just to play the devil's advocate: if the
> intent is to speed up Subversion, have you measured the effects of
> your changes?
Not really, as in: I have not collected hard facts. This is also due
to the problem that different people use the program very differently.
Most uses will version rather small files, which probably fit in the
kernel block cache, but many people also use it to version large files
up to 100's of MBs.
> That is, assume that the temporary files' contents stay in the
> kernel's block cache, so the whole operation is memory-to-memory
> whether you use composed streams or temporary files.
Yes, but that assumes that filesystem read calls have a low overhead.
Ofcourse, the overhead is lower than doing real I/O, but is it lower
than having bad code-locality? (I don't know, since there isn't one
special use case I'm considering here.)
Seeing how much snappier wc-propcaching made svn, I think that
(repeatedly) reading small files isn't a costless operation...
Recently, the blocksize changed from 100kB to 16kB, so chances are,
fewer source files fit into the buffer at once now.
> Making a series
> of distinct passes over the full dataset (say, as one does when using
> temporary files) gives you worse data locality, but better code
Yes, but you still need to make those calls into the OS to read the
file. Windows provides extensive hooks to monitor file-activity. If
I'm correct, it also does for reading files. So, on those systems, you
can't even guarantee a file-read is a single process operation,
meaning that a *lot* of context switches may be required *and* that
code-locality may not be as good as it looks from the application POV.
> Using composed streams gives you better data locality, but
> worse code locality. If the common case is to operate on smallish
> files (and most source files aren't that big), then gaining the data
> locality wouldn't be worth it.
Yes, in this case, it won't gain much, but probably won't loose much
either. In the extremely large file case, gaining data locality means
reducing actual I/O, meaning a big gain.
> Understand, I'm not arguing that this is actually so. I'm just saying
> that I can find an argument that doesn't seem totally dumb that this
> might not be a speedup.
Oh, sure, I understand that. Though I probably sound defensive above,
I'm just exploring the mechanisms surrounding file reads and
explaining my reasoning why I expect a win.
Also, I'm not very familiar with networked filesystems, but you seem
to assume a local filesystem for working copies. Many people have to
work on home directories which reside on servers over a networked fs.
Do networked fs implementations also keep a block cache, or do they
require retransmission of the full file? (That answer too probably
depends on the circumstances...)
> It's certainly an increase in complexity;
> aren't stream filters harder to write than functions that make a pass
> over a file?
I don't consider it harder to do so, because essentially, it's the
The looping implementation:
while (readlen == SVN_WINDOW_SIZE)
svn_read_from_file (fd, buf, &len, pool);
<do the calculation/transformation>
Where the streaming implementation looks like this:
static stream_handler (my-baton *baton)
svn_read_from_file (baton->fd, baton->buf, &len, baton->pool);
<do the calculation/transformation>
> Of course, if the stream filters' buffers are so large that they
> usually hold the entire file, then you've effectively replaced
> temporary files with in-memory buffers, which seems like it would have
> to be faster.
Most of the files in our source tree fit in the 16k buffer, nearly all
files would have fit in the old buffer size (100kB). OTOH, we changed
the buffer size, partly because some OSes read buffers with a size
which is a power of 2 faster than other buffer sizes....
Received on Tue Dec 13 11:39:11 2005