On Tue, May 11, 2010 at 6:16 PM, Johan Corveleyn <jcorvel_at_gmail.com> wrote:
> On Tue, May 11, 2010 at 1:56 PM, Stefan Sperling <stsp_at_elego.de> wrote:
> > On Tue, May 11, 2010 at 07:43:33AM -0400, Mark Phippard wrote:
> >> On Tue, May 11, 2010 at 7:27 AM, Stefan Sperling <stsp_at_elego.de> wrote:
> >> > On Tue, May 11, 2010 at 01:36:26AM +0200, Johan Corveleyn wrote:
> >> >> As I understand your set of patches, you're mainly focusing on saving
> >> >> cpu cycles, and not on avoiding I/O where possible (unless I'm
> missing
> >> >> something). Maybe some of the low- or high-level algorithms in the
> >> >> back-end can be reworked a bit to reduce the amount of I/O? Or maybe
> >> >> some clever caching can avoid some file accesses?
> >> >
> >> > In general, I think trying to work around I/O slowness by loading
> >> > stuff into RAM (caching) is a bad idea. You're just taking away memory
> >> > from the OS buffer cache if you do this. A good buffer cache in the OS
> >> > should make open/close/seek fast. (So don't run a windows server if
> >> > you can avoid it.)
> >> >
> >> > The only point where it's worth thinking about optimizing I/O
> >> > access is when you get to clustered, distributed storage, because
> >> > at that point every I/O request translated into a network packet.
> >>
> >> You had me until that last part. I think we should ALWAYS be thinking
> >> about optimizing I/O. I have little doubt that is where the biggest
> >> performance bottlenecks live (other than network of course). I agree
> >> that making a big cache is probably not the best way to go, but I
> >> think we should always be looking for optimizations where we avoid
> >> repeated open/closes that are not necessary.
> >
> > That's true. Avoiding repeated open/close of the same file
> > is a good optimisation. Even with a good buffer cache it will
> > make a difference.
> >
> > So s/The only point/One point/ :)
>
> Yes, some form of caching may or may not be a good approach, but the
> main point is that ideally, for a certain client request, every
> interesting rev file should be opened and read exactly once. Currently
> this is definitely not the case (for "svn log" it's closer to 10
> opens/closes and 5 times the amount of bytes of every rev file
> involved; with packed revs it's even worse because of the extra lookup
> of the rev offset in the pack manifest file).
>
> Maybe the ideal situation is currently impossible because of the
> higher level algorithms (retrieving the data in a certain way). So
> with "some clever caching" I really meant "read the rev file (or the
> interesting parts of it) exactly once, and keep that in memory for
> those 10 other accesses you need (which follow very shortly), then
> forget about it". Not some sort of LRU cache or something. But I
> really don't know whether this is a good idea (it may be difficult to
> determine when you don't need it anymore, ...). Just guessing ...
>
> In my book, I/O is almost always one of the slowest parts, even if the
> data is on a local 15k rpm or even SSD disk. Since SVN with FSFS does
> so much I/O with potentially thousands of little files for a single
> client request, I think it could pay off big time to try to reduce it
> as much as possible.
>
I remember trying to do some caching of open file handles back when working
on the packing code, but it turned out that multiple layers of the FSFS
stack were using the same revision file for different purposes
*simultaneously*. By attempting to reuse the open file handles, one
function would seek to a different part of the file, while a function higher
up in the stack understandably didn't expect the file pointer to have
changed. It made things...confusing.
There may be other ways of caching this information, which would be great.
But my experience with wc-ng leads me to ask if we can use an existing piece
of tech, rather than having to implement Yet Another Cache. (Caching is a
well known problem with many good solutions, let's not reinvent the wheel.)
This conversation about I/O also reminds me of Jon Trowbridge's comments
about rewriting libsvn_fs_bigtable: the FS layer pretends that all I/O is
essentially free, which it is obviously not. To make matters worse, the
disparity between disk speeds and CPU speeds has increased in the last 10
years, making this assumption even less valid.
-Hyrum
Received on 2010-05-12 19:45:18 CEST