[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: FSv2 (was: FREE Apache Subversion Meetup...)

From: Blair Zajac <blair_at_orcaware.com>
Date: Tue, 19 Oct 2010 10:12:08 -0700

On 10/19/2010 01:31 AM, Greg Stein wrote:
> On Mon, Oct 18, 2010 at 23:51, Blair Zajac<blair_at_orcaware.com> wrote:
>> On 10/04/2010 06:45 AM, C. Michael Pilato wrote:
>>>
>>> There, you can learn more about what the Meetups tend to look like, what
>>> other Meetups are planned for this years conference, and so on. You'll
>>> also
>>> find a link to the Subversion Meetup wiki page:
>>>
>>> http://subversion.open.collab.net/wiki/ApacheConNA2010Meetup
>>
>> That's the first mention I've seen of FSv2. What ideas are going into it?
>> What problems is it primarily meant to solve?
>
> FSv2 is a hand-wave.
>
> Personally, I see it as a broad swath of API changes to align our
> needs with the underlying storage. Trowbridge noted that our current
> API makes it *really* difficult to implement an effective backend. I'd
> also like to see a backend that allows for parallel PUTs during the
> commit process. Hyrum sees FSv2 as some kind of super-key-value
> storage with layers on top, allowing for various types of high-scaling
> mechanisms.

How would that API look? The API as it is is pretty clear.

Background for my wish list.

We use Subversion as a backend for a versioned asset management system.
  We get up to 5 commits per second from render processes generating new
assets and artists saving assets. We have interactive GUI users that do
asset lookups all the time.

While the immutability of svn has allowed us to cache revision data and
our servers can push 4,000 lookups per second to our render farm that do
lookups on a particular revision, interactive users that do HEAD lookups
suffer because the high commit rate. We cache data by node-id in
memcached, but because the root node always get a new node-id and
because the first thing interactive users do is get a list of folders of
the root node, we always get cache misses. I don't really want svn to
change the way new node-ids are assigned to parent nodes all the way to
the root.

1) Scalability to 30,000 child nodes in a single directory.

Currently, a single change to a node in a directory with 20,000 child
nodes causes a new revision file in fsfs to use around 960 kB. With a
commit rate of 1.5 commits per second in a repository, the disk usage is
very high. We introduced a hidden layer of "hash:DD" directories, 30 in
our case, that our internal Subversion server hashes path elements to.
This makes the revision files much smaller, but now when getting a list
of nodes in a directory, we have up to 30 child directories to index,
increasing lookup times.

If we could remove the need to hash directories, then the lookup on the
root node would be much faster and interactive users would be happier.

2) I would like to ensure that the new backend supports multiple
modifications to the same node. I don't know if this was designed into
the current backend, but given I expose svn_fs.h over RPC, clients can
make any one or multiple modifications to the tree, so the new backend
should support this.

And while we're discussing wants.

3) Pools are painful to use. We have repository, revision and
transaction C++ objects stored in an LRU cache. They cache revision and
transaction roots for improved performance. Using the wrong pool for a
RPC method can cause memory leaks (we just found one Monday causing a
backend server to run out of memory). Constructing and destroying pools
in the wrong order can cause the process to crash. This is hard to get
right, so using a different model would be very useful. I haven't had
the cycles to look at Hyrum's new C++ object and see how that would help.

Blair
Received on 2010-10-19 19:12:48 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.