[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

RE: repository GUIDs

From: Bill Tutt <rassilon_at_lyra.org>
Date: 2002-12-12 22:33:59 CET

> From: John Barstow [mailto:John_Barstow@gfsg.co.nz]
>
> > It feels right, but sometimes it's necessary to cluster an index on
an
> > Alternative Key or indeed even on a non-unique column rather than
the
> > Primary Key for various reasons. The reasons in this case are space
> > optimization related. Our funky base-whatever IDs are much shorter
than
> > the number of bytes a GUID has. Monotonically increasing integers
also
> > use a B+-tree index more efficiently.
>
> A few counter-points here.
>
> Monotonically increasing integers are very difficult to handle in
> distributed scenarios. There are inevitably key conflicts whenever
inserts
> occur.
>

It depends on your replication mechanism. With the envisioned
replication mechanism this doesn't turn out to be an issue. Each
repository contains their own local set of monotonically increasing IDs.

> Not all indexes need to be implemented as B+ trees. I don't have my
> purple
> book handy (the movers are supposed to bring it next week), but there
are
> a
> number of indexing schemes that work just fine for GUIDs. I'm
thinking
> bitmap indices off the top of head.
>

Indeed, R-tree type indices are certainly useful in certain
circumstances. I'm afraid you'd have to be more specific on what kind of
binary index you're thinking of. Esp. when the index I'm concerned about
is the NodeRevision PK index, and not the Repository index.

> Monotonically increasing integers as primary keys create insertion
hot-
> spots
> in many database implementations, since new records are always
inserted at
> the end.
>

Those databases really ought to be fixed.

> Index selectivity is likely to be low anyways. There are likely to be
very
> few repositories in the majority of installations, meaning that a
query
> optimizer is unlikely to use the RepositoryID index at all. It's much
> more
> likely to use criteria such as revision number, date ranges, or even
> properties to select an index, since those tend to be highly
selective.
> Even in the case of thousands of repositories, revision number is
probably
> more selective than repository ID - in such cases, you're likely to
have
> millions of records in the linked tables.
>

I'm not talking about index selectivity of the Repository table. I'm
referring to the index selectivity of the new NodeRevision PK when we
get around to widening it by a unique index on the Repository table.

In BDB terms this is always the PK. So the PK of the Repository table
should be something that's selective.
 
> "Premature optimization is the root of all evil." Don't recall who
said
> it,
> but it's true
> even for databases. Dropping a poorly selected primary key got me a
40%
> performance gain once - it was actually 40% slower to use the index
than
> to
> just scan all the records, and we had millions of records.
>

As an unrelated aside, what data store was this with? (So asks the
relational optimizer geek.)

To sum up, it really isn't that big of a deal atm. It's not like we're
widening the NodeRevisionPK yet. When we do, having the only index on
Repository be on GUID will add extra space usage to EVERY new
NodeRevision row. E.g. 16bytes * # of NodeRevisions in repository vs.
2-3 bytes * # of NodeRevisions in repository for the extra RepositoryID
integer in the NodeRevision row.

So at worst we might end up wanting to upgrade the table schema for the
Repository table. All told, that's not that hard to do either.

Bill

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Thu Dec 12 22:34:33 2002

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.