Another thought would be to pursue a FUSE-like approach similar to scord
[1][2]
which implements a lightweight file system adapter that knows just enough
about
the pristine store and the working copy files such that it could maintain a
single copy
of the pristine contents for the overwhelming majority of files: those whose
working-copy contents are identical to the pristine base contents. And of
course it
would also need to perform a transparent copy-on-write if/when any
modifications
(i.e. write, truncate, etc) are done to a working copy file. Note: this
would not a
be a full file system, just an adaptation layer top of the underlying file
system that
would manage the wc/pristine pairings and trigger copy-on-write as
necessary.
Each file managed by this layer would be either a direct pass through to a
file in
the underlying file system, or a reference to a pristine file.
If the full goal is to reduce pressure on the underlying file system in the
presence
of many large working copies (e.g. one per branch) then duplicate pristine
contents,
even with super-awesome compression would not match the space savings of a
de-duplicated, pristine-aware, copy-on-write file system.
[1] http://svn.haxx.se/dev/archive-2007-05/0486.shtml
[2] http://scord.sourceforge.net/
regards,
markt
From: Julian Foad
<julianfoad_at_btopenworld.com<julianfoad_at_btopenworld.com?Subject=Re:%20Compressed%20Pristines%20(Summary)>
>
Date: Mon, 2 Apr 2012 11:16:07 +0100 (BST)
Hi Ashnod. 1. Filesystem compression. Would you like to assess the
feasibility of compressing the pristine store by re-mounting the
"pristines" subdirectory as a compressed subtree in the operating system's
file system? This can be done (I believe) under Windows with NTFS <
http://support.microsoft.com/kb/307987> and under Linux with FUSE-compress <
http://code.google.com/p/fusecompress/>. Certainly the trade-offs are
different, compared with implementing compression inside Subversion, but
delegating the task to a third-party subsytem could give us a huge
advantage in terms of reducing the ongoing maintenance cost. 2.
Uncompressed copies. There has been a lot of discussion about achieving
maximal compression by exploiting properties of similarity, ordering, and
so on. That is an interesting topic. However, compression is notthe only
thing the pristine store needs to do. The pristine store implementation
also needs to provide *uncompressed* copies of the files. Some of the API
consumers can and should read the data through svn_stream_t; this is the
easy part. Other API consumers -- primarily those that invoke an external
'diff' tool -- need to be given access to a complete uncompressed file on
disk. At the moment, we just pass them the path to the file in the pristine
store. When the pristine file is compressed, I imagine we will need to
implement a cache of uncompressed copies of the pristine files. The
lifetimes of those uncompressed copies will need to be managed, and this
may require some changes to the interface that is used to access them. A
typical problem is: user runs "svn diff", svn starts up a GUI diff tool and
passes it two paths: the path to an uncompressed copy of a pristine file,
and the path of a working-copy file. The GUI tool runs as a separate
process and the "svn" process finishes. Now the GUI diff is still running,
accessing a file in our uncompressed-pristines cache. How do we manage
this so that we don't immediately delete the uncompressed file while the
GUI diff is still displaying it, and yet also know when to clean up our
cache later? We could of course declare that the "pristine store" software
layer is only responsible for providing streamy read access, and the
management of uncompressed copies is the responsibility of higher level
code. But no matter where we draw the boundary, that functionality has to
be designed and implemented before we can successfully use any kind of
compression. - Julian >
From: Branko Čibej
<brane_at_apache.org<brane_at_apache.org?Subject=Re:%20Compressed%20Pristines%20(Summary)>
>
Date: Sun, 01 Apr 2012 09:23:58 +0200
On 31.03.2012 23:30, Ashod Nakashian wrote:
*>>> Git can keep deleted items until git-gc is invoked, should we support *
*>> something similar, we need to be consistent and probably support
arbitrary *
*>> revision history, which is out of scope. *
*>> *
*>> I'm confused: how does revision history affect the pristine store? *
*> If the pristine store also keeps multiple revisions, then it's a whole
different set of features than what we are aiming for (at least for
compressed pristines). *
Certainly the pristine store keeps multiple revisions of files. After
all, it's just a SHA-1 => contents dictionary, so every time you "svn
update," you'll get new revisions of files in the pristine store.
What the store doesn't do is /know/ about the revisions. Neither does
the wc.db, which only tracks reference counts for the SHA-1 keys. Every
time a file changes, its hash will change, too, a new key will be
inserted in the pristine store, and the reference count for the old key
will be decremented. I'm not sure what happens when the count reaches
zero; used to be that only "svn cleanup" would delete unreferenced
pristines, but ISTR this changed a while ago.
In any case, the pristine store shouldn't worry about revisions, only
about efficiently storing the contents. It doesn't even have to worry
about reference counting, since wc.db already does that.
-- Brane
P.S.: If we ever implement local snapshots and/or local branches, it
/still/ won't be the pristine store's problem to track whole-tree info.
This is why I like the clear separation between pristine store, which is
a simple dictionary, and wc.db, which is moderately complex.
P.P.S.: When we transition from pristine store per working copy to
pristine store per ~/.subversion directory, then the pristine store will
have to track how many working copies are using it. But that's way in
the future -- and another good reason to use a proper database for the
indexing.
Received on 2012-04-01 09:24:15 CEST
Received on 2012-04-02 20:23:16 CEST