[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: binary file size limit? (700MB retry - VERY annoying svn issues)

From: Benjamin Pflugmann <benjamin-svn-usr_at_pflugmann.de>
Date: 2004-03-12 19:40:18 CET

Hi!

On Fri 2004-03-12 at 02:41:37 +0100, c.a.t. wrote
[...]
> X:\SVNSandbox>svnadmin create file:///X:/SVNSandbox/Repos
> # [...create Work and put a ~700MB (790,582,464) file in...]
>
> X:\SVNSandbox>svn add Work/huge700.bin
> A (bin) Work\huge700.bin
> X:\SVNSandbox>svn commit Work/huge700.bin -m "initial add huge"
> Adding (bin) Work\huge700.bin
> Transmitting file data .
> Committed revision 1.
> X:\SVNSandbox>echo %time%
> ~10 minutes total.
[...]
> X:\SVNSandbox\Work> touch huge700.bin
> *windows explorer booooom*
[...]
> -> svn dev@ team: optimization request:
> if you already -recognize-, that a file is only
> 'touched' and its MD5 still is the same, why is the timestamp
> in the entries file not updatet? but instead the MD5 is recalculated
> on -every- status/commit/update access to the file !

They already do. A quick search for "timestamp" found e.g.

  http://subversion.tigris.org/issues/show_bug.cgi?id=1523

The existance of such (fixed) bugs show that some effort was taken to
prevent repeated calculation of checksums. But I don't doubt your
observation, just your conclusion.

So the question is, did you find a bug, or was this an intended
omission from the rule.

IIRC, the timestamp (recorded in the .svn-area) will only be rewritten
if the WC client library has taken write locks anyhow. I remember some
discussion on the dev list about that, but a quick search didn't come
up with any useful references.

[...]
> now for the timings of a 'svn status' when the file timestamp has
> been 'touched' (not modified)
>
> X:\SVNSandbox\Work>svn status
> #nothing printed
[...]
> ok, you can see the same 2 minute lockup
> (MD5 check/file compare actually) that happened to TSVN
> on the console too.
> status doesnt print anything because the file isnt modified.
> but (as stated above) from now on, every svn status will take +2
> minutes.

Since svn status is a read-only operation, it won't touch the
timestamp, I suppose.

[...]
> =====
> ok, lets see how we can get rid of that 2 minute delay...
>
> >svn commit
> At revision 1.

From what you wrote later, I assume you did an svn update.

> woho, svn is 'too' smart -- this command completes immediatly because
> svn detects that the local version has the same revision than the
> repository.
> it doesn't matter that the locally file is modified.

Correct. Since local modifications have no influence over whether an
update is needed or not, that's the correct behaviour, IMHO.

> so this doesn't help the 2 minute delay...
>
> =====
> ok, lets try again and see how we can get rid of that 2 minute delay...
>
> # time 1:22:00.36
> X:\SVNSandbox\Work> svn commit
> # nothing printed
> # time 1:24:10.26
>
> earlier, svn status took 2 minutes and printed that nothing has changed --
> now svn commit takes another 2 minutes.
>
> ok: after the -commit- all operations work fast again,
> so svn must have updated its internal timestamp.

That fits to what I said above: svn commit takes write locks on the
.svn area, therefore the WC lib can update the timestamps on-the-fly.

> -> svn dev@ team: optimization request:
> like with CVS, an "update" must put the modified timestamp into
> entries, if the file did not change. otherways all touched
> files (e.g. script generated files) sum up quite bad.

I am not sure I understand you correctly? Script generated files
shouldn't be checked in (usually), so they are unversioned, and the
above "time penality" affects only versioned files.

Anyhow, forcing update to look at files it wouldn't normally look at
is not the way to go, IMHO.
I understand that you want a record-the-changed-timestamps command,
but it's not update. (Note however, that update will of course record
the timestamps on the command it does an update for.)

One command that does what you want is svn revert. But I wouldn't like
to suggest it for this purpose. First, it does more than filing an
updating timestamp, it copies the prestine copy back, which takes some
time with large files. And second, revert is a potentially destructive
command and so there is a higher than usual risk in running it (like
the whining after mistyping the filename). And third, it would be
nicer if the job was done by some command you would be running anyhow
(i.e. the user shouldn't have to worry about how svn optimizes
detecting changes).

What works fine is a simple svn commit on the file. commit does a
check on the file to see which changes have to be submitted, but there
are none and so it simply updates the recorded timestamp. Of course,
that also works, if you commit the whole directory containing the
file as part of a wanted check-in.

[...]
> lets diff...
>
> X:\SVNSandbox\Work>svn diff
> Index: huge700.bin
> Cannot display: file marked as a binary type.
> svn:mime-type = application/octet-stream
>
> hmmm nothing to say here --
> not even a few details, e.g. that the filesize changed.

That wouldn't be good, IMHO. Displaying some difference (length) only
sometimes suggests that if no difference is displayed, they are the
same. If you want to add output, add something like cmp for binaries,
which gives the first position that differs.

[...]
> X:\SVNSandbox\Work>svn commit -m "appended text"
> Sending huge700.bin
> Transmitting file data .
>
> uh oh.. I don't like what I see:
> Repos/db/strings is currently growing to more than double its size (!!!)
> what about the "supports binary diffs" ???

It does. If you check in a third or forth time, you will see, that it
doesn't grow much anymore. Subversion stores deltas, but it doesn't
explicitly free "holes" in the database. The free space will be reused
later or has to be freed by some administrative action on the
underlying database.

> its sending the whole file content up to the database, taking
> again ~11 minutes to complete.

I doubt that. Try it with a network and do some sniffing (well, or
timing, if the network is slow). I bet that most of the 11 minutes is
the repository shuffling data.

Oh, and by the way, deltas (or diffs) for the network layer and deltas
for storing in the database are quite unrelated. The one will still
work if you disabled the other and vice versa.

[...]
> !!
> -> svn dev@ team: binary-diff bug with large files!!? - I just appended 16
> bytes.
> !!

See above. Again, correct observation (of file size), wrong conclusion.
Just to be sure I tested it with 5 check-ins (each time adding a few
bytes) on a 50MB file. It still just needs about 104MB.

> svn works just fine in the common cases, but I hope
> my 'large file test' will give new ideas for optimizations
> and result in some bugfixes.

As far as I can see, everything works as intended/documented, so
probably no bugfixes. But I want to encourage you to continue testing
worst case scenarios, I am sure there is a lot to gain.

Bye,

        Benjamin.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Received on Mon Mar 15 17:42:47 2004

This is an archived mail posted to the Subversion Users mailing list.