Re: svn commit: r1698359 - in /subversion/trunk/subversion: include/svn_io.h libsvn_subr/stream.c svnadmin/svnadmin.c svnfsfs/load-index-cmd.c tests/libsvn_subr/stream-test.c

From: Evgeny Kotkov <evgeny.kotkov_at_visualsvn.com>
Date: Tue, 1 Sep 2015 21:26:11 +0300

Stefan Fuhrmann <stefan.fuhrmann_at_wandisco.com> writes:

> Yes. This is exactly why we can only use it when we have reasonable control
> over the stream's usage, i.e. we can use it in our CL tools because all the
> code that will be run is under our control. But we cannot make e.g.
> svn_stream_for_stdin() use it by default.

[...]

> The best solution seems to be to allow for explicit resource management as
> we do with other potentially "expensive" objects. r1700305 implements that.

I have several concerns about these changes (r1698359 and r1700305):

1) The new public API â€” svn_stream_wrap_buffered_read() â€” is dangerous to
use, and wrong usage causes unbounded memory consumption in seemingly
harmless situations.

   A svn_stream_mark() mark placed on this stream causes all data that's read
   afterwards to be stored in memory, even if the data was processed and
   immediately discarded. Lifetime of the buffered data is bound to the
   lifetime of the pool where the mark object lives. Hence, avoiding memory
   usage issues now requires either precise pool management or explicit control
   of where and when the svn_stream_mark_t objects are added and removed.

   For example, prior to r1700305, svnadmin load-revprops was actually causing
   unbounded memory usage. Executing this command with a dump containing large
   revisions (e.g., .iso files) was quickly consuming all the available RAM on
   the machine. This behavior was caused by a svn_stream_mark_t allocated in
   a long-living pool that was preventing deallocation of the buffered data.

   As of r1700305, the problem is mitigated by calling svn_stream_remove_mark()
   within the svn_stream_readline() implementation. However, the new stream
   is generic, and every potential user of the stream API should be aware of
   these caveats, because she could in fact be working with the new stream.

   Place a mark somewhere in the middle of what svn_repos_parse_dumpstream3()
   does â€” and you could be already consuming unbounded amounts of memory.
   As another example, the pools are not being cleaned up / destroyed during
   error flow. Say, we successfully place a stream mark, but receive an error
   afterwards, and, coincidentally, the caller thinks that it's okay to keep
   reading from the stream after this specific error. Again, the memory usage
   is potentially unbounded.

2) The new API is backward incompatible.

   The problem is not only about how we handle this in our codebase. Existing
   API users have no idea about the fact that having svn_stream_mark_t objects
   could be "expensive", and about the existence of svn_stream_remove_mark().

   Calling svn_stream_mark() can have a severe impact on the behavior of the
   stream, and can trigger memory usage issues if used improperly. The new
   stream is a generic svn_stream_t, but instances of this stream cannot cross
   the API boundaries, because if they do, the users would be required to
   adapt their code in order to avoid the possibility of unbounded memory
   consumption.

3) The behavior of pool-based reference counting is unpredictable.

   The new svn_stream_wrap_buffered_read() stream performs reference counting
   on a per-pool basis, and its behavior depends of whether the corresponding
   pools were cleared / destroyed, or not. This behavior is unpredictable and
   can lead to hard-to-diagnose problems that only happen under non-default
   circumstances or with certain data patterns.

   Furthermore, we have a history of dealing with unbounded memory usage that
   happened due to pool-based refcounting in the first-level FSFS DAG cache
   (CVE-2015-0202 [1]). As far as I recall, at some point a long-living pool
   was causing positive reference count and preventing deallocation of the
   cached objects. As I see it, the new stream does the same sort of reference
   counting for pools holding svn_stream_mark_t objects, and, similarly, if
   one of those objects is allocated in a long-living pool, it prevents the
   deallocation of the data buffered by the stream.

Given the above, I am -1 on this optimization for svnadmin load-revprops that
was implemented in r1698359 and r1700305. Please revert these changes.

As for the problem itself, if the way we currently process the input during
svnadmin load and load-revprops is causing a noticeable overhead, I think that
we should introduce -F (--file) option to both of these commands:

svnadmin load /path/to/repos -F (--file) /path/to/dump

svnadmin load-revprops /path/to/repos -F (--file) /path/to/dump

As long as file streams support both svn_stream_seek() and svn_stream_mark(),
this should avoid byte-by-byte processing of the input and get rid of the
associated overhead.

[1] https://subversion.apache.org/security/CVE-2015-0202-advisory.txt

Regards,
Evgeny Kotkov
Received on 2015-09-01 20:26:48 CEST

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]