Stefan Fuhrmann <stefan.fuhrmann_at_wandisco.com> writes:
> Yes. This is exactly why we can only use it when we have reasonable control
> over the stream's usage, i.e. we can use it in our CL tools because all the
> code that will be run is under our control. But we cannot make e.g.
> svn_stream_for_stdin() use it by default.
[...]
> The best solution seems to be to allow for explicit resource management as
> we do with other potentially "expensive" objects. r1700305 implements that.
I have several concerns about these changes (r1698359 and r1700305):
1) The new public API — svn_stream_wrap_buffered_read() — is dangerous to
use, and wrong usage causes unbounded memory consumption in seemingly
harmless situations.
A svn_stream_mark() mark placed on this stream causes all data that's read
afterwards to be stored in memory, even if the data was processed and
immediately discarded. Lifetime of the buffered data is bound to the
lifetime of the pool where the mark object lives. Hence, avoiding memory
usage issues now requires either precise pool management or explicit control
of where and when the svn_stream_mark_t objects are added and removed.
For example, prior to r1700305, svnadmin load-revprops was actually causing
unbounded memory usage. Executing this command with a dump containing large
revisions (e.g., .iso files) was quickly consuming all the available RAM on
the machine. This behavior was caused by a svn_stream_mark_t allocated in
a long-living pool that was preventing deallocation of the buffered data.
As of r1700305, the problem is mitigated by calling svn_stream_remove_mark()
within the svn_stream_readline() implementation. However, the new stream
is generic, and every potential user of the stream API should be aware of
these caveats, because she could in fact be working with the new stream.
Place a mark somewhere in the middle of what svn_repos_parse_dumpstream3()
does — and you could be already consuming unbounded amounts of memory.
As another example, the pools are not being cleaned up / destroyed during
error flow. Say, we successfully place a stream mark, but receive an error
afterwards, and, coincidentally, the caller thinks that it's okay to keep
reading from the stream after this specific error. Again, the memory usage
is potentially unbounded.
2) The new API is backward incompatible.
The problem is not only about how we handle this in our codebase. Existing
API users have no idea about the fact that having svn_stream_mark_t objects
could be "expensive", and about the existence of svn_stream_remove_mark().
Calling svn_stream_mark() can have a severe impact on the behavior of the
stream, and can trigger memory usage issues if used improperly. The new
stream is a generic svn_stream_t, but instances of this stream cannot cross
the API boundaries, because if they do, the users would be required to
adapt their code in order to avoid the possibility of unbounded memory
consumption.
3) The behavior of pool-based reference counting is unpredictable.
The new svn_stream_wrap_buffered_read() stream performs reference counting
on a per-pool basis, and its behavior depends of whether the corresponding
pools were cleared / destroyed, or not. This behavior is unpredictable and
can lead to hard-to-diagnose problems that only happen under non-default
circumstances or with certain data patterns.
Furthermore, we have a history of dealing with unbounded memory usage that
happened due to pool-based refcounting in the first-level FSFS DAG cache
(CVE-2015-0202 [1]). As far as I recall, at some point a long-living pool
was causing positive reference count and preventing deallocation of the
cached objects. As I see it, the new stream does the same sort of reference
counting for pools holding svn_stream_mark_t objects, and, similarly, if
one of those objects is allocated in a long-living pool, it prevents the
deallocation of the data buffered by the stream.
Given the above, I am -1 on this optimization for svnadmin load-revprops that
was implemented in r1698359 and r1700305. Please revert these changes.
As for the problem itself, if the way we currently process the input during
svnadmin load and load-revprops is causing a noticeable overhead, I think that
we should introduce -F (--file) option to both of these commands:
svnadmin load /path/to/repos -F (--file) /path/to/dump
svnadmin load-revprops /path/to/repos -F (--file) /path/to/dump
As long as file streams support both svn_stream_seek() and svn_stream_mark(),
this should avoid byte-by-byte processing of the input and get rid of the
associated overhead.
[1] https://subversion.apache.org/security/CVE-2015-0202-advisory.txt
Regards,
Evgeny Kotkov
Received on 2015-09-01 20:26:48 CEST