[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

[PATCH] Make FSFS use both SHA1 and MD5 checksums

From: David Glasser <glasser_at_davidglasser.net>
Date: Fri, 21 Nov 2008 12:31:54 -0800

On Fri, Nov 21, 2008 at 8:48 AM, Hyrum K. Wright
<hyrum_wright_at_mail.utexas.edu> wrote:
> Per Dave Glasser's concerns as noted in TODO-1.6, and following the example of
> BDB in r33975, I've added the ability to store both SHA1 (for rep-sharing) and
> MD5 (for backwards compat and integrity checking) to FSFS. The SHA1 stuff
> should be conditional on the FS format number, but I plan on handling that in a
> future commit.
>
> The following patch passes all tests, except for svnadmin_tests 12. This test
> is a test for failure, and with the patch it still fails, just in a different
> way. I'm tempted to tweak the expected error output of the test, but want to
> make sure that I'm not hiding more sinister problems with the patch first.
>
> There's still a few rough edges to be fixed in followup commits, but could a
> FSFS-minded person give this a look over to make sure things are right?
>
> Thanks,
> -Hyrum
>
> [[[
> Similar to r33975 with BDB, teach FSFS to store both MD5 and SHA1 checksums.
>
> * subversion/libsvn_fs_fs/fs_fs.c
> (read_rep_offsets): Parse both sha1 and md5 checksums from the node rev.
> (representation_string): Put both the sha1 and md5 checksums, if they exist
> (put the all-zero checksum if they don't).
> (read_rep_baton): Rename context and checksum to reflect type.
> (rep_read_get_baton): Initialize baton contents.
> (rep_read_contents): Update the checksum with the appropriate contents.
> (svn_fs_fs__get_file_delta_stream): More baton variable rename fixes.
> (svn_fs_fs__file_checksum): Return the md5 checksum no matter the type
> requested.
> (svn_fs_fs__rep_copy): Update for new representation checksum members.
> (rep_write_baton): Add contexts for both sha1 and md5.
> (rep_write_contents): Update both contexts.
> (rep_write_get_baton): Initialize both contexts.
> (rep_write_contents_close): Finalize both checksums.
> (write_final_rev): Update for new checksum variable names.
>
> * subversion/libsvn_fs_fs/fs.h
> (representation_t): Remove generic checksum member in favor of specifically
> named checksum member.
>
> * subversoin/libsvn_fs_fs/rep-cache.c

subversion :)

> (svn_fs_fs__get_rep_reference): Make sure only sha1 references are requested.
> (svn_fs_fs__set_rep_reference): Make sure only sha1 references are inserted.
> (svn_fs_fs__inc_rep_reuse): Update variable names.
> ]]]
>
> Index: subversion/libsvn_fs_fs/fs_fs.c
> ===================================================================
> --- subversion/libsvn_fs_fs/fs_fs.c (revision 34308)
> +++ subversion/libsvn_fs_fs/fs_fs.c (working copy)
> @@ -1644,16 +1644,24 @@ read_rep_offsets(representation_t **rep_p,
> return svn_error_create(SVN_ERR_FS_CORRUPT, NULL,
> _("Malformed text rep offset line in node-rev"));
>
> - SVN_ERR(svn_checksum_parse_hex(&rep->checksum, svn_checksum_md5, str, pool));
> + SVN_ERR(svn_checksum_parse_hex(&rep->md5_checksum, svn_checksum_md5, str,
> + pool));
>
> - /* Read the reuse count. */
> + /* The remaining fields are only used for formats >= 4, so check that. */

Throw in a ### or TODO or something if you don't fix this first.

> str = apr_strtok(NULL, " ", &last_str);
> if (str == NULL)
> - {
> - /* The reuse_count field is only in formats >= 4. */
> - return SVN_NO_ERROR;
> - }
> + return SVN_NO_ERROR;
>
> + /* Read the SHA1 hash. */
> + if (strlen(str) != (APR_SHA1_DIGESTSIZE * 2))

It might be worth commenting that the *2 is because we're encoding as
two chars per byte? Maybe that's obvious.

> + return svn_error_create(SVN_ERR_FS_CORRUPT, NULL,
> + _("Malformed text rep offset line in node-rev"));
> +
> + SVN_ERR(svn_checksum_parse_hex(&rep->sha1_checksum, svn_checksum_sha1, str,
> + pool));
> +
> + /* Read the reuse count. */
> + str = apr_strtok(NULL, " ", &last_str);
> rep->reuse_count = apr_atoi64(str);
>
> return SVN_NO_ERROR;
> @@ -1870,11 +1878,17 @@ representation_string(representation_t *rep,
> return "-1";
> else
> return apr_psprintf(pool, "%ld %" APR_OFF_T_FMT " %" SVN_FILESIZE_T_FMT
> - " %" SVN_FILESIZE_T_FMT " %s %" APR_INT64_T_FMT,
> + " %" SVN_FILESIZE_T_FMT " %s %s %" APR_INT64_T_FMT,
> rep->revision, rep->offset, rep->size,
> rep->expanded_size,
> - svn_checksum_to_cstring_display(rep->checksum,
> - pool),
> + rep->md5_checksum ?
> + svn_checksum_to_cstring_display(rep->md5_checksum,
> + pool) :
> + "00000000000000000000000000000000",

Do these constants exist elsewhere in the source? Might be cleaner to
have a #define for them. (Or just use something like "-", although
that would probably mean more special-case code.)

> + rep->sha1_checksum ?
> + svn_checksum_to_cstring_display(rep->sha1_checksum,
> + pool) :
> + "0000000000000000000000000000000000000000",
> rep->reuse_count);
> }
>
> @@ -2442,15 +2456,19 @@ struct rep_read_baton
> apr_size_t buf_pos;
> apr_size_t buf_len;
>
> - /* A checksum context for summing the data read in order to verify it. */
> - svn_checksum_ctx_t *checksum_ctx;
> + /* A checksum context for summing the data read in order to verify it.
> + Note: we don't need to use the sha1 checksum because we're only doing
> + data verification, for which md5 is perfectly safe. */
> + svn_checksum_ctx_t *md5_checksum_ctx;
> +
> svn_boolean_t checksum_finalized;
>
> /* The stored checksum of the representation we are reading, its
> length, and the amount we've read so far. Some of this
> information is redundant with rs_list and src_state, but it's
> convenient for the checksumming code to have it here. */
> - svn_checksum_t *checksum;
> + svn_checksum_t *md5_checksum;
> +
> svn_filesize_t len;
> svn_filesize_t off;
>
> @@ -2486,9 +2504,9 @@ rep_read_get_baton(struct rep_read_baton **rb_p,
> b->fs = fs;
> b->chunk_index = 0;
> b->buf = NULL;
> - b->checksum_ctx = svn_checksum_ctx_create(svn_checksum_md5, pool);
> + b->md5_checksum_ctx = svn_checksum_ctx_create(svn_checksum_md5, pool);
> b->checksum_finalized = FALSE;
> - b->checksum = svn_checksum_dup(rep->checksum, pool);
> + b->md5_checksum = svn_checksum_dup(rep->md5_checksum, pool);
> b->len = rep->expanded_size;
> b->off = 0;
> b->fulltext_cache_key = fulltext_cache_key;
> @@ -2766,22 +2784,22 @@ rep_read_contents(void *baton,
> twice. */
> if (!rb->checksum_finalized)
> {
> - SVN_ERR(svn_checksum_update(rb->checksum_ctx, buf, *len));
> + SVN_ERR(svn_checksum_update(rb->md5_checksum_ctx, buf, *len));
> rb->off += *len;
> if (rb->off == rb->len)
> {
> - svn_checksum_t *checksum;
> + svn_checksum_t *md5_checksum;
>
> rb->checksum_finalized = TRUE;
> - svn_checksum_final(&checksum, rb->checksum_ctx, rb->pool);
> - if (!svn_checksum_match(checksum, rb->checksum))
> + svn_checksum_final(&md5_checksum, rb->md5_checksum_ctx, rb->pool);
> + if (!svn_checksum_match(md5_checksum, rb->md5_checksum))
> return svn_error_createf
> (SVN_ERR_FS_CORRUPT, NULL,
> _("Checksum mismatch while reading representation:\n"
> " expected: %s\n"
> " actual: %s\n"),
> - svn_checksum_to_cstring_display(rb->checksum, rb->pool),
> - svn_checksum_to_cstring_display(checksum, rb->pool));
> + svn_checksum_to_cstring_display(rb->md5_checksum, rb->pool),
> + svn_checksum_to_cstring_display(md5_checksum, rb->pool));
> }
> }
>
> @@ -2932,7 +2950,8 @@ svn_fs_fs__get_file_delta_stream(svn_txdelta_strea
> /* Create the delta read baton. */
> struct delta_read_baton *drb = apr_pcalloc(pool, sizeof(*drb));
> drb->rs = rep_state;
> - drb->checksum = svn_checksum_dup(target->data_rep->checksum, pool);
> + drb->checksum = svn_checksum_dup(target->data_rep->md5_checksum,
> + pool);
> *stream_p = svn_txdelta_stream_create(drb, delta_read_next_window,
> delta_read_md5_digest, pool);
> return SVN_NO_ERROR;
> @@ -3235,8 +3254,10 @@ svn_fs_fs__file_checksum(svn_checksum_t **checksum
> node_revision_t *noderev,
> apr_pool_t *pool)
> {
> + /* TODO: pass a kind argument to this function, to enable it to choose
> + between rep->md5_checksum and rep->sha1_checksum. */
> if (noderev->data_rep)
> - *checksum = svn_checksum_dup(noderev->data_rep->checksum, pool);
> + *checksum = svn_checksum_dup(noderev->data_rep->md5_checksum, pool);
> else
> *checksum = NULL;
>
> @@ -3255,7 +3276,8 @@ svn_fs_fs__rep_copy(representation_t *rep,
> rep_new = apr_pcalloc(pool, sizeof(*rep_new));
>
> memcpy(rep_new, rep, sizeof(*rep_new));
> - rep_new->checksum = svn_checksum_dup(rep->checksum, pool);
> + rep_new->md5_checksum = svn_checksum_dup(rep->md5_checksum, pool);
> + rep_new->sha1_checksum = svn_checksum_dup(rep->sha1_checksum, pool);
>
> return rep_new;
> }
> @@ -4404,7 +4426,8 @@ struct rep_write_baton
> writing to it. */
> void *lockcookie;
>
> - svn_checksum_ctx_t *checksum_ctx;
> + svn_checksum_ctx_t *md5_checksum_ctx;
> + svn_checksum_ctx_t *sha1_checksum_ctx;
>
> apr_pool_t *pool;
>
> @@ -4421,7 +4444,8 @@ rep_write_contents(void *baton,
> {
> struct rep_write_baton *b = baton;
>
> - SVN_ERR(svn_checksum_update(b->checksum_ctx, data, *len));
> + SVN_ERR(svn_checksum_update(b->md5_checksum_ctx, data, *len));
> + SVN_ERR(svn_checksum_update(b->sha1_checksum_ctx, data, *len));
> b->rep_size += *len;
>
> /* If we are writing a delta, use that stream. */
> @@ -4493,7 +4517,8 @@ rep_write_get_baton(struct rep_write_baton **wb_p,
>
> b = apr_pcalloc(pool, sizeof(*b));
>
> - b->checksum_ctx = svn_checksum_ctx_create(svn_checksum_md5, pool);
> + b->sha1_checksum_ctx = svn_checksum_ctx_create(svn_checksum_sha1, pool);
> + b->md5_checksum_ctx = svn_checksum_ctx_create(svn_checksum_md5, pool);
>
> b->fs = fs;
> b->parent_pool = pool;
> @@ -4576,12 +4601,15 @@ rep_write_contents_close(void *baton)
> rep->revision = SVN_INVALID_REVNUM;
>
> /* Finalize the checksum. */
> - svn_checksum_final(&rep->checksum, b->checksum_ctx, b->parent_pool);
> + SVN_ERR(svn_checksum_final(&rep->md5_checksum, b->md5_checksum_ctx,
> + b->parent_pool));
> + SVN_ERR(svn_checksum_final(&rep->sha1_checksum, b->sha1_checksum_ctx,
> + b->parent_pool));
>
> /* Check and see if we already have a representation somewhere that's
> identical to the one we just wrote out. */
> if (ffd->format >= SVN_FS_FS__MIN_REP_SHARING_FORMAT)
> - SVN_ERR(svn_fs_fs__get_rep_reference(&old_rep, b->fs, rep->checksum,
> + SVN_ERR(svn_fs_fs__get_rep_reference(&old_rep, b->fs, rep->sha1_checksum,
> b->parent_pool));
> else
> old_rep = NULL;
> @@ -4881,7 +4909,7 @@ write_final_rev(const svn_fs_id_t **new_id_p,
> noderev->data_rep->revision = rev;
> SVN_ERR(get_file_offset(&noderev->data_rep->offset, file, pool));
> SVN_ERR(write_hash_rep(&noderev->data_rep->size,
> - &noderev->data_rep->checksum, file,
> + &noderev->data_rep->md5_checksum, file,
> str_entries, pool));
> noderev->data_rep->expanded_size = noderev->data_rep->size;
> }
> @@ -4907,7 +4935,7 @@ write_final_rev(const svn_fs_id_t **new_id_p,
> SVN_ERR(svn_fs_fs__get_proplist(&proplist, fs, noderev, pool));
> SVN_ERR(get_file_offset(&noderev->prop_rep->offset, file, pool));
> SVN_ERR(write_hash_rep(&noderev->prop_rep->size,
> - &noderev->prop_rep->checksum, file,
> + &noderev->prop_rep->md5_checksum, file,
> proplist, pool));
>
> noderev->prop_rep->txn_id = NULL;
> Index: subversion/libsvn_fs_fs/fs.h
> ===================================================================
> --- subversion/libsvn_fs_fs/fs.h (revision 34308)
> +++ subversion/libsvn_fs_fs/fs.h (working copy)
> @@ -252,14 +252,19 @@ typedef struct
> * svn_fs_fs__rep_copy. */
> typedef struct
> {
> - /* Checksum for the contents produced by this representation.
> + /* Checksums for the contents produced by this representation.
> This checksum is for the contents the rep shows to consumers,
> regardless of how the rep stores the data under the hood. It is
> independent of the storage (fulltext, delta, whatever).
>
> If checksum is NULL, then for compatibility behave as though this
> - checksum matches the expected checksum. */
> - svn_checksum_t *checksum;
> + checksum matches the expected checksum.
> +
> + The md5 checksum is always filled (with the exception being reading
> + from a repository which store checksums). The sha1 checksum is only

I don't understand the parenthetical here.

> + computed on a write, for use with rep-sharing, otherwise it is NULL. */

second comma should be semi-colon.

This paragraph is generally confusing; I think you need to
differentiate between when the svn_checksum_t's exist (vs NULL), and
when they are *computed*. Aren't the sha1_checksums non-NULL during
reads if the sha1 is written in the rep?

> + svn_checksum_t *md5_checksum;
> + svn_checksum_t *sha1_checksum;
>
> /* Revision where this representation is located. */
> svn_revnum_t revision;
> Index: subversion/libsvn_fs_fs/rep-cache.c
> ===================================================================
> --- subversion/libsvn_fs_fs/rep-cache.c (revision 34308)
> +++ subversion/libsvn_fs_fs/rep-cache.c (working copy)
> @@ -114,6 +114,12 @@ svn_fs_fs__get_rep_reference(representation_t **re
> return SVN_NO_ERROR;
> }
>
> + /* We only allow SHA1 checksums in this table. */
> + if (checksum->kind != svn_checksum_sha1)
> + return svn_error_create(SVN_ERR_BAD_CHECKSUM_KIND, NULL,
> + _("Only SHA1 checksums can be used as keys in the "
> + "rep_cache table.\n"));
> +
> if (!ffd->rep_cache.get_rep_stmt)
> SVN_ERR(svn_sqlite__prepare(&ffd->rep_cache.get_rep_stmt, ffd->rep_cache.db,
> "select revision, offset, size, expanded_size from rep_cache "
> @@ -126,7 +132,7 @@ svn_fs_fs__get_rep_reference(representation_t **re
> if (have_row)
> {
> *rep = apr_pcalloc(pool, sizeof(**rep));
> - (*rep)->checksum = svn_checksum_dup(checksum, pool);
> + (*rep)->sha1_checksum = svn_checksum_dup(checksum, pool);
> (*rep)->revision = svn_sqlite__column_revnum(ffd->rep_cache.get_rep_stmt,
> 0);
> (*rep)->offset = svn_sqlite__column_int(ffd->rep_cache.get_rep_stmt, 1);
> @@ -153,11 +159,17 @@ svn_fs_fs__set_rep_reference(svn_fs_t *fs,
> if (ffd->rep_cache.db == NULL)
> return SVN_NO_ERROR;
>
> - /* Check to see if we already have a mapping for REP->CHECKSUM. If so,
> + /* We only allow SHA1 checksums in this table. */
> + if (rep->sha1_checksum == NULL)
> + return svn_error_create(SVN_ERR_BAD_CHECKSUM_KIND, NULL,
> + _("Only SHA1 checksums can be used as keys in the "
> + "rep_cache table.\n"));
> +
> + /* Check to see if we already have a mapping for REP->SHA1_CHECKSUM. If so,
> and the value is the same one we were about to write, that's
> cool -- just do nothing. If, however, the value is *different*,
> that's a red flag! */
> - SVN_ERR(svn_fs_fs__get_rep_reference(&old_rep, fs, rep->checksum, pool));
> + SVN_ERR(svn_fs_fs__get_rep_reference(&old_rep, fs, rep->sha1_checksum, pool));
>
> if (old_rep)
> {
> @@ -174,7 +186,7 @@ svn_fs_fs__set_rep_reference(svn_fs_t *fs,
> APR_OFF_T_FMT, SVN_FILESIZE_T_FMT,
> SVN_FILESIZE_T_FMT, APR_OFF_T_FMT,
> SVN_FILESIZE_T_FMT, SVN_FILESIZE_T_FMT),
> - svn_checksum_to_cstring_display(rep->checksum, pool),
> + svn_checksum_to_cstring_display(rep->sha1_checksum, pool),
> fs->path, old_rep->revision, old_rep->offset, old_rep->size,
> old_rep->expanded_size, rep->revision, rep->offset, rep->size,
> rep->expanded_size);
> @@ -189,7 +201,8 @@ svn_fs_fs__set_rep_reference(svn_fs_t *fs,
> "values (:1, :2, :3, :4, :5, 0);", fs->pool));
>
> SVN_ERR(svn_sqlite__bind_text(ffd->rep_cache.set_rep_stmt, 1,
> - svn_checksum_to_cstring(rep->checksum, pool)));
> + svn_checksum_to_cstring(rep->sha1_checksum,
> + pool)));
> SVN_ERR(svn_sqlite__bind_int64(ffd->rep_cache.set_rep_stmt, 2, rep->revision));
> SVN_ERR(svn_sqlite__bind_int64(ffd->rep_cache.set_rep_stmt, 3, rep->offset));
> SVN_ERR(svn_sqlite__bind_int64(ffd->rep_cache.set_rep_stmt, 4, rep->size));
> @@ -219,13 +232,14 @@ svn_fs_fs__inc_rep_reuse(svn_fs_t *fs,
> fs->pool));
>
> SVN_ERR(svn_sqlite__bind_text(ffd->rep_cache.inc_select_stmt, 1,
> - svn_checksum_to_cstring(rep->checksum, pool)));
> + svn_checksum_to_cstring(rep->sha1_checksum,
> + pool)));
> SVN_ERR(svn_sqlite__step(&have_row, ffd->rep_cache.inc_select_stmt));
>
> if (!have_row)
> return svn_error_createf(SVN_ERR_FS_CORRUPT, NULL,
> _("Representation for hash '%s' not found"),
> - svn_checksum_to_cstring_display(rep->checksum,
> + svn_checksum_to_cstring_display(rep->sha1_checksum,
> pool));
>
> rep->reuse_count =
> @@ -242,7 +256,8 @@ svn_fs_fs__inc_rep_reuse(svn_fs_t *fs,
> SVN_ERR(svn_sqlite__bind_int64(ffd->rep_cache.inc_update_stmt, 1,
> rep->reuse_count));
> SVN_ERR(svn_sqlite__bind_text(ffd->rep_cache.inc_update_stmt, 2,
> - svn_checksum_to_cstring(rep->checksum, pool)));
> + svn_checksum_to_cstring(rep->sha1_checksum,
> + pool)));
> SVN_ERR(svn_sqlite__step_done(ffd->rep_cache.inc_update_stmt));
>
> return svn_sqlite__reset(ffd->rep_cache.inc_update_stmt);
>
>

Generally looks reasonable to me.

--dave

--
David Glasser | glasser@davidglasser.net | http://www.davidglasser.net/
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe_at_subversion.tigris.org
For additional commands, e-mail: dev-help_at_subversion.tigris.org
Received on 2008-11-21 21:32:09 CET

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.