Thanks for the review! Comments inline.
On Thu, Oct 2, 2008 at 11:39 PM, Daniel Shahaf <d.s_at_daniel.shahaf.name> wrote:
> Took me a few days to get to these...
>> Continued evoluation of the metadata schema.
>> - place the checksum directly into this table, rather than a "text_id"
>> field. due to operational issues, the text_id may not be a satisfiable
>> foreign key (we may not want to insert a row).
> Can you explain this scenario? When do you see TEXT_ID not a valid
> foreign key *and* CHECKSUM a valid foreign key (at the same time)?
> Before this change, the schema was "id PRIMARY KEY, checksum NOT NULL".
CHECKSUM might not be valid, but it is a known quantity when the row
is inserted into NODE. Say, during a checkout, I get all the metadata
for the file, but I haven't started to fetch it. So we create a row in
NODE. Later, as we begin the fetching of the content, then we insert a
row into TEXT_BASE.
If we kept it as an ID, then we'd have to create a row in TEXT_BASE
well before any intent of actually fetching that. Now, that said, the
TEXT_BASE.STORED_SIZE would be NULL, so we already know it is not
there or is incomplete. So that might actually be fine. (and, in which
case, we could go back to an ID style keying).
>> +CREATE TABLE TEXT_BASE (
>> /* ### the hash algorithm (MD5 or SHA-1) is encoded in this value */
>> - checksum TEXT NOT NULL,
>> - /* ### enumerated values specifying type of compression */
>> - compression INTEGER NOT NULL,
>> + checksum TEXT NOT NULL PRIMARY KEY,
> Okay. A certain text-base may appear more than once (if we change the
> hash algorithm used), but this won't break anything. (The row may be
> shared, so we'll have to create a new one rather than just modify the
> CHECKSUM column's value.)
>> + /* ### used to verify the pristine file is "proper". NULL if unknown,
>> + ### and (thus) the pristine copy is incomplete/unusable. */
>> + stored_size INTEGER,
> Do we need this?
Yes. Remember that we do not have transactions across the table and
the filesystem. My plan is that we insert the row first, *then* we
fetch the file into the filesystem. At the completion of that event,
we stash the size back into the table. The presence of the size tells
us what it *should* be (detect corruption), and it tells us that the
file should be there [now].
It seems more problematic to fetch the content first, then insert a
row. That's just kind of a gut feel... I don't have immediate reasons
> Most of the time the pristine files don't change under our feet. (When
> they do, they're corrupted.) So we can skip the size check and always
> do only the more expensive check (calculating the checksum). And if we
> skip the size check, we won't need stored_size in the database.
We can do a quick apr_stat() to see if the file is present and the
right size, and have a high confidence that it is usable.
If the stored_size is not event present, then the file is not going to
be there, or it has only been partially-fetched.
>> - refcount INTEGER NOT NULL
>> + refcount INTEGER NOT NULL
> This value can be derived from the other tables, can't it? (See? Who
> said whitespace-only changes were useless?)
I realized this while driving yesterday, yes.
Right now, I only plan to have a TEXT_BASE table, which means I do
actually need a refcount. When I later add a NODE and NODE_CHANGES
table, then I can drop the refcount and just use COUNT(*) on the other
tables. Which means that I also want to add an index on the CHECKSUM
column in those tables. (or ID if we decide to switch back)
Alternatively, keep the refcount and use SQLite triggers to maintain
it. But I think a COUNT(*) should be more than fast enough against an
>> --- trunk/subversion/libsvn_wc/wc_db.h 2008/09/30 04:39:20 33357
>> +++ trunk/subversion/libsvn_wc/wc_db.h 2008/09/30 04:58:37 33358
>> +/* ### @a new_refcount may be NULL */
>> +svn_error_t *
>> +svn_wc__db_pristine_decref(int *new_refcount,
>> + svn_wc__db_pdh_t *pdh,
>> + svn_checksum_t *checksum,
> So, sometimes 'svn_checksum_t *checksum' is the primary key of the
> operation, and sometimes it just describes a file passed in another arg.
> Personally I'd find it easier to read if it were named differently
> depending on its role -- e.g., here it might be named 'svn_checksum_t
> *text_base_id'. That conveys more information than calling it 'checksum'.
>> /* If the file is special, then we need to refer to the encapsulating
> Contrary to the comment, the boolean variable 'special' is not used in
> the function.
True. But the important part was that *if* it is a symlink, then it
was not resolved. And flags come back as svn_node_file/TRUE, and given
that it isn't "dir", then we get the dirname and go from there.
I'll clarify the comment.
>> svn_error_t *
>> svn_wc__db_pristine_dirhandle(svn_wc__db_pdh_t **pdh,
>> svn_wc__db_t *db,
>> const char *dirpath,
>> apr_pool_t *result_pool,
>> apr_pool_t *scratch_pool)
>> /* ### need to fix this up. we'll probably get called with a subdirectory
>> ### of some dirpath that we opened originally. that means we probably
>> ### won't have the subdir in the hash table. need to be able to
>> ### incrementally grow the hash of per-dir structures. */
>> *pdh = apr_hash_get(db->dir_data, dirpath, APR_HASH_KEY_STRING);
>> return SVN_NO_ERROR;
> The metadata location (per-dir, wc root, etc.) is typically per-wc, not
> per-versioned-dir. As the code is, we'll just set a key in the hash for
> every single directory in the wc, recursively. But we could store the
> information (whether the pristine dir is stored in-place or at the wc
> root) just for the wc root and let deeper dirs inherit its setting.
Whenever an operation hits a subdirectory, we don't want to scan
upwards to find the wc-root. We want to just *go* to where the data
is. Thus, we have pdh.base_dir.
I also didn't envision proactively setting subdir values, but lazily
filling them in. Today, the code will create a new pdh structure since
we're doing per-versioned-dir base_dir values. In the future, yes,
they could all share the same pdh.
And note that we can't just have db.wc_root ... any given operation
could cover *multiple* wc roots (think of svn:externals or switched
> Also, for the values, why use strings? An enum should cover the 2-3
> most common cases ("in ./.svn", "in wc root" (we know where it is),
> "in ~/.subversion").
Given a directory, you don't know where the wc-root is, without
looking for it. The ~/.subversion/ or /var/svn options are based on
the config data, so you'd have to scrawl through there if your enum
said HOME or GIVEN_PATH. Rather than scrawl each time, again: just
drop in the result.
Note that if the result is HOME or GIVEN_PATH, then all the hash
values would point to the same PDH structure. Well... at least for a
given WC. I envision an eventual possibility of different configs for
different WCs. For example, a WC on an NFS mount might be WC_ROOT,
while your personal WCs are all GIVEN_PATH.
>> svn_error_t *
>> svn_wc__db_pristine_decref(int *new_refcount,
>> svn_wc__db_pdh_t *pdh,
>> svn_checksum_t *checksum,
>> apr_pool_t *scratch_pool)
>> return svn_error__malfunction(__FILE__, __LINE__, "Not implemented.");
> Interesting spelling of the error handling.
hehe. Figured I'd go there rather than abort() :-)
Note that these functions will probably just go away. It looks like we
don't have to do separate reference count management.
To unsubscribe, e-mail: dev-unsubscribe_at_subversion.tigris.org
For additional commands, e-mail: dev-help_at_subversion.tigris.org
Received on 2008-10-03 12:50:16 CEST