Re: Serialized Changeset

From: Julian Foad <julianfoad_at_btopenworld.com>
Date: 2005-05-31 23:59:22 CEST

Tez Kamihira wrote:
> I tried to define a new changeset format:
>
> http://scm.bluegate.org/scs.txt

Excellent. I'm glad to see this work. It is a complex problem. I hope that
my comments are useful.

> b.) Do you think this specification has enough ability to describe
> your system's changeset completely ?

Not quite. I don't think this is able to capture the information: "This file
is a new copy (with history) of the existing version-controlled file FOO."

> c.) :-( Then, what is the missing part to achieve it ? Could you
> tell it immediately ? or there exists very subtle/touchy issues
> ?

How are character sets and different line-ending conventions handled in plain
text? (I see later on that non-ASCII characters are not yet addressed by this
proposal.)

> Item Type Symbol
>
> Directory Item, symbolic Item, regular Item and null Item have
> symbolic representation, that is, 'D', 'S', 'R', '0'. These
> symbols are called Item type symbols.

These symbols are not used in this document. Only the symbols used in extended
markers are used, and they are not defined here.

> 4.1. Standard Hunk
>
> Standard Hunk Header usually has the following format:
>
> @@ -A,B +C,D @@
[...]
> (1) B or D can be omitted when it equals to 1.

B and/or D can be omitted in the traditional Unidiff format, and so SCS parsers
might want to treat them as optional so that they can parse both traditional
Unidiff and SCS hunks in the same way. However, I recommend that SCS
generators MUST write all four numbers. This is because making them optional
doesn't save much space or time, because the content is rarely exactly one
line, especially when context lines are used; and also because every optional
thing in any specification makes the parsers a bit more complicated and reduces
the possibility for backward-compatible enhancements to be made to the format
in the future.

> 4.2. Binary Hunk
[...]
> ---x ./binary.bin 2005-05-10 13:05:26.000000000 +0900
> +++x ./binary.bin 2005-05-10 13:06:10.000000000 +0900
> @@ bin: strangelove, -1,3 +0,0 @@
> -
> - ... forward.bin data encoded by base64 ...
> -

Please specify exactly what the numbers A, B, C, D (pre-offset, pre-size,
post-offset, post-size) mean in a binary hunk.

>
> The pre-defined binary algorithms are as follows:

"formats", not "algorithms" (see below).

> plain (irreversible):
>
> Not calculating actual difference, the original and modified
> version of the file are simply base64 encoded and embedded to
> the Hunk. The original lines are added by '-' at the head of
> each line, and modified lines are added by '+'.
>
> gzip (irreversible):
>
> Same as plain, except that compression must be occurred before
> base64 encoding.
>
> xdelta (irreversible):
>
> Using xdelta algorithm for calculation.
>
> bsdiff (irreversible):
>
> Using bsdiff/bspatch algorithm for calculation.

Please explain why certain particular formats are allowed and no others.

Presumably you want to limit the methods to a small, fixed set, so that it is
possible to write an SCS "patch" program that can understand all possible SCS
patches.

You probably need to limit the methods to ones whose algorithms and formats are
"free" as in unencumbered by patents etc., so that everyone is able to use the
SCS format. Does that exclude using gzip?

Why are both xdelta and bsdiff allowed? Does each of them have unique, strong
advantages over the other in certain situations?

Note that you really need to specify the data formats that are allowed, rather
than the algorithms that are used to generate and decode them. Usually there
is a one-to-one mapping between data formats and algorithms, but not always.
For example, the "xdelta" algorithm generates the same data format as the
"vdelta" algorithm, so you should specify "the xdelta data format" or "the
xdelta/vdelta data format" rather than "the xdelta algorithm". It does not
matter whether SCS-diff programs uses xdelta or vdelta or any other algorithm
to generate hunks in that format. An "SCS-patch" program using xdelta will
understand patches generated by an "SCS-diff" program that used vdelta, and
vice versa.

> 4.3. Property Hunk

What is the format of a property hunk?

[...]
> Some property keys, just like file permission, are reserved. The
> list of reserved keys are as follows:
>
> permission: file permission information.

Are the reserved keys in the same name space as user-defined keys? That would
be awkward.

Note that file permission information is difficult to represent in a portable
manner, but there may be standards that you can use, such as are used in CD-ROM
file systems.

> The body of property Hunk MUST be sorted by their key order. This
> rule MUST be applied in both deleted lines part and appended lines
> part.

Character set and encoding needs to be specified in order to sort the keys into
order. I would suggest that the property name (key) must be converted to
Unicode and encoded in UTF-8, if it is feasible to demand this.

> We could consider the file modification time as a kind of property,
> but this values are changed almost every time. If it would be put
> into property Hunk, almost every time SCS has property Hunk in each
> Unit. To avoid this, modification time MUST be put in the tail of
> the Unit header.

I wonder how Subversion would handle this. I suppose we would be
special-casing some of Subversion's reserved properties anyway, so we can do
this without difficulty, and we would also convert Subversion's
"svn:executable" property to/from an SCS "permission" property.

> The binary encoding rule accepts
>
> \\, \', \", \n, \r, \t, \ooo, \xhh
>
> characters. [FIXME:]

Why are escape codes provided for quotes?

Does every backslash/newline/return/tab in the value have to be represented by
its escape code (I recommend "yes"), or is the use of these escapes optional?

In Subversion, a property can have a non-text value - e.g. a JPEG picture. It
would seem odd to encode that by a sequence of "\ooo" or "\xhh" groups rather
than in base-64. However, such property values are not expected to be large,
and do not appear to be widely used, so it is probably OK and sensible to
encode the value as if it were text.

> 4.4. ID Hunk
>
> Item ID would be embedded in a special Hunk which MUST be always at
> the beginning of the all Hunks. This Hunk is called ID Hunk. The
> format of ID Hunk is as follows:
>
> If this file is tagged by GNU arch's tagline method:
>
> @@ id: i_cdbdd634-3f67-438c-97d3-a63a0699d6b9 @@
>
> or by Subversion's node ID method:
>
> @@ id: [FIXME:] @@

I don't know.

> ID Hunk is optional and doesn't have to always exits, but if exists,
> the patch operation MUST be done by the ID information and MUST NOT
> by the file name in the Unit header.

Shouldn't the "patch" program check that the file name in the unit header is
correct? It depends where this patch is being applied - to a repository or to
plain files.

I'm not sure how this id is to be used. Presumably this id refers to the item
being modified or added or deleted, but what if the change is renaming or
copying an item? Can the same item (id) exist by multiple names (paths) in the
repository, and if so, which of those paths would be renamed? In a copy, does
this id refer to the old or the new item? Or is a copy not represented by a
single Unit?

> In Subversion case, so-called node ID would be embedded into the ID
> Hunk. [FIXME: Am I wrong ?]

I don't know.

[...]
> The permitted character in the Item ID is out of scope of this
> document. But [0-9a-zA-Z_]+ would be reasonable by extended regular
> expression.

That set of characters isn't sufficient for the Arch id. that you used as an
example :-)

> 5. Differential Calculation
>
> Given two Items which have any Item types, we can always calculate
> their difference. SCS allows Item type transitions even between
> different Item types. We have to define general differential rule
> even between hetero-Item types.

What type of Unit is used to represent the replacement of a directory or a
symbolic link by a binary file, or vice versa? A Binary Unit or an (extended)
Unidiff Unit? :-)

[...]
> The actual differential algorithm is as follows:
>
> (1) Check Unidiff calculation is available or not.

"Check whether a Unidiff calculation is possible on these two items."

> (2) If possible, that output is just the answer.
>
> (3) If impossible, arbitrary binary differential algorithm is
> applied to them and the result is converted to base64 form.

May I suggest this wording: "(3) If impossible, the SCS diff program uses an
algorithm of its choice to generate one of the allowed binary formats specified
in section 4.2: Binary Hunk." ?

> We don't touch the condition when calculation in (1) is failed to
> avoid some complex issues around binary check algorithms.

Do you mean, "We do not specify how to determine whether a Unidiff is possible.
There is no simple answer and this decision is left up to the implementation." ?

> According to the specific Unidiff implementation, or processing
> system, the check result of (1) would be different from each
> other. So the difference SHOULD be Unidiff algorithm as much as
> possible. When an Unidiff calculation system is fixed, the following
> either result will be possible for arbitrary two Items.

The wording in this paragraph, and in some others, is a bit difficult to read,
but the maining is clear. We can improve the language later. It is probably
best to concentrate on the technical details first.

> (a) It can be represented by both Unidiff format and binary format.
> (b) It can be represented only by binary format.
>
>
> 6. Applying Semantics
[...]
> 6.2.5. actual patching
>
> Depending on the Unit type, actual patching process is invoked.
> normal patch applying for extended Unidiff Unit, binary patch
> applying for Binary Unit, respectively. this process must be
> done by the Item ID. not by the file name.

What do you mean by "this process must be done by the Item ID. not by the file
name"? What about when this patch is applied to a plain filesystem directory
tree, or to a type of repository that does not have that type of item
identifiers? Perhaps you should say, "If the item-id hunk exists and makes
sense in the system to which the patch is being applied, then the change must
be applied to the item identified by the item-id hunk." That wording would
allow the SCS-patch program to choose whether to verify also that the item's
file name is correct.

[...]
> 6.3. Error Recovery
>
> When serious errors are detected in step 6.2. the whole tree MUST be
> rollbacked to the previous state completely.

Section 6.3 specifies the behaviour of a program that uses this SCS format. I
agree that this behaviour is desirable, but specifying such behaviour is
outside the declared scope of this document.

>
>
> 7. Unit Header
>
> 7.1. Extended Marker
>
> Extended marker consists of four characters and indicates what kind
> of Item Type Transition occurred in it and whether the Unit is
> binary or not. Notice that legacy Unidiff marker had only three
> characters.
>
> The symbolic rules of Extended Marker are as follows:
>
> char 1. char 2. char 3. char 4.
>
> 1st line '-' '-' [*1] [*3]
>
> 2nd line '+' '+' [*2] [*4]

One problem with Unidiff format is that a line looking exactly like a Unit
header line can appear within the hunk body. This increases the risk of
applying a patch wrongly (if the patch has been edited by hand, for example)
and makes it more difficult to write parsers for the format, for example to
provide syntax highlighting. This deficiency could be fixed in your SCS format
by making the first character of each header line something different from "-",
"+", " ", "\".

[...]
> 7.2. File Name Field Convention
>
> File names must be always started by '.' character.

I suggest: "The first path component of a file name must be '.'."
Or: "A file name must start with './' unless it is just '.' which represents
the tree's root directory."

[...]
> When null marker is appeared in Unit header, the corresponding file
> name MUST be blanked. The time field rule described in 7.3. is still
> hold.

What? How can a non-existent item have a last modification time?

- Julian

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Wed Jun 1 00:00:14 2005

This message: [ Message body ]
Next message: C. Michael Pilato: "Re: confusion about largefile support"
Next in thread: Tez Kamihira: "Re: Serialized Changeset"
Reply: Tez Kamihira: "Re: Serialized Changeset"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]