Re: Serialized Changeset

From: Tez Kamihira <tez_at_kamihira.com>
Date: 2005-06-01 16:49:36 CEST

On Tue, 31 May 2005 22:59:22 +0100, Julian Foad <julianfoad@btopenworld.com> said:
>
> I'm glad to see this work.
>

Thank you.

>
> > b.) Do you think this specification has enough ability to describe
> > your system's changeset completely ?
>
> Not quite. I don't think this is able to capture the information: "This file
> is a new copy (with history) of the existing version-controlled file FOO."
>

Some Japanese folks also pointed it out. Basically, we would need some
special type of "file creation Unit", that is:

        --.-
        ++++ ./foo.txt 2005-05-10 13:06:10.000000000 +0900
        @@ -0,0 +1,n @@
        +
        + file contents
        + ...
        +

plus additional information, just like:

"I'm not Adam. My father is XXX."

somewhere. But this way might destroy your beautiful "cheap copy"
world, because This Unit consumes actual disk space. I also guess that
this issue would be closely related to

"What is the 'Item ID' in subversion world ?"

too. Maybe I must study Subversion Filesystem more deeply.

> > c.) :-( Then, what is the missing part to achieve it ? Could you
> > tell it immediately ? or there exists very subtle/touchy issues
> > ?
>
> How are character sets and different line-ending conventions handled in plain
> text? (I see later on that non-ASCII characters are not yet addressed by this
> proposal.)
>

I can't respond to them immediately, especially non-ASCII character
problems. I'll think more and consult some CJK guys (and ladies, of
course) as well.

> > Item Type Symbol
> >
> > Directory Item, symbolic Item, regular Item and null Item have
> > symbolic representation, that is, 'D', 'S', 'R', '0'. These
> > symbols are called Item type symbols.
>
> These symbols are not used in this document. Only the symbols used in extended
> markers are used, and they are not defined here.
>

Removed.

> > 4.1. Standard Hunk
> >
> > Standard Hunk Header usually has the following format:
> >
> > @@ -A,B +C,D @@
> [...]
> > (1) B or D can be omitted when it equals to 1.
>
> B and/or D can be omitted in the traditional Unidiff format, and so SCS parsers
> might want to treat them as optional so that they can parse both traditional
> Unidiff and SCS hunks in the same way. However, I recommend that SCS
> generators MUST write all four numbers. This is because making them optional
> doesn't save much space or time, because the content is rarely exactly one
> line, especially when context lines are used; and also because every optional
> thing in any specification makes the parsers a bit more complicated and reduces
> the possibility for backward-compatible enhancements to be made to the format
> in the future.
>

OK, I'll take them.

> > 4.2. Binary Hunk
> [...]
> > ---x ./binary.bin 2005-05-10 13:05:26.000000000 +0900
> > +++x ./binary.bin 2005-05-10 13:06:10.000000000 +0900
> > @@ bin: strangelove, -1,3 +0,0 @@
> > -
> > - ... forward.bin data encoded by base64 ...
> > -
>
> Please specify exactly what the numbers A, B, C, D (pre-offset, pre-size,
> post-offset, post-size) mean in a binary hunk.
>

OK.

> >
> > The pre-defined binary algorithms are as follows:
>
> "formats", not "algorithms" (see below).
>

Indeed.

> > plain (irreversible):
> >
> > Not calculating actual difference, the original and modified
> > version of the file are simply base64 encoded and embedded to
> > the Hunk. The original lines are added by '-' at the head of
> > each line, and modified lines are added by '+'.
> >
> > gzip (irreversible):
> >
> > Same as plain, except that compression must be occurred before
> > base64 encoding.
> >
> > xdelta (irreversible):
> >
> > Using xdelta algorithm for calculation.
> >
> > bsdiff (irreversible):
> >
> > Using bsdiff/bspatch algorithm for calculation.
>
> Please explain why certain particular formats are allowed and no others.
>

Not special meaning. I simply listed some examples.

> Presumably you want to limit the methods to a small, fixed set, so that it is
> possible to write an SCS "patch" program that can understand all possible SCS
> patches.
>
> You probably need to limit the methods to ones whose algorithms and formats are
> "free" as in unencumbered by patents etc., so that everyone is able to use the
> SCS format. Does that exclude using gzip?
>
> Why are both xdelta and bsdiff allowed? Does each of them have unique, strong
> advantages over the other in certain situations?
>

I don't have any special meaning nor strong opinions on those
lines. Moreover, I'm all thumb when it comes to license issues. Is
there anything wrong with gzip ? Isn't it a free software ? I can find
it in my Fedora box.

>
> Note that you really need to specify the data formats that are allowed, rather
> than the algorithms that are used to generate and decode them. Usually there
> is a one-to-one mapping between data formats and algorithms, but not always.
> For example, the "xdelta" algorithm generates the same data format as the
> "vdelta" algorithm, so you should specify "the xdelta data format" or "the
> xdelta/vdelta data format" rather than "the xdelta algorithm". It does not
> matter whether SCS-diff programs uses xdelta or vdelta or any other algorithm
> to generate hunks in that format. An "SCS-patch" program using xdelta will
> understand patches generated by an "SCS-diff" program that used vdelta, and
> vice versa.
>

Very easy explanation. thank you.

>
> > 4.3. Property Hunk
>
> What is the format of a property hunk?
>

It's related to non-ASCII character issue. I'll design it in more
detail again.

> [...]
> > Some property keys, just like file permission, are reserved. The
> > list of reserved keys are as follows:
> >
> > permission: file permission information.
>
> Are the reserved keys in the same name space as user-defined keys? That would
> be awkward.
>

Yes. Some name space management will be needed. Some other guy
privately suggested just like "scs:permission" style management. I'll
think a little more.

> Note that file permission information is difficult to represent in a portable
> manner, but there may be standards that you can use, such as are used in CD-ROM
> file systems.
>

Can I get the spec ?

> > The body of property Hunk MUST be sorted by their key order. This
> > rule MUST be applied in both deleted lines part and appended lines
> > part.
>
> Character set and encoding needs to be specified in order to sort the keys into
> order. I would suggest that the property name (key) must be converted to
> Unicode and encoded in UTF-8, if it is feasible to demand this.
>

Yes. I think UTF-8 is reasonable. SCS should be based upon it. Even in
that case, I think there'd be still many deadly messy problems under
practical daily Japanese Environment, but clearly it's another story.

> > We could consider the file modification time as a kind of property,
> > but this values are changed almost every time. If it would be put
> > into property Hunk, almost every time SCS has property Hunk in each
> > Unit. To avoid this, modification time MUST be put in the tail of
> > the Unit header.
>
> I wonder how Subversion would handle this. I suppose we would be
> special-casing some of Subversion's reserved properties anyway, so we can do
> this without difficulty, and we would also convert Subversion's
> "svn:executable" property to/from an SCS "permission" property.
>

Sounds like good. Because of the /principle of neutrality/, some
conversion between "svn:executable" <--> SCS "permission" is
inevitable.

> > The binary encoding rule accepts
> >
> > \\, \', \", \n, \r, \t, \ooo, \xhh
> >
> > characters. [FIXME:]
>
> Why are escape codes provided for quotes?
>

You mean we don't need \' and \" ? OK, remove them.

> Does every backslash/newline/return/tab in the value have to be represented by
> its escape code (I recommend "yes"), or is the use of these escapes optional?
>
> In Subversion, a property can have a non-text value - e.g. a JPEG picture. It
> would seem odd to encode that by a sequence of "\ooo" or "\xhh" groups rather
> than in base-64. However, such property values are not expected to be large,
> and do not appear to be widely used, so it is probably OK and sensible to
> encode the value as if it were text.
>

Large blob is rare, but certainly possible. So I'll provide both pure
hex representation (base64?) and escape string representation. The
latter is just like

".*\n*.o\n*.a"

for svn:ignore, for example.

>
> > 4.4. ID Hunk
> >
> > Item ID would be embedded in a special Hunk which MUST be always at
> > the beginning of the all Hunks. This Hunk is called ID Hunk. The
> > format of ID Hunk is as follows:
> >
> > If this file is tagged by GNU arch's tagline method:
> >
> > @@ id: i_cdbdd634-3f67-438c-97d3-a63a0699d6b9 @@
> >
> > or by Subversion's node ID method:
> >
> > @@ id: [FIXME:] @@
>
> I don't know.
>

I guess that this issue would be most critical one. I read

http://svn.collab.net/viewcvs/svn/trunk/subversion/libsvn_fs_base/notes/fs-history?view=markup
http://svn.collab.net/viewcvs/svn/trunk/subversion/libsvn_fs_base/notes/structure?view=markup

but I couldn't resolve my question.

> > ID Hunk is optional and doesn't have to always exits, but if exists,
> > the patch operation MUST be done by the ID information and MUST NOT
> > by the file name in the Unit header.
>
> Shouldn't the "patch" program check that the file name in the unit header is
> correct? It depends where this patch is being applied - to a repository or to
> plain files.
>
> I'm not sure how this id is to be used. Presumably this id refers to the item
> being modified or added or deleted, but what if the change is renaming or
> copying an item? Can the same item (id) exist by multiple names (paths) in the
> repository, and if so, which of those paths would be renamed? In a copy, does
> this id refer to the old or the new item? Or is a copy not represented by a
> single Unit?
>
> > In Subversion case, so-called node ID would be embedded into the ID
> > Hunk. [FIXME: Am I wrong ?]
>
> I don't know.
>

Again, I think this part would be /critical/. The worst senario is
that Subversion doesn't have any kind of strictly corresponding
concept of "Inventory" that has very important roll in GNU arch world.

> [...]
> > The permitted character in the Item ID is out of scope of this
> > document. But [0-9a-zA-Z_]+ would be reasonable by extended regular
> > expression.
>
> That set of characters isn't sufficient for the Arch id. that you used as an
> example :-)
>

Oops...

>
> > 5. Differential Calculation
> >
> > Given two Items which have any Item types, we can always calculate
> > their difference. SCS allows Item type transitions even between
> > different Item types. We have to define general differential rule
> > even between hetero-Item types.
>
> What type of Unit is used to represent the replacement of a directory or a
> symbolic link by a binary file, or vice versa? A Binary Unit or an (extended)
> Unidiff Unit? :-)
>

You can always use the general unidiff check logic. In the case just
you presented, unidiff check will be failed. so we should use Binary
Unit. Notice that it can be explained by the general rule. We don't
have to make any exceptional rule for this case.

> [...]
> > The actual differential algorithm is as follows:
> >
> > (1) Check Unidiff calculation is available or not.
>
> "Check whether a Unidiff calculation is possible on these two items."
>
> > (2) If possible, that output is just the answer.
> >
> > (3) If impossible, arbitrary binary differential algorithm is
> > applied to them and the result is converted to base64 form.
>
> May I suggest this wording: "(3) If impossible, the SCS diff program uses an
> algorithm of its choice to generate one of the allowed binary formats specified
> in section 4.2: Binary Hunk." ?
>

Yes. Thank you. The bottom line is, I don't have any solid idea about
the choice of binary format yet. Please not to take it so
seriously. :-)

> > We don't touch the condition when calculation in (1) is failed to
> > avoid some complex issues around binary check algorithms.
>
> Do you mean, "We do not specify how to determine whether a Unidiff is possible.
> There is no simple answer and this decision is left up to the implementation." ?
>

That's it. In GNUdiff's case, it usually seems to check whether the
top 8K byte has any '\0' or not, but actually the check size would be
dependent on some system parameter. so, as you said, I don't want
to... ah... "open Pandora's box", Yep. (What a useful J/E
dictionary...)

http://demimparati.blogs.sapo.pt/arquivo/Pandora%20Box%20Arthur%20Rackham.jpg

(I like this one best.)

> > According to the specific Unidiff implementation, or processing
> > system, the check result of (1) would be different from each
> > other. So the difference SHOULD be Unidiff algorithm as much as
> > possible. When an Unidiff calculation system is fixed, the following
> > either result will be possible for arbitrary two Items.
>
> The wording in this paragraph, and in some others, is a bit difficult to read,
> but the maining is clear. We can improve the language later. It is probably
> best to concentrate on the technical details first.
>

Thank you. I feel confident.

> > (a) It can be represented by both Unidiff format and binary format.
> > (b) It can be represented only by binary format.
> >
> >
> > 6. Applying Semantics
> [...]
> > 6.2.5. actual patching
> >
> > Depending on the Unit type, actual patching process is invoked.
> > normal patch applying for extended Unidiff Unit, binary patch
> > applying for Binary Unit, respectively. this process must be
> > done by the Item ID. not by the file name.
>
> What do you mean by "this process must be done by the Item ID. not by the file
> name"? What about when this patch is applied to a plain filesystem directory
> tree, or to a type of repository that does not have that type of item
> identifiers? Perhaps you should say, "If the item-id hunk exists and makes
> sense in the system to which the patch is being applied, then the change must
> be applied to the item identified by the item-id hunk." That wording would
> allow the SCS-patch program to choose whether to verify also that the item's
> file name is correct.
>

More strictly speaking, you should generate unique "Item ID" for this
Unit, then use it. Remember common-lisp's (gensym) function, or linux's
"uuidgen(1)" or freebsd's "uuid(1)".

But this thema would involve much more touchy things. Especially, when
you generate an artificial SCS which does not correspond to real
directory tree's transformation. but I can't answer such pathological
cases immediately.

> [...]
> > 6.3. Error Recovery
> >
> > When serious errors are detected in step 6.2. the whole tree MUST be
> > rollbacked to the previous state completely.
>
> Section 6.3 specifies the behaviour of a program that uses this SCS format. I
> agree that this behaviour is desirable, but specifying such behaviour is
> outside the declared scope of this document.
>

OK. I removed this section.

> >
> >
> > 7. Unit Header
> >
> > 7.1. Extended Marker
> >
> > Extended marker consists of four characters and indicates what kind
> > of Item Type Transition occurred in it and whether the Unit is
> > binary or not. Notice that legacy Unidiff marker had only three
> > characters.
> >
> > The symbolic rules of Extended Marker are as follows:
> >
> > char 1. char 2. char 3. char 4.
> >
> > 1st line '-' '-' [*1] [*3]
> >
> > 2nd line '+' '+' [*2] [*4]
>
> One problem with Unidiff format is that a line looking exactly like a Unit
> header line can appear within the hunk body. This increases the risk of
> applying a patch wrongly (if the patch has been edited by hand, for example)
> and makes it more difficult to write parsers for the format, for example to
> provide syntax highlighting. This deficiency could be fixed in your SCS format
> by making the first character of each header line something different from "-",
> "+", " ", "\".

Yes. But I have completely no idea what is the best character to avoid
any kind of such conflict. It sounds like a typical "real world" matter,
not theory. I'd like to find most suitable & practical one. Comments
are always welcome.

>
> [...]
> > 7.2. File Name Field Convention
> >
> > File names must be always started by '.' character.
>
> I suggest: "The first path component of a file name must be '.'."
> Or: "A file name must start with './' unless it is just '.' which represents
> the tree's root directory."
>

I'll take the former.

> [...]
> > When null marker is appeared in Unit header, the corresponding file
> > name MUST be blanked. The time field rule described in 7.3. is still
> > hold.
>
> What? How can a non-existent item have a last modification time?
>

oops...

Thanks for your all comments. Almost everything is quite helpful. I
made "TODO.txt" file at

http://scm.bluegate.org/TODO.txt

for open issues.

Now "Revision" 3.

http://scm.bluegate.org/scs-3.txt

Cheers,

- Tez

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Wed Jun 1 16:52:48 2005

This message: [ Message body ]
Next message: e.huelsmann_at_gmx.net: "[l10n] Translation status for 1.2.x r14877"
Previous message: VK Sameer: "Re: [PATCH]: issue #2264 - multiple locks over ra_svn - v4"
In reply to: Julian Foad: "Re: Serialized Changeset"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]