[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Serialized Changeset

From: Tez Kamihira <tez_at_kamihira.com>
Date: 2005-05-27 16:58:36 CEST

Hi,

Here's my second subject to this community.

I tried to define a new changeset format:

    http://scm.bluegate.org/scs.txt

and have some questions.

  a.) Does your system have any sort of changeset ?
      (according to the book, it'd be "yes".)

  b.) Do you think this specification has enough ability to describe
      your system's changeset completely ?

  c.) :-( Then, what is the missing part to achieve it ? Could you
      tell it immediately ? or there exists very subtle/touchy issues
      ?

Cheers,

         - Tez

                    UNIVERSAL SERIALIZED CHANGESET
                                   
                      DATA FORMAT SPECIFICATION

                                 FOR

           GENERAL SOFTWARE CONFIGURATION MANAGEMENT SYSTEM

                               (Draft)

                               May 2005

                             prepared for

            All Software Configuration Management Hackers

                                  by

                             Tez Kamihira

Table Of Contents

  Preface

  1. Introduction

    1.1. Motive
    1.2. Fundamental Principles
    1.3. Scope
    1.4. Structure Of This Document

  2. Terminology

    2.1. Overview
    2.2. Unidiff Output Sample
    2.3. Definitions

  3. Differential Unit

    3.1. Layout
    3.2. Extended Unidiff Unit
    3.3. Binary Unit

  4. Hunk

    4.1. Standard Hunk
    4.2. Binary Hunk
    4.3. Property Hunk
    4.4. ID Hunk

  5. Differential Calculation

  6. Applying Semantics

    6.1. Notation
    6.2. Algorithm
    6.3. Error Recovery

  7. Unit Header

    7.1 Extended Marker
    7.2 File Name Field Convention
    7.3 Time Field Convention

  APPENDIX A: Some Examples

  APPENDIX B: Formal Syntax

  REFERENCES

                               PREFACE

This document tries to define a special file format so-called
"Changeset" which has been used by several Software Configuration
Management Systems (SCM) to describe the difference between arbitrary
two directory trees. One of the most famous format along the lines is
GNU arch's[1]. But unfortunately, this format is a directory structure
which consists of many structured files and subdirectories, and
clearly not suitable for handy transfer by e-mail nor similar
text-based communication channels.

In this document, mainly based on de-facto Unidiff format and an early
research by GNU arch, I'll try to define another type of file format
called "Serialized Changeset", under the conditions that (1) it
doesn't depend on any specific SCM system and (2) as a single
monolithic file format.

I hope this document will be useful for anyone who is interested in
inter-operability and easy data exchanges between any two different
SCMs, and also it will take the first step to re-understand the key
concept "Changeset" more consciously, more reflexionally, more
uniformly, and more abstractly.

                                                        Tez Kamihira
                                                   (tez@kamihira.com)

                                                            May 2005

                           1. Introduction

1.1. Motive

  In the world of freely available software, to express the total
  modification under the directory, we almost always invoke "diff"
  tool, make a well-known Unidiff format text, embed it in E-mail, and
  transfer to another person. At the receiving side, he/she invokes
  "patch" tool paired with "diff", applies the information to their
  target trees, and reproduce the proper result. Recently thanks to
  the many developers in this area, various value-added SCM systems
  have been made and we can exchange the difference of our data with
  others more correctly and more conveniently.

  Using SCM for data exchange, we can send not only file contents
  itself, but also additional data just like renamed file information,
  binary file's difference, the properties attached to the files, so
  on. Such benefits drive many people to make various "yet another
  SCM" to satisfy the case-by-case needs and actually they have given
  us a huge amount of freedom to select the most suitable SCM for the
  individual project. But on the other hand, this trend results in the
  serious data incompatibility among many SCMs.

  This document will try to define a new data format mainly based on
  the well-known Unidiff format to exchange any kind of difference
  occurred in one SCM system with another, even if the latter system
  is not the same as the former.

  I'll call the data format "Serialized Changeset" and is just noted
  as "SCS" shortly below.

  Roughly speaking, SCS is defined by the collection of top-level
  sub-structure named "Differential Unit (or simply Unit)" which then
  consists of sub level structure named "Hunk". "Hunk" is well-known
  word in Unidiff, but it'll be slightly extended for additional
  information.

1.2. Fundamental Principles

  The fundamental design concept is as following:

    Principle 1: Readability

      SCS must be easily readable by human being as much as it can,
      just like traditional Unidiff as well.

    Principle 2: Monolithity

      SCS must be represented by a single text-based monolithic file
      which is supposed to be exchanged by E-mail.

    Principle 3: Upper Compatibility

      SCS must be upper compatible with Unidiff and almost all the
      line semantics are easily guessed by every traditional Unidiff
      user.

    Principle 4: Neutrality

      SCS MUST NOT depend on any kind of specific SCM system. It
      should be completely neutral beyond such all systems.

1.3. Scope

  This specification was initially inspired by Directory-based
  Changeset defined by GNU arch revision control system[1]. But
  actually SCS concept isn't related to any kind of specific SCM
  directly. It is provided simply for describing the difference
  between arbitrary two directory trees. Adding another option to
  GNUdiff/GNUpatch and allowing it to accept SCS as well would be an
  interesting application, for example.

1.4. Structure Of This Document

  In this document, after reviewing the traditional Unidiff format and
  every term of it, SCS internal structure will be explained by
  top-down approach, that is:

     SCS -> Unit -> Hunk

  order. Then it handles actual calculation and applying algorithms.
  These are roughly corresponding to diff/patch operations in Unidiff
  world.

  In chapter 2, all terminologies used in this document will be
  properly defined. Starting from the well known Unidiff format, we
  will fix every word which has been vaguely and unconsiously used by
  traditional Unidiff, and introduce SCS specific terms as well.

  In chapter 3, the top-most SCS sub-structure called "Differential
  Unit (or simply Unit)" will be explained in detail. SCS is a
  collection of Unit which is very similar data structure with Unidiff
  one. SCS's Unit consists of two sub-types. The first one is called
  Extended Unidiff Unit which is some extended version of original
  Unidiff's. The second one is called Binary Unit which is newly
  introduced in SCS to describe the difference of general binary data.

  Every Unit is, again, a collection of finer sub-structure named
  Hunk. In chapter 4, all Hunk types will be explained. There exist
  totally four types now. Standard Hunk is just Unidiff Hunk
  itself. Rest of them are ID Hunk, Property Hunk, and Binary
  Hunk. Extended Hunks enable us to describe much more detail
  informations about the tree than simply using Unidiff.

  SCS can handle not only regular files, but also special files such
  as directories, symbolic links, binary files which are not calculate
  the difference easily by legacy Unidiff method. In chapter 5,
  specific differential rule will be explained to calculate for all
  file types uniformly.

  When applying SCS information to target tree, several SCS specific
  issues will arise, all of which we didn't meet in the legacy Unidiff
  world. In chapter 6, we will get all complex and subtle issues about
  SCS semantics straight and present specific algorithm for applying
  it to the target tree.

                           2. Terminology

2.1. Overview

  All terminologies used by SCS format will be presented in this
  chapter. Because SCS is a kind of extension of Unidiff, after a
  typical Unidiff output will be showed, each part of it will be
  defined explicitly. Extended term isn't found in traditional Unidiff
  sample, but is also defined along the lines.

  Unidiff doesn't seem to have any kind of formal specification except
  the source code itself. Notice that this chapter will be also
  devoted to make Unidiff's own concepts more clear as well as to
  introduce SCS specific ones.

2.2. Unidiff Output Sample

  Let "./a" and "./b" are two directory and there exist "foo.txt" and
  "bar.txt" under "./a". Moreover let any modifications to those files
  be put into "./b". On such premises, the output of "diff" program
  would be the following:

     1: diff -ur a/bar.txt b/bar.txt
     2: --- a/bar.txt 2005-05-23 10:58:17.583015024 +0900
     3: +++ b/bar.txt 2005-05-23 10:59:12.976593920 +0900
     4: @@ -1,5 +1,6 @@
     5: aaaaaaaaaa
     6: +bbbbbbbbbb
     7: cccccccccc
     8: dddddddddd
     9: eeeeeeeeee
    10: -ffffffffff
    11: \ No newline at end of file
    12: +ffffffffff
    13: diff -ur a/foo.txt b/foo.txt
    14: --- a/foo.txt 2005-05-23 10:34:11.199898680 +0900
    15: +++ b/foo.txt 2005-05-23 10:35:02.941032832 +0900
    16: @@ -1,7 +1,7 @@
    17: It was many and many a year ago,
    18: In a kingdom by the sea,
    19: That a maiden there lived whom you may know
    20: -By the name of ANABEL LEE;
    21: +By the name of ANNABEL LEE;
    22: And this maiden she lived with no other thought
    23: Than to love and be loved by me.
    24:
    25: @@ -34,7 +34,7 @@
    26: And neither the angels in heaven above,
    27: Nor the demons down under the sea,
    28: Can ever dissever my soul from the soul
    29: -Of the beautiful Anabel Lee.
    30: +Of the beautiful Annabel Lee.
    31:
    32: For the moon never beams without bringing me dreams
    33: Of the beautiful Annabel Lee;

  The line numbers appeared in the following sections correspond to
  the above lines.

2.3. Definitions

  Unidiff Format

    The legacy differential format generated by Unix's "diff" program
    such as GNU diff with -u option. There aren't any specific
    documents that refer to this format except their implementations.

  Serialized Changeset (SCS)

    All modifications considered as a single entity as a whole in a
    file, and it will be defined formally in the below documents.
    Also noted SCS shortly. SCS format is quite similar with the
    output of "diff -Nur". The only extended parts from Unidiff are
    basically a few special Hunks.

  Differential Unit (Unit)

    Such lines as 2-12, 14-33 are called Differential Unit or simply
    Unit. Every Differential Unit consists of a header and collection
    of Hunks defined below. It'll be noted shortly "Unit" as well.

  Differential Unit Header

    First and second lines of Differential Unit. In the above example,
    line 2, 3, 14, 15 are headers.

  Differential Unit Marker

    In the case of Unidiff, the beginning of Unit Header starts with
    "---" or "+++" substring and tells us which line is the beginning
    of actual Differential Unit. Those strings are called Unit
    Markers.

    There are other additional markers defined by SCS to describe each
    unit's nature in more detail. The basic property of each
    Differential Unit can be specified completely by the marker type.

  Standard Marker, Extended Marker

    The origial "---" and "+++" marker that appear in legacy Unidiff
    are called standard markers and others are called extended marker.
    Extended markers are newly introduced by SCS. In the above sample,
    there aren't any extended markers. They will be defined in the
    below document separately.

  Hunk, Hunk Header, Hunk Body

    Lines 4-12, 16-24, 25-33 are started by "@@"-string and we call
    them Hunks. This term has already been used by legacy Unidiff as
    well. The beginning line of the Hunk is called Hunk header and
    rest of them are called Hunk body.

    In the case of above example, it consists of two Differential
    Units. First one has only one Hunk. Second one has two.

    SCS defines a few additional Hunks. They can describe the tree
    information in more detail. See also later sections.

  Numeric values appeared in Hunk Header

    Hunk Header usually has four numeric values. They are called as
    pre-offset, pre-size, post-offset, and post-size.

  Junk Lines

    line 1 and 13 are called Junk Lines, and patch program doesn't
    actually need them. It's only for human.

  Line attribute in Hunk bodies

    Every line in Hunk body starts with ' ', '-', '+', or '\'
    character. They are called invariant line, deleted line, appended
    line, and EOLS(End of line status), respectively.

    Even if extended Hunks in SCS also start with these characters,
    the meanings would be slightly different from original ones
    appeared in legacy Unidiff.

  Standard Hunk, Extended Hunk

    The Hunk appeared in traditional Unidiff is called standard Hunk,
    and SCS specific Hunks are called extended Hunks. Extended Hunks
    are divided into three types, that is, ID Hunk, property Hunk, and
    binary Hunk. Above example has standard Hunk only.

  Property Hunk

    Some of the SCM systems that are widely used now can also manage
    meta-information or property information attached some files, as
    well as the actual contents of them. Property Hunk is provided to
    record such kind of information. The detail will be discussed
    below.
 
    There are no property Hunks in above example.

  ID Hunk

    Some of the system can track such event as file renaming. This
    mechanism assumes that some invariant true name exists behind the
    flexible file names. This universal name is called "inventory",
    "file ID", or "node ID" depending on the specific SCM. We call it
    "Item ID" in this document.

    ID Hunk is provided to record this invariant name and consists of
    one line Hunk header only. The detail will be explained at the
    other part of this document.

    There are no ID Hunks in the above example.

  Item

    Each managed entity by SCM is called Item. Most of the Items refer
    to the regular files. But sometimes they refer to special files as
    symbolic link or directory as well. The would be also managed by
    some SCMs. There are three types of Items, that is:

      directory Item: refers to some directory.
       symbolic Item: refers to some symbolic link.
        regular Item: refers to some regular file.

    That's all. This attribute is called Item type.

    Legacy Unidiff can only handle regular files, but SCS also handles
    any other Item types and their transitions.

  Item ID

    See ID Hunk.

  Item Type Transition

    Every Item could be changed to another type of Item. This
    transition is called Item Type Transition. It's very unnatural
    event to change the Item types each other, because it means, for
    example, a directory changes to a regular file, or a regular file
    changes to a symbolic link, etc,. Some SCM prohibits such changes,
    but others are permitted. Because we want to add pretty general
    abilities to SCS format not depend on any SCMs, Item type
    transition is broadly supported.

  Null Item

    Set aside normal three types of Items, we also consider the
    "non-existed Item state" virtually as well. It is called Null
    Item. Using this idea, we can distinguish whether the file simply
    truncated to size zero or actually removed. The same discussion is
    also hold between pure file creation and simple inflation from an
    empty file.

  Item Type Symbol

    Directory Item, symbolic Item, regular Item and null Item have
    symbolic representation, that is, 'D', 'S', 'R', '0'. These
    symbols are called Item type symbols.

    Item type symbols are also used in extended markers, but the
    notation is slightly different from here for human
    readability. The detail will be discussed later.

  Item Contents

    For every Item type, the contents of the Item is assumed, which is
    called Item contents. In the case of regular Item which refers to
    regular file, file contents has literal meaning, but it's defined
    for rest of the Item types as well. See the chapter of
    differential calculation in detail. The operation will be executed
    between two Item contents.

  Binary Difference

    When calculating the difference between two files, Unidiff would
    be failed by several reasons. In those cases, SCS MUST still
    record the difference by some kind of binary differential
    algorithm. It includes the special case that both original and
    modified files are simply recorded as is. The actual contents of
    the difference will be preserved in SCS by base64 encoded
    form. Binary difference will be discussed later.

  Reversibility of Difference

    The differential algorithm is either reversible or
    irreversible. Let arbitrary two files b1 and b2, and let the
    difference d. Under the condition, if b2 can be restored from b1
    and d as well as b1 can be restored from b2 and d, we say the
    algorithm is reversible. If the algorithm is not reversible, it's
    called irreversible.

    Unidiff difference is always reversible, for example.

  Binary Difference Type

    Because there are many binary differential algorithms, the name of
    the algorithm will be embedded into the Hunk header. This
    algorithm name is called the type of the binary difference.

                      3. Differential Unit (Unit)

3.1. Layout

  SCS is a sequence of several Differential Unit (or simply Unit), in
  a line. Junk lines between two Units are permitted, but it doesn't
  allowed the lines that have only zero or more blank characters.
  This is because unlike Unidiff format, SCS has an individable
  meaning as a whole.

  There are two types of Units. (Extended) Unidiff Unit and Binary
  Unit. Former may have some extended Hunks discussed below, but
  basically it's quite similar form with traditional Unidiff. Latter
  was introduced for some Item pairs by which we can not calculate
  Unidiff difference. Even in such cases, Binary Unit can still
  represent those difference.

  Because Binary Unit breaks the principle of readability and it has
  normally too many lines, they MUST be put after the all Unidiff
  Units.

  As a result, SCS is defined by a sequence of zero or more Unidiff
  Units and zero or more Binary Units in this order.

  When SCS is actually applied to target trees, the Unit applying
  order will be critical compared to legacy Unidiff cases. Because SCS
  can describe file renaming/creation/deletion as well, it can not
  always form valid operation to apply each Unit from the beginning to
  the end. The physical Unit layout order in SCS is _completely
  irrelevant_ to their proper applying order. It's decided only by the
  logical relationship among each Unit and absolutely independent from
  their physical Unit order.

  The algorithm related to Unit applying order will be discussed later.

3.2. Extended Unidiff Unit

  Extended Unidiff Unit consists of:

    (1) Extended Unit Header
    (2) ID Hunk
    (3) Sequence of Standard Hunks
    (4) Property Hunk

  in this order. (2) and (4) are optional. The last character of the
  Extended Markers are always "-" or "+".

  Here's an example of Extended Unidiff Unit:

     ---- ./foo.txt 2005-05-23 10:58:17.583015024 +0900
     ++++ ./bar.txt 2005-05-23 10:59:12.976593920 +0900
     @@ id: i_abc @@
     @@ -1,3 +1,4 @@
      aaaaaaaaaa
     +bbbbbbbbbb
      cccccccccc
      dddddddddd
     @@ properties: @@
     -zzz: xxx
     +zzz: yyy

  We MUST also accept the traditional

     ---
     +++

  style marker. It MUST be interpreted as a short cut for

     ----
     ++++

  extended marker. Almost all the Units must have this type when used
  by software development, because they basically manage text files
  only.

3.3. Binary Unit

  Binary Unit consists of:

    (1) Extended Unit Header
    (2) ID Hunk
    (3) Binary Hunk
    (4) Property Hunk

  in this order. (2) and (4) are optional. (3) MUST appears exactly
  once. The last character of the Extended Markers are always "x".

  Here's an example of Binary Unit:

     ---x ./foo.txt 2005-05-23 10:58:17.583015024 +0900
     +++x ./foo.txt 2005-05-23 10:59:12.976593920 +0900
     @@ bin: plain, -1,1 +1,1 @@
     -AAECAwQF
     +AAECAwQFBg==

                              4. Hunk

This chapter is devoted to the explanation of each Hunk type. Standard
Hunk appeared in 4.1. has already been used in traditional Unidiff
format as well and that part is only for reviewing purpose. Rest of
the sections are pure extensions by SCS.

4.1. Standard Hunk

  Standard Hunk Header usually has the following format:

    @@ -A,B +C,D @@

  A, B, C, D are called pre-offset, pre-size, post-offset, and
  post-size respectively.

  The meaning of this Hunk header is as follows. "The original part
  from line number A and has the length B was replaced by modified
  part from line number C and has the length D. The detail is
  presented by the following Hunk body.", where A and C are 1-origin
  line number. Not zero.

  A, B, C, D are usually all natural numbers, but there are various
  edge cases according to their range and relations.

    (1) B or D can be omitted when it equals to 1.

    (2) When there are no lines in original or modified text, B or D
        will be zero.

    (3) When insertion or deletion occurrs at the beginning of the
        text, A or C will be zero.

  Moreover, the following relations are hold.

    (4) B is always sum of deleted lines and invariant lines. D is
        always sum of appended lines and invariant lines.

    (5) If B is zero, this Hunk doesn't have any invariant lines nor
        deleted lines at all. In such cases, A points to just before
        the insertion point. If this position is the beginning of the
        file, A must be zero.

    (6) If D is zero, this Hunk doesn't have any invariant lines nor
        appending lines at all. In such cases, C points just before
        the deleting point. If this position is the beginning of the
        file, C must be zero.

    (7) If A is zero, then B is also zero.

    (8) If C is zero, then D is also zero.

4.2. Binary Hunk

  Binary differential algorithms are divided into two
  types. Reversible and Irreversible.

  For example, two of the well known command line binary differential
  programs are "xdelta"[5] and "bsdiff/bspatch"[6]. Both seem
  irreversible.

  In the below explanation, we assume there exists a file named
  "binary.bin", its original contents are called "binary.bin(1)", and
  modified contents are called "binary.bin(2)", the binary difference
  "binary.bin(1)" and "binary.bin(2)" is "forward.bin", and its
  reverse difference is "reverse.bin", respectively.

  Using some irreversible binary differential algorithm, say "yoyodyne
  method", we can get the reversible binary differential information by
  calculating both "forward.bin" and "reverse.bin", being encoded by
  base64, and embedding it to some special Hunk. Binary Hunk header
  has an algorithm name and quartet line offset and size information
  just like normal Unidiff Hunk. Here's an image.

  ---x ./binary.bin 2005-05-10 13:05:26.000000000 +0900
  +++x ./binary.bin 2005-05-10 13:06:10.000000000 +0900
  @@ bin: yoyodyne, -1,3 +1,3 @@
  -
  - ... forward.bin data encoded by base64 ...
  -
  +
  + ... reverse.bin data encoded by base64 ...
  +

  The semantics of '-' or '+' is slightly different from Unidiff one,
  but we decided to use them to avoid populating special characters.
  Because base64 encoded text couldn't be read by everyone anyway,
  nobody would be confused about such lines.

  On the other hand, when there exists a reversible binary
  differential algorithm, say "strangelove method", all we have to do
  is calculating "forward.bin" only:

  ---x ./binary.bin 2005-05-10 13:05:26.000000000 +0900
  +++x ./binary.bin 2005-05-10 13:06:10.000000000 +0900
  @@ bin: strangelove, -1,3 +0,0 @@
  -
  - ... forward.bin data encoded by base64 ...
  -

  In the same spirit as irreversible case, we simply add '-' character
  to the head of each base64 encoded line.

  By checking both post-offset and post-size are zero, we can see
  whether this algorithm is reversible or not.

  The pre-defined binary algorithms are as follows:

    plain (irreversible):

       Not calculating actual difference, the original and modified
       version of the file are simply base64 encoded and embedded to
       the Hunk. The original lines are added by '-' at the head of
       each line, and modified lines are added by '+'.

    gzip (irreversible):

       Same as plain, except that compression must be occurred before
       base64 encoding.

    xdelta (irreversible):

       Using xdelta algorithm for calculation.

    bsdiff (irreversible):

       Using bsdiff/bspatch algorithm for calculation.

4.3. Property Hunk

  Property Hunk is provided for recording property information or
  meta-information in each file controlled by some modern SCMs. You
  can add it to not only Unidiff Unit, but also to Binary
  Unit. property Hunk MUST be the last Hunk in the Unit, if exists.

  Some property keys, just like file permission, are reserved. The
  list of reserved keys are as follows:

    permission: file permission information.

  The body of property Hunk MUST be sorted by their key order. This
  rule MUST be applied in both deleted lines part and appended lines
  part.

  We could consider the file modification time as a kind of property,
  but this values are changed almost every time. If it would be put
  into property Hunk, almost every time SCS has property Hunk in each
  Unit. To avoid this, modification time MUST be put in the tail of
  the Unit header.

  The binary encoding rule accepts

    \\, \', \", \n, \r, \t, \ooo, \xhh

  characters. [FIXME:]

4.4. ID Hunk

  Item ID would be embedded in a special Hunk which MUST be always at
  the beginning of the all Hunks. This Hunk is called ID Hunk. The
  format of ID Hunk is as follows:

  If this file is tagged by GNU arch's tagline method:

    @@ id: i_cdbdd634-3f67-438c-97d3-a63a0699d6b9 @@

  or by Subversion's node ID method:

    @@ id: [FIXME:] @@

  ID Hunk is optional and doesn't have to always exits, but if exists,
  the patch operation MUST be done by the ID information and MUST NOT
  by the file name in the Unit header.

  In Subversion case, so-called node ID would be embedded into the ID
  Hunk. [FIXME: Am I wrong ?]

  ID Hunk has this line only. It doesn't have any Hunk body at all.

  The permitted character in the Item ID is out of scope of this
  document. But [0-9a-zA-Z_]+ would be reasonable by extended regular
  expression.

                     5. Differential Calculation

  Given two Items which have any Item types, we can always calculate
  their difference. SCS allows Item type transitions even between
  different Item types. We have to define general differential rule
  even between hetero-Item types.

  Differential Calculation must be done on the two file contents which
  are properly defined for each Item type respectively. They are as
  follows:

       Item Type Contents
    ------------------ -------------------------------------------
    directory Item or It's considered as just zero length
         null Item empty byte sequence.
    ------------------ -------------------------------------------
    symbolic Item Link target path name + LF.
    ------------------ -------------------------------------------
    regular Item Normal meaning. Data in the file themselves.
    ------------------ -------------------------------------------

  The actual differential algorithm is as follows:

    (1) Check Unidiff calculation is available or not.

    (2) If possible, that output is just the answer.

    (3) If impossible, arbitrary binary differential algorithm is
        applied to them and the result is converted to base64 form.

  We don't touch the condition when calculation in (1) is failed to
  avoid some complex issues around binary check algorithms. Notice
  that the binary differential calculation in (3) always succeeds.

  According to the specific Unidiff implementation, or processing
  system, the check result of (1) would be different from each
  other. So the difference SHOULD be Unidiff algorithm as much as
  possible. When an Unidiff calculation system is fixed, the following
  either result will be possible for arbitrary two Items.

    (a) It can be represented by both Unidiff format and binary format.
    (b) It can be represented only by binary format.

                        6. Applying Semantics

Each Unit in legacy Unidiff can't handle rename information, because
it has only closed information related to _single file_. It means that
the applying order of Unidiff Unit doesn't affect to the result of the
transformation. All of the different ordered transformation result in
the same tree.

But this property will not hold for SCS which can also describe even
file renaming, creation, and deletion as well. Take the following SCS
for example: (1) "foo.txt" was removed, and (2) "bar.txt" was renamed
to "foo.txt". Regardless to the physical order of (1) and (2) in the
SCS, those MUST be applied from (1) to (2) order. If you apply them
from (2) to (1), it's pretty obvious that both "foo.txt" and "bar.txt"
will be lost out of the target.

This chapter is devoted to the Unit applying order in SCS.

6.1. Notation

  When a Unit was changed from file f to g and it has ID Hunk valued
  I, we note it as (f, g, I). We will always use f, g, h, ... for file
  names and I, J, K ... for Item IDs. f is actually such names as
  "./foo.txt" and I would be some UUID-flavoured strings as
  "i_cdbdd634-3f67-438c-97d3-a63a0699d6b9". Each variable has index if
  necessary. Index will be noted by [] just like f[n].

  Because ID Hunk is optional, Unit wouldn't have any Item ID. In such
  cases, SCS assumes any uniquely assigned Item ID that will never
  conflict with other Item IDs.

  In the below argument, every Unit has properly assigned Item ID
  introduced by the above convention.

6.2. Algorithm

  Basically the applying algorithm in this section uses a slightly
  modified version of GNU arch's "apply_changeset" algorithm. Using
  this algorithm, we can always avoid any kind of edge cases and
  modify the target tree properly.

  Let SCS Units are

     U[1] = (f[1], g[1], I[1])

     U[2] = (f[2], g[2], I[2])

         ...

     U[n] = (f[n], g[n], I[n])

  then SCS MUST be applied by the following steps.

  6.2.1. remove deleted Items from the tree.

     Compute the subset of U[1..n] which element's g[i] is null
     Item. Remove all the files in the subset from the tree.

     This process MUST be concerned the directory structure, that is,
     ordered from the deeper to the shallower.

  6.2.2. Set aside renamed Items from the tree, once.

     Compute the subset of U[1..n] which element's f[i] and g[i]
     differ. Set aside such files out of the tree for a moment.

  6.2.3. Install Items set asided in 6.2.2.

     When each item is installed in target tree, new name must be
     used.

     This process must consider the directory structure. You should
     execute from shallower ones to deeper ones.

  6.2.4. Add appended Items to target tree

     Compute the subset of U[1..n] which element's f[i] is null
     Item. Install such files to the tree.

     This process must, again, concern the directory structure,
     that is, ordered from the shallower to the deeper.

  6.2.5. actual patching

     Depending on the Unit type, actual patching process is invoked.
     normal patch applying for extended Unidiff Unit, binary patch
     applying for Binary Unit, respectively. this process must be
     done by the Item ID. not by the file name.

     How to handle In-exact patching is out of scope of this document.

6.3. Error Recovery

  When serious errors are detected in step 6.2. the whole tree MUST be
  rollbacked to the previous state completely.

                           7. Unit Header

7.1. Extended Marker

  Extended marker consists of four characters and indicates what kind
  of Item Type Transition occurred in it and whether the Unit is
  binary or not. Notice that legacy Unidiff marker had only three
  characters.

  The symbolic rules of Extended Marker are as follows:

                   char 1. char 2. char 3. char 4.

     1st line '-' '-' [*1] [*3]

     2nd line '+' '+' [*2] [*4]

  *1
      directory Item ... 'D'
      symbolic Item ... 'S'
      regular Item ... '-'
      null Item ... '.'

  *2
      directory Item ... 'D'
      symbolic Item ... 'S'
      regular Item ... '+'
      null Item ... '.'

  *3
      when extended Unidiff Unit ... '-'
      when Binary Unit ... 'x'

  *4
      when extended Unidiff Unit ... '+'
      when Binary Unit ... 'x'

  
7.2. File Name Field Convention

  File names must be always started by '.' character. Special file
  named just one '.' is permitted and it represents the tree's root
  directory. The files just under '.' are referred as "./foo.txt" or
  so, and the "bar.txt" under the sub-directory "sub" are specified by
  "./sub/bar.txt" or so.

  Even if the file name refers to some directory, The file name MUST
  NOT have a trailing '/' character.

  Every file name doesn't have any redundant subdirectory part
  appeared in "diff -ur" output.

  When null marker is appeared in Unit header, the corresponding file
  name MUST be blanked. The time field rule described in 7.3. is still
  hold.

  [FIXME: File name MUST NOT have any space characters.]

  [FIXME: non-ascii character are out of scope of this document.]

7.3. Time Field Convention

  Time fields SHOULD be indented to match the both start columns of first
  and second header lines. It increases the readability pretty well.

                      APPENDIX A. Some Examples

Here's some examples of Differential Unit:

  a.) Regular file "./foo.txt" was modified. It's not under Item ID
      control.

     ---- ./foo.txt 2005-05-23 10:58:17.583015024 +0900
     ++++ ./foo.txt 2005-05-23 10:59:12.976593920 +0900
     @@ -1,4 +1,5 @@
      aaaaaaaaaa
     +bbbbbbbbbb
      cccccccccc
      dddddddddd
      eeeeeeeeee

  b.) All of the lines in "./foo.txt" are deleted. not under Item ID
      control. "./foo.txt" still remains as an empty file.

     ---- ./foo.txt 2005-05-23 10:58:17.583015024 +0900
     ++++ ./foo.txt 2005-05-23 10:59:12.976593920 +0900
     @@ -1,4 +0,0 @@
     -aaaaaaaaaa
     -bbbbbbbbbb
     -cccccccccc
     -dddddddddd

  c.) Same as b.), but "./foo.txt" was actually removed.

     ---- ./foo.txt 2005-05-23 10:58:17.583015024 +0900
     ++.+ 2005-05-23 10:59:12.976593920 +0900
     @@ -1,4 +0,0 @@
     -aaaaaaaaaa
     -bbbbbbbbbb
     -cccccccccc
     -dddddddddd

  d.) Regular file "./foo.txt" was renamed to bar.txt. The Item is
      managed by Item ID "i_abc".

     ---- ./foo.txt 2005-05-23 10:58:17.583015024 +0900
     ++++ ./bar.txt 2005-05-23 10:59:12.976593920 +0900
     @@ id: i_abc @@

  e.) Same as d.), but the contents also modified.

     ---- ./foo.txt 2005-05-23 10:58:17.583015024 +0900
     ++++ ./bar.txt 2005-05-23 10:59:12.976593920 +0900
     @@ id: i_abc @@
     @@ -1,3 +1,4 @@
      aaaaaaaaaa
     +bbbbbbbbbb
      cccccccccc
      dddddddddd

  f.) Same as e.), but the property value of "zzz" is also modified
      from "xxx" to "yyy".

     ---- ./foo.txt 2005-05-23 10:58:17.583015024 +0900
     ++++ ./bar.txt 2005-05-23 10:59:12.976593920 +0900
     @@ id: i_abc @@
     @@ -1,3 +1,4 @@
      aaaaaaaaaa
     +bbbbbbbbbb
      cccccccccc
      dddddddddd
     @@ properties: @@
     -zzz: xxx
     +zzz: yyy

  g.) Append 1-byte "\x06" was appended at the bottom of the binary
      file "./foo.txt" which has had the 6 byte values
      "\x00\x01\x02\x03\x04\x05". The name was unchanged and used
      "plain" binary difference type.

     ---x ./foo.txt 2005-05-23 10:58:17.583015024 +0900
     +++x ./foo.txt 2005-05-23 10:59:12.976593920 +0900
     @@ bin: plain, -1,1 +1,1 @@
     -AAECAwQF
     +AAECAwQFBg==

     (You can also use "-1 +1" instead of "-1,1 +1,1".)

  h.) Create a new symbolic link named "./foo.txt" which refers to
      "/some/directory/foo.txt".

     --.- 2005-05-23 10:58:17.583015024 +0900
     ++S+ ./foo.txt 2005-05-23 10:59:12.976593920 +0900
     @@ -0,0 +1,1 @@
     +/some/directory/foo.txt

     (You can also use "-0,0 +1" instead of "-0,0 +1,1")

                      APPENDIX B. Formal Syntax

SCS :=
     [extended Unidiff Unit]*
     [Binary Unit]*

extended Unidiff Unit :=
     Unidiff Unit header-1
     Unidiff Unit header-2
     [ID Hunk]
     [Unidiff Hunk]*
     [property Hunk]

Binary Unit :=
     Binary Unit header-1
     Binary Unit header-2
     [ID Hunk]
     Binary Hunk
     [property Hunk]

Unidiff Unit header-1 := '--' [-.DS] '- ' header-info '\n'

Unidiff Unit header-2 := '++' [+.DS] '+ ' header-info '\n'

Binary Unit header-1 := '--' [-.DS] 'x ' header-info '\n'

Binary Unit header-2 := '++' [+.DS] 'x ' header-info '\n'

header-info := file-name-description ' ' time-description '\n'

ID Hunk := '@@ id: ' [0-9a-zA-Z_]+ ' @@' '\n'

binary Hunk := '@@ bin: '
                    binary-algorithm-name
                    ', '
                    standard-Hunk-quartet
                    ' @@' '\n'
                    binary Hunk body

binary-algorithm-name := 'plain' | 'gzip' | 'yoyodyne' | ...

binary-Hunk-body :=
    <lines started by '-' and forward binary diff data encoded by base64>
    [lines started by '+' and reverse binary diff data encoded by base64]

standard-Hunk-quartet := '-' number ',' number ' +' number ',' number

file-name-description :=

time-description :=

property-Hunk := '@@ properties: @@' '\n' property-Hunk-body

[FIXME:]

                              REFERENCES

[1] GNU arch
     http://www.gnu.org/software/gnu-arch/

[2] Lord, T., "arch Meets hello-world - A Tutorial Introduction to The
     arch Revision Control System"
     http://www.gnu.org/software/gnu-arch/tutorial/arch.html
     2001, 2002, 2003, 2004.

[3] GNU diff/patch
     http://ftp.gnu.org/pub/gnu/patch/patch-2.5.4.tar.gz
     http://ftp.gnu.org/pub/gnu/diffutils/diffutils-2.8.1.tar.gz

[4] Subversion
     http://subversion.tigris.org/

[5] xdelta
     http://sourceforge.net/projects/xdelta/

[6] bsdiff/bspatch
     http://www.daemonology.net/bsdiff/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Fri May 27 16:59:56 2005

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.