Hi,
Here's my second subject to this community.
I tried to define a new changeset format:
http://scm.bluegate.org/scs.txt
and have some questions.
a.) Does your system have any sort of changeset ?
(according to the book, it'd be "yes".)
b.) Do you think this specification has enough ability to describe
your system's changeset completely ?
c.) :-( Then, what is the missing part to achieve it ? Could you
tell it immediately ? or there exists very subtle/touchy issues
?
Cheers,
- Tez
UNIVERSAL SERIALIZED CHANGESET
DATA FORMAT SPECIFICATION
FOR
GENERAL SOFTWARE CONFIGURATION MANAGEMENT SYSTEM
(Draft)
May 2005
prepared for
All Software Configuration Management Hackers
by
Tez Kamihira
Table Of Contents
Preface
1. Introduction
1.1. Motive
1.2. Fundamental Principles
1.3. Scope
1.4. Structure Of This Document
2. Terminology
2.1. Overview
2.2. Unidiff Output Sample
2.3. Definitions
3. Differential Unit
3.1. Layout
3.2. Extended Unidiff Unit
3.3. Binary Unit
4. Hunk
4.1. Standard Hunk
4.2. Binary Hunk
4.3. Property Hunk
4.4. ID Hunk
5. Differential Calculation
6. Applying Semantics
6.1. Notation
6.2. Algorithm
6.3. Error Recovery
7. Unit Header
7.1 Extended Marker
7.2 File Name Field Convention
7.3 Time Field Convention
APPENDIX A: Some Examples
APPENDIX B: Formal Syntax
REFERENCES
PREFACE
This document tries to define a special file format so-called
"Changeset" which has been used by several Software Configuration
Management Systems (SCM) to describe the difference between arbitrary
two directory trees. One of the most famous format along the lines is
GNU arch's[1]. But unfortunately, this format is a directory structure
which consists of many structured files and subdirectories, and
clearly not suitable for handy transfer by e-mail nor similar
text-based communication channels.
In this document, mainly based on de-facto Unidiff format and an early
research by GNU arch, I'll try to define another type of file format
called "Serialized Changeset", under the conditions that (1) it
doesn't depend on any specific SCM system and (2) as a single
monolithic file format.
I hope this document will be useful for anyone who is interested in
inter-operability and easy data exchanges between any two different
SCMs, and also it will take the first step to re-understand the key
concept "Changeset" more consciously, more reflexionally, more
uniformly, and more abstractly.
Tez Kamihira
(tez@kamihira.com)
May 2005
1. Introduction
1.1. Motive
In the world of freely available software, to express the total
modification under the directory, we almost always invoke "diff"
tool, make a well-known Unidiff format text, embed it in E-mail, and
transfer to another person. At the receiving side, he/she invokes
"patch" tool paired with "diff", applies the information to their
target trees, and reproduce the proper result. Recently thanks to
the many developers in this area, various value-added SCM systems
have been made and we can exchange the difference of our data with
others more correctly and more conveniently.
Using SCM for data exchange, we can send not only file contents
itself, but also additional data just like renamed file information,
binary file's difference, the properties attached to the files, so
on. Such benefits drive many people to make various "yet another
SCM" to satisfy the case-by-case needs and actually they have given
us a huge amount of freedom to select the most suitable SCM for the
individual project. But on the other hand, this trend results in the
serious data incompatibility among many SCMs.
This document will try to define a new data format mainly based on
the well-known Unidiff format to exchange any kind of difference
occurred in one SCM system with another, even if the latter system
is not the same as the former.
I'll call the data format "Serialized Changeset" and is just noted
as "SCS" shortly below.
Roughly speaking, SCS is defined by the collection of top-level
sub-structure named "Differential Unit (or simply Unit)" which then
consists of sub level structure named "Hunk". "Hunk" is well-known
word in Unidiff, but it'll be slightly extended for additional
information.
1.2. Fundamental Principles
The fundamental design concept is as following:
Principle 1: Readability
SCS must be easily readable by human being as much as it can,
just like traditional Unidiff as well.
Principle 2: Monolithity
SCS must be represented by a single text-based monolithic file
which is supposed to be exchanged by E-mail.
Principle 3: Upper Compatibility
SCS must be upper compatible with Unidiff and almost all the
line semantics are easily guessed by every traditional Unidiff
user.
Principle 4: Neutrality
SCS MUST NOT depend on any kind of specific SCM system. It
should be completely neutral beyond such all systems.
1.3. Scope
This specification was initially inspired by Directory-based
Changeset defined by GNU arch revision control system[1]. But
actually SCS concept isn't related to any kind of specific SCM
directly. It is provided simply for describing the difference
between arbitrary two directory trees. Adding another option to
GNUdiff/GNUpatch and allowing it to accept SCS as well would be an
interesting application, for example.
1.4. Structure Of This Document
In this document, after reviewing the traditional Unidiff format and
every term of it, SCS internal structure will be explained by
top-down approach, that is:
SCS -> Unit -> Hunk
order. Then it handles actual calculation and applying algorithms.
These are roughly corresponding to diff/patch operations in Unidiff
world.
In chapter 2, all terminologies used in this document will be
properly defined. Starting from the well known Unidiff format, we
will fix every word which has been vaguely and unconsiously used by
traditional Unidiff, and introduce SCS specific terms as well.
In chapter 3, the top-most SCS sub-structure called "Differential
Unit (or simply Unit)" will be explained in detail. SCS is a
collection of Unit which is very similar data structure with Unidiff
one. SCS's Unit consists of two sub-types. The first one is called
Extended Unidiff Unit which is some extended version of original
Unidiff's. The second one is called Binary Unit which is newly
introduced in SCS to describe the difference of general binary data.
Every Unit is, again, a collection of finer sub-structure named
Hunk. In chapter 4, all Hunk types will be explained. There exist
totally four types now. Standard Hunk is just Unidiff Hunk
itself. Rest of them are ID Hunk, Property Hunk, and Binary
Hunk. Extended Hunks enable us to describe much more detail
informations about the tree than simply using Unidiff.
SCS can handle not only regular files, but also special files such
as directories, symbolic links, binary files which are not calculate
the difference easily by legacy Unidiff method. In chapter 5,
specific differential rule will be explained to calculate for all
file types uniformly.
When applying SCS information to target tree, several SCS specific
issues will arise, all of which we didn't meet in the legacy Unidiff
world. In chapter 6, we will get all complex and subtle issues about
SCS semantics straight and present specific algorithm for applying
it to the target tree.
2. Terminology
2.1. Overview
All terminologies used by SCS format will be presented in this
chapter. Because SCS is a kind of extension of Unidiff, after a
typical Unidiff output will be showed, each part of it will be
defined explicitly. Extended term isn't found in traditional Unidiff
sample, but is also defined along the lines.
Unidiff doesn't seem to have any kind of formal specification except
the source code itself. Notice that this chapter will be also
devoted to make Unidiff's own concepts more clear as well as to
introduce SCS specific ones.
2.2. Unidiff Output Sample
Let "./a" and "./b" are two directory and there exist "foo.txt" and
"bar.txt" under "./a". Moreover let any modifications to those files
be put into "./b". On such premises, the output of "diff" program
would be the following:
1: diff -ur a/bar.txt b/bar.txt
2: --- a/bar.txt 2005-05-23 10:58:17.583015024 +0900
3: +++ b/bar.txt 2005-05-23 10:59:12.976593920 +0900
4: @@ -1,5 +1,6 @@
5: aaaaaaaaaa
6: +bbbbbbbbbb
7: cccccccccc
8: dddddddddd
9: eeeeeeeeee
10: -ffffffffff
11: \ No newline at end of file
12: +ffffffffff
13: diff -ur a/foo.txt b/foo.txt
14: --- a/foo.txt 2005-05-23 10:34:11.199898680 +0900
15: +++ b/foo.txt 2005-05-23 10:35:02.941032832 +0900
16: @@ -1,7 +1,7 @@
17: It was many and many a year ago,
18: In a kingdom by the sea,
19: That a maiden there lived whom you may know
20: -By the name of ANABEL LEE;
21: +By the name of ANNABEL LEE;
22: And this maiden she lived with no other thought
23: Than to love and be loved by me.
24:
25: @@ -34,7 +34,7 @@
26: And neither the angels in heaven above,
27: Nor the demons down under the sea,
28: Can ever dissever my soul from the soul
29: -Of the beautiful Anabel Lee.
30: +Of the beautiful Annabel Lee.
31:
32: For the moon never beams without bringing me dreams
33: Of the beautiful Annabel Lee;
The line numbers appeared in the following sections correspond to
the above lines.
2.3. Definitions
Unidiff Format
The legacy differential format generated by Unix's "diff" program
such as GNU diff with -u option. There aren't any specific
documents that refer to this format except their implementations.
Serialized Changeset (SCS)
All modifications considered as a single entity as a whole in a
file, and it will be defined formally in the below documents.
Also noted SCS shortly. SCS format is quite similar with the
output of "diff -Nur". The only extended parts from Unidiff are
basically a few special Hunks.
Differential Unit (Unit)
Such lines as 2-12, 14-33 are called Differential Unit or simply
Unit. Every Differential Unit consists of a header and collection
of Hunks defined below. It'll be noted shortly "Unit" as well.
Differential Unit Header
First and second lines of Differential Unit. In the above example,
line 2, 3, 14, 15 are headers.
Differential Unit Marker
In the case of Unidiff, the beginning of Unit Header starts with
"---" or "+++" substring and tells us which line is the beginning
of actual Differential Unit. Those strings are called Unit
Markers.
There are other additional markers defined by SCS to describe each
unit's nature in more detail. The basic property of each
Differential Unit can be specified completely by the marker type.
Standard Marker, Extended Marker
The origial "---" and "+++" marker that appear in legacy Unidiff
are called standard markers and others are called extended marker.
Extended markers are newly introduced by SCS. In the above sample,
there aren't any extended markers. They will be defined in the
below document separately.
Hunk, Hunk Header, Hunk Body
Lines 4-12, 16-24, 25-33 are started by "@@"-string and we call
them Hunks. This term has already been used by legacy Unidiff as
well. The beginning line of the Hunk is called Hunk header and
rest of them are called Hunk body.
In the case of above example, it consists of two Differential
Units. First one has only one Hunk. Second one has two.
SCS defines a few additional Hunks. They can describe the tree
information in more detail. See also later sections.
Numeric values appeared in Hunk Header
Hunk Header usually has four numeric values. They are called as
pre-offset, pre-size, post-offset, and post-size.
Junk Lines
line 1 and 13 are called Junk Lines, and patch program doesn't
actually need them. It's only for human.
Line attribute in Hunk bodies
Every line in Hunk body starts with ' ', '-', '+', or '\'
character. They are called invariant line, deleted line, appended
line, and EOLS(End of line status), respectively.
Even if extended Hunks in SCS also start with these characters,
the meanings would be slightly different from original ones
appeared in legacy Unidiff.
Standard Hunk, Extended Hunk
The Hunk appeared in traditional Unidiff is called standard Hunk,
and SCS specific Hunks are called extended Hunks. Extended Hunks
are divided into three types, that is, ID Hunk, property Hunk, and
binary Hunk. Above example has standard Hunk only.
Property Hunk
Some of the SCM systems that are widely used now can also manage
meta-information or property information attached some files, as
well as the actual contents of them. Property Hunk is provided to
record such kind of information. The detail will be discussed
below.
There are no property Hunks in above example.
ID Hunk
Some of the system can track such event as file renaming. This
mechanism assumes that some invariant true name exists behind the
flexible file names. This universal name is called "inventory",
"file ID", or "node ID" depending on the specific SCM. We call it
"Item ID" in this document.
ID Hunk is provided to record this invariant name and consists of
one line Hunk header only. The detail will be explained at the
other part of this document.
There are no ID Hunks in the above example.
Item
Each managed entity by SCM is called Item. Most of the Items refer
to the regular files. But sometimes they refer to special files as
symbolic link or directory as well. The would be also managed by
some SCMs. There are three types of Items, that is:
directory Item: refers to some directory.
symbolic Item: refers to some symbolic link.
regular Item: refers to some regular file.
That's all. This attribute is called Item type.
Legacy Unidiff can only handle regular files, but SCS also handles
any other Item types and their transitions.
Item ID
See ID Hunk.
Item Type Transition
Every Item could be changed to another type of Item. This
transition is called Item Type Transition. It's very unnatural
event to change the Item types each other, because it means, for
example, a directory changes to a regular file, or a regular file
changes to a symbolic link, etc,. Some SCM prohibits such changes,
but others are permitted. Because we want to add pretty general
abilities to SCS format not depend on any SCMs, Item type
transition is broadly supported.
Null Item
Set aside normal three types of Items, we also consider the
"non-existed Item state" virtually as well. It is called Null
Item. Using this idea, we can distinguish whether the file simply
truncated to size zero or actually removed. The same discussion is
also hold between pure file creation and simple inflation from an
empty file.
Item Type Symbol
Directory Item, symbolic Item, regular Item and null Item have
symbolic representation, that is, 'D', 'S', 'R', '0'. These
symbols are called Item type symbols.
Item type symbols are also used in extended markers, but the
notation is slightly different from here for human
readability. The detail will be discussed later.
Item Contents
For every Item type, the contents of the Item is assumed, which is
called Item contents. In the case of regular Item which refers to
regular file, file contents has literal meaning, but it's defined
for rest of the Item types as well. See the chapter of
differential calculation in detail. The operation will be executed
between two Item contents.
Binary Difference
When calculating the difference between two files, Unidiff would
be failed by several reasons. In those cases, SCS MUST still
record the difference by some kind of binary differential
algorithm. It includes the special case that both original and
modified files are simply recorded as is. The actual contents of
the difference will be preserved in SCS by base64 encoded
form. Binary difference will be discussed later.
Reversibility of Difference
The differential algorithm is either reversible or
irreversible. Let arbitrary two files b1 and b2, and let the
difference d. Under the condition, if b2 can be restored from b1
and d as well as b1 can be restored from b2 and d, we say the
algorithm is reversible. If the algorithm is not reversible, it's
called irreversible.
Unidiff difference is always reversible, for example.
Binary Difference Type
Because there are many binary differential algorithms, the name of
the algorithm will be embedded into the Hunk header. This
algorithm name is called the type of the binary difference.
3. Differential Unit (Unit)
3.1. Layout
SCS is a sequence of several Differential Unit (or simply Unit), in
a line. Junk lines between two Units are permitted, but it doesn't
allowed the lines that have only zero or more blank characters.
This is because unlike Unidiff format, SCS has an individable
meaning as a whole.
There are two types of Units. (Extended) Unidiff Unit and Binary
Unit. Former may have some extended Hunks discussed below, but
basically it's quite similar form with traditional Unidiff. Latter
was introduced for some Item pairs by which we can not calculate
Unidiff difference. Even in such cases, Binary Unit can still
represent those difference.
Because Binary Unit breaks the principle of readability and it has
normally too many lines, they MUST be put after the all Unidiff
Units.
As a result, SCS is defined by a sequence of zero or more Unidiff
Units and zero or more Binary Units in this order.
When SCS is actually applied to target trees, the Unit applying
order will be critical compared to legacy Unidiff cases. Because SCS
can describe file renaming/creation/deletion as well, it can not
always form valid operation to apply each Unit from the beginning to
the end. The physical Unit layout order in SCS is _completely
irrelevant_ to their proper applying order. It's decided only by the
logical relationship among each Unit and absolutely independent from
their physical Unit order.
The algorithm related to Unit applying order will be discussed later.
3.2. Extended Unidiff Unit
Extended Unidiff Unit consists of:
(1) Extended Unit Header
(2) ID Hunk
(3) Sequence of Standard Hunks
(4) Property Hunk
in this order. (2) and (4) are optional. The last character of the
Extended Markers are always "-" or "+".
Here's an example of Extended Unidiff Unit:
---- ./foo.txt 2005-05-23 10:58:17.583015024 +0900
++++ ./bar.txt 2005-05-23 10:59:12.976593920 +0900
@@ id: i_abc @@
@@ -1,3 +1,4 @@
aaaaaaaaaa
+bbbbbbbbbb
cccccccccc
dddddddddd
@@ properties: @@
-zzz: xxx
+zzz: yyy
We MUST also accept the traditional
---
+++
style marker. It MUST be interpreted as a short cut for
----
++++
extended marker. Almost all the Units must have this type when used
by software development, because they basically manage text files
only.
3.3. Binary Unit
Binary Unit consists of:
(1) Extended Unit Header
(2) ID Hunk
(3) Binary Hunk
(4) Property Hunk
in this order. (2) and (4) are optional. (3) MUST appears exactly
once. The last character of the Extended Markers are always "x".
Here's an example of Binary Unit:
---x ./foo.txt 2005-05-23 10:58:17.583015024 +0900
+++x ./foo.txt 2005-05-23 10:59:12.976593920 +0900
@@ bin: plain, -1,1 +1,1 @@
-AAECAwQF
+AAECAwQFBg==
4. Hunk
This chapter is devoted to the explanation of each Hunk type. Standard
Hunk appeared in 4.1. has already been used in traditional Unidiff
format as well and that part is only for reviewing purpose. Rest of
the sections are pure extensions by SCS.
4.1. Standard Hunk
Standard Hunk Header usually has the following format:
@@ -A,B +C,D @@
A, B, C, D are called pre-offset, pre-size, post-offset, and
post-size respectively.
The meaning of this Hunk header is as follows. "The original part
from line number A and has the length B was replaced by modified
part from line number C and has the length D. The detail is
presented by the following Hunk body.", where A and C are 1-origin
line number. Not zero.
A, B, C, D are usually all natural numbers, but there are various
edge cases according to their range and relations.
(1) B or D can be omitted when it equals to 1.
(2) When there are no lines in original or modified text, B or D
will be zero.
(3) When insertion or deletion occurrs at the beginning of the
text, A or C will be zero.
Moreover, the following relations are hold.
(4) B is always sum of deleted lines and invariant lines. D is
always sum of appended lines and invariant lines.
(5) If B is zero, this Hunk doesn't have any invariant lines nor
deleted lines at all. In such cases, A points to just before
the insertion point. If this position is the beginning of the
file, A must be zero.
(6) If D is zero, this Hunk doesn't have any invariant lines nor
appending lines at all. In such cases, C points just before
the deleting point. If this position is the beginning of the
file, C must be zero.
(7) If A is zero, then B is also zero.
(8) If C is zero, then D is also zero.
4.2. Binary Hunk
Binary differential algorithms are divided into two
types. Reversible and Irreversible.
For example, two of the well known command line binary differential
programs are "xdelta"[5] and "bsdiff/bspatch"[6]. Both seem
irreversible.
In the below explanation, we assume there exists a file named
"binary.bin", its original contents are called "binary.bin(1)", and
modified contents are called "binary.bin(2)", the binary difference
"binary.bin(1)" and "binary.bin(2)" is "forward.bin", and its
reverse difference is "reverse.bin", respectively.
Using some irreversible binary differential algorithm, say "yoyodyne
method", we can get the reversible binary differential information by
calculating both "forward.bin" and "reverse.bin", being encoded by
base64, and embedding it to some special Hunk. Binary Hunk header
has an algorithm name and quartet line offset and size information
just like normal Unidiff Hunk. Here's an image.
---x ./binary.bin 2005-05-10 13:05:26.000000000 +0900
+++x ./binary.bin 2005-05-10 13:06:10.000000000 +0900
@@ bin: yoyodyne, -1,3 +1,3 @@
-
- ... forward.bin data encoded by base64 ...
-
+
+ ... reverse.bin data encoded by base64 ...
+
The semantics of '-' or '+' is slightly different from Unidiff one,
but we decided to use them to avoid populating special characters.
Because base64 encoded text couldn't be read by everyone anyway,
nobody would be confused about such lines.
On the other hand, when there exists a reversible binary
differential algorithm, say "strangelove method", all we have to do
is calculating "forward.bin" only:
---x ./binary.bin 2005-05-10 13:05:26.000000000 +0900
+++x ./binary.bin 2005-05-10 13:06:10.000000000 +0900
@@ bin: strangelove, -1,3 +0,0 @@
-
- ... forward.bin data encoded by base64 ...
-
In the same spirit as irreversible case, we simply add '-' character
to the head of each base64 encoded line.
By checking both post-offset and post-size are zero, we can see
whether this algorithm is reversible or not.
The pre-defined binary algorithms are as follows:
plain (irreversible):
Not calculating actual difference, the original and modified
version of the file are simply base64 encoded and embedded to
the Hunk. The original lines are added by '-' at the head of
each line, and modified lines are added by '+'.
gzip (irreversible):
Same as plain, except that compression must be occurred before
base64 encoding.
xdelta (irreversible):
Using xdelta algorithm for calculation.
bsdiff (irreversible):
Using bsdiff/bspatch algorithm for calculation.
4.3. Property Hunk
Property Hunk is provided for recording property information or
meta-information in each file controlled by some modern SCMs. You
can add it to not only Unidiff Unit, but also to Binary
Unit. property Hunk MUST be the last Hunk in the Unit, if exists.
Some property keys, just like file permission, are reserved. The
list of reserved keys are as follows:
permission: file permission information.
The body of property Hunk MUST be sorted by their key order. This
rule MUST be applied in both deleted lines part and appended lines
part.
We could consider the file modification time as a kind of property,
but this values are changed almost every time. If it would be put
into property Hunk, almost every time SCS has property Hunk in each
Unit. To avoid this, modification time MUST be put in the tail of
the Unit header.
The binary encoding rule accepts
\\, \', \", \n, \r, \t, \ooo, \xhh
characters. [FIXME:]
4.4. ID Hunk
Item ID would be embedded in a special Hunk which MUST be always at
the beginning of the all Hunks. This Hunk is called ID Hunk. The
format of ID Hunk is as follows:
If this file is tagged by GNU arch's tagline method:
@@ id: i_cdbdd634-3f67-438c-97d3-a63a0699d6b9 @@
or by Subversion's node ID method:
@@ id: [FIXME:] @@
ID Hunk is optional and doesn't have to always exits, but if exists,
the patch operation MUST be done by the ID information and MUST NOT
by the file name in the Unit header.
In Subversion case, so-called node ID would be embedded into the ID
Hunk. [FIXME: Am I wrong ?]
ID Hunk has this line only. It doesn't have any Hunk body at all.
The permitted character in the Item ID is out of scope of this
document. But [0-9a-zA-Z_]+ would be reasonable by extended regular
expression.
5. Differential Calculation
Given two Items which have any Item types, we can always calculate
their difference. SCS allows Item type transitions even between
different Item types. We have to define general differential rule
even between hetero-Item types.
Differential Calculation must be done on the two file contents which
are properly defined for each Item type respectively. They are as
follows:
Item Type Contents
------------------ -------------------------------------------
directory Item or It's considered as just zero length
null Item empty byte sequence.
------------------ -------------------------------------------
symbolic Item Link target path name + LF.
------------------ -------------------------------------------
regular Item Normal meaning. Data in the file themselves.
------------------ -------------------------------------------
The actual differential algorithm is as follows:
(1) Check Unidiff calculation is available or not.
(2) If possible, that output is just the answer.
(3) If impossible, arbitrary binary differential algorithm is
applied to them and the result is converted to base64 form.
We don't touch the condition when calculation in (1) is failed to
avoid some complex issues around binary check algorithms. Notice
that the binary differential calculation in (3) always succeeds.
According to the specific Unidiff implementation, or processing
system, the check result of (1) would be different from each
other. So the difference SHOULD be Unidiff algorithm as much as
possible. When an Unidiff calculation system is fixed, the following
either result will be possible for arbitrary two Items.
(a) It can be represented by both Unidiff format and binary format.
(b) It can be represented only by binary format.
6. Applying Semantics
Each Unit in legacy Unidiff can't handle rename information, because
it has only closed information related to _single file_. It means that
the applying order of Unidiff Unit doesn't affect to the result of the
transformation. All of the different ordered transformation result in
the same tree.
But this property will not hold for SCS which can also describe even
file renaming, creation, and deletion as well. Take the following SCS
for example: (1) "foo.txt" was removed, and (2) "bar.txt" was renamed
to "foo.txt". Regardless to the physical order of (1) and (2) in the
SCS, those MUST be applied from (1) to (2) order. If you apply them
from (2) to (1), it's pretty obvious that both "foo.txt" and "bar.txt"
will be lost out of the target.
This chapter is devoted to the Unit applying order in SCS.
6.1. Notation
When a Unit was changed from file f to g and it has ID Hunk valued
I, we note it as (f, g, I). We will always use f, g, h, ... for file
names and I, J, K ... for Item IDs. f is actually such names as
"./foo.txt" and I would be some UUID-flavoured strings as
"i_cdbdd634-3f67-438c-97d3-a63a0699d6b9". Each variable has index if
necessary. Index will be noted by [] just like f[n].
Because ID Hunk is optional, Unit wouldn't have any Item ID. In such
cases, SCS assumes any uniquely assigned Item ID that will never
conflict with other Item IDs.
In the below argument, every Unit has properly assigned Item ID
introduced by the above convention.
6.2. Algorithm
Basically the applying algorithm in this section uses a slightly
modified version of GNU arch's "apply_changeset" algorithm. Using
this algorithm, we can always avoid any kind of edge cases and
modify the target tree properly.
Let SCS Units are
U[1] = (f[1], g[1], I[1])
U[2] = (f[2], g[2], I[2])
...
U[n] = (f[n], g[n], I[n])
then SCS MUST be applied by the following steps.
6.2.1. remove deleted Items from the tree.
Compute the subset of U[1..n] which element's g[i] is null
Item. Remove all the files in the subset from the tree.
This process MUST be concerned the directory structure, that is,
ordered from the deeper to the shallower.
6.2.2. Set aside renamed Items from the tree, once.
Compute the subset of U[1..n] which element's f[i] and g[i]
differ. Set aside such files out of the tree for a moment.
6.2.3. Install Items set asided in 6.2.2.
When each item is installed in target tree, new name must be
used.
This process must consider the directory structure. You should
execute from shallower ones to deeper ones.
6.2.4. Add appended Items to target tree
Compute the subset of U[1..n] which element's f[i] is null
Item. Install such files to the tree.
This process must, again, concern the directory structure,
that is, ordered from the shallower to the deeper.
6.2.5. actual patching
Depending on the Unit type, actual patching process is invoked.
normal patch applying for extended Unidiff Unit, binary patch
applying for Binary Unit, respectively. this process must be
done by the Item ID. not by the file name.
How to handle In-exact patching is out of scope of this document.
6.3. Error Recovery
When serious errors are detected in step 6.2. the whole tree MUST be
rollbacked to the previous state completely.
7. Unit Header
7.1. Extended Marker
Extended marker consists of four characters and indicates what kind
of Item Type Transition occurred in it and whether the Unit is
binary or not. Notice that legacy Unidiff marker had only three
characters.
The symbolic rules of Extended Marker are as follows:
char 1. char 2. char 3. char 4.
1st line '-' '-' [*1] [*3]
2nd line '+' '+' [*2] [*4]
*1
directory Item ... 'D'
symbolic Item ... 'S'
regular Item ... '-'
null Item ... '.'
*2
directory Item ... 'D'
symbolic Item ... 'S'
regular Item ... '+'
null Item ... '.'
*3
when extended Unidiff Unit ... '-'
when Binary Unit ... 'x'
*4
when extended Unidiff Unit ... '+'
when Binary Unit ... 'x'
7.2. File Name Field Convention
File names must be always started by '.' character. Special file
named just one '.' is permitted and it represents the tree's root
directory. The files just under '.' are referred as "./foo.txt" or
so, and the "bar.txt" under the sub-directory "sub" are specified by
"./sub/bar.txt" or so.
Even if the file name refers to some directory, The file name MUST
NOT have a trailing '/' character.
Every file name doesn't have any redundant subdirectory part
appeared in "diff -ur" output.
When null marker is appeared in Unit header, the corresponding file
name MUST be blanked. The time field rule described in 7.3. is still
hold.
[FIXME: File name MUST NOT have any space characters.]
[FIXME: non-ascii character are out of scope of this document.]
7.3. Time Field Convention
Time fields SHOULD be indented to match the both start columns of first
and second header lines. It increases the readability pretty well.
APPENDIX A. Some Examples
Here's some examples of Differential Unit:
a.) Regular file "./foo.txt" was modified. It's not under Item ID
control.
---- ./foo.txt 2005-05-23 10:58:17.583015024 +0900
++++ ./foo.txt 2005-05-23 10:59:12.976593920 +0900
@@ -1,4 +1,5 @@
aaaaaaaaaa
+bbbbbbbbbb
cccccccccc
dddddddddd
eeeeeeeeee
b.) All of the lines in "./foo.txt" are deleted. not under Item ID
control. "./foo.txt" still remains as an empty file.
---- ./foo.txt 2005-05-23 10:58:17.583015024 +0900
++++ ./foo.txt 2005-05-23 10:59:12.976593920 +0900
@@ -1,4 +0,0 @@
-aaaaaaaaaa
-bbbbbbbbbb
-cccccccccc
-dddddddddd
c.) Same as b.), but "./foo.txt" was actually removed.
---- ./foo.txt 2005-05-23 10:58:17.583015024 +0900
++.+ 2005-05-23 10:59:12.976593920 +0900
@@ -1,4 +0,0 @@
-aaaaaaaaaa
-bbbbbbbbbb
-cccccccccc
-dddddddddd
d.) Regular file "./foo.txt" was renamed to bar.txt. The Item is
managed by Item ID "i_abc".
---- ./foo.txt 2005-05-23 10:58:17.583015024 +0900
++++ ./bar.txt 2005-05-23 10:59:12.976593920 +0900
@@ id: i_abc @@
e.) Same as d.), but the contents also modified.
---- ./foo.txt 2005-05-23 10:58:17.583015024 +0900
++++ ./bar.txt 2005-05-23 10:59:12.976593920 +0900
@@ id: i_abc @@
@@ -1,3 +1,4 @@
aaaaaaaaaa
+bbbbbbbbbb
cccccccccc
dddddddddd
f.) Same as e.), but the property value of "zzz" is also modified
from "xxx" to "yyy".
---- ./foo.txt 2005-05-23 10:58:17.583015024 +0900
++++ ./bar.txt 2005-05-23 10:59:12.976593920 +0900
@@ id: i_abc @@
@@ -1,3 +1,4 @@
aaaaaaaaaa
+bbbbbbbbbb
cccccccccc
dddddddddd
@@ properties: @@
-zzz: xxx
+zzz: yyy
g.) Append 1-byte "\x06" was appended at the bottom of the binary
file "./foo.txt" which has had the 6 byte values
"\x00\x01\x02\x03\x04\x05". The name was unchanged and used
"plain" binary difference type.
---x ./foo.txt 2005-05-23 10:58:17.583015024 +0900
+++x ./foo.txt 2005-05-23 10:59:12.976593920 +0900
@@ bin: plain, -1,1 +1,1 @@
-AAECAwQF
+AAECAwQFBg==
(You can also use "-1 +1" instead of "-1,1 +1,1".)
h.) Create a new symbolic link named "./foo.txt" which refers to
"/some/directory/foo.txt".
--.- 2005-05-23 10:58:17.583015024 +0900
++S+ ./foo.txt 2005-05-23 10:59:12.976593920 +0900
@@ -0,0 +1,1 @@
+/some/directory/foo.txt
(You can also use "-0,0 +1" instead of "-0,0 +1,1")
APPENDIX B. Formal Syntax
SCS :=
[extended Unidiff Unit]*
[Binary Unit]*
extended Unidiff Unit :=
Unidiff Unit header-1
Unidiff Unit header-2
[ID Hunk]
[Unidiff Hunk]*
[property Hunk]
Binary Unit :=
Binary Unit header-1
Binary Unit header-2
[ID Hunk]
Binary Hunk
[property Hunk]
Unidiff Unit header-1 := '--' [-.DS] '- ' header-info '\n'
Unidiff Unit header-2 := '++' [+.DS] '+ ' header-info '\n'
Binary Unit header-1 := '--' [-.DS] 'x ' header-info '\n'
Binary Unit header-2 := '++' [+.DS] 'x ' header-info '\n'
header-info := file-name-description ' ' time-description '\n'
ID Hunk := '@@ id: ' [0-9a-zA-Z_]+ ' @@' '\n'
binary Hunk := '@@ bin: '
binary-algorithm-name
', '
standard-Hunk-quartet
' @@' '\n'
binary Hunk body
binary-algorithm-name := 'plain' | 'gzip' | 'yoyodyne' | ...
binary-Hunk-body :=
<lines started by '-' and forward binary diff data encoded by base64>
[lines started by '+' and reverse binary diff data encoded by base64]
standard-Hunk-quartet := '-' number ',' number ' +' number ',' number
file-name-description :=
time-description :=
property-Hunk := '@@ properties: @@' '\n' property-Hunk-body
[FIXME:]
REFERENCES
[1] GNU arch
http://www.gnu.org/software/gnu-arch/
[2] Lord, T., "arch Meets hello-world - A Tutorial Introduction to The
arch Revision Control System"
http://www.gnu.org/software/gnu-arch/tutorial/arch.html
2001, 2002, 2003, 2004.
[3] GNU diff/patch
http://ftp.gnu.org/pub/gnu/patch/patch-2.5.4.tar.gz
http://ftp.gnu.org/pub/gnu/diffutils/diffutils-2.8.1.tar.gz
[4] Subversion
http://subversion.tigris.org/
[5] xdelta
http://sourceforge.net/projects/xdelta/
[6] bsdiff/bspatch
http://www.daemonology.net/bsdiff/
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Fri May 27 16:59:56 2005