Check-in [6f1af23ebe]
Not logged in
Overview

SHA1 Hash:6f1af23ebe3a862c7288713b897223b6e59a4f3c
Date: 2007-08-26 22:22:01
User: aku
Comment:Added section numbers to delta format, labels for linking, navigation bar. Added delta encoder description (incomplete, right now only all the trivial parts). Using TeX for formulas, and mimetex for conversion.
Timelines: ancestors | descendants | both | trunk
Other Links: files | ZIP archive | manifest

Tags And Properties
Changes
[hide diffs]

Added art/encode1.tex version [bb4fa24df8]

@@ -1,1 +1,2 @@
+\LARGE A = (\sum_{i=0}^{NHASH-1} z_i) \bmod 2^{16}
 

Added art/encode2.tex version [0a86eb21bb]

@@ -1,1 +1,1 @@
-
+\LARGE B = (\sum_{i=0}^{NHASH-1} (NHASH-i)z_i) \bmod 2^{16}

Added art/encode3.tex version [e14430ac5a]

@@ -1,1 +1,1 @@
-
+\LARGE V = 2^{16}B + A

Added art/encode4.tex version [ef19f3297d]

@@ -1,1 +1,1 @@
-
+\LARGE z_0

Added art/encode5.tex version [8c67f6dc7e]

@@ -1,1 +1,1 @@
-
+\LARGE z_{new}

Added art/encode6.tex version [460b9901be]

@@ -1,1 +1,1 @@
-
+\LARGE A_{new} = (A - z_0 + z_{new}) \bmod 2^{16}

Added art/encode7.tex version [52fdbf5294]

@@ -1,1 +1,1 @@
-
+\LARGE B_{new} = (B - z_0 NHASH + A_{new}) \bmod 2^{16}

Added art/encode8.tex version [002741146e]

@@ -1,1 +1,1 @@
-
+\LARGE V_{new} = 2^{16}B_{new} + A_{new}

Added art/encode9.tex version [84df5abf42]

@@ -1,1 +1,1 @@
-
+\LARGE A_{new}

Modified src/delta.c from [10d700df60] to [4f6e9c70e9].

@@ -265,11 +265,11 @@
 ** delta that may contain some binary data.
 **
 ** Algorithm:
 **
 ** The encoder first builds a hash table to help it find matching
-** patterns in the source file.  16-byte chucks of the source file
+** patterns in the source file.  16-byte chunks of the source file
 ** sampled at evenly spaced intervals are used to populate the hash
 ** table.
 **
 ** Next we begin scanning the target file using a sliding 16-byte
 ** window.  The hash of the 16-byte window in the target is used to

Added tools/encode_math.sh version [86eb1215d8]

@@ -1,1 +1,12 @@
-
+#!/bin/sh
+bindir="`dirname "$0"`"
+topdir="`dirname "$bindir"`"
+mimetex -d "`cat "$topdir/art/encode1.tex"`" > "$topdir/www/encode1.gif"
+mimetex -d "`cat "$topdir/art/encode2.tex"`" > "$topdir/www/encode2.gif"
+mimetex -d "`cat "$topdir/art/encode3.tex"`" > "$topdir/www/encode3.gif"
+mimetex -d "`cat "$topdir/art/encode4.tex"`" > "$topdir/www/encode4.gif"
+mimetex -d "`cat "$topdir/art/encode5.tex"`" > "$topdir/www/encode5.gif"
+mimetex -d "`cat "$topdir/art/encode6.tex"`" > "$topdir/www/encode6.gif"
+mimetex -d "`cat "$topdir/art/encode7.tex"`" > "$topdir/www/encode7.gif"
+mimetex -d "`cat "$topdir/art/encode8.tex"`" > "$topdir/www/encode8.gif"
+mimetex -d "`cat "$topdir/art/encode9.tex"`" > "$topdir/www/encode9.gif"

Added www/delta_encoder_algorithm.html version [8a9f809ab9]

@@ -1,1 +1,153 @@
+<html>
+<head>
+<title>Fossil Delta Encoding Algorithm</title>
+</head>
+<body bgcolor="white">
+<p>[ <a href="index.html">Index</a> ]</p>
+<hr>
+<h1 align="center">
+Fossil Delta Encoding Algorithm
+</h1>
+
+<p>A key component for the efficient storage of multiple revisions of
+a file in fossil repositories is the use of delta-compression, i.e. to
+store only the changes between revisions instead of the whole
+file.</p>
+
+<p>This document describes the encoding algorithm used by Fossil to
+generate deltas. It is targeted at developers working on either
+<a href="index.html">fossil</a> itself, or on tools compatible with
+it. The exact format of the generated byte-sequences, while in general
+not necessary to understand encoder operation, can be found in the
+companion specification titled "<a href="delta_format.html">Fossil
+Delta Format</a>".
+</p>
+
+<p>The entire algorithm is inspired
+by <a href="http://samba.anu.edu.au/rsync/">rsync</a>.</p>
+
+<a name="argresparam"><h2>1.0 Arguments, Results, and Parameters</h2>
+
+<p>The encoder takes two byte-sequences as input, the "original", and
+the "target", and returns a single byte-sequence containing the
+"delta" which transforms the original into the target upon its
+application.</p>
+
+<p>Note that the data of a "byte-sequence" includes its length,
+i.e. the number of bytes contained in the sequence.</p>
+
+<p>The algorithm has one parameter named "NHASH", the size of the
+"sliding window" for the "rolling hash", in bytes. These two terms are
+explained in the next section. The value of this parameter has to be a
+power of two for the algorithm to work. For Fossil the value of this
+parameter is set to "16".</p>
+
+<a name="operation"><h2>2.0 Operation</h2>
+
+<p>The algorithm is split into three phases which generate
+the <a href="delta_format.html#header">header</a>,
+<a href="delta_format.html#slist">segment list</a>,
+and <a href="delta_format.html#trailer">trailer</a> of the delta, per
+its general <a href="delta_format.html#structure">structure</a>.</p>
+
+<p>The two phases generating header and trailer are not covered here
+as their implementation trivially follows directly from the
+specification of the <a href="delta_format.html">delta format</a>.</p>
+
+<p>This leaves the segment-list. Its generation is done in two phases,
+a pre-processing step operating on the "original" byte-sequence,
+followed by the processing of the "target" byte-sequence using the
+information gathered by the first step.</p>
+
+<a name="preprocessing"><h3>2.1 Preprocessing the original</h3>
+
+<p>A major part of the processing of the "target" is to find a range
+in the "original" which contains the same content as found at the
+current location in the "target".</p>
+
+<p>A naive approach to this would be to search the whole "original"
+for such content. This however is very inefficient as it would search
+the same parts of the "original" over and over. What is done instead
+is to sample the "original" at regular intervals, compute signatures
+for the sampled locations and store them in a hash table keyed by
+these signatures.</p>
+
+<p>That is what happens in this step. The following processing step
+can then the compute signature for its current location and then has
+to search only a narrow set of locations in the "original" for
+possible matches, namely those which have the same signature.</p>
+
+<p>In detail:</p>
+
+<ol>
+<li>The "original" is split into chunks of NHASH bytes. Note that a
+partial chunk of less than NHASH bytes at the end of "original" is
+ignored.
+</li>
+<li>The <a href="#rollhash">rolling hash</a> of each chunk is
+computed.
+</li>
+<li>A hashtable is filled, mapping from the hashes of the chunks to
+the list of chunk locations having this hash.
+</li>
+</ol>
+
+<a name="processing"><h3>2.1 Processing the target</h3>
+
+<b>... to be completed ... </b>
+
+<a name="exceptions"><h2>3.0 Exceptions</h2>
+
+<p>If the "original" is at most NHASH bytes long no compression of
+changes is possible, and the segment-list of the delta consists of a
+single literal which contains the entire "target".</p>
+
+
+<a name="rollhash"><h2>4.0 The rolling hash</h2>
+
+<p>The rolling hash described below and used to compute content
+signatures was chosen not only for good hashing properties, but also
+to enable the easy (incremental) recalculation of its value for a
+sliding window, i.e. where the oldest character is removed from the
+window and a new character is shifted in.<p>
+
+<a name="rhdef"><h2>4.1 Definition</h2>
+
+<p>Assuming an array Z of NHASH bytes (indexing starting at 0) the
+hash V is computed via</p>
+
+<p align=center><table><tr><td>
+<p><img src="encode1.gif" align="center"></p>
+<p><img src="encode2.gif" align="center"></p>
+<p><img src="encode3.gif" align="center"></p>
+</td></tr></table></p>
+
+where A and B are unsigned 16-bit integers (hence the <u>mod</u>), and
+V is a 32-bit unsigned integer with B as MSB, A as LSB.
+
+<a name="rhincr"><h2>4.2 Incremental recalculation</h2>
+
+<p>Assuming an array Z of NHASH bytes (indexing starting at 0) with
+hash V (and components A and B), the dropped
+byte <img src="encode4.gif" align="center">, and the new byte
+<img src="encode5.gif" align="center"> , the new hash can
+be computed incrementally via: </p>
+
+<p align=center><table><tr><td>
+<p><img src="encode6.gif" align="center"></p>
+<p><img src="encode7.gif" align="center"></p>
+<p><img src="encode8.gif" align="center"></p>
+</td></tr></table></p>
+
+<p>For A, the regular sum, it can be seen easily that this the correct
+way recomputing that component.</p>
+
+<p>For B, the weighted sum, note first that <img src="encode4.gif"
+align="center"> has the weight NHASH in the sum, so that is what has
+to be removed. Then adding in <img src="encode9.gif" align="center">
+adds one weight factor to all the other values of Z, and at last adds
+in <img src="encode5.gif" align="center"> with weight 1, also
+generating the correct new sum</p>
 
+</body>
+</html>

Modified www/delta_format.html from [403bbf0aac] to [db8d44b23e].

@@ -1,10 +1,12 @@
 <html>
 <head>
 <title>Fossil Delta Format</title>
 </head>
 <body bgcolor="white">
+<p>[ <a href="index.html">Index</a> ]</p>
+<hr>
 <h1 align="center">
 Fossil Delta Format
 </h1>
 
 <p>A key component for the efficient storage of multiple revisions of
@@ -15,21 +17,21 @@
 <p>This document describes the format used to encode such changes,
 also known as "delta". It is targeted at developers working on either
 <a href="index.html">fossil</a> itself, or on tools compatible with
 it.</p>
 
-<h2>Structure</h2>
+<a name="structure"><h2>1.0 Structure</h2>
 <img src="delta1.gif" align="left" hspace="10">
 
 <p>A delta consists of three parts, a "header", a "trailer", and a
 "segment-list" between them.</p>
 
 <p>Both header and trailer provide information about the target
 helping the decoder, and the segment-list describes how the target can
 be constructed from the original.</p>
 
-<h3>Header</h3>
+<a name="header"><h3>1.1 Header</h3>
 <img src="delta6.gif" align="left" hspace="10">
 
 <p>The header consists of a single number followed by a newline
 character (ASCII 0x0a). The number is the length of the target in
 bytes.</p>
@@ -37,11 +39,11 @@
 <p>This means that, given a delta, the decoder can compute the size of
 the target (and allocate any necessary memory based on that) by simply
 reading the first line of the delta and decoding the number found
 there. In other words, before it has to decode everything else.</p>
 
-<h3>Trailer</h3>
+<a name="trailer"><h3>1.2 Trailer</h3>
 <img src="delta5.gif" align="left" hspace="10">
 
 <p>The trailer consists of a single number followed by a semicolon (ASCII
 0x3b). This number is a checksum of the target and can be used by a
 decoder to verify that the delta applied correctly, reconstructing the
@@ -54,11 +56,11 @@
 
 <p>By putting this information at the end of the delta a decoder has
 it available immediately after the target has been reconstructed
 fully.</p>
 
-<h3>Segment-List</h3>
+<a name="slist"><h3>1.3 Segment-List</h3>
 <img src="delta2.gif" align="left" hspace="10">
 
 <p>The segment-list of a delta describes how to create the target from
 the original by a combination of inserting literal byte-sequences and
 copying ranges of bytes from the original. This is there the
@@ -67,20 +69,20 @@
 
 <p>The target is constructed from beginning to end, with the data
 generated by each instruction appended after the data of all previous
 instructions, with no gaps.</p>
 
-<h4>Insert Literal</h4>
+<a name="insertlit"><h4>1.3.1 Insert Literal</h4>
 
 <p>A literal is specified by two elements, the size of the literal in
 bytes, and the bytes of the literal itself.</p>
 
 <img src="delta4.gif" align="left" hspace="10">
 <p>The length is written first, followed by a colon character (ASCII
 0x3a), followed by the bytes of the literal.</p>
 
-<h4>Copy Range</h4>
+<a name="copyrange"><h4>1.3.2 Copy Range</h4>
 
 <p>A range to copy is specified by two numbers, the offset of the
 first byte in the original to copy, and the size of the range, in
 bytes. The size zero is special, its usage indicates that the range
 extends to the end of the original.</p>
@@ -87,11 +89,11 @@
 
 <img src="delta3.gif" align="left" hspace="10">
 <p>The length is written first, followed by an "at" character (ASCII
 0x40), then the offset, followed by a comma (ASCII 0x2c).</p>
 
-<h2>Encoding of integers</h2>
+<a name="intcoding"><h2>2.0 Encoding of integers</h2>
 
 <p>
 The format currently handles only 32 bit integer numbers. They are
 written base-64 encoded, MSB first, and without leading
 "0"-characters, except if they are significant (i.e. 0 => "0").
@@ -100,13 +102,13 @@
 <p>
 The base-64 coding is described in
 <a href="http://www.ietf.org/rfc/rfc3548.txt">RFC 3548</a>.
 </p>
 
-<h2>Examples</h2>
-
-<h3>Number encoding</h3>
+<a name="examples"><h2>3.0 Examples</h2>
+
+<a name="examplesint"><h3>3.1 Integer encoding</h3>
 
 <table border=1>
 <tr>
 <th>Value</th>
 <th>Encoding</th>
@@ -123,11 +125,11 @@
 <td>-1101438770</td>
 <td>2zMM3E</td>
 </tr>
 </table>
 
-<h3>Delta</h3>
+<a name="examplesdelta"><h3>3.2 Delta encoding</h3>
 
 <p>An example of a delta using the specified encoding is:</p>
 
 <table border=1><tr><td><pre>
 1Xb
@@ -199,11 +201,11 @@
 
 </pre></td></tr></table>
 
 
 
-<h2>Notes</h2>
+<a name="notes"><h2>Notes</h2>
 
 <ul>
 <li>Pure text files generate a pure text delta.
 </li>
 <li>Binary files generate a delta that may contain some binary data.

Added www/encode1.gif version [d75ce317b5]

cannot compute difference between binary files

Added www/encode2.gif version [e2b657c112]

cannot compute difference between binary files

Added www/encode3.gif version [1565ebf8d7]

cannot compute difference between binary files

Added www/encode4.gif version [7879afadc5]

cannot compute difference between binary files

Added www/encode5.gif version [90a35e3626]

cannot compute difference between binary files

Added www/encode6.gif version [ca683ebcfa]

cannot compute difference between binary files

Added www/encode7.gif version [5bb9e59f86]

cannot compute difference between binary files

Added www/encode8.gif version [99cb8d91c3]

cannot compute difference between binary files

Added www/encode9.gif version [7aca8807dd]

cannot compute difference between binary files

Modified www/index.html from [7006a635c1] to [976b9491be].

@@ -88,9 +88,11 @@
 helps insure project integrity.</li>
 <li>The <a href="fileformat.html">file format</a> used by every content
 file stored in the repository.</li>
 <li>The <a href="delta_format.html">format of deltas</a> used to
 efficiently store changes between file revisions.</li>
+<li>The <a href="delta_encoder_algorithm.html">encoder algorithm</a> used to
+efficiently generate deltas.</li>
 </ul>
 
 </body>
 </html>