File Annotation
Not logged in
6f1af23ebe 2007-08-26       aku: <html>
6f1af23ebe 2007-08-26       aku: <head>
6f1af23ebe 2007-08-26       aku: <title>Fossil Delta Encoding Algorithm</title>
6f1af23ebe 2007-08-26       aku: </head>
6f1af23ebe 2007-08-26       aku: <body bgcolor="white">
6f1af23ebe 2007-08-26       aku: <p>[ <a href="index.html">Index</a> ]</p>
6f1af23ebe 2007-08-26       aku: <hr>
6f1af23ebe 2007-08-26       aku: <h1 align="center">
6f1af23ebe 2007-08-26       aku: Fossil Delta Encoding Algorithm
6f1af23ebe 2007-08-26       aku: </h1>
6f1af23ebe 2007-08-26       aku: 
6f1af23ebe 2007-08-26       aku: <p>A key component for the efficient storage of multiple revisions of
6f1af23ebe 2007-08-26       aku: a file in fossil repositories is the use of delta-compression, i.e. to
6f1af23ebe 2007-08-26       aku: store only the changes between revisions instead of the whole
6f1af23ebe 2007-08-26       aku: file.</p>
6f1af23ebe 2007-08-26       aku: 
6f1af23ebe 2007-08-26       aku: <p>This document describes the encoding algorithm used by Fossil to
6f1af23ebe 2007-08-26       aku: generate deltas. It is targeted at developers working on either
6f1af23ebe 2007-08-26       aku: <a href="index.html">fossil</a> itself, or on tools compatible with
6f1af23ebe 2007-08-26       aku: it. The exact format of the generated byte-sequences, while in general
6f1af23ebe 2007-08-26       aku: not necessary to understand encoder operation, can be found in the
6f1af23ebe 2007-08-26       aku: companion specification titled "<a href="delta_format.html">Fossil
6f1af23ebe 2007-08-26       aku: Delta Format</a>".
6f1af23ebe 2007-08-26       aku: </p>
6f1af23ebe 2007-08-26       aku: 
6f1af23ebe 2007-08-26       aku: <p>The entire algorithm is inspired
6f1af23ebe 2007-08-26       aku: by <a href="http://samba.anu.edu.au/rsync/">rsync</a>.</p>
6f1af23ebe 2007-08-26       aku: 
6f1af23ebe 2007-08-26       aku: <a name="argresparam"><h2>1.0 Arguments, Results, and Parameters</h2>
6f1af23ebe 2007-08-26       aku: 
6f1af23ebe 2007-08-26       aku: <p>The encoder takes two byte-sequences as input, the "original", and
6f1af23ebe 2007-08-26       aku: the "target", and returns a single byte-sequence containing the
6f1af23ebe 2007-08-26       aku: "delta" which transforms the original into the target upon its
6f1af23ebe 2007-08-26       aku: application.</p>
6f1af23ebe 2007-08-26       aku: 
6f1af23ebe 2007-08-26       aku: <p>Note that the data of a "byte-sequence" includes its length,
6f1af23ebe 2007-08-26       aku: i.e. the number of bytes contained in the sequence.</p>
6f1af23ebe 2007-08-26       aku: 
6f1af23ebe 2007-08-26       aku: <p>The algorithm has one parameter named "NHASH", the size of the
6f1af23ebe 2007-08-26       aku: "sliding window" for the "rolling hash", in bytes. These two terms are
6f1af23ebe 2007-08-26       aku: explained in the next section. The value of this parameter has to be a
6f1af23ebe 2007-08-26       aku: power of two for the algorithm to work. For Fossil the value of this
6f1af23ebe 2007-08-26       aku: parameter is set to "16".</p>
6f1af23ebe 2007-08-26       aku: 
6f1af23ebe 2007-08-26       aku: <a name="operation"><h2>2.0 Operation</h2>
6f1af23ebe 2007-08-26       aku: 
6f1af23ebe 2007-08-26       aku: <p>The algorithm is split into three phases which generate
6f1af23ebe 2007-08-26       aku: the <a href="delta_format.html#header">header</a>,
6f1af23ebe 2007-08-26       aku: <a href="delta_format.html#slist">segment list</a>,
6f1af23ebe 2007-08-26       aku: and <a href="delta_format.html#trailer">trailer</a> of the delta, per
6f1af23ebe 2007-08-26       aku: its general <a href="delta_format.html#structure">structure</a>.</p>
6f1af23ebe 2007-08-26       aku: 
6f1af23ebe 2007-08-26       aku: <p>The two phases generating header and trailer are not covered here
6f1af23ebe 2007-08-26       aku: as their implementation trivially follows directly from the
6f1af23ebe 2007-08-26       aku: specification of the <a href="delta_format.html">delta format</a>.</p>
6f1af23ebe 2007-08-26       aku: 
6f1af23ebe 2007-08-26       aku: <p>This leaves the segment-list. Its generation is done in two phases,
6f1af23ebe 2007-08-26       aku: a pre-processing step operating on the "original" byte-sequence,
6f1af23ebe 2007-08-26       aku: followed by the processing of the "target" byte-sequence using the
6f1af23ebe 2007-08-26       aku: information gathered by the first step.</p>
6f1af23ebe 2007-08-26       aku: 
6f1af23ebe 2007-08-26       aku: <a name="preprocessing"><h3>2.1 Preprocessing the original</h3>
6f1af23ebe 2007-08-26       aku: 
6f1af23ebe 2007-08-26       aku: <p>A major part of the processing of the "target" is to find a range
6f1af23ebe 2007-08-26       aku: in the "original" which contains the same content as found at the
6f1af23ebe 2007-08-26       aku: current location in the "target".</p>
6f1af23ebe 2007-08-26       aku: 
6f1af23ebe 2007-08-26       aku: <p>A naive approach to this would be to search the whole "original"
6f1af23ebe 2007-08-26       aku: for such content. This however is very inefficient as it would search
6f1af23ebe 2007-08-26       aku: the same parts of the "original" over and over. What is done instead
6f1af23ebe 2007-08-26       aku: is to sample the "original" at regular intervals, compute signatures
6f1af23ebe 2007-08-26       aku: for the sampled locations and store them in a hash table keyed by
6f1af23ebe 2007-08-26       aku: these signatures.</p>
6f1af23ebe 2007-08-26       aku: 
6f1af23ebe 2007-08-26       aku: <p>That is what happens in this step. The following processing step
6f1af23ebe 2007-08-26       aku: can then the compute signature for its current location and then has
6f1af23ebe 2007-08-26       aku: to search only a narrow set of locations in the "original" for
6f1af23ebe 2007-08-26       aku: possible matches, namely those which have the same signature.</p>
6f1af23ebe 2007-08-26       aku: 
6f1af23ebe 2007-08-26       aku: <p>In detail:</p>
6f1af23ebe 2007-08-26       aku: 
6f1af23ebe 2007-08-26       aku: <ol>
6f1af23ebe 2007-08-26       aku: <li>The "original" is split into chunks of NHASH bytes. Note that a
6f1af23ebe 2007-08-26       aku: partial chunk of less than NHASH bytes at the end of "original" is
6f1af23ebe 2007-08-26       aku: ignored.
6f1af23ebe 2007-08-26       aku: </li>
6f1af23ebe 2007-08-26       aku: <li>The <a href="#rollhash">rolling hash</a> of each chunk is
6f1af23ebe 2007-08-26       aku: computed.
6f1af23ebe 2007-08-26       aku: </li>
6f1af23ebe 2007-08-26       aku: <li>A hashtable is filled, mapping from the hashes of the chunks to
6f1af23ebe 2007-08-26       aku: the list of chunk locations having this hash.
6f1af23ebe 2007-08-26       aku: </li>
6f1af23ebe 2007-08-26       aku: </ol>
6f1af23ebe 2007-08-26       aku: 
6f1af23ebe 2007-08-26       aku: <a name="processing"><h3>2.1 Processing the target</h3>
6f1af23ebe 2007-08-26       aku: 
59ad045fef 2007-08-27       aku: <p>This, the main phase of the encoder, processes the target in a loop
59ad045fef 2007-08-27       aku: from beginning to end. The state of the encoder is captured by two
59ad045fef 2007-08-27       aku: locations, the "base" and the "slide". "base" points to the first byte
59ad045fef 2007-08-27       aku: of the target for which no delta output has been generated yet, and
59ad045fef 2007-08-27       aku: "slide" is the location of the window used to look in the "origin" for
59ad045fef 2007-08-27       aku: commonalities. This window is NHASH bytes long.</p>
59ad045fef 2007-08-27       aku: 
59ad045fef 2007-08-27       aku: <p>Initially both "base" and "slide" point to the beginning of the
59ad045fef 2007-08-27       aku: "target". In each iteration of the loop the encoder decides whether to
59ad045fef 2007-08-27       aku: <ul>
59ad045fef 2007-08-27       aku: <li>emit a single instruction
59ad045fef 2007-08-27       aku: to <a href="delta_format.html#copyrange">copy a range</a>, or
59ad045fef 2007-08-27       aku: </li>
59ad045fef 2007-08-27       aku: <li>emit two instructions, first
59ad045fef 2007-08-27       aku: to <a href="delta_format.html#insertlit">insert a literal</a>, then
59ad045fef 2007-08-27       aku: to <a href="delta_format.html#copyrange">copy a range</a>, or
59ad045fef 2007-08-27       aku: </li>
59ad045fef 2007-08-27       aku: <li>move the window forward one byte.
59ad045fef 2007-08-27       aku: </li>
59ad045fef 2007-08-27       aku: </ul>
59ad045fef 2007-08-27       aku: </p>
59ad045fef 2007-08-27       aku: 
59ad045fef 2007-08-27       aku: <img src="encode10.gif" align="right" hspace="10">
59ad045fef 2007-08-27       aku: <p>To make this decision the encoder first computes the hash value for
59ad045fef 2007-08-27       aku: the NHASH bytes in the window and then looks at all the locations in
59ad045fef 2007-08-27       aku: the "origin" which have the same signature. This part uses the hash
59ad045fef 2007-08-27       aku: table created by the pre-processing step to effiently find these
59ad045fef 2007-08-27       aku: locations.</p>
59ad045fef 2007-08-27       aku: 
59ad045fef 2007-08-27       aku: <p>For each of the possible candidates the encoder finds the maximal
59ad045fef 2007-08-27       aku: range of bytes common to both "origin" and "target", going forward and
59ad045fef 2007-08-27       aku: backward from "slide" in the "target", and the candidate location in
59ad045fef 2007-08-27       aku: the "origin". This search is constrained on the side of the "target"
59ad045fef 2007-08-27       aku: by the "base" (backward search), and the end of the "target" (forward
59ad045fef 2007-08-27       aku: search), and on the side of the "origin" by the beginning and end of
59ad045fef 2007-08-27       aku: the "origin", respectively.</p>
3e899ae0e5 2007-09-08       aku: 
3e899ae0e5 2007-09-08       aku: <p>There are input files for which the hash chains generated by the
3e899ae0e5 2007-09-08       aku: pre-processing step can become very long, leading to long search times
3e899ae0e5 2007-09-08       aku: and affecting the performance of the delta generator. To limit the
3e899ae0e5 2007-09-08       aku: effect such long chains can have the actual search for candidates is
3e899ae0e5 2007-09-08       aku: bounded, looking at most N candidates. Currently N is set to 250.</p>
59ad045fef 2007-08-27       aku: 
59ad045fef 2007-08-27       aku: <p>From the ranges for all the candidates the best (= largest) common
59ad045fef 2007-08-27       aku: range is taken and it is determined how many bytes are needed to
59ad045fef 2007-08-27       aku: encode the bytes between the "base" and the end of that range. If the
59ad045fef 2007-08-27       aku: range extended back to the "base" then this can be done in a single
59ad045fef 2007-08-27       aku: copy instruction. Otherwise, i.e if there is a gap between the "base"
59ad045fef 2007-08-27       aku: and the beginning of the range then two instructions are needed, one
59ad045fef 2007-08-27       aku: to insert the bytes in the gap as a literal, and a copy instruction
59ad045fef 2007-08-27       aku: for the range itself. The general situation at this point can be seen
59ad045fef 2007-08-27       aku: in the picture to the right.</p>
59ad045fef 2007-08-27       aku: 
59ad045fef 2007-08-27       aku: <p>If the number of bytes needed to encode both gap (if present), and
59ad045fef 2007-08-27       aku: range is less than the number of bytes we are encoding the encoder
59ad045fef 2007-08-27       aku: will emit the necessary instructions as described above, set "base"
59ad045fef 2007-08-27       aku: and "slide" to the end of the encoded range and start the next
59ad045fef 2007-08-27       aku: iteration at that point.</p>
59ad045fef 2007-08-27       aku: 
59ad045fef 2007-08-27       aku: <p>If, on the other hand, the encoder either did not find candidate
59ad045fef 2007-08-27       aku: locations in the origin, or the best range coming out of the search
59ad045fef 2007-08-27       aku: needed more bytes to encode the range than there were bytes in the
59ad045fef 2007-08-27       aku: range, then no instructions are emitted and the window is moved one
59ad045fef 2007-08-27       aku: byte forward. The "base" is left unchanged in that case.</p>
59ad045fef 2007-08-27       aku: 
59ad045fef 2007-08-27       aku: <p>The processing loop stops at one of two conditions:
59ad045fef 2007-08-27       aku: <ol>
59ad045fef 2007-08-27       aku: <li>The encoder decided to move the window forward, but the end of the
59ad045fef 2007-08-27       aku: window reached the end of the "target".
59ad045fef 2007-08-27       aku: </li>
59ad045fef 2007-08-27       aku: <li>After the emission of instructions the new "base" location is
59ad045fef 2007-08-27       aku: within NHASH bytes of end of the "target", i.e. there are no more than
59ad045fef 2007-08-27       aku: at most NHASH bytes left.
59ad045fef 2007-08-27       aku: </li>
59ad045fef 2007-08-27       aku: </ol>
59ad045fef 2007-08-27       aku: </p>
59ad045fef 2007-08-27       aku: 
59ad045fef 2007-08-27       aku: <p>If the processing loop left bytes unencoded, i.e. "base" not
59ad045fef 2007-08-27       aku: exactly at the end of the "target", as is possible for both end
59ad045fef 2007-08-27       aku: conditions, then one last insert instruction is emitted to put these
59ad045fef 2007-08-27       aku: bytes into the delta.<p>
6f1af23ebe 2007-08-26       aku: 
6f1af23ebe 2007-08-26       aku: <a name="exceptions"><h2>3.0 Exceptions</h2>
6f1af23ebe 2007-08-26       aku: 
6f1af23ebe 2007-08-26       aku: <p>If the "original" is at most NHASH bytes long no compression of
6f1af23ebe 2007-08-26       aku: changes is possible, and the segment-list of the delta consists of a
6f1af23ebe 2007-08-26       aku: single literal which contains the entire "target".</p>
6f1af23ebe 2007-08-26       aku: 
59ad045fef 2007-08-27       aku: <p>This is actually equivalent to the second end condition of the
59ad045fef 2007-08-27       aku: processing loop described in the previous section, just checked before
59ad045fef 2007-08-27       aku: actually entering the loop.</p>
6f1af23ebe 2007-08-26       aku: 
6f1af23ebe 2007-08-26       aku: <a name="rollhash"><h2>4.0 The rolling hash</h2>
6f1af23ebe 2007-08-26       aku: 
6f1af23ebe 2007-08-26       aku: <p>The rolling hash described below and used to compute content
6f1af23ebe 2007-08-26       aku: signatures was chosen not only for good hashing properties, but also
6f1af23ebe 2007-08-26       aku: to enable the easy (incremental) recalculation of its value for a
59ad045fef 2007-08-27       aku: sliding window, i.e. where the oldest byte is removed from the window
59ad045fef 2007-08-27       aku: and a new byte is shifted in.<p>
6f1af23ebe 2007-08-26       aku: 
59ad045fef 2007-08-27       aku: <a name="rhdef"><h3>4.1 Definition</h3>
6f1af23ebe 2007-08-26       aku: 
6f1af23ebe 2007-08-26       aku: <p>Assuming an array Z of NHASH bytes (indexing starting at 0) the
6f1af23ebe 2007-08-26       aku: hash V is computed via</p>
6f1af23ebe 2007-08-26       aku: 
6f1af23ebe 2007-08-26       aku: <p align=center><table><tr><td>
6f1af23ebe 2007-08-26       aku: <p><img src="encode1.gif" align="center"></p>
6f1af23ebe 2007-08-26       aku: <p><img src="encode2.gif" align="center"></p>
6f1af23ebe 2007-08-26       aku: <p><img src="encode3.gif" align="center"></p>
6f1af23ebe 2007-08-26       aku: </td></tr></table></p>
6f1af23ebe 2007-08-26       aku: 
6f1af23ebe 2007-08-26       aku: where A and B are unsigned 16-bit integers (hence the <u>mod</u>), and
6f1af23ebe 2007-08-26       aku: V is a 32-bit unsigned integer with B as MSB, A as LSB.
6f1af23ebe 2007-08-26       aku: 
59ad045fef 2007-08-27       aku: <a name="rhincr"><h3>4.2 Incremental recalculation</h3>
6f1af23ebe 2007-08-26       aku: 
6f1af23ebe 2007-08-26       aku: <p>Assuming an array Z of NHASH bytes (indexing starting at 0) with
6f1af23ebe 2007-08-26       aku: hash V (and components A and B), the dropped
6f1af23ebe 2007-08-26       aku: byte <img src="encode4.gif" align="center">, and the new byte
6f1af23ebe 2007-08-26       aku: <img src="encode5.gif" align="center"> , the new hash can
6f1af23ebe 2007-08-26       aku: be computed incrementally via: </p>
6f1af23ebe 2007-08-26       aku: 
6f1af23ebe 2007-08-26       aku: <p align=center><table><tr><td>
6f1af23ebe 2007-08-26       aku: <p><img src="encode6.gif" align="center"></p>
6f1af23ebe 2007-08-26       aku: <p><img src="encode7.gif" align="center"></p>
6f1af23ebe 2007-08-26       aku: <p><img src="encode8.gif" align="center"></p>
6f1af23ebe 2007-08-26       aku: </td></tr></table></p>
6f1af23ebe 2007-08-26       aku: 
6f1af23ebe 2007-08-26       aku: <p>For A, the regular sum, it can be seen easily that this the correct
6f1af23ebe 2007-08-26       aku: way recomputing that component.</p>
6f1af23ebe 2007-08-26       aku: 
6f1af23ebe 2007-08-26       aku: <p>For B, the weighted sum, note first that <img src="encode4.gif"
6f1af23ebe 2007-08-26       aku: align="center"> has the weight NHASH in the sum, so that is what has
6f1af23ebe 2007-08-26       aku: to be removed. Then adding in <img src="encode9.gif" align="center">
6f1af23ebe 2007-08-26       aku: adds one weight factor to all the other values of Z, and at last adds
6f1af23ebe 2007-08-26       aku: in <img src="encode5.gif" align="center"> with weight 1, also
6f1af23ebe 2007-08-26       aku: generating the correct new sum</p>
6f1af23ebe 2007-08-26       aku: 
6f1af23ebe 2007-08-26       aku: </body>
6f1af23ebe 2007-08-26       aku: </html>