6f1af23ebe 2007-08-26 aku: <html> 6f1af23ebe 2007-08-26 aku: <head> 6f1af23ebe 2007-08-26 aku: <title>Fossil Delta Encoding Algorithm</title> 6f1af23ebe 2007-08-26 aku: </head> 6f1af23ebe 2007-08-26 aku: <body bgcolor="white"> 6f1af23ebe 2007-08-26 aku: <p>[ <a href="index.html">Index</a> ]</p> 6f1af23ebe 2007-08-26 aku: <hr> 6f1af23ebe 2007-08-26 aku: <h1 align="center"> 6f1af23ebe 2007-08-26 aku: Fossil Delta Encoding Algorithm 6f1af23ebe 2007-08-26 aku: </h1> 6f1af23ebe 2007-08-26 aku: 6f1af23ebe 2007-08-26 aku: <p>A key component for the efficient storage of multiple revisions of 6f1af23ebe 2007-08-26 aku: a file in fossil repositories is the use of delta-compression, i.e. to 6f1af23ebe 2007-08-26 aku: store only the changes between revisions instead of the whole 6f1af23ebe 2007-08-26 aku: file.</p> 6f1af23ebe 2007-08-26 aku: 6f1af23ebe 2007-08-26 aku: <p>This document describes the encoding algorithm used by Fossil to 6f1af23ebe 2007-08-26 aku: generate deltas. It is targeted at developers working on either 6f1af23ebe 2007-08-26 aku: <a href="index.html">fossil</a> itself, or on tools compatible with 6f1af23ebe 2007-08-26 aku: it. The exact format of the generated byte-sequences, while in general 6f1af23ebe 2007-08-26 aku: not necessary to understand encoder operation, can be found in the 6f1af23ebe 2007-08-26 aku: companion specification titled "<a href="delta_format.html">Fossil 6f1af23ebe 2007-08-26 aku: Delta Format</a>". 6f1af23ebe 2007-08-26 aku: </p> 6f1af23ebe 2007-08-26 aku: 6f1af23ebe 2007-08-26 aku: <p>The entire algorithm is inspired 6f1af23ebe 2007-08-26 aku: by <a href="http://samba.anu.edu.au/rsync/">rsync</a>.</p> 6f1af23ebe 2007-08-26 aku: 6f1af23ebe 2007-08-26 aku: <a name="argresparam"><h2>1.0 Arguments, Results, and Parameters</h2> 6f1af23ebe 2007-08-26 aku: 6f1af23ebe 2007-08-26 aku: <p>The encoder takes two byte-sequences as input, the "original", and 6f1af23ebe 2007-08-26 aku: the "target", and returns a single byte-sequence containing the 6f1af23ebe 2007-08-26 aku: "delta" which transforms the original into the target upon its 6f1af23ebe 2007-08-26 aku: application.</p> 6f1af23ebe 2007-08-26 aku: 6f1af23ebe 2007-08-26 aku: <p>Note that the data of a "byte-sequence" includes its length, 6f1af23ebe 2007-08-26 aku: i.e. the number of bytes contained in the sequence.</p> 6f1af23ebe 2007-08-26 aku: 6f1af23ebe 2007-08-26 aku: <p>The algorithm has one parameter named "NHASH", the size of the 6f1af23ebe 2007-08-26 aku: "sliding window" for the "rolling hash", in bytes. These two terms are 6f1af23ebe 2007-08-26 aku: explained in the next section. The value of this parameter has to be a 6f1af23ebe 2007-08-26 aku: power of two for the algorithm to work. For Fossil the value of this 6f1af23ebe 2007-08-26 aku: parameter is set to "16".</p> 6f1af23ebe 2007-08-26 aku: 6f1af23ebe 2007-08-26 aku: <a name="operation"><h2>2.0 Operation</h2> 6f1af23ebe 2007-08-26 aku: 6f1af23ebe 2007-08-26 aku: <p>The algorithm is split into three phases which generate 6f1af23ebe 2007-08-26 aku: the <a href="delta_format.html#header">header</a>, 6f1af23ebe 2007-08-26 aku: <a href="delta_format.html#slist">segment list</a>, 6f1af23ebe 2007-08-26 aku: and <a href="delta_format.html#trailer">trailer</a> of the delta, per 6f1af23ebe 2007-08-26 aku: its general <a href="delta_format.html#structure">structure</a>.</p> 6f1af23ebe 2007-08-26 aku: 6f1af23ebe 2007-08-26 aku: <p>The two phases generating header and trailer are not covered here 6f1af23ebe 2007-08-26 aku: as their implementation trivially follows directly from the 6f1af23ebe 2007-08-26 aku: specification of the <a href="delta_format.html">delta format</a>.</p> 6f1af23ebe 2007-08-26 aku: 6f1af23ebe 2007-08-26 aku: <p>This leaves the segment-list. Its generation is done in two phases, 6f1af23ebe 2007-08-26 aku: a pre-processing step operating on the "original" byte-sequence, 6f1af23ebe 2007-08-26 aku: followed by the processing of the "target" byte-sequence using the 6f1af23ebe 2007-08-26 aku: information gathered by the first step.</p> 6f1af23ebe 2007-08-26 aku: 6f1af23ebe 2007-08-26 aku: <a name="preprocessing"><h3>2.1 Preprocessing the original</h3> 6f1af23ebe 2007-08-26 aku: 6f1af23ebe 2007-08-26 aku: <p>A major part of the processing of the "target" is to find a range 6f1af23ebe 2007-08-26 aku: in the "original" which contains the same content as found at the 6f1af23ebe 2007-08-26 aku: current location in the "target".</p> 6f1af23ebe 2007-08-26 aku: 6f1af23ebe 2007-08-26 aku: <p>A naive approach to this would be to search the whole "original" 6f1af23ebe 2007-08-26 aku: for such content. This however is very inefficient as it would search 6f1af23ebe 2007-08-26 aku: the same parts of the "original" over and over. What is done instead 6f1af23ebe 2007-08-26 aku: is to sample the "original" at regular intervals, compute signatures 6f1af23ebe 2007-08-26 aku: for the sampled locations and store them in a hash table keyed by 6f1af23ebe 2007-08-26 aku: these signatures.</p> 6f1af23ebe 2007-08-26 aku: 6f1af23ebe 2007-08-26 aku: <p>That is what happens in this step. The following processing step 6f1af23ebe 2007-08-26 aku: can then the compute signature for its current location and then has 6f1af23ebe 2007-08-26 aku: to search only a narrow set of locations in the "original" for 6f1af23ebe 2007-08-26 aku: possible matches, namely those which have the same signature.</p> 6f1af23ebe 2007-08-26 aku: 6f1af23ebe 2007-08-26 aku: <p>In detail:</p> 6f1af23ebe 2007-08-26 aku: 6f1af23ebe 2007-08-26 aku: <ol> 6f1af23ebe 2007-08-26 aku: <li>The "original" is split into chunks of NHASH bytes. Note that a 6f1af23ebe 2007-08-26 aku: partial chunk of less than NHASH bytes at the end of "original" is 6f1af23ebe 2007-08-26 aku: ignored. 6f1af23ebe 2007-08-26 aku: </li> 6f1af23ebe 2007-08-26 aku: <li>The <a href="#rollhash">rolling hash</a> of each chunk is 6f1af23ebe 2007-08-26 aku: computed. 6f1af23ebe 2007-08-26 aku: </li> 6f1af23ebe 2007-08-26 aku: <li>A hashtable is filled, mapping from the hashes of the chunks to 6f1af23ebe 2007-08-26 aku: the list of chunk locations having this hash. 6f1af23ebe 2007-08-26 aku: </li> 6f1af23ebe 2007-08-26 aku: </ol> 6f1af23ebe 2007-08-26 aku: 6f1af23ebe 2007-08-26 aku: <a name="processing"><h3>2.1 Processing the target</h3> 6f1af23ebe 2007-08-26 aku: 59ad045fef 2007-08-27 aku: <p>This, the main phase of the encoder, processes the target in a loop 59ad045fef 2007-08-27 aku: from beginning to end. The state of the encoder is captured by two 59ad045fef 2007-08-27 aku: locations, the "base" and the "slide". "base" points to the first byte 59ad045fef 2007-08-27 aku: of the target for which no delta output has been generated yet, and 59ad045fef 2007-08-27 aku: "slide" is the location of the window used to look in the "origin" for 59ad045fef 2007-08-27 aku: commonalities. This window is NHASH bytes long.</p> 59ad045fef 2007-08-27 aku: 59ad045fef 2007-08-27 aku: <p>Initially both "base" and "slide" point to the beginning of the 59ad045fef 2007-08-27 aku: "target". In each iteration of the loop the encoder decides whether to 59ad045fef 2007-08-27 aku: <ul> 59ad045fef 2007-08-27 aku: <li>emit a single instruction 59ad045fef 2007-08-27 aku: to <a href="delta_format.html#copyrange">copy a range</a>, or 59ad045fef 2007-08-27 aku: </li> 59ad045fef 2007-08-27 aku: <li>emit two instructions, first 59ad045fef 2007-08-27 aku: to <a href="delta_format.html#insertlit">insert a literal</a>, then 59ad045fef 2007-08-27 aku: to <a href="delta_format.html#copyrange">copy a range</a>, or 59ad045fef 2007-08-27 aku: </li> 59ad045fef 2007-08-27 aku: <li>move the window forward one byte. 59ad045fef 2007-08-27 aku: </li> 59ad045fef 2007-08-27 aku: </ul> 59ad045fef 2007-08-27 aku: </p> 59ad045fef 2007-08-27 aku: 59ad045fef 2007-08-27 aku: <img src="encode10.gif" align="right" hspace="10"> 59ad045fef 2007-08-27 aku: <p>To make this decision the encoder first computes the hash value for 59ad045fef 2007-08-27 aku: the NHASH bytes in the window and then looks at all the locations in 59ad045fef 2007-08-27 aku: the "origin" which have the same signature. This part uses the hash 59ad045fef 2007-08-27 aku: table created by the pre-processing step to effiently find these 59ad045fef 2007-08-27 aku: locations.</p> 59ad045fef 2007-08-27 aku: 59ad045fef 2007-08-27 aku: <p>For each of the possible candidates the encoder finds the maximal 59ad045fef 2007-08-27 aku: range of bytes common to both "origin" and "target", going forward and 59ad045fef 2007-08-27 aku: backward from "slide" in the "target", and the candidate location in 59ad045fef 2007-08-27 aku: the "origin". This search is constrained on the side of the "target" 59ad045fef 2007-08-27 aku: by the "base" (backward search), and the end of the "target" (forward 59ad045fef 2007-08-27 aku: search), and on the side of the "origin" by the beginning and end of 59ad045fef 2007-08-27 aku: the "origin", respectively.</p> 59ad045fef 2007-08-27 aku: 59ad045fef 2007-08-27 aku: <p>From the ranges for all the candidates the best (= largest) common 59ad045fef 2007-08-27 aku: range is taken and it is determined how many bytes are needed to 59ad045fef 2007-08-27 aku: encode the bytes between the "base" and the end of that range. If the 59ad045fef 2007-08-27 aku: range extended back to the "base" then this can be done in a single 59ad045fef 2007-08-27 aku: copy instruction. Otherwise, i.e if there is a gap between the "base" 59ad045fef 2007-08-27 aku: and the beginning of the range then two instructions are needed, one 59ad045fef 2007-08-27 aku: to insert the bytes in the gap as a literal, and a copy instruction 59ad045fef 2007-08-27 aku: for the range itself. The general situation at this point can be seen 59ad045fef 2007-08-27 aku: in the picture to the right.</p> 59ad045fef 2007-08-27 aku: 59ad045fef 2007-08-27 aku: <p>If the number of bytes needed to encode both gap (if present), and 59ad045fef 2007-08-27 aku: range is less than the number of bytes we are encoding the encoder 59ad045fef 2007-08-27 aku: will emit the necessary instructions as described above, set "base" 59ad045fef 2007-08-27 aku: and "slide" to the end of the encoded range and start the next 59ad045fef 2007-08-27 aku: iteration at that point.</p> 59ad045fef 2007-08-27 aku: 59ad045fef 2007-08-27 aku: <p>If, on the other hand, the encoder either did not find candidate 59ad045fef 2007-08-27 aku: locations in the origin, or the best range coming out of the search 59ad045fef 2007-08-27 aku: needed more bytes to encode the range than there were bytes in the 59ad045fef 2007-08-27 aku: range, then no instructions are emitted and the window is moved one 59ad045fef 2007-08-27 aku: byte forward. The "base" is left unchanged in that case.</p> 59ad045fef 2007-08-27 aku: 59ad045fef 2007-08-27 aku: <p>The processing loop stops at one of two conditions: 59ad045fef 2007-08-27 aku: <ol> 59ad045fef 2007-08-27 aku: <li>The encoder decided to move the window forward, but the end of the 59ad045fef 2007-08-27 aku: window reached the end of the "target". 59ad045fef 2007-08-27 aku: </li> 59ad045fef 2007-08-27 aku: <li>After the emission of instructions the new "base" location is 59ad045fef 2007-08-27 aku: within NHASH bytes of end of the "target", i.e. there are no more than 59ad045fef 2007-08-27 aku: at most NHASH bytes left. 59ad045fef 2007-08-27 aku: </li> 59ad045fef 2007-08-27 aku: </ol> 59ad045fef 2007-08-27 aku: </p> 59ad045fef 2007-08-27 aku: 59ad045fef 2007-08-27 aku: <p>If the processing loop left bytes unencoded, i.e. "base" not 59ad045fef 2007-08-27 aku: exactly at the end of the "target", as is possible for both end 59ad045fef 2007-08-27 aku: conditions, then one last insert instruction is emitted to put these 59ad045fef 2007-08-27 aku: bytes into the delta.<p> 6f1af23ebe 2007-08-26 aku: 6f1af23ebe 2007-08-26 aku: <a name="exceptions"><h2>3.0 Exceptions</h2> 6f1af23ebe 2007-08-26 aku: 6f1af23ebe 2007-08-26 aku: <p>If the "original" is at most NHASH bytes long no compression of 6f1af23ebe 2007-08-26 aku: changes is possible, and the segment-list of the delta consists of a 6f1af23ebe 2007-08-26 aku: single literal which contains the entire "target".</p> 6f1af23ebe 2007-08-26 aku: 59ad045fef 2007-08-27 aku: <p>This is actually equivalent to the second end condition of the 59ad045fef 2007-08-27 aku: processing loop described in the previous section, just checked before 59ad045fef 2007-08-27 aku: actually entering the loop.</p> 6f1af23ebe 2007-08-26 aku: 6f1af23ebe 2007-08-26 aku: <a name="rollhash"><h2>4.0 The rolling hash</h2> 6f1af23ebe 2007-08-26 aku: 6f1af23ebe 2007-08-26 aku: <p>The rolling hash described below and used to compute content 6f1af23ebe 2007-08-26 aku: signatures was chosen not only for good hashing properties, but also 6f1af23ebe 2007-08-26 aku: to enable the easy (incremental) recalculation of its value for a 59ad045fef 2007-08-27 aku: sliding window, i.e. where the oldest byte is removed from the window 59ad045fef 2007-08-27 aku: and a new byte is shifted in.<p> 6f1af23ebe 2007-08-26 aku: 59ad045fef 2007-08-27 aku: <a name="rhdef"><h3>4.1 Definition</h3> 6f1af23ebe 2007-08-26 aku: 6f1af23ebe 2007-08-26 aku: <p>Assuming an array Z of NHASH bytes (indexing starting at 0) the 6f1af23ebe 2007-08-26 aku: hash V is computed via</p> 6f1af23ebe 2007-08-26 aku: 6f1af23ebe 2007-08-26 aku: <p align=center><table><tr><td> 6f1af23ebe 2007-08-26 aku: <p><img src="encode1.gif" align="center"></p> 6f1af23ebe 2007-08-26 aku: <p><img src="encode2.gif" align="center"></p> 6f1af23ebe 2007-08-26 aku: <p><img src="encode3.gif" align="center"></p> 6f1af23ebe 2007-08-26 aku: </td></tr></table></p> 6f1af23ebe 2007-08-26 aku: 6f1af23ebe 2007-08-26 aku: where A and B are unsigned 16-bit integers (hence the <u>mod</u>), and 6f1af23ebe 2007-08-26 aku: V is a 32-bit unsigned integer with B as MSB, A as LSB. 6f1af23ebe 2007-08-26 aku: 59ad045fef 2007-08-27 aku: <a name="rhincr"><h3>4.2 Incremental recalculation</h3> 6f1af23ebe 2007-08-26 aku: 6f1af23ebe 2007-08-26 aku: <p>Assuming an array Z of NHASH bytes (indexing starting at 0) with 6f1af23ebe 2007-08-26 aku: hash V (and components A and B), the dropped 6f1af23ebe 2007-08-26 aku: byte <img src="encode4.gif" align="center">, and the new byte 6f1af23ebe 2007-08-26 aku: <img src="encode5.gif" align="center"> , the new hash can 6f1af23ebe 2007-08-26 aku: be computed incrementally via: </p> 6f1af23ebe 2007-08-26 aku: 6f1af23ebe 2007-08-26 aku: <p align=center><table><tr><td> 6f1af23ebe 2007-08-26 aku: <p><img src="encode6.gif" align="center"></p> 6f1af23ebe 2007-08-26 aku: <p><img src="encode7.gif" align="center"></p> 6f1af23ebe 2007-08-26 aku: <p><img src="encode8.gif" align="center"></p> 6f1af23ebe 2007-08-26 aku: </td></tr></table></p> 6f1af23ebe 2007-08-26 aku: 6f1af23ebe 2007-08-26 aku: <p>For A, the regular sum, it can be seen easily that this the correct 6f1af23ebe 2007-08-26 aku: way recomputing that component.</p> 6f1af23ebe 2007-08-26 aku: 6f1af23ebe 2007-08-26 aku: <p>For B, the weighted sum, note first that <img src="encode4.gif" 6f1af23ebe 2007-08-26 aku: align="center"> has the weight NHASH in the sum, so that is what has 6f1af23ebe 2007-08-26 aku: to be removed. Then adding in <img src="encode9.gif" align="center"> 6f1af23ebe 2007-08-26 aku: adds one weight factor to all the other values of Z, and at last adds 6f1af23ebe 2007-08-26 aku: in <img src="encode5.gif" align="center"> with weight 1, also 6f1af23ebe 2007-08-26 aku: generating the correct new sum</p> 6f1af23ebe 2007-08-26 aku: 6f1af23ebe 2007-08-26 aku: </body> 6f1af23ebe 2007-08-26 aku: </html>