16094f7ebc 2008-05-16 drh: <nowiki> d87ca60c58 2008-05-15 stephan: <h1 align="center"> d87ca60c58 2008-05-15 stephan: Fossil Delta Encoding Algorithm d87ca60c58 2008-05-15 stephan: </h1> d87ca60c58 2008-05-15 stephan: d87ca60c58 2008-05-15 stephan: <p>A key component for the efficient storage of multiple revisions of d87ca60c58 2008-05-15 stephan: a file in fossil repositories is the use of delta-compression, i.e. to d87ca60c58 2008-05-15 stephan: store only the changes between revisions instead of the whole d87ca60c58 2008-05-15 stephan: file.</p> d87ca60c58 2008-05-15 stephan: d87ca60c58 2008-05-15 stephan: <p>This document describes the encoding algorithm used by Fossil to d87ca60c58 2008-05-15 stephan: generate deltas. It is targeted at developers working on either 16094f7ebc 2008-05-16 drh: <a href="index.wiki">fossil</a> itself, or on tools compatible with d87ca60c58 2008-05-15 stephan: it. The exact format of the generated byte-sequences, while in general d87ca60c58 2008-05-15 stephan: not necessary to understand encoder operation, can be found in the 16094f7ebc 2008-05-16 drh: companion specification titled "<a href="delta_format.wiki">Fossil d87ca60c58 2008-05-15 stephan: Delta Format</a>". d87ca60c58 2008-05-15 stephan: </p> d87ca60c58 2008-05-15 stephan: adc0b3bfb0 2008-07-15 drh: <p>The algorithm is inspired d87ca60c58 2008-05-15 stephan: by <a href="http://samba.anu.edu.au/rsync/">rsync</a>.</p> d87ca60c58 2008-05-15 stephan: 16094f7ebc 2008-05-16 drh: <a name="argresparam"></a><h2>1.0 Arguments, Results, and Parameters</h2> d87ca60c58 2008-05-15 stephan: d87ca60c58 2008-05-15 stephan: <p>The encoder takes two byte-sequences as input, the "original", and d87ca60c58 2008-05-15 stephan: the "target", and returns a single byte-sequence containing the d87ca60c58 2008-05-15 stephan: "delta" which transforms the original into the target upon its d87ca60c58 2008-05-15 stephan: application.</p> d87ca60c58 2008-05-15 stephan: d87ca60c58 2008-05-15 stephan: <p>Note that the data of a "byte-sequence" includes its length, d87ca60c58 2008-05-15 stephan: i.e. the number of bytes contained in the sequence.</p> d87ca60c58 2008-05-15 stephan: d87ca60c58 2008-05-15 stephan: <p>The algorithm has one parameter named "NHASH", the size of the d87ca60c58 2008-05-15 stephan: "sliding window" for the "rolling hash", in bytes. These two terms are d87ca60c58 2008-05-15 stephan: explained in the next section. The value of this parameter has to be a d87ca60c58 2008-05-15 stephan: power of two for the algorithm to work. For Fossil the value of this d87ca60c58 2008-05-15 stephan: parameter is set to "16".</p> d87ca60c58 2008-05-15 stephan: 16094f7ebc 2008-05-16 drh: <a name="operation"></a><h2>2.0 Operation</h2> d87ca60c58 2008-05-15 stephan: d87ca60c58 2008-05-15 stephan: <p>The algorithm is split into three phases which generate 16094f7ebc 2008-05-16 drh: the <a href="delta_format.wiki#header">header</a>, 16094f7ebc 2008-05-16 drh: <a href="delta_format.wiki#slist">segment list</a>, 16094f7ebc 2008-05-16 drh: and <a href="delta_format.wiki#trailer">trailer</a> of the delta, per 16094f7ebc 2008-05-16 drh: its general <a href="delta_format.wiki#structure">structure</a>.</p> d87ca60c58 2008-05-15 stephan: d87ca60c58 2008-05-15 stephan: <p>The two phases generating header and trailer are not covered here d87ca60c58 2008-05-15 stephan: as their implementation trivially follows directly from the 16094f7ebc 2008-05-16 drh: specification of the <a href="delta_format.wiki">delta format</a>.</p> d87ca60c58 2008-05-15 stephan: d87ca60c58 2008-05-15 stephan: <p>This leaves the segment-list. Its generation is done in two phases, d87ca60c58 2008-05-15 stephan: a pre-processing step operating on the "original" byte-sequence, d87ca60c58 2008-05-15 stephan: followed by the processing of the "target" byte-sequence using the d87ca60c58 2008-05-15 stephan: information gathered by the first step.</p> d87ca60c58 2008-05-15 stephan: 16094f7ebc 2008-05-16 drh: <a name="preprocessing"></a><h3>2.1 Preprocessing the original</h3> d87ca60c58 2008-05-15 stephan: d87ca60c58 2008-05-15 stephan: <p>A major part of the processing of the "target" is to find a range d87ca60c58 2008-05-15 stephan: in the "original" which contains the same content as found at the d87ca60c58 2008-05-15 stephan: current location in the "target".</p> d87ca60c58 2008-05-15 stephan: d87ca60c58 2008-05-15 stephan: <p>A naive approach to this would be to search the whole "original" d87ca60c58 2008-05-15 stephan: for such content. This however is very inefficient as it would search d87ca60c58 2008-05-15 stephan: the same parts of the "original" over and over. What is done instead d87ca60c58 2008-05-15 stephan: is to sample the "original" at regular intervals, compute signatures d87ca60c58 2008-05-15 stephan: for the sampled locations and store them in a hash table keyed by d87ca60c58 2008-05-15 stephan: these signatures.</p> d87ca60c58 2008-05-15 stephan: d87ca60c58 2008-05-15 stephan: <p>That is what happens in this step. The following processing step d87ca60c58 2008-05-15 stephan: can then the compute signature for its current location and then has d87ca60c58 2008-05-15 stephan: to search only a narrow set of locations in the "original" for d87ca60c58 2008-05-15 stephan: possible matches, namely those which have the same signature.</p> d87ca60c58 2008-05-15 stephan: d87ca60c58 2008-05-15 stephan: <p>In detail:</p> d87ca60c58 2008-05-15 stephan: d87ca60c58 2008-05-15 stephan: <ol> d87ca60c58 2008-05-15 stephan: <li>The "original" is split into chunks of NHASH bytes. Note that a d87ca60c58 2008-05-15 stephan: partial chunk of less than NHASH bytes at the end of "original" is d87ca60c58 2008-05-15 stephan: ignored. d87ca60c58 2008-05-15 stephan: </li> d87ca60c58 2008-05-15 stephan: <li>The <a href="#rollhash">rolling hash</a> of each chunk is d87ca60c58 2008-05-15 stephan: computed. d87ca60c58 2008-05-15 stephan: </li> 9eb6ea75c1 2008-11-11 kejoki: <li>A hash table is filled, mapping from the hashes of the chunks to d87ca60c58 2008-05-15 stephan: the list of chunk locations having this hash. d87ca60c58 2008-05-15 stephan: </li> d87ca60c58 2008-05-15 stephan: </ol> d87ca60c58 2008-05-15 stephan: 16094f7ebc 2008-05-16 drh: <a name="processing"></a><h3>2.1 Processing the target</h3> d87ca60c58 2008-05-15 stephan: d87ca60c58 2008-05-15 stephan: <p>This, the main phase of the encoder, processes the target in a loop d87ca60c58 2008-05-15 stephan: from beginning to end. The state of the encoder is captured by two d87ca60c58 2008-05-15 stephan: locations, the "base" and the "slide". "base" points to the first byte d87ca60c58 2008-05-15 stephan: of the target for which no delta output has been generated yet, and d87ca60c58 2008-05-15 stephan: "slide" is the location of the window used to look in the "origin" for d87ca60c58 2008-05-15 stephan: commonalities. This window is NHASH bytes long.</p> d87ca60c58 2008-05-15 stephan: d87ca60c58 2008-05-15 stephan: <p>Initially both "base" and "slide" point to the beginning of the d87ca60c58 2008-05-15 stephan: "target". In each iteration of the loop the encoder decides whether to d87ca60c58 2008-05-15 stephan: <ul> d87ca60c58 2008-05-15 stephan: <li>emit a single instruction 16094f7ebc 2008-05-16 drh: to <a href="delta_format.wiki#copyrange">copy a range</a>, or d87ca60c58 2008-05-15 stephan: </li> d87ca60c58 2008-05-15 stephan: <li>emit two instructions, first 16094f7ebc 2008-05-16 drh: to <a href="delta_format.wiki#insertlit">insert a literal</a>, then 16094f7ebc 2008-05-16 drh: to <a href="delta_format.wiki#copyrange">copy a range</a>, or d87ca60c58 2008-05-15 stephan: </li> d87ca60c58 2008-05-15 stephan: <li>move the window forward one byte. d87ca60c58 2008-05-15 stephan: </li> d87ca60c58 2008-05-15 stephan: </ul> d87ca60c58 2008-05-15 stephan: </p> d87ca60c58 2008-05-15 stephan: d87ca60c58 2008-05-15 stephan: <img src="encode10.gif" align="right" hspace="10"> d87ca60c58 2008-05-15 stephan: <p>To make this decision the encoder first computes the hash value for d87ca60c58 2008-05-15 stephan: the NHASH bytes in the window and then looks at all the locations in d87ca60c58 2008-05-15 stephan: the "origin" which have the same signature. This part uses the hash 9eb6ea75c1 2008-11-11 kejoki: table created by the pre-processing step to efficiently find these d87ca60c58 2008-05-15 stephan: locations.</p> d87ca60c58 2008-05-15 stephan: d87ca60c58 2008-05-15 stephan: <p>For each of the possible candidates the encoder finds the maximal d87ca60c58 2008-05-15 stephan: range of bytes common to both "origin" and "target", going forward and d87ca60c58 2008-05-15 stephan: backward from "slide" in the "target", and the candidate location in d87ca60c58 2008-05-15 stephan: the "origin". This search is constrained on the side of the "target" d87ca60c58 2008-05-15 stephan: by the "base" (backward search), and the end of the "target" (forward d87ca60c58 2008-05-15 stephan: search), and on the side of the "origin" by the beginning and end of d87ca60c58 2008-05-15 stephan: the "origin", respectively.</p> d87ca60c58 2008-05-15 stephan: d87ca60c58 2008-05-15 stephan: <p>There are input files for which the hash chains generated by the d87ca60c58 2008-05-15 stephan: pre-processing step can become very long, leading to long search times d87ca60c58 2008-05-15 stephan: and affecting the performance of the delta generator. To limit the d87ca60c58 2008-05-15 stephan: effect such long chains can have the actual search for candidates is d87ca60c58 2008-05-15 stephan: bounded, looking at most N candidates. Currently N is set to 250.</p> d87ca60c58 2008-05-15 stephan: d87ca60c58 2008-05-15 stephan: <p>From the ranges for all the candidates the best (= largest) common d87ca60c58 2008-05-15 stephan: range is taken and it is determined how many bytes are needed to d87ca60c58 2008-05-15 stephan: encode the bytes between the "base" and the end of that range. If the d87ca60c58 2008-05-15 stephan: range extended back to the "base" then this can be done in a single d87ca60c58 2008-05-15 stephan: copy instruction. Otherwise, i.e if there is a gap between the "base" d87ca60c58 2008-05-15 stephan: and the beginning of the range then two instructions are needed, one d87ca60c58 2008-05-15 stephan: to insert the bytes in the gap as a literal, and a copy instruction d87ca60c58 2008-05-15 stephan: for the range itself. The general situation at this point can be seen d87ca60c58 2008-05-15 stephan: in the picture to the right.</p> d87ca60c58 2008-05-15 stephan: d87ca60c58 2008-05-15 stephan: <p>If the number of bytes needed to encode both gap (if present), and d87ca60c58 2008-05-15 stephan: range is less than the number of bytes we are encoding the encoder d87ca60c58 2008-05-15 stephan: will emit the necessary instructions as described above, set "base" d87ca60c58 2008-05-15 stephan: and "slide" to the end of the encoded range and start the next d87ca60c58 2008-05-15 stephan: iteration at that point.</p> d87ca60c58 2008-05-15 stephan: d87ca60c58 2008-05-15 stephan: <p>If, on the other hand, the encoder either did not find candidate d87ca60c58 2008-05-15 stephan: locations in the origin, or the best range coming out of the search d87ca60c58 2008-05-15 stephan: needed more bytes to encode the range than there were bytes in the d87ca60c58 2008-05-15 stephan: range, then no instructions are emitted and the window is moved one d87ca60c58 2008-05-15 stephan: byte forward. The "base" is left unchanged in that case.</p> d87ca60c58 2008-05-15 stephan: d87ca60c58 2008-05-15 stephan: <p>The processing loop stops at one of two conditions: d87ca60c58 2008-05-15 stephan: <ol> d87ca60c58 2008-05-15 stephan: <li>The encoder decided to move the window forward, but the end of the d87ca60c58 2008-05-15 stephan: window reached the end of the "target". d87ca60c58 2008-05-15 stephan: </li> d87ca60c58 2008-05-15 stephan: <li>After the emission of instructions the new "base" location is d87ca60c58 2008-05-15 stephan: within NHASH bytes of end of the "target", i.e. there are no more than d87ca60c58 2008-05-15 stephan: at most NHASH bytes left. d87ca60c58 2008-05-15 stephan: </li> d87ca60c58 2008-05-15 stephan: </ol> d87ca60c58 2008-05-15 stephan: </p> d87ca60c58 2008-05-15 stephan: d87ca60c58 2008-05-15 stephan: <p>If the processing loop left bytes unencoded, i.e. "base" not d87ca60c58 2008-05-15 stephan: exactly at the end of the "target", as is possible for both end d87ca60c58 2008-05-15 stephan: conditions, then one last insert instruction is emitted to put these d87ca60c58 2008-05-15 stephan: bytes into the delta.<p> d87ca60c58 2008-05-15 stephan: 16094f7ebc 2008-05-16 drh: <a name="exceptions"></a><h2>3.0 Exceptions</h2> d87ca60c58 2008-05-15 stephan: d87ca60c58 2008-05-15 stephan: <p>If the "original" is at most NHASH bytes long no compression of d87ca60c58 2008-05-15 stephan: changes is possible, and the segment-list of the delta consists of a d87ca60c58 2008-05-15 stephan: single literal which contains the entire "target".</p> d87ca60c58 2008-05-15 stephan: d87ca60c58 2008-05-15 stephan: <p>This is actually equivalent to the second end condition of the d87ca60c58 2008-05-15 stephan: processing loop described in the previous section, just checked before d87ca60c58 2008-05-15 stephan: actually entering the loop.</p> d87ca60c58 2008-05-15 stephan: 16094f7ebc 2008-05-16 drh: <a name="rollhash"></a><h2>4.0 The rolling hash</h2> d87ca60c58 2008-05-15 stephan: d87ca60c58 2008-05-15 stephan: <p>The rolling hash described below and used to compute content d87ca60c58 2008-05-15 stephan: signatures was chosen not only for good hashing properties, but also d87ca60c58 2008-05-15 stephan: to enable the easy (incremental) recalculation of its value for a d87ca60c58 2008-05-15 stephan: sliding window, i.e. where the oldest byte is removed from the window d87ca60c58 2008-05-15 stephan: and a new byte is shifted in.<p> d87ca60c58 2008-05-15 stephan: 16094f7ebc 2008-05-16 drh: <a name="rhdef"></a><h3>4.1 Definition</h3> d87ca60c58 2008-05-15 stephan: d87ca60c58 2008-05-15 stephan: <p>Assuming an array Z of NHASH bytes (indexing starting at 0) the d87ca60c58 2008-05-15 stephan: hash V is computed via</p> d87ca60c58 2008-05-15 stephan: d87ca60c58 2008-05-15 stephan: <p align=center><table><tr><td> d87ca60c58 2008-05-15 stephan: <p><img src="encode1.gif" align="center"></p> d87ca60c58 2008-05-15 stephan: <p><img src="encode2.gif" align="center"></p> d87ca60c58 2008-05-15 stephan: <p><img src="encode3.gif" align="center"></p> d87ca60c58 2008-05-15 stephan: </td></tr></table></p> d87ca60c58 2008-05-15 stephan: d87ca60c58 2008-05-15 stephan: where A and B are unsigned 16-bit integers (hence the <u>mod</u>), and d87ca60c58 2008-05-15 stephan: V is a 32-bit unsigned integer with B as MSB, A as LSB. d87ca60c58 2008-05-15 stephan: 16094f7ebc 2008-05-16 drh: <a name="rhincr"></a><h3>4.2 Incremental recalculation</h3> d87ca60c58 2008-05-15 stephan: d87ca60c58 2008-05-15 stephan: <p>Assuming an array Z of NHASH bytes (indexing starting at 0) with d87ca60c58 2008-05-15 stephan: hash V (and components A and B), the dropped d87ca60c58 2008-05-15 stephan: byte <img src="encode4.gif" align="center">, and the new byte d87ca60c58 2008-05-15 stephan: <img src="encode5.gif" align="center"> , the new hash can d87ca60c58 2008-05-15 stephan: be computed incrementally via: </p> d87ca60c58 2008-05-15 stephan: d87ca60c58 2008-05-15 stephan: <p align=center><table><tr><td> d87ca60c58 2008-05-15 stephan: <p><img src="encode6.gif" align="center"></p> d87ca60c58 2008-05-15 stephan: <p><img src="encode7.gif" align="center"></p> d87ca60c58 2008-05-15 stephan: <p><img src="encode8.gif" align="center"></p> d87ca60c58 2008-05-15 stephan: </td></tr></table></p> d87ca60c58 2008-05-15 stephan: d87ca60c58 2008-05-15 stephan: <p>For A, the regular sum, it can be seen easily that this the correct d87ca60c58 2008-05-15 stephan: way recomputing that component.</p> d87ca60c58 2008-05-15 stephan: d87ca60c58 2008-05-15 stephan: <p>For B, the weighted sum, note first that <img src="encode4.gif" d87ca60c58 2008-05-15 stephan: align="center"> has the weight NHASH in the sum, so that is what has d87ca60c58 2008-05-15 stephan: to be removed. Then adding in <img src="encode9.gif" align="center"> d87ca60c58 2008-05-15 stephan: adds one weight factor to all the other values of Z, and at last adds d87ca60c58 2008-05-15 stephan: in <img src="encode5.gif" align="center"> with weight 1, also d87ca60c58 2008-05-15 stephan: generating the correct new sum</p>