Artifact Content
Not logged in

Artifact 2064b7fe7f2f347b5b56603d4fdcb06979cf6736

File ci_fossil.txt part of check-in [10062df2fa] - Reworked my notes regarding 'reconstruct' based on my reading of content.c, checkin.c, and manifest.c by aku on 2007-08-28 05:01:23.


To perform CVS imports for fossil we need at least the ability to
parse CVS files, i.e. RCS files, with slight differences.

For the general architecture of the import facility we have two major
paths to choose between.

One is to use an external tool which processes a cvs repository and
drives fossil through its CLI to insert the found changesets.

The other is to integrate the whole facility into the fossil binary
itself.

I dislike the second choice. It may be faster, as the implementation
can use all internal functionality of fossil to perform the import,
however it will also bloat the binary with functionality not needed
most of the time. Which becomes especially obvious if more importers
are to be written, like for monotone, bazaar, mercurial, bitkeeper,
git, SVN, Arc, etc. Keeping all this out of the core fossil binary is
IMHO more beneficial in the long term, also from a maintenance point
of view. The tools can evolve separately. Especially important for CVS
as it will have to deal with lots of broken repositories, all
different.

However, nothing speaks against looking for common parts in all
possible import tools, and having these in the fossil core, as a
general backend all importer may use. Something like that has already
been proposed: The deconstruct|reconstruct methods. For us, actually
only reconstruct is important. Taking an unordered collection of files
(data, and manifests) it generates a proper fossil repository.  With
that method implemented all import tools only have to generate the
necessary collection and then leave the main work of filling the
database to fossil itself.

The disadvantage of this method is however that it will gobble up a
lot of temporary space in the filesystem to hold all unique revisions
of all files in their expanded form.

It might be worthwhile to consider an extension of 'reconstruct' which
is able to incrementally add a set of files to an exiting fossil
repository already containing revisions. In that case the import tool
can be changed to incrementally generate the collection for a
particular revision, import it, and iterate over all revisions in the
origin repository. This is of course also dependent on the origin
repository itself, how good it supports such incremental export.

This also leads to a possible method for performing the import using
only existing functionality ('reconstruct' has not been implemented
yet). Instead generating an unordered collection for each revision
generate a properly setup workspace, simply commit it. This will
require use of rm, add and update methods as well, to remove old and
enter new files, and point the fossil repository to the correct parent
revision from the new revision is derived.

The relative efficiency (in time) of these incremental methods versus
importing a complete collection of files encoding the entire origin
repository however is not clear.

----------------------------------

reconstruct

The core logic for handling content is in the file "content.c", in
particular the functions 'content_put' and 'content_deltify'. One of
the main users of these functions is in the file "checkin.c", see the
function 'commit_cmd'.

The logic is clear. The new modified files are simply stored without
delta-compression, using 'content_put'. And should fosssil have an id
for the _previous_ revision of the committed file it uses
'content_deltify' to convert the already stored data for that revision
into a delta with the just stored new revision as origin.

In other words, fossil produces reverse deltas, with leaf revisions
stored just zip-compressed (plain) and older revisions using both zip-
and delta-compression.

Of note is that the underlying logic in 'content_deltify' gives up on
delta compression if the involved files are either not large enough,
or if the achieved compression factor was not high enough. In that
case the old revision of the file is left plain.

The scheme can thus be called a 'truncated reverse delta'.

The manifest is created and committed after the modified files. It
uses the same logic as for the regular files. The new leaf is stored
plain, and storage of the parent manifest is modified to be a delta
with the current as origin.

Further note that for a checkin of a merge result oonly the primary
parent is modified in that way. The secondary parent, the one merged
into the current revision is not touched. I.e. from the storage layer
point of view this revision is still a leaf and the data is kept
stored plain, not delta-compressed.



Now the "reconstruct" can be done like so:

- Scan the files in the indicated directory, and look for a manifest.

- When the manifest has been found parse its contents and follow the
  chain of parent links to locate the root manifest (no parent).

- Import the files referenced by the root manifest, then the manifest
  itself. This can be done using a modified form of the 'commit_cmd'
  which does not have to construct a manifest on its own from vfile,
  vmerge, etc.

- After that recursively apply the import of the previous step to the
  children of the root, and so on.

For an incremental "reconstruct" the collection of files would not be
a single tree with a root, but a forest, and the roots to look for are
not manifests without parent, but with a parent which is already
present in the repository. After one such root has been found and
processed the unprocessed files have to be searched further for more
roots, and only if no such are found anymore will the remaining files
be considered as superfluous.

We can use the functions in "manifest.c" for the parsing and following
the parental chain.

Hm. But we have no direct child information. So the above algorithm
has to be modified, we have to scan all manifests before we start
importing, and we have to create a reverse index, from manifest to
children so that we can perform the import from root to leaves.