Fossil: Artifact Content

Artifact ffe422bb7a0a4b807739b1a03292e07d4e600286

File ci_fossil.txt part of check-in [103c397e4b] - Updated my work list, added first notes about 'cvs import' functionality. by aku on 2007-08-28 03:34:12.

To perform CVS imports for fossil we need at least the ability to
parse CVS files, i.e. RCS files, with slight differences.

For the general architecture of the import facility we have two major
paths to choose between.

One is to use an external tool which processes a cvs repository and
drives fossil through its CLI to insert the found changesets.

The other is to integrate the whole facility into the fossil binary
itself.

I dislike the second choice. It may be faster, as the implementation
can use all internal functionality of fossil to perform the import,
however it will also bloat the binary with functionality not needed
most of the time. Which becomes especially obvious if more importers
are to be written, like for monotone, bazaar, mercurial, bitkeeper,
git, SVN, Arc, etc. Keeping all this out of the core fossil binary is
IMHO more beneficial in the long term, also from a maintenance point
of view. The tools can evolve separately. Especially important for CVS
as it will have to deal with lots of broken repositories, all
different.

However, nothing speaks against looking for common parts in all
possible import tools, and having these in the fossil core, as a
general backend all importer may use. Something like that has already
been proposed: The deconstruct|reconstruct methods. For us, actually
only reconstruct is important. Taking an unordered collection of files
(data, and manifests) it generates a proper fossil repository.  With
that method implemented all import tools only have to generate the
necessary collection and then leave the main work of filling the
database to fossil itself.

The disadvantage of this method is however that it will gobble up a
lot of temporary space in the filesystem to hold all unique revisions
of all files in their expanded form.

It might be worthwhile to consider an extension of 'reconstruct' which
is able to incrementally add a set of files to an exiting fossil
repository already containing revisions. In that case the import tool
can be changed to incrementally generate the collection for a
particular revision, import it, and iterate over all revisions in the
origin repository. This is of course also dependent on the origin
repository itself, how good it supports such incremental export.

This also leads to a possible method for performing the import using
only existing functionality ('reconstruct' has not been implemented
yet). Instead generating an unordered collection for each revision
generate a properly setup workspace, simply commit it. This will
require use of rm, add and update methods as well, to remove old and
enter new files, and point the fossil repository to the correct parent
revision from the new revision is derived.

The relative efficiency (in time) of these incremental methods versus
importing a complete collection of files encoding the entire origin
repository however is not clear.

----------------------------------

reconstruct

It is currently not clear to me when and how fossil does
delta-compression. Does it use deltas, or reverse deltas, or a
combination of both ? And when does it generate the deltas ?

The trivial solution is that it uses deltas, i.e. the first revision
of a file is stored without delta compression and all future versions
are deltas from that, and the delta is generated when the new revision
is committed. With the obvious disadvantage that newer revisions take
more and more time to be decompressed as the set of deltas to apply
grows.

And during xfer it simply sends the deltas as is, making for easy
integration on the remote side.

However reconstruct, that method sees initially only an unordered
collection of files, some of which may be manifests, others are data
files, and if it imports them in a random order it might find that
file X, which was imported first and therefore has no delta
compression, is actually somewhere in the middle of a line of
revisions, and should be delta-compressed, and then it has to find out
the predecessor and do the compression, etc.

So depending on how the internal logic of delta-compression is done
reconstruct might need more logic to help the lower level achieve good
compression.

Like, in a first pass determine which files are manifests, and read
enough of them to determine their parent/child structure, and in a
second pass actually imports them, in topological order, with all
relevant non-manifest files for a manifest imported as that time
too. With that the underlying engine would see the files basically in
the same order as generated by a regular series of commits.

Problems for reconstruct: Files referenced, but not present, and,
conversely, files present, but not referenced. This can done as part
of the second pass, aborting when a missing file is encountered, with
(un)marking of used files, and at the end we know the unused
files. Could also be a separate pass between first and second.