Check-in [103c397e4b]
Not logged in
Overview

SHA1 Hash:103c397e4b36b1ddfb4afe52d120489361cac9c8
Date: 2007-08-28 03:34:12
User: aku
Comment:Updated my work list, added first notes about 'cvs import' functionality.
Timelines: ancestors | descendants | both | trunk
Other Links: files | ZIP archive | manifest

Tags And Properties
Changes
[hide diffs]

Added ci_cvs.txt version [06c022ce6b]

@@ -1,1 +1,86 @@
+
+Notes about CVS import, regarding CVS.
+
+- Problem: CVS does not really track changesets, but only individual
+  revisions of files. To recover changesets it is necessary to look at
+  author, branch, timestamp information, and the commit messages. Even
+  so this is only heuristic, not foolproof.
+
+  Existing tool: cvsps.
+
+  Processes the output of 'cvs log' to recover changesets. Problem:
+  Sees only a linear list of revisions, does not see branchpoints,
+  etc. Cannot use the tree structure to help in making the decisions.
+
+- Problem: CVS does not track merge-points at all. Recovery through
+  heuristics is brittle at best, looking for keywords in commit
+  messages which might indicate that a branch was merged with some
+  other.
+
+
+Ideas regarding an algorithm to recover changesets.
+
+Key feature: Uses the per-file revision trees to help in uncovering
+the underlying changesets and global revision tree G.
+
+The per-file revision tree for a file X is in essence the global
+revision tree with all nodes not pertaining to X removed from it. In
+the reverse this allows us to built up the global revision tree from
+the per-file trees by matching nodes to each other and extending.
+
+Start with the per file revision tree of a single file as initial
+approximation of the global tree. All nodes of this tree refer to the
+revision of the file belonging to it, and through that the file
+itself. At each step the global tree contains the nodes for a finite
+set of files, and all nodes in the tree refer to revisions of all
+files in the set, making the mapping total.
+
+To add a file X to the tree take the per-file revision tree R and
+performs the following actions:
+
+- For each node N in R use the tuple <author, branch, commit message>
+  to identify a set of nodes in G which may match N. Use the timestamp
+  to locate the node nearest in time.
+
+- This process will leave nodes in N unmapped. If there are unmapped
+  nodes which have no neighbouring mapped nodes we have to
+  abort.
+
+  Otherwise take the nodes which have mapped neighbours. Trace the
+  edges and see which of these nodes are connected in the local
+  tree. Then look at the identified neighbours and trace their
+  connections.
+
+  If two global nodes have a direct connection, but a multi-edge
+  connection in the local tree insert global nodes mapping to the
+  local nodes and map them together. This expands the global tree to
+  hold the revisions added by the new file.
+
+  Otherwise, both sides have multi-edge connections then abort. This
+  looks like a merge of two different branches, but there are no such
+  in CVS ... Wait ... sort the nodes over time and fit the new nodes
+  in between the other nodes, per the timestamps. We have overlapping
+  / alternating changes to one file and others.
+
+  A last possibility is that a node is only connected to a mapped
+  parent. This may be a new branch, or again an alternating change on
+  the given line. Symbols on the revisions will help to map this.
+
+- We now have an extended global tree which incorporates the revisions
+  of the new file. However new nodes will refer only to the new file,
+  and old nodes may not refer to the new file. This has to be fixed,
+  as all nodes have to refer to all files.
+
+  Run over the tree and look at each parent/child pair. If a file is
+  not referenced in the child, but the parent, then copy a reference
+  to the file revision on the parent forward to the child. This
+  signals that the file did not change in the given revision.
+
+- After all files have been integrated in this manner we have global
+  revision tree capturing all changesets, including the unchanged
+  files per changeset.
+
+
+This algorithm has to be refined to also take Attic/ files into
+account.
 

Added ci_fossil.txt version [ffe422bb7a]

@@ -1,1 +1,100 @@
 
+To perform CVS imports for fossil we need at least the ability to
+parse CVS files, i.e. RCS files, with slight differences.
+
+For the general architecture of the import facility we have two major
+paths to choose between.
+
+One is to use an external tool which processes a cvs repository and
+drives fossil through its CLI to insert the found changesets.
+
+The other is to integrate the whole facility into the fossil binary
+itself.
+
+I dislike the second choice. It may be faster, as the implementation
+can use all internal functionality of fossil to perform the import,
+however it will also bloat the binary with functionality not needed
+most of the time. Which becomes especially obvious if more importers
+are to be written, like for monotone, bazaar, mercurial, bitkeeper,
+git, SVN, Arc, etc. Keeping all this out of the core fossil binary is
+IMHO more beneficial in the long term, also from a maintenance point
+of view. The tools can evolve separately. Especially important for CVS
+as it will have to deal with lots of broken repositories, all
+different.
+
+However, nothing speaks against looking for common parts in all
+possible import tools, and having these in the fossil core, as a
+general backend all importer may use. Something like that has already
+been proposed: The deconstruct|reconstruct methods. For us, actually
+only reconstruct is important. Taking an unordered collection of files
+(data, and manifests) it generates a proper fossil repository.  With
+that method implemented all import tools only have to generate the
+necessary collection and then leave the main work of filling the
+database to fossil itself.
+
+The disadvantage of this method is however that it will gobble up a
+lot of temporary space in the filesystem to hold all unique revisions
+of all files in their expanded form.
+
+It might be worthwhile to consider an extension of 'reconstruct' which
+is able to incrementally add a set of files to an exiting fossil
+repository already containing revisions. In that case the import tool
+can be changed to incrementally generate the collection for a
+particular revision, import it, and iterate over all revisions in the
+origin repository. This is of course also dependent on the origin
+repository itself, how good it supports such incremental export.
+
+This also leads to a possible method for performing the import using
+only existing functionality ('reconstruct' has not been implemented
+yet). Instead generating an unordered collection for each revision
+generate a properly setup workspace, simply commit it. This will
+require use of rm, add and update methods as well, to remove old and
+enter new files, and point the fossil repository to the correct parent
+revision from the new revision is derived.
+
+The relative efficiency (in time) of these incremental methods versus
+importing a complete collection of files encoding the entire origin
+repository however is not clear.
+
+----------------------------------
+
+reconstruct
+
+It is currently not clear to me when and how fossil does
+delta-compression. Does it use deltas, or reverse deltas, or a
+combination of both ? And when does it generate the deltas ?
+
+The trivial solution is that it uses deltas, i.e. the first revision
+of a file is stored without delta compression and all future versions
+are deltas from that, and the delta is generated when the new revision
+is committed. With the obvious disadvantage that newer revisions take
+more and more time to be decompressed as the set of deltas to apply
+grows.
+
+And during xfer it simply sends the deltas as is, making for easy
+integration on the remote side.
+
+However reconstruct, that method sees initially only an unordered
+collection of files, some of which may be manifests, others are data
+files, and if it imports them in a random order it might find that
+file X, which was imported first and therefore has no delta
+compression, is actually somewhere in the middle of a line of
+revisions, and should be delta-compressed, and then it has to find out
+the predecessor and do the compression, etc.
+
+So depending on how the internal logic of delta-compression is done
+reconstruct might need more logic to help the lower level achieve good
+compression.
+
+Like, in a first pass determine which files are manifests, and read
+enough of them to determine their parent/child structure, and in a
+second pass actually imports them, in topological order, with all
+relevant non-manifest files for a manifest imported as that time
+too. With that the underlying engine would see the files basically in
+the same order as generated by a regular series of commits.
+
+Problems for reconstruct: Files referenced, but not present, and,
+conversely, files present, but not referenced. This can done as part
+of the second pass, aborting when a missing file is encountered, with
+(un)marking of used files, and at the end we know the unused
+files. Could also be a separate pass between first and second.

Modified todo-ak.txt from [0f794bbab7] to [7f4b3bcffb].

@@ -11,15 +11,15 @@
 
 * Think about exposure of functionality as libraries, i.e. Tcl
   packages. Foundations like delta, etc. first, work up to
   higher-levels.
 
-* Document delta format, delta encoder.
-
 * Document the merge algorithm.
 
 * Document the xfer protocol.
+
+* CVS import. Testcases: Tcl, Tk, Tcllib.
 
 Questions
 
 * In the timeline seen at http://fossil-scm.hwaci.com/fossil/timeline
   the manifest uuids are links to pages providing additional