Fossil: Artifact Content

Artifact 13f4f929b286b6f5740b70414fd41f8c62493f5d

File ci_cvs.txt part of check-in [f166b0a63c] - Added first code regarding import from cvs, processing a CVSROOT/history file. Looks good, except that the history I have is incomplete, truncated at the beginning. Extended my notes with results from this experiment, thinking about a possible different method. by aku on 2007-08-31 04:57:33.
===============================================================================

First experimental codes ...

toosl/import-cvs.tcl
tools/lib/rcsparser.tcl

No actual import, right now only working on getting csets right. The
code uses CVSROOT/history as foundation, and augments that with data
from the individual RCS files (commit messages).

Statistics of a run ...
	3516 csets.

	1545 breaks on user change
	 558 breaks on file duplicate
	  13 breaks on branch/trunk change
	1402 breaks on commit message change

Time statistics ...
	3297 were processed in <= 1 seconds (93.77%)
	 217 were processed in between 2 seconds and 14 minutes.
	   1 was  processed in ~41 minutes
	   1 was  processed in ~22 hours

Time fuzz - Differences between csets range from 0 seconds to 66
days. Needs stats analysis to see if there is an obvious break. Even
so the times within csets and between csets overlap a great deal,
making time a bad criterium for cset separation, IMHO.

Leaving that topic, back to the current cset separator ...

It has a problem:
	The history file is not starting at the root!

Examples:
	The first three changesets are

	=============================/user
	M {Wed Nov 22 09:28:49 AM PST 2000} ericm 1.4 tcllib/modules/ftpd/ChangeLog
	M {Wed Nov 22 09:28:49 AM PST 2000} ericm 1.7 tcllib/modules/ftpd/ftpd.tcl
	files: 2
	delta: 0
	range: 0 seconds
	=============================/cmsg
	M {Wed Nov 29 02:14:33 PM PST 2000} ericm 1.3 tcllib/aclocal.m4
	files: 1
	delta: 
	range: 0 seconds
	=============================/cmsg
	M {Sun Feb 04 12:28:35 AM PST 2001} ericm 1.9 tcllib/modules/mime/ChangeLog
	M {Sun Feb 04 12:28:35 AM PST 2001} ericm 1.12 tcllib/modules/mime/mime.tcl
	files: 2
	delta: 0
	range: 0 seconds

All csets modify files which already have several revisions. We have
no csets from before that in the history, but these csets are in the
RCS files.

I wonder, is SF maybe removing old entries from the history when it
grows too large ?

This also affects incremental import ... I cannot assume that the
history always grows. It may shrink ... I cannot keep an offset, will
have to record the time of the last entry, or even the full entry
processed last, to allow me to skip ahead to anything not known yet.

I might have to try to implement the algorithm outlined below,
matching the revision trees of the individual RCS files to each other
to form the global tree of revisions. Maybe we can use the history to
help in the matchup, for the parts where we do have it.

Wait. This might be easier ... Take the delta information from the RCS
files and generate a fake history ... Actually, this might even allow
us to create a total history ... No, not quite, the merge entries the
actual history may contain will be missing. These we can mix in from
the actual history, as much as we have.

Still, lets try that, a fake history, and then run this script on it
to see if/where are differences.

===============================================================================


Notes about CVS import, regarding CVS.

- Problem: CVS does not really track changesets, but only individual
  revisions of files. To recover changesets it is necessary to look at
  author, branch, timestamp information, and the commit messages. Even
  so this is only heuristic, not foolproof.

  Existing tool: cvsps.

  Processes the output of 'cvs log' to recover changesets. Problem:
  Sees only a linear list of revisions, does not see branchpoints,
  etc. Cannot use the tree structure to help in making the decisions.

- Problem: CVS does not track merge-points at all. Recovery through
  heuristics is brittle at best, looking for keywords in commit
  messages which might indicate that a branch was merged with some
  other.


Ideas regarding an algorithm to recover changesets.

Key feature: Uses the per-file revision trees to help in uncovering
the underlying changesets and global revision tree G.

The per-file revision tree for a file X is in essence the global
revision tree with all nodes not pertaining to X removed from it. In
the reverse this allows us to built up the global revision tree from
the per-file trees by matching nodes to each other and extending.

Start with the per file revision tree of a single file as initial
approximation of the global tree. All nodes of this tree refer to the
revision of the file belonging to it, and through that the file
itself. At each step the global tree contains the nodes for a finite
set of files, and all nodes in the tree refer to revisions of all
files in the set, making the mapping total.

To add a file X to the tree take the per-file revision tree R and
performs the following actions:

- For each node N in R use the tuple <author, branch, commit message>
  to identify a set of nodes in G which may match N. Use the timestamp
  to locate the node nearest in time.

- This process will leave nodes in N unmapped. If there are unmapped
  nodes which have no neighbouring mapped nodes we have to
  abort.

  Otherwise take the nodes which have mapped neighbours. Trace the
  edges and see which of these nodes are connected in the local
  tree. Then look at the identified neighbours and trace their
  connections.

  If two global nodes have a direct connection, but a multi-edge
  connection in the local tree insert global nodes mapping to the
  local nodes and map them together. This expands the global tree to
  hold the revisions added by the new file.

  Otherwise, both sides have multi-edge connections then abort. This
  looks like a merge of two different branches, but there are no such
  in CVS ... Wait ... sort the nodes over time and fit the new nodes
  in between the other nodes, per the timestamps. We have overlapping
  / alternating changes to one file and others.

  A last possibility is that a node is only connected to a mapped
  parent. This may be a new branch, or again an alternating change on
  the given line. Symbols on the revisions will help to map this.

- We now have an extended global tree which incorporates the revisions
  of the new file. However new nodes will refer only to the new file,
  and old nodes may not refer to the new file. This has to be fixed,
  as all nodes have to refer to all files.

  Run over the tree and look at each parent/child pair. If a file is
  not referenced in the child, but the parent, then copy a reference
  to the file revision on the parent forward to the child. This
  signals that the file did not change in the given revision.

- After all files have been integrated in this manner we have global
  revision tree capturing all changesets, including the unchanged
  files per changeset.


This algorithm has to be refined to also take Attic/ files into
account.