f166b0a63c 2007-08-31 aku: =============================================================================== f166b0a63c 2007-08-31 aku: f166b0a63c 2007-08-31 aku: First experimental codes ... f166b0a63c 2007-08-31 aku: f166b0a63c 2007-08-31 aku: toosl/import-cvs.tcl f166b0a63c 2007-08-31 aku: tools/lib/rcsparser.tcl f166b0a63c 2007-08-31 aku: f166b0a63c 2007-08-31 aku: No actual import, right now only working on getting csets right. The f166b0a63c 2007-08-31 aku: code uses CVSROOT/history as foundation, and augments that with data f166b0a63c 2007-08-31 aku: from the individual RCS files (commit messages). f166b0a63c 2007-08-31 aku: f166b0a63c 2007-08-31 aku: Statistics of a run ... f166b0a63c 2007-08-31 aku: 3516 csets. f166b0a63c 2007-08-31 aku: f166b0a63c 2007-08-31 aku: 1545 breaks on user change f166b0a63c 2007-08-31 aku: 558 breaks on file duplicate f166b0a63c 2007-08-31 aku: 13 breaks on branch/trunk change f166b0a63c 2007-08-31 aku: 1402 breaks on commit message change f166b0a63c 2007-08-31 aku: f166b0a63c 2007-08-31 aku: Time statistics ... f166b0a63c 2007-08-31 aku: 3297 were processed in <= 1 seconds (93.77%) f166b0a63c 2007-08-31 aku: 217 were processed in between 2 seconds and 14 minutes. f166b0a63c 2007-08-31 aku: 1 was processed in ~41 minutes f166b0a63c 2007-08-31 aku: 1 was processed in ~22 hours f166b0a63c 2007-08-31 aku: f166b0a63c 2007-08-31 aku: Time fuzz - Differences between csets range from 0 seconds to 66 f166b0a63c 2007-08-31 aku: days. Needs stats analysis to see if there is an obvious break. Even f166b0a63c 2007-08-31 aku: so the times within csets and between csets overlap a great deal, f166b0a63c 2007-08-31 aku: making time a bad criterium for cset separation, IMHO. f166b0a63c 2007-08-31 aku: f166b0a63c 2007-08-31 aku: Leaving that topic, back to the current cset separator ... f166b0a63c 2007-08-31 aku: f166b0a63c 2007-08-31 aku: It has a problem: f166b0a63c 2007-08-31 aku: The history file is not starting at the root! f166b0a63c 2007-08-31 aku: f166b0a63c 2007-08-31 aku: Examples: f166b0a63c 2007-08-31 aku: The first three changesets are f166b0a63c 2007-08-31 aku: f166b0a63c 2007-08-31 aku: =============================/user f166b0a63c 2007-08-31 aku: M {Wed Nov 22 09:28:49 AM PST 2000} ericm 1.4 tcllib/modules/ftpd/ChangeLog f166b0a63c 2007-08-31 aku: M {Wed Nov 22 09:28:49 AM PST 2000} ericm 1.7 tcllib/modules/ftpd/ftpd.tcl f166b0a63c 2007-08-31 aku: files: 2 f166b0a63c 2007-08-31 aku: delta: 0 f166b0a63c 2007-08-31 aku: range: 0 seconds f166b0a63c 2007-08-31 aku: =============================/cmsg f166b0a63c 2007-08-31 aku: M {Wed Nov 29 02:14:33 PM PST 2000} ericm 1.3 tcllib/aclocal.m4 f166b0a63c 2007-08-31 aku: files: 1 f166b0a63c 2007-08-31 aku: delta: f166b0a63c 2007-08-31 aku: range: 0 seconds f166b0a63c 2007-08-31 aku: =============================/cmsg f166b0a63c 2007-08-31 aku: M {Sun Feb 04 12:28:35 AM PST 2001} ericm 1.9 tcllib/modules/mime/ChangeLog f166b0a63c 2007-08-31 aku: M {Sun Feb 04 12:28:35 AM PST 2001} ericm 1.12 tcllib/modules/mime/mime.tcl f166b0a63c 2007-08-31 aku: files: 2 f166b0a63c 2007-08-31 aku: delta: 0 f166b0a63c 2007-08-31 aku: range: 0 seconds f166b0a63c 2007-08-31 aku: f166b0a63c 2007-08-31 aku: All csets modify files which already have several revisions. We have f166b0a63c 2007-08-31 aku: no csets from before that in the history, but these csets are in the f166b0a63c 2007-08-31 aku: RCS files. f166b0a63c 2007-08-31 aku: f166b0a63c 2007-08-31 aku: I wonder, is SF maybe removing old entries from the history when it f166b0a63c 2007-08-31 aku: grows too large ? f166b0a63c 2007-08-31 aku: f166b0a63c 2007-08-31 aku: This also affects incremental import ... I cannot assume that the f166b0a63c 2007-08-31 aku: history always grows. It may shrink ... I cannot keep an offset, will f166b0a63c 2007-08-31 aku: have to record the time of the last entry, or even the full entry f166b0a63c 2007-08-31 aku: processed last, to allow me to skip ahead to anything not known yet. f166b0a63c 2007-08-31 aku: f166b0a63c 2007-08-31 aku: I might have to try to implement the algorithm outlined below, f166b0a63c 2007-08-31 aku: matching the revision trees of the individual RCS files to each other f166b0a63c 2007-08-31 aku: to form the global tree of revisions. Maybe we can use the history to f166b0a63c 2007-08-31 aku: help in the matchup, for the parts where we do have it. f166b0a63c 2007-08-31 aku: f166b0a63c 2007-08-31 aku: Wait. This might be easier ... Take the delta information from the RCS f166b0a63c 2007-08-31 aku: files and generate a fake history ... Actually, this might even allow f166b0a63c 2007-08-31 aku: us to create a total history ... No, not quite, the merge entries the f166b0a63c 2007-08-31 aku: actual history may contain will be missing. These we can mix in from f166b0a63c 2007-08-31 aku: the actual history, as much as we have. f166b0a63c 2007-08-31 aku: f166b0a63c 2007-08-31 aku: Still, lets try that, a fake history, and then run this script on it f166b0a63c 2007-08-31 aku: to see if/where are differences. f166b0a63c 2007-08-31 aku: f166b0a63c 2007-08-31 aku: =============================================================================== f166b0a63c 2007-08-31 aku: 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: Notes about CVS import, regarding CVS. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: - Problem: CVS does not really track changesets, but only individual 103c397e4b 2007-08-28 aku: revisions of files. To recover changesets it is necessary to look at 103c397e4b 2007-08-28 aku: author, branch, timestamp information, and the commit messages. Even 103c397e4b 2007-08-28 aku: so this is only heuristic, not foolproof. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: Existing tool: cvsps. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: Processes the output of 'cvs log' to recover changesets. Problem: 103c397e4b 2007-08-28 aku: Sees only a linear list of revisions, does not see branchpoints, 103c397e4b 2007-08-28 aku: etc. Cannot use the tree structure to help in making the decisions. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: - Problem: CVS does not track merge-points at all. Recovery through 103c397e4b 2007-08-28 aku: heuristics is brittle at best, looking for keywords in commit 103c397e4b 2007-08-28 aku: messages which might indicate that a branch was merged with some 103c397e4b 2007-08-28 aku: other. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: Ideas regarding an algorithm to recover changesets. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: Key feature: Uses the per-file revision trees to help in uncovering 103c397e4b 2007-08-28 aku: the underlying changesets and global revision tree G. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: The per-file revision tree for a file X is in essence the global 103c397e4b 2007-08-28 aku: revision tree with all nodes not pertaining to X removed from it. In 103c397e4b 2007-08-28 aku: the reverse this allows us to built up the global revision tree from 103c397e4b 2007-08-28 aku: the per-file trees by matching nodes to each other and extending. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: Start with the per file revision tree of a single file as initial 103c397e4b 2007-08-28 aku: approximation of the global tree. All nodes of this tree refer to the 103c397e4b 2007-08-28 aku: revision of the file belonging to it, and through that the file 103c397e4b 2007-08-28 aku: itself. At each step the global tree contains the nodes for a finite 103c397e4b 2007-08-28 aku: set of files, and all nodes in the tree refer to revisions of all 103c397e4b 2007-08-28 aku: files in the set, making the mapping total. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: To add a file X to the tree take the per-file revision tree R and 103c397e4b 2007-08-28 aku: performs the following actions: 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: - For each node N in R use the tuple <author, branch, commit message> 103c397e4b 2007-08-28 aku: to identify a set of nodes in G which may match N. Use the timestamp 103c397e4b 2007-08-28 aku: to locate the node nearest in time. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: - This process will leave nodes in N unmapped. If there are unmapped 103c397e4b 2007-08-28 aku: nodes which have no neighbouring mapped nodes we have to 103c397e4b 2007-08-28 aku: abort. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: Otherwise take the nodes which have mapped neighbours. Trace the 103c397e4b 2007-08-28 aku: edges and see which of these nodes are connected in the local 103c397e4b 2007-08-28 aku: tree. Then look at the identified neighbours and trace their 103c397e4b 2007-08-28 aku: connections. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: If two global nodes have a direct connection, but a multi-edge 103c397e4b 2007-08-28 aku: connection in the local tree insert global nodes mapping to the 103c397e4b 2007-08-28 aku: local nodes and map them together. This expands the global tree to 103c397e4b 2007-08-28 aku: hold the revisions added by the new file. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: Otherwise, both sides have multi-edge connections then abort. This 103c397e4b 2007-08-28 aku: looks like a merge of two different branches, but there are no such 103c397e4b 2007-08-28 aku: in CVS ... Wait ... sort the nodes over time and fit the new nodes 103c397e4b 2007-08-28 aku: in between the other nodes, per the timestamps. We have overlapping 103c397e4b 2007-08-28 aku: / alternating changes to one file and others. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: A last possibility is that a node is only connected to a mapped 103c397e4b 2007-08-28 aku: parent. This may be a new branch, or again an alternating change on 103c397e4b 2007-08-28 aku: the given line. Symbols on the revisions will help to map this. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: - We now have an extended global tree which incorporates the revisions 103c397e4b 2007-08-28 aku: of the new file. However new nodes will refer only to the new file, 103c397e4b 2007-08-28 aku: and old nodes may not refer to the new file. This has to be fixed, 103c397e4b 2007-08-28 aku: as all nodes have to refer to all files. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: Run over the tree and look at each parent/child pair. If a file is 103c397e4b 2007-08-28 aku: not referenced in the child, but the parent, then copy a reference 103c397e4b 2007-08-28 aku: to the file revision on the parent forward to the child. This 103c397e4b 2007-08-28 aku: signals that the file did not change in the given revision. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: - After all files have been integrated in this manner we have global 103c397e4b 2007-08-28 aku: revision tree capturing all changesets, including the unchanged 103c397e4b 2007-08-28 aku: files per changeset. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: This algorithm has to be refined to also take Attic/ files into 103c397e4b 2007-08-28 aku: account. 103c397e4b 2007-08-28 aku: