103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: Notes about CVS import, regarding CVS. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: - Problem: CVS does not really track changesets, but only individual 103c397e4b 2007-08-28 aku: revisions of files. To recover changesets it is necessary to look at 103c397e4b 2007-08-28 aku: author, branch, timestamp information, and the commit messages. Even 103c397e4b 2007-08-28 aku: so this is only heuristic, not foolproof. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: Existing tool: cvsps. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: Processes the output of 'cvs log' to recover changesets. Problem: 103c397e4b 2007-08-28 aku: Sees only a linear list of revisions, does not see branchpoints, 103c397e4b 2007-08-28 aku: etc. Cannot use the tree structure to help in making the decisions. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: - Problem: CVS does not track merge-points at all. Recovery through 103c397e4b 2007-08-28 aku: heuristics is brittle at best, looking for keywords in commit 103c397e4b 2007-08-28 aku: messages which might indicate that a branch was merged with some 103c397e4b 2007-08-28 aku: other. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: Ideas regarding an algorithm to recover changesets. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: Key feature: Uses the per-file revision trees to help in uncovering 103c397e4b 2007-08-28 aku: the underlying changesets and global revision tree G. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: The per-file revision tree for a file X is in essence the global 103c397e4b 2007-08-28 aku: revision tree with all nodes not pertaining to X removed from it. In 103c397e4b 2007-08-28 aku: the reverse this allows us to built up the global revision tree from 103c397e4b 2007-08-28 aku: the per-file trees by matching nodes to each other and extending. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: Start with the per file revision tree of a single file as initial 103c397e4b 2007-08-28 aku: approximation of the global tree. All nodes of this tree refer to the 103c397e4b 2007-08-28 aku: revision of the file belonging to it, and through that the file 103c397e4b 2007-08-28 aku: itself. At each step the global tree contains the nodes for a finite 103c397e4b 2007-08-28 aku: set of files, and all nodes in the tree refer to revisions of all 103c397e4b 2007-08-28 aku: files in the set, making the mapping total. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: To add a file X to the tree take the per-file revision tree R and 103c397e4b 2007-08-28 aku: performs the following actions: 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: - For each node N in R use the tuple <author, branch, commit message> 103c397e4b 2007-08-28 aku: to identify a set of nodes in G which may match N. Use the timestamp 103c397e4b 2007-08-28 aku: to locate the node nearest in time. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: - This process will leave nodes in N unmapped. If there are unmapped 103c397e4b 2007-08-28 aku: nodes which have no neighbouring mapped nodes we have to 103c397e4b 2007-08-28 aku: abort. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: Otherwise take the nodes which have mapped neighbours. Trace the 103c397e4b 2007-08-28 aku: edges and see which of these nodes are connected in the local 103c397e4b 2007-08-28 aku: tree. Then look at the identified neighbours and trace their 103c397e4b 2007-08-28 aku: connections. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: If two global nodes have a direct connection, but a multi-edge 103c397e4b 2007-08-28 aku: connection in the local tree insert global nodes mapping to the 103c397e4b 2007-08-28 aku: local nodes and map them together. This expands the global tree to 103c397e4b 2007-08-28 aku: hold the revisions added by the new file. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: Otherwise, both sides have multi-edge connections then abort. This 103c397e4b 2007-08-28 aku: looks like a merge of two different branches, but there are no such 103c397e4b 2007-08-28 aku: in CVS ... Wait ... sort the nodes over time and fit the new nodes 103c397e4b 2007-08-28 aku: in between the other nodes, per the timestamps. We have overlapping 103c397e4b 2007-08-28 aku: / alternating changes to one file and others. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: A last possibility is that a node is only connected to a mapped 103c397e4b 2007-08-28 aku: parent. This may be a new branch, or again an alternating change on 103c397e4b 2007-08-28 aku: the given line. Symbols on the revisions will help to map this. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: - We now have an extended global tree which incorporates the revisions 103c397e4b 2007-08-28 aku: of the new file. However new nodes will refer only to the new file, 103c397e4b 2007-08-28 aku: and old nodes may not refer to the new file. This has to be fixed, 103c397e4b 2007-08-28 aku: as all nodes have to refer to all files. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: Run over the tree and look at each parent/child pair. If a file is 103c397e4b 2007-08-28 aku: not referenced in the child, but the parent, then copy a reference 103c397e4b 2007-08-28 aku: to the file revision on the parent forward to the child. This 103c397e4b 2007-08-28 aku: signals that the file did not change in the given revision. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: - After all files have been integrated in this manner we have global 103c397e4b 2007-08-28 aku: revision tree capturing all changesets, including the unchanged 103c397e4b 2007-08-28 aku: files per changeset. 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: 103c397e4b 2007-08-28 aku: This algorithm has to be refined to also take Attic/ files into 103c397e4b 2007-08-28 aku: account. 103c397e4b 2007-08-28 aku: