Overview
SHA1 Hash: | 103c397e4b36b1ddfb4afe52d120489361cac9c8 |
---|---|
Date: | 2007-08-28 03:34:12 |
User: | aku |
Comment: | Updated my work list, added first notes about 'cvs import' functionality. |
Timelines: | ancestors | descendants | both | trunk |
Other Links: | files | ZIP archive | manifest |
Tags And Properties
- branch=trunk inherited from [a28c83647d]
- sym-trunk inherited from [a28c83647d]
Changes
[hide diffs]Added ci_cvs.txt version [06c022ce6b]
@@ -1,1 +1,86 @@ + +Notes about CVS import, regarding CVS. + +- Problem: CVS does not really track changesets, but only individual + revisions of files. To recover changesets it is necessary to look at + author, branch, timestamp information, and the commit messages. Even + so this is only heuristic, not foolproof. + + Existing tool: cvsps. + + Processes the output of 'cvs log' to recover changesets. Problem: + Sees only a linear list of revisions, does not see branchpoints, + etc. Cannot use the tree structure to help in making the decisions. + +- Problem: CVS does not track merge-points at all. Recovery through + heuristics is brittle at best, looking for keywords in commit + messages which might indicate that a branch was merged with some + other. + + +Ideas regarding an algorithm to recover changesets. + +Key feature: Uses the per-file revision trees to help in uncovering +the underlying changesets and global revision tree G. + +The per-file revision tree for a file X is in essence the global +revision tree with all nodes not pertaining to X removed from it. In +the reverse this allows us to built up the global revision tree from +the per-file trees by matching nodes to each other and extending. + +Start with the per file revision tree of a single file as initial +approximation of the global tree. All nodes of this tree refer to the +revision of the file belonging to it, and through that the file +itself. At each step the global tree contains the nodes for a finite +set of files, and all nodes in the tree refer to revisions of all +files in the set, making the mapping total. + +To add a file X to the tree take the per-file revision tree R and +performs the following actions: + +- For each node N in R use the tuple <author, branch, commit message> + to identify a set of nodes in G which may match N. Use the timestamp + to locate the node nearest in time. + +- This process will leave nodes in N unmapped. If there are unmapped + nodes which have no neighbouring mapped nodes we have to + abort. + + Otherwise take the nodes which have mapped neighbours. Trace the + edges and see which of these nodes are connected in the local + tree. Then look at the identified neighbours and trace their + connections. + + If two global nodes have a direct connection, but a multi-edge + connection in the local tree insert global nodes mapping to the + local nodes and map them together. This expands the global tree to + hold the revisions added by the new file. + + Otherwise, both sides have multi-edge connections then abort. This + looks like a merge of two different branches, but there are no such + in CVS ... Wait ... sort the nodes over time and fit the new nodes + in between the other nodes, per the timestamps. We have overlapping + / alternating changes to one file and others. + + A last possibility is that a node is only connected to a mapped + parent. This may be a new branch, or again an alternating change on + the given line. Symbols on the revisions will help to map this. + +- We now have an extended global tree which incorporates the revisions + of the new file. However new nodes will refer only to the new file, + and old nodes may not refer to the new file. This has to be fixed, + as all nodes have to refer to all files. + + Run over the tree and look at each parent/child pair. If a file is + not referenced in the child, but the parent, then copy a reference + to the file revision on the parent forward to the child. This + signals that the file did not change in the given revision. + +- After all files have been integrated in this manner we have global + revision tree capturing all changesets, including the unchanged + files per changeset. + + +This algorithm has to be refined to also take Attic/ files into +account.
Added ci_fossil.txt version [ffe422bb7a]
@@ -1,1 +1,100 @@ +To perform CVS imports for fossil we need at least the ability to +parse CVS files, i.e. RCS files, with slight differences. + +For the general architecture of the import facility we have two major +paths to choose between. + +One is to use an external tool which processes a cvs repository and +drives fossil through its CLI to insert the found changesets. + +The other is to integrate the whole facility into the fossil binary +itself. + +I dislike the second choice. It may be faster, as the implementation +can use all internal functionality of fossil to perform the import, +however it will also bloat the binary with functionality not needed +most of the time. Which becomes especially obvious if more importers +are to be written, like for monotone, bazaar, mercurial, bitkeeper, +git, SVN, Arc, etc. Keeping all this out of the core fossil binary is +IMHO more beneficial in the long term, also from a maintenance point +of view. The tools can evolve separately. Especially important for CVS +as it will have to deal with lots of broken repositories, all +different. + +However, nothing speaks against looking for common parts in all +possible import tools, and having these in the fossil core, as a +general backend all importer may use. Something like that has already +been proposed: The deconstruct|reconstruct methods. For us, actually +only reconstruct is important. Taking an unordered collection of files +(data, and manifests) it generates a proper fossil repository. With +that method implemented all import tools only have to generate the +necessary collection and then leave the main work of filling the +database to fossil itself. + +The disadvantage of this method is however that it will gobble up a +lot of temporary space in the filesystem to hold all unique revisions +of all files in their expanded form. + +It might be worthwhile to consider an extension of 'reconstruct' which +is able to incrementally add a set of files to an exiting fossil +repository already containing revisions. In that case the import tool +can be changed to incrementally generate the collection for a +particular revision, import it, and iterate over all revisions in the +origin repository. This is of course also dependent on the origin +repository itself, how good it supports such incremental export. + +This also leads to a possible method for performing the import using +only existing functionality ('reconstruct' has not been implemented +yet). Instead generating an unordered collection for each revision +generate a properly setup workspace, simply commit it. This will +require use of rm, add and update methods as well, to remove old and +enter new files, and point the fossil repository to the correct parent +revision from the new revision is derived. + +The relative efficiency (in time) of these incremental methods versus +importing a complete collection of files encoding the entire origin +repository however is not clear. + +---------------------------------- + +reconstruct + +It is currently not clear to me when and how fossil does +delta-compression. Does it use deltas, or reverse deltas, or a +combination of both ? And when does it generate the deltas ? + +The trivial solution is that it uses deltas, i.e. the first revision +of a file is stored without delta compression and all future versions +are deltas from that, and the delta is generated when the new revision +is committed. With the obvious disadvantage that newer revisions take +more and more time to be decompressed as the set of deltas to apply +grows. + +And during xfer it simply sends the deltas as is, making for easy +integration on the remote side. + +However reconstruct, that method sees initially only an unordered +collection of files, some of which may be manifests, others are data +files, and if it imports them in a random order it might find that +file X, which was imported first and therefore has no delta +compression, is actually somewhere in the middle of a line of +revisions, and should be delta-compressed, and then it has to find out +the predecessor and do the compression, etc. + +So depending on how the internal logic of delta-compression is done +reconstruct might need more logic to help the lower level achieve good +compression. + +Like, in a first pass determine which files are manifests, and read +enough of them to determine their parent/child structure, and in a +second pass actually imports them, in topological order, with all +relevant non-manifest files for a manifest imported as that time +too. With that the underlying engine would see the files basically in +the same order as generated by a regular series of commits. + +Problems for reconstruct: Files referenced, but not present, and, +conversely, files present, but not referenced. This can done as part +of the second pass, aborting when a missing file is encountered, with +(un)marking of used files, and at the end we know the unused +files. Could also be a separate pass between first and second.
Modified todo-ak.txt from [0f794bbab7] to [7f4b3bcffb].
@@ -11,15 +11,15 @@ * Think about exposure of functionality as libraries, i.e. Tcl packages. Foundations like delta, etc. first, work up to higher-levels. -* Document delta format, delta encoder. - * Document the merge algorithm. * Document the xfer protocol. + +* CVS import. Testcases: Tcl, Tk, Tcllib. Questions * In the timeline seen at http://fossil-scm.hwaci.com/fossil/timeline the manifest uuids are links to pages providing additional