Artifact Content
Not logged in

Artifact 0f9001c9f0d15a71bc48a46351fda01b62d9bbcc

File cvs2fossil.txt part of check-in [27ed4f7dc3] - Extended pass InitCsets and underlying code with more log output geared towards memory introspection, and added markers for special locations. Extended my notes with general observations from the first test runs over my example CVS repositories. by aku on 2008-02-16 06:46:41.


Known problems and areas to work on
===================================

*	Not yet able to handle the specification of multiple projects
	for one CVS repository. I.e. I can, for example, import all of
	tcllib, or a single subproject of tcllib, like tklib, but not
	multiple sub-projects in one go.

*	We have to look into the pass 'InitCsets' and hunt for the
	cause of the large amount of memory it is gobbling up.

	Results from the first look using the new memory tracking
	subsystem:

	(1) The general architecture, workflow, is a bit wasteful. All
	    changesets are generated and kept in memory before getting
	    persisted. This means that allocated memory piles up over
	    time, with later changesets pushing the boundaries. This
	    is made worse that some of the preliminary changesets seem
	    to require a lot of temporary memory as part of getting
	    broken down into the actual ones. InititializeBreakState
	    seems to be the culprit here. Its memory usage is possibly
	    quadratic in the number of items in the changeset.

	(2) A number of small inefficiencies. Like 'state eval' always
	    pulling the whole result into memory before processing it
	    with 'foreach'. Here potentially large lists.

	(3) We maintain an in-memory map from tagged items to their
	    changesets. While this is needed later in the sorting
	    passes during the creation this is wasted space. And also
	    wasted time, to maintain it during the creation and
	    breaking.

	Changes:

	(a) Re-architect to create, break, and persist changesets one
	    by one, completely releasing all associated in-memory data
	    before going to the next. Should be low-hanging fruit with
	    high impact, as we have all the necessary operations
	    already, just not in that order, and that alone should
	    already keep the pile from forming, making the spikes of
	    (2) more manageable.

	(b) Look into the smaller problems described in (2), and
	    especially (3). These should still be low-hanging fruit,
	    although of lesser effect than (a). For (3) disable the
	    map and its maintenace during construction, and put it
	    into a separate command, to be used when loading the
	    created changesets at the end.

	(c) With larger effect, but more difficult to achieve, go into
	    command 'InitializeBreakState' and the preceding
	    'internalsuccessors', and rearchitect it. Definitely not a
	    low-hanging fruit. Possibly also something we can skip if
	    doing (a) had a large enough effect.

*	Look at the dependencies on external packages and consider
	which of them can be moved into the importer, either as a
	simple utility command, or wholesale.

	struct::list
		assign, map, reverse, filter

		Very few and self-contained commands.

	struct::set
		size, empty, contains, add, include, exclude,
		intersect, subsetof

		Most of the core commands.

	fileutil
		cat, appendToFile, writeFile,
		tempfile, stripPath, test

	fileutil::traverse
		In toto

	struct::graph
		In toto

	snit
		In toto

	sqlite3
		In toto