dbda8d6ce9 2007-07-21 drh: <html> dbda8d6ce9 2007-07-21 drh: <head> dbda8d6ce9 2007-07-21 drh: <title>Fossil Repository Integrity Self-Checks</title> dbda8d6ce9 2007-07-21 drh: </head> dbda8d6ce9 2007-07-21 drh: <body bgcolor="white"> dbda8d6ce9 2007-07-21 drh: <h1 align="center"> dbda8d6ce9 2007-07-21 drh: Fossil Repository Integrity Self-Checks dbda8d6ce9 2007-07-21 drh: </h1> dbda8d6ce9 2007-07-21 drh: dbda8d6ce9 2007-07-21 drh: <p> dbda8d6ce9 2007-07-21 drh: Even though fossil is a relatively new project and still contains dbda8d6ce9 2007-07-21 drh: many bugs, it is designed with features to give it a high level dbda8d6ce9 2007-07-21 drh: of integrity so that you can have confidence that you will not dbda8d6ce9 2007-07-21 drh: lose your files. This note describes the defensive measures that dbda8d6ce9 2007-07-21 drh: fossil uses to help prevent file loss due to bugs. dbda8d6ce9 2007-07-21 drh: </p> dbda8d6ce9 2007-07-21 drh: dbda8d6ce9 2007-07-21 drh: <h2>Atomic Check-ins With Rollback</h2> dbda8d6ce9 2007-07-21 drh: dbda8d6ce9 2007-07-21 drh: <p> dbda8d6ce9 2007-07-21 drh: The fossil repository is an dbda8d6ce9 2007-07-21 drh: <a href="http://www.sqlite.org/">SQLite</a> database file. SQLite dbda8d6ce9 2007-07-21 drh: is very mature and stable and has been in wide-spread use for many dbda8d6ce9 2007-07-21 drh: years, so we have little worries that it might cause repository dbda8d6ce9 2007-07-21 drh: corruption. SQLite dbda8d6ce9 2007-07-21 drh: databases do not corrupt even if a program or system crash or power dbda8d6ce9 2007-07-21 drh: failure occurs in the middle of the update. If some kind of crash dbda8d6ce9 2007-07-21 drh: does occur in the middle of a change, then all the changes are rolled dbda8d6ce9 2007-07-21 drh: back the next time that the database is accessed. dbda8d6ce9 2007-07-21 drh: </p> dbda8d6ce9 2007-07-21 drh: dbda8d6ce9 2007-07-21 drh: <p> dbda8d6ce9 2007-07-21 drh: A check-in operation in fossil makes many changes to the repository dbda8d6ce9 2007-07-21 drh: database. But all these changes happen within a single transaction. dbda8d6ce9 2007-07-21 drh: If something goes wrong in the middle of the commit, then the transaction dbda8d6ce9 2007-07-21 drh: is rolled back and the database is unchanged. dbda8d6ce9 2007-07-21 drh: </p> dbda8d6ce9 2007-07-21 drh: dbda8d6ce9 2007-07-21 drh: <h2>Verification Of Delta Encodings Prior To Transaction Commit</h2> dbda8d6ce9 2007-07-21 drh: dbda8d6ce9 2007-07-21 drh: <p> dbda8d6ce9 2007-07-21 drh: The content files that comprise the global state of a fossil respository dbda8d6ce9 2007-07-21 drh: are stored in the repository as a tree. The leaves of the tree are dbda8d6ce9 2007-07-21 drh: stored as zlib-compressed BLOBs. Interior nodes are deltas from their dbda8d6ce9 2007-07-21 drh: decendents. There is a lot of encoding going on here. There is dbda8d6ce9 2007-07-21 drh: zlib-compression which is relatively well-tested but still might dbda8d6ce9 2007-07-21 drh: cause corruption if used improperly. And there is the relatively dbda8d6ce9 2007-07-21 drh: new delta-encoding mechanism designed expressly for fossil. We want dbda8d6ce9 2007-07-21 drh: to make sure that bugs in these encoding mechanisms do not lead to dbda8d6ce9 2007-07-21 drh: loss of data. dbda8d6ce9 2007-07-21 drh: </p> dbda8d6ce9 2007-07-21 drh: dbda8d6ce9 2007-07-21 drh: <p> dbda8d6ce9 2007-07-21 drh: To increase our confidence that everything in the repository is dbda8d6ce9 2007-07-21 drh: recoverable, fossil makes sure it can extract an exact replicate dbda8d6ce9 2007-07-21 drh: of every content file that it changes just prior to transaction dbda8d6ce9 2007-07-21 drh: commit. So during the course of check-in, many different files dbda8d6ce9 2007-07-21 drh: in the repository might be modified. Some files are simply dbda8d6ce9 2007-07-21 drh: compressed. Other files are delta encoded and then compressed. dbda8d6ce9 2007-07-21 drh: While all this is going on, fossil makes a record of every file dbda8d6ce9 2007-07-21 drh: that is encoded and the MD5 hash of the original content of that dbda8d6ce9 2007-07-21 drh: file. Then just before transaction commit, fossil re-extracts dbda8d6ce9 2007-07-21 drh: the original content of all files that were written, computes dbda8d6ce9 2007-07-21 drh: the MD5 checksum again, and verifies that the checksums match. dbda8d6ce9 2007-07-21 drh: If anything does not match up, an error dbda8d6ce9 2007-07-21 drh: message is printed and the transaction rolls back. dbda8d6ce9 2007-07-21 drh: </p> dbda8d6ce9 2007-07-21 drh: dbda8d6ce9 2007-07-21 drh: <p> dbda8d6ce9 2007-07-21 drh: So, in other words, fossil always checks to make sure it can dbda8d6ce9 2007-07-21 drh: re-extract a file before it commits a check-in of that file. dbda8d6ce9 2007-07-21 drh: Hence bugs in fossil are unlikely to corrupt the repository in dbda8d6ce9 2007-07-21 drh: a way that prevents us from extracting historical versions of dbda8d6ce9 2007-07-21 drh: files. dbda8d6ce9 2007-07-21 drh: </p> dbda8d6ce9 2007-07-21 drh: dbda8d6ce9 2007-07-21 drh: <h2>Checksums on all files and versions</h2> dbda8d6ce9 2007-07-21 drh: dbda8d6ce9 2007-07-21 drh: <p> dbda8d6ce9 2007-07-21 drh: Repository records of type "file" (records that hold the content dbda8d6ce9 2007-07-21 drh: of project files) contain a "cksum" property which records the dbda8d6ce9 2007-07-21 drh: MD5 checksum of the content of that file. So if something goes dbda8d6ce9 2007-07-21 drh: wrong in the file extraction process we will at least know about dbda8d6ce9 2007-07-21 drh: it. This checksum is in addition to the digital signature that dbda8d6ce9 2007-07-21 drh: is over the entire header and content of the record. dbda8d6ce9 2007-07-21 drh: </p> dbda8d6ce9 2007-07-21 drh: dbda8d6ce9 2007-07-21 drh: <p> dbda8d6ce9 2007-07-21 drh: Repository records of type "version" contain a "cksum" dbda8d6ce9 2007-07-21 drh: property that holds the MD5 checksum of the concatenation of dbda8d6ce9 2007-07-21 drh: every file in the entire project. During a check-in, after dbda8d6ce9 2007-07-21 drh: fossil has inserted all changes into the repository, it goes dbda8d6ce9 2007-07-21 drh: back and rereads every file out of the repository and recomputes dbda8d6ce9 2007-07-21 drh: this global checksum based on the respository content. It then dbda8d6ce9 2007-07-21 drh: computes an MD5 checksum over the files on disk. If these two dbda8d6ce9 2007-07-21 drh: checksums do not match, the check-in files and rolls back. dbda8d6ce9 2007-07-21 drh: Thus if a check-in transaction is successful, we have high dbda8d6ce9 2007-07-21 drh: confidence that the content in the repository exactly matches dbda8d6ce9 2007-07-21 drh: the content on disk. dbda8d6ce9 2007-07-21 drh: </p> dbda8d6ce9 2007-07-21 drh: dbda8d6ce9 2007-07-21 drh: <p> dbda8d6ce9 2007-07-21 drh: Every project files is verified by three separate checksums. dbda8d6ce9 2007-07-21 drh: There is an SHA256 checksum used as part of the digital signature dbda8d6ce9 2007-07-21 drh: on the file. There is an MD5 checksum on the content of each dbda8d6ce9 2007-07-21 drh: individual file. And there is a global MD5 checksum over the dbda8d6ce9 2007-07-21 drh: entire project source tree. If any of these cross-checks do not dbda8d6ce9 2007-07-21 drh: match then the operation fails and an error is displayed. Taken dbda8d6ce9 2007-07-21 drh: together, these cross-checks give us high confidence that the dbda8d6ce9 2007-07-21 drh: files you checked out are identical to the files you checked in. dbda8d6ce9 2007-07-21 drh: </p>