Scm migration scope

From Apache OpenOffice Wiki
Jump to: navigation, search

Update

The OpenOffice.org steering committee (2008/07/03) decided to go with route c) "Trunk migration only"

Summary

Migrating a project with lots of history from CVS to SVN is difficult and needs a bit pf planing. Especially the question "Which part of the history needs to be migrated?" needs to be answered upfront to avoid later disappointments. This paper discusses the pro's and con's of four different approaches:

  • a) Full migration, including all CWS branches.
  • b) Full migration minus "finished" CWS branches.
  • c) Trunk migration only.
  • d) No history migration at all.

Introduction, or why is it so hard to migrate from CVS to a modern SCM?

One thing which needs to be defined within our migration project is how much we will need to convert to the next OpenOffice.org SCM system - which will be Subversion first and later most probably a DSCM.

All modern open source SCM systems I know of are change set oriented. This means that not files are individually tracked but change sets, which are bundles of changes to a potentially huge number of files. One benefit of change sets is obvious: changes in the structure of files (directories, file renames etc) can be recorded as well. Another very important benefit is consistency: changes which belong together are recorded together in one place.

Consider a CVS branch or tag operation over the whole source tree: all active repository files (these which are not in /Attic) are marked with a tag or branch label, involving a rewrite of the complete active archive. Some 12 GiB or so in the case of OpenOffice.org. This is not only incredible slow but also quite unsafe, because thousands individual files are involved in recording one simple thing like a branch or tag label. If something happens during such a operation the repository as a whole will be left in a corrupt state.

All CVS repositories of notable size are corrupted in one way or another. This is usually not a problem and rarely noted in daily CVS usage, but can be a problem when trying to recreate a historical state of the project. Or when trying to import a project history in a new SCM system.

CVS best practices recommend to never move a tag or - even worse - a branch after they have been created to keep a repository consistent. But moving tags and branches are at the heart of the CWS "resync" mechanism we employ for OpenOffice.org. We can expect a certain amount of repository corruption within the OOo repository.

What needs to be done when migrating the project history from an "old style" CVS repository to a new change set based repository? First one needs to identify which change in which file belongs to one change set. The naive way is to represent each change in each file as an individual change set. This can lead to an incredible blown up new repository because there is a certain overhead per change set. Some change set based SCMs are better in this respect than others, but still, representing say, an identical license change in 10000 files as 10000 individual change sets is going to be wasteful. How does a conversion tool recognize if individual changes in different files belong together? Well, they might probably belong together if a) the revision comment is identical and b) the commit time is, well, within a certain time span. Remember, CVS stores the history in individual repository files which do not know from each other, so extracting "correct" change sets is going to be an imprecise science at best. Conversion tools employ quite a bit of heuristics for this. Oh and just going on chronological is not an option because unrelated changes could have been committed at the same time, so it is possible that a long commit of say 1000 files is interleaved with a commit of one file which is completely unrelated.

Is it important that a conversion tool extracts "correct and minimal" change sets? Well no, because in case of doubt it's always possible to represent a logical change which spans two or more files as two or more change sets at the cost of some overhead. But there is an important constraint to be observed: CVS tags and branches need to be always correctly represented in the new repository. After all these are used to reconstruct historical states of the project.

There seems to be number of cvs2xxx (with xxx being your favorite SCM system) tools, but the more capable ones are based either on cvs2svn or cvsps. Since we are going first to Subversion we'll use cvs2svn.

How much history should we migrate?

Good question. As a developer you'll probably say "Why, everything of course, thank you, that's what SCMs are for!". Right, but it's so wasteful! The resulting repository will be huge. The conversion time for the full OOo repository will be longer than a week even on a very powerful machine. And lot and lots of the migrated history isn't really useful to anyone anymore. Furthermore, it's an illusion to believe that you really have everything, repository corruption will always require a certain tweaking of the original repository before a migration can succeed.

So how much history should really be migrated? The answer will always depend on the specifics of the project. For OpenOffice.org I can think of the following rough scenarios.

a) Full migration, including all CWS branches

We'll migrate everything from OOo day one including all historic 7000+ CWS branches.

  • Pro:
    • Everything there. This has to count for something.
    • All historic states which are not corrupted (in CVS) can be extracted.
    • All current work can be checked out from new SCM.
    • CVS server can be switched of forever ... if you trust the conversion ...
  • Con:
    • The resulting repository will be huge, probably exceeding 100 GiB or so for Subversion, huge for DSCM as well.
    • Ridiculous number of revisions in the new repository, don't know how many, probably far exceeding a million or so.
    • The conversion will take ages. Well, not sure if it will ever finish, hasn't been tried before.
    • During the conversion we'll require a quiet period, no CVS commits allowed at all.
    • There is no way to really check the accuracy of the conversion for historic states.
    • If the CVS server is to be switched off, then all "physical" workspaces needs to be converted at some time.

Well, this is the overkill solution. Not really practicable because the (branch and tag) symbol resolution pass of cvs2svn will probably never finish. Who needs all the information in finished or dead CWSs anyway?

b) Full migration minus "finished" CWS branches

We'll migrate everything from OOo day one with the exception of historic CWS branches and anchor tags. Only CWS branches which are not in state "integrated", "finished", "dead" or "canceled" are migrated. Obviously broken tags and branches are not migrated as well. Obvious test branches or tags are left out when spotted.

  • Pro:
    • Almost everything of relevance there.
    • All historic states which are not corrupted (in CVS) can be extracted, with the exception of historic CWS branches.
    • All current work can be checked out from new SCM.
    • Conversion can be done, this has been tried out.
    • CVS server can be switched of forever ... if you trust the conversion ...
  • Con:
    • The resulting repository will still be huge, probably exceeding 70 GiB or so for Subversion, huge for DSCM as well.
    • Number of revisions in the new repository estimated > 500000.
    • The conversion will take about a week.
    • During the conversion we'll require a quiet period, no CVS commits allowed at all.
    • There is no way to really check the accuracy of the conversion for historic states.
    • If the CVS server is to be switched off, then all "physical" workspaces needs to be converted at some time, including legacy workspaces which are kept for legal reasons.

This is the scenario which we have been planning for so far, this doesn't mean it's the most sensible scenario. I'm convinced we can do this, it might take a lot of time, though.

Repository as of July 18th 2008

  • Time for conversion (Sun Fire x4150 with 64GB RAM): 4d 15h 29min
  • Number of revisions: 633995
  • Size of of repository: 91 GB

c) Trunk migration only

In this scenario we'll migrate only the trunk of files which a) still belong to the current OOo build and b) are source files. Only the latest revision of binary files will be migrated. Old releases and existing CWSs are still maintained in CVS. Developers/RE will create a patch when a CWS is bound for integration as a one time operation. New CWSs will use the new SCM.

  • Pro:
    • Still almost everything of relevance there. How often do you look at the history of removed files and binaries?
    • Conversion is easy, this has been tried out.
    • Repository is lean and mean. This is even more important with a DSCM as well.
    • Since CVS server is not switched off we don't have to care for the accuracy of the conversion, migrated history is for reference only.
    • Old CVS server is still there for browsing the complete history, even that of finished CWSs.
    • No quiet period. Only RE has to refrain for while from integrations, everyone else can work as usual on their CWS during migration.
    • No hassle with workspace migration.
  • Con:
    • Requires that the CVS server remains online for quite some time.
    • A one time patch effort by either Developers or RE is needed when integrating an existing CWS into the trunk.
    • Inspecting real ancient history might require a switch to CVS server.
    • All historic states need to be constructed from old CVS server.
    • Maintaining old releases and existing CWSs must still be done with CVS.

This is actually my favorite scenario. Others went a similar road as well, for instance the netbeans team. Results in a handy repository, gets rid of all the binary mess people have committed over time (like that 600 MB iso file some smart guy checked in some time ago). Later conversion to a DSCM should be easy as well, and then the reduced repository size will be a huge plus. Because there is no developer quiet time the pressure on the migration guy will be a lot less :-), historic data is still available in CVS making maintainance work on old releases really safe which should be a comfort to RE and to managers as well ... :-).

Repository as of July 18th 2008

  • All code modules, no pruning
    • Time for conversion (Sun Fire x4150 with 64GB RAM): 19h 30min
    • Number of revisions: 362042
    • Size of of repository: 12 GB
  • Only active code modules, prune all files from /Attic
    • Time for conversion (Sun Fire x4150 with 64GB RAM): 13h 21min
    • Number of revisions: 272687
    • Size of of repository: 6.6 GB
  • Only active code modules, prune all files from /Attic, reduce binary files to last revision
    • Time for conversion (Sun Fire x4150 with 64GB RAM): 11h 58min
    • Number of revisions: 264465
    • Size of of repository: 6.0 GB

d) No history migration at all

The most radical solution. Just take the latest version, import it and that's it. A new beginning can be so sweet ...

  • Pro:
    • Conversion is trivial.
    • Repository is lean and mean. This is even more important with a DSCM as well.
    • Since CVS server is not switched off we don't have to care for the accuracy of the conversion.
    • Old CVS server is still there for browsing the complete history, even that of finished CWSs.
    • No quiet period. Only RE has to refrain for while from integrations, everyone else can work as usual on their CWS during migration.
    • No hassle with workspace migration.
  • Con:
    • Requires that the CVS server remains online for quite some time.
    • A one time patch effort by either Developers or RE is needed when integrating an existing CWS into the trunk.
    • Inspecting any history requires a switch to CVS server.
    • All historic states need to be constructed from old CVS server.
    • Maintaining old releases and existing CWSs must still be done with CVS.

Some projects (i.e thunderbird) decided that it's not worth the hassle to migrate any history at all, especially if you have to keep a CVS server online anyway. The idea is, to create a web front end which will just transparently browse CVS if old history data is needed, otherwise mercurial (that's the SCM the thunderbird team will migrate to). There is some value in that approach ...

Personal tools