SCM Migration

From Apache OpenOffice Wiki
Jump to: navigation, search

Glossary

The exact meaning of two terms is essential for the following migration guide:

  • project: a top level project, with a project lead and a separate space on the OpenOffice.org web site and in the OpenOffice.org repository. Example: gsl (the project which hosts the vcl code module, rsc the resource compiler and 16 other code modules), zh (this project hosts the Chinese language community).
  • module: the next level of structure is called module. Code projects typically host several modules, language projects usually have only a www module. Attention: Some modules are called like their hosting project, for example sw is also a module in the project sw.

Repository restructuring

Whether we take subversion as new SCM tool or a distributed SCM like git, bazaar or mercurial, the necessary migration is also a good opportunity to restructure our repository and to do some badly needly clean up.

This restructure and migration guide is geared towards a migration to subversion, but the same principles can and should be applied to a potential migration to another SCM tool.

Currently we have 141 top level projects. Inside these projects we have varying numbers of modules, either web content modules or code modules. Many projects are dedicated to the OpenOffice.org language communities which are essentially independent from each other. Modules from code projects on the other hand are highly dependent from each other. We got 260 of these code modules (some of them historical).

The idea is to move all modules containing OOo source code into a single repository. After that, each project get it's own repository, which is mostly for web content. After the migration, modules inside the new "code project" get linked into their original projects, to maintain the integrity of these projects.

Clean up

In 6 years we accumulated a lot of cruft in the CVS repository. We take the opportunity and skip some dead ends from the migration. The rule is, that every released version of OOo must be represented in the new repository. Otherwise we are pretty free to define what we want to migrate and what not. Currently I plan to implement the following strategy:

  1. migrate all releases of OOo to the new SCM, this means release tags and branches must be preserved
  2. skip experimental branches and tags if they can be proven to be obsolete
  3. skip obvious dead parts of the repository
  4. skip tags and branches of all CWS with status integrated, finished, deleted or canceled at a certain date (currently the date is 2007/05/15)

The last rule reduces the number of the to be migrated branches from about 5000 to about 500.

Recipe for migrating the code repository

For a migration to subversion a fast Unix machine with cvs, subversion-1.4 and the cvs2svn python script installed is needed.

Copy and restructure the CVS code repository

Create a copy of the CVS repository. In the following <work> is the directory which contains the 136 OOo top level projects.

Compare the repository with the reference module list

I've prepared three files to help in the migration. The repository structure document repositorystructure.ods contains the new structure, the script repositorystructure.sh moves code modules into the new code project and removes obsolete stuff.

The OOo repository is constantly growing. Before you start the migration it's mandatory to check if new projects or modules have been added to the CVS repository. Compare the directory with repositorystructure.txt.reference.

$ cd <work>
$ echo */* | sed -e 'y/ /\n/' | sort > repositorystructure.txt
$ diff repositorystructure.txt.reference repositorystructure.txt

If new projects and/or modules have been added, please add them to repositorystructure.sh and repositorystructure.ods according the above mentioned principles and don't forget to upgrade the reference repositorystructure.txt.reference, too.

Restructure repository

Use the repositorystructure.sh script to restructure the repository.

$ cd <work>
$ sh repositorystructure.sh

This script moves all modules with source code into the new code project but leaves the language projects alone. Additionally it removes some cruft from the module level of the repository, like nonsensical empty modules etc.

Repository clean up

The CVS repository contains a number of broken CVS archives, which fall in three categories:

  • Files which do not contain any revisions, but just the RCS header. These files are no valid RCS files and can safely be removed.
  • Files which are present in the a <dir> and also <dir/Attic>, for example hu/hu-po/crashrep.po,v and hu/hu-po/Attic/crashrep.po,v. Here a decision has to be made which version is the right one, the other one must be removed. The one which we will keep is the one with the higher head revision number.
  • Files which have a tag/branch on a deleted revision. The tags/branches have to be removed via the rcs command.

The script cleanbrokenfiles.sh does the cleanup. It requires that the RCS command rcs can be found in $PATH.

$ cd <work>
$ sh cleanbrokenfiles.sh

It's quite likely that there are more cases with files present in <dir> and <dir/Attic> in the meantime. These can be found with the python script finddouble.py (requires python 2.5).

$ python finddouble.py

If this tool prints one or more lines, please add them to the cleanbrokenfiles.sh script and rerun the clean script. The output of finddouble.py is formated in a way that the line(s) can be directly added to the clean script.

Converting the repository

We convert the repository with python script cvs2svn-1.5.1. This script has quite a few dependencies, it needs a working berkelydb, berkelydb python bindings, rcs tools in path, subversion-1.4, and subversion python bindings. cvs2svn is very flexible and detects quite a few CVS repository inconsistencies. It has 9 passes. Many problems will be found during the first pass which parses all the *,v CVS archives. If you encounter a problem in this pass you'll have to go back to the last section and add the affected files to the clean up script and restart the conversion afterwards.

Converting the language projects

Please use the script convert.sh. The first section of the script contains a few paths, please adapt them to your needs. The script iterate over all 138 language projects and creates 138 subversion repositories. If in the meantime more language projects have been introduced add them to this script accordingly. The script assumes that the cvs2svn script is in your path. It will create subversion repositories in the berkelydb format.

$ cd ..
$ sh convert.sh

Converting the code modules

For converting the code modules we'll need the full flexibility of cvs2svn. cvs2svn can be customized via a so called "option" file. This files contains python instructions on how exactly the conversion should be done. We need customization for:

  • excluding old and/or obsolete tags and branch names (called symbolic names by cvs2svn)
  • force mixed tag/branch symbolic names to either a tag or a branch
  • resolve the problem of "Blocked Exclusion", that is symbolic names which are no longer needed but do have other symbols depending on it, so they can't be excluded.

The customization file is named cvs2svn.options. Please adapt the first part of the options file to your need (especially the paths). In it's current form it will generate a berkelydb based subversion repository. It can be configured to create just a subversion dump file.

$ cvs2svn --options=cvs2svn.options

Hint: I experienced problems with python memory leaks which lead to a out of memory condition. In this case the problem can be workarounded by running the first eight and the final pass of cvs2svn separately.

$ cvs2svn -p 1:8 --options=cvs2svn.options
$ cvs2svn -p 9:9 --options=cvs2svn.options

It might be necessary to recreate parts of the cvs2svn.options customization file. This is the case if the list of excluded branches/tags is changed or new broken tags/branches (symbolic names) appear. To help in this there is the python script options.py. It requires a file with the name cws_done.csv in the available in the run directory. cws_done.csv is a list of all child workspaces which are considered finished and will not be migrated to subversion. Broken tags/branches are handled directly in options.py. Run the script with:

$ python options.py > options

and replace the with [option.py] marked section of cvs2svn.options with the content of options.

Notes

The scripts mentioned above have been updated as of 2007/10/30, with the exception of cws_done.csv, which is as of March 2007. The migration of the code repository takes about 84:20h on a [x4200]. I tried to upgrade the conversion script from cvs2svn-1.5.1 to cvs2svn-2.0.1 but failed, because the script was so slow it never finished.

Access to test server

Subversion

The result of the conversion can be accessed with subversion via the URL svn+ssh://svn@o3-build.services.openoffice.org/svn

Example: See the latest change to the repository:

$ svn info svn+ssh://svn@o3-build.services.openoffice.org/svn


You'll need a ssh key, send me (hr) your public key if you plan to take part in the testing.

A read-only service without authentication is available via the URLs svn://o3-build.services.openoffice.org/svn and http://o3-build.services.openoffice.org/svn.

Example: list all tags in the repository:

$ svn list svn://o3-build.services.openoffice.org/svn/tags
$ svn list http://o3-build.services.openoffice.org/svn/tags
bazaar

The o3-build server also hosts a flat import of OpenOffice 2.3.0 in a bazaar repository. Note that this repository has no history information at all so it is not comparable to the subversion repository above. As soon as we have a working import I'll replace it with a real repository.

The bazaar repository can accessed via sftp and (read-only) via http. A smart server setup (bzr+ssh) will follow soon.

Example: lightweight checkout via sftp

$ bzr checkout --lightweight sftp://svn@o3-build.services.openoffice.org/srv/bzr/trunk my.lightweight.checkout

Example: branch via http

$ bzr branch http://o3-build.services.openoffice.org/~svn/bzr/trunk my.branch

Please note the differences in the access URLs.

git

The o3-build server also hosts an import of OpenOffice a git repository with about almost the same amount of history as the SVN repository above.

The git repository can accessed via the git protocol and (read-only) via http.

Example: clone via the git protocol

$ git clone git://o3-build.services.openoffice.org/git/ooo.git

Example: clone via http

$ git clone http://o3-build.services.openoffice.org/~svn/ooo.git

Please note the differences in the access URLs.

Replicate test server

The repository can be replicated with the svnsync tool. No special server side setup is necessary (read-only access is sufficient), but you need to make certain that the target repository can't be modified by other means than svnsync.

  • First create an empty target repository:
$ svnadmin create /absolute/path/to/rep
  • Implement the pre-revprop-change and start-commit hooks
$ cat /absolute/path/to/rep/hooks/pre-revprop-change
#!/bin/sh 

USER="$3"

if [ "$USER" = "syncuser" ]; then exit 0; fi

echo "Only the syncuser user may change revision properties" >&2
exit 1
$ cat /absolute/path/to/rep/hooks/start-commit
#!/bin/sh 

USER="$2"

if [ "$USER" = "syncuser" ]; then exit 0; fi

echo "Only the syncuser user may commit new revisions" >&2
exit 1
  • Initialize the target repository:
$ svnsync init file:///absolute/path/to/rep http://o3-build.services.openoffice.org/svn 
  • And finally synchronize the target repository with the source repository
$ svnsync synchronize file:///absolute/path/to/rep http://o3-build.services.openoffice.org/svn

The full details for replicating SVN repositories can be found [here].

Evaluating the SCM candidates, Metrics

Evaluating centralized vs. distributed SCM systems for their viability for hosting the OpenOffice.org source code repository isn't that straight forward as one might hope. This is because the workflows differs substantially for distributed and centralized SCM systems. The best approach seems to define typical work flows for each SCM which will be evaluated against the test repositories above, defining our metrics. Since each developer group within OpenOffice.org community has quite different needs, there will be no one workflow which will fit for all. I (hr) would like to ask each developer group within the OOo community to add their expected typical work flow with each SCM below.

Sun Hamburg RE

Subversion

checkout (2) tag anchor (3) tag branch (4) switch (5) diff (7) commit (8) rebase (9) commit (10) move tag (11) switch (12) integrate (13)
Unix, local disk (warm) 5m38s ~1s ~1s 59s 6s 48s 2m33s 2m06s ~1s 39s 2m30s
Windows (cygwin), local disk 35m25s ~1s ~1s 5m03s 35s 3m48s 4m36s 4m50s ~1s 6m00s 6m07s
Unix, remote volume 77m17s ~1s ~1s 3.23s 47s 4m21s 6m29s 5m11 ~1s 5m35s 8m69s
Windows, remote volume 157m14s


local status over whole tree (warm) local status over whole tree (cold) log over single file annotate over single file
Unix (local disk) 2s 33s ~1s ~1s
Windows (local disk) 19s na (don't know how to drop caches on windows) ~1s ~1s

CWS creation, workflow:

  1. tag test milestone on basis of OOo_2_3_0 release: [not timed]
  2. check out test milestone from o3-build: [time]
  3. tag test milestone with anchor tag: [time]
  4. tag test milestone with branch tag: [time]
  5. switch to branch tag: [time]
  6. make changes on test branch (~2157 files): [not timed]
  7. diff changes: [time]
  8. commit changes on branch: [time]
  9. rebase branch to newer milestone with non-conflicting changes (another ~2157 files): [time]
  10. commit changes on branch: [time]
  11. move anchor tag to new milestone: [time]
  12. switch to trunk: [time]
  13. integrate (merge) branch into trunk: [time]

Misc. operations:

  1. status over whole tree, cold: [time]
  2. status over whole tree, warm: [time]
  3. log on single file: [time]
  4. annotate on single file: [time]

Git

clone remote (2) clone local (3) create branch (4) switch (5) diff (7) commit (8) pull to pristine (9) pull to working (10) rebase (11) push (12)
Unix, local disk (warm) 24m13s 1m58s <1s 3s 3s 8s 49s 6s 1m04s 9s
Windows, local disk 33m07s (some lock errors) 14m43s <1s 13s 10s 25s 1m16s 30s 1m51s 15s
Unix, remote volume 41m98s 13m15s <1s 54s 35s 36s 1m15s 1m10s 56s 16s
Windows, remote volume stops at 39% completion


local status over whole tree (warm) local status over whole tree (cold) log over single file annotate over single file
Unix (local disk) 2s 37s 20s 8s
Windows (local disk) 18s na 25s 23


CWS creation, workflow:

  1. tag test milestone on basis of OOo_2_3_0 release: [not timed]
  2. clone git repository from o3-build to pristine local copy: [time]
  3. clone repository to local working copy: [time]
  4. create new branch in working copy: [time]
  5. switch to branch: [time]
  6. make changes on test branch (~1000 files): [not timed]
  7. diff changes: [time]
  8. commit changes on branch: [time]
  9. pull non-conflicting changes (another ~1000 files) from upstream into pristine copy: [time]
  10. pull non-conflicting changes from pristine copy into local copy: [time]
  11. rebase branch to newer milestone with non-conflicting changes: [time]
  12. push changes to upstream: [time]

Misc. operations:

  1. status over whole tree, cold: [time]
  2. status over whole tree, warm: [time]
  3. log on single file: [time]
  4. annotate on single file: [time]

Preliminary Evaluation results

For the February, 18th, 2008 ESC steering committee meeting we (Jan Holesovsky and Jens-Heiner Rechtien) prepared a paper about the current status of the SCM evaluation. Please find it here.

Evaluating DSCM candidates

The current version control system for OpenOffice.org is going to be replaced by a Distributed Software Configuration Management (DSCM) system. An Evaluation has been prepared and was presented in the March 2009 ESC meeting.

DSCM System Preferences Survey

During the ESC meeting it was suggested to consult OpenOffice.org contributors about experiences and preferences with version control systems.

The survey started 2009-03-12 and closed after 2 weeks.

Participation in the Survey

  1. Click on the link below. You will be asked to enter your name and your @openoffice.org email address. The email address is mandatory and we will ignore any submissions from any other email addresses.
  2. An email will be sent to your @openoffice.org address containing a link. Clicking on the link will take you to the survey system. Note: if you do not receive the email, please check your spam filter!
  3. When you have made your choices, the system will send you a second email to confirm that your selections have been stored.

http://surveys.services.openoffice.org/surveys/index.php?sid=52123&lang=en

SCM System Preferences Survey Thanks

Thank you for participating in the survey. The results will be taken into consideration for the final decision.

DSCM System Preferences Survey Results

149 contributors participated in the survey. The system of choice is for 3% Bazaar, for 23% Git and for 49% Mercurial. 25% had no preference.

Personal tools