User:LA2

From Apache OpenOffice Wiki
Jump to: navigation, search

LA2 is the username for Lars Aronsson, Sweden, also known from Wikipedia, Project Runeberg, and other projects.

Useful links

Diary

August 2013: Continue reading on Apertium wiki.

June 18, 2012: Göran Andersson closes DSSO and archives the existing dictionary on Sourceforge, as Anders Wallenquist wrote on Google+

June 2011: Project Runeberg starts to digitize and publish more recent dictionaries based on the assumption that they are not covered by copyright for 70 years, but only by catalog/database rights for 15 years after publication. The first of this kind is an English-Swedish Dictionary (1948), soon followed by Spansk-svensk ordbok (1978) and many others.

June 2011: OpenOffice.org moves from Oracle to the incubator of the Apache Foundation.

April 2011: Oracle announces the discontinuation of Oracle OpenOffice (= StarOffice) and to make OpenOffice.org a "purely community-based project".

September 28, 2010: The Document Foundation is founded, in opposition to Oracle, intending to fork the OpenOffice.org software and community and create LibreOffice. The Norwegian foundation "Åpne kontorprogram på norsk" (incorporated in 2004) decides to support LibreOffice. Their spokesperson is Karl Morten Ramberg. It seems the Swedish community coordinator Per Eriksson is also active within TDF and LibreOffice.

August 2010: Former coordinator of the Swedish OOo community Lars Nooden is no longer active here.

January 27, 2010: Sun Microsystems is acquired by Oracle Corporation.

November 11, 2007: Niklas Johansson (in Norrköping, [1]) says on the Swedish dev list that he's working on a free Swedish thesaurus, En synonymordbok.

August 5, 2007: Karl Jonsson's Weblog discovers that the Swedish spell checker in OpenOffice.org (version not metnioned) is pretty bad.

June 17, 2007: As sv.openoffice coordinator Lars Noodén passes through my home town Linköping, I meet up with him and two local open source advocates, Ralf Andersson and Inge Wallin (KDE/Koffice developer). We have lunch, coffee and a brief walk around town before his train leaves.

May 7, 2007: The article by Kalevi Kilkki, "A practical model for analyzing long tails", in First Monday magazine provides exactly the same kind of coverage-popularity diagrams, but with a better mathematical model than used in the paper by Géza Németh and Csaba Zainkó (see December 20, 2006, below).

March 23, 2007: In IDG's Computer Sweden, Anders Lotsson writes a column about En massa sammansatta ord (lots of compound words), providing examples of hard-to-explain Swedish compounds with "massa" meaning either "large quantity" (mass- or -masse-): kroppsmasseindex, massmedier, folkmassor, folkmassefobi, massvariationer, massuggestion, etermassmedier; or "substance" (massa- or -masse-): pappersmassa, massafabrik, mandelmassa, massaved, pappersmassefabrik. The column also mentions kyrkogårdar, kyrkofullmäktige, kyrkvaktmästare, kyrkkaffe, kvinnfolk.

March 19, 2007: Still waiting for issue 62268.

February 6, 2007: In a chapter by Henning Spang-Hanssen, "Den retskrivningsmæssige udvikling i Danmark siden det nordiske retskrivningsmøde i Stockholm 1869" in Språk i Norden 1970 (1970), the following statistics are presented: In Danish text of the 1850s or 1800s, for every 100 words, some 30 edits are necessary to make the orthography fully modern.

Year of spelling reform Edit Number of edits per 100 words
1889-1892 Change gj/kj/ii/uu/x to g/k/i/u/ks, etc. 10
1948 Remove capitalization of nouns 15
1948 Change aa to å 5
Total 30


February 1, 2007: After the IETF in RFC 3066 (January 2001) devised a best current practice for language codes for use in Internet standards and protocols, there was a need for more codes. In particular, Germans wanted codes for their language before and after the 1996 spelling reform. For some time, the Internet Assigned Numbers Authority (IANA) maintained a list of additional language tags but this has been incorporated into the new series of three RFCs: 4645. Initial Language Subtag Registry, 4646. Tags for Identifying Languages and 4647. Matching of Language Tags (September 2006). In addition to these rules, there is a new registry, operated by IANA. Here, de-1901 is the traditional German spelling (daß, illustrierte, Schiffahrt, Tier) and de-1996 is the new German spelling (dass, illustrierte, Schifffahrt, Tier). No other languages have codes with regards to spelling reform. And there is yet no code for German before 1901 (daß, illustrirte, Schiffahrt, Thier).

I think it could make sense to propose the following language codes:

Code Used for Samples
da-1775 Danish orthography before the 1892 reform Kjøbenhavn, sexten
Dansk biografisk Lexikon, 1st ed.
da-1892 Danish spelling reforms of 1889-1892.
Plural verbs (ere, bleve) became optional in 1900 and are
almost completely absent from literature that follows this spelling
København, seksten, Maade, kunde, skulde, vilde
Salmonsens Konversationsleksikon, 2nd ed.
da-1948 Modern day Danish, reform of 1948 måde, kunne, skulle, ville
sv-1801 Orthography of Carl Gustaf af Leopold elf, godt, jern, qvacksalfvare
blefvo, gingo, åto, ega, äro
Nordisk familjebok, 1st ed.
sv-1889 SAOL, 6th ed. godt, järn, kvacksalfvare, älf
blefvo, gingo, åto, äga, äro
Nordisk familjebok, 2nd ed.
sv-1906 Modern day Swedish, spelling reform of Fridtjuv Berg.
Plural verbs (äro, blevo) become optional around 1940
and are completely absent around 1970
gott, järn, kvacksalvare, älv
blevo, gingo, åto, äga, äro
Nils Holgerssons underbara resa genom Sverige
SAOL, 8th ed.

For Norwegian, the situation is a lot more complex and I need to learn more before I can propose something like this:

Code Used for Samples
nb-1862 First uniquely Norwegian (non-Danish) instruction on orthography
nb-1907 First official norm for riksmål mænd, ryg, hesterne
nb-1917 Reform introduces letter å, changes many æ to e, removes r from plurals menn, rygg, hestene
nb-1982 Final adjustment of the 1938 reform.
Is this really different from nb-1917?
nn-1853 Ivar Aasen's landsmål Vin, Dyr, Sjo, kastade-kastat
nn-1901 Norwegian education ministry's norm for landsmål ven, dør, sjø, kasta, hestarne
nn-1917 Reform introduces letter å, removes r from plurals hestane
nn-1938 Reform


January 28, 2007: During the weekend I'm trying to figure out if there is any algorithm that can determine the language of a text. There are several approaches, such as comparing the most common words or counting bigrams and trigrams. In Perl, there is a CPAN module called Lingua::Ident. It seems to work fine for telling English apart from Spanish, but it is a whole different problem to separate Norwegian bokmål from nynorsk. Or to tell Swedish modern spelling apart from old spelling. Just from trigram analysis, Swedish and Norwegian are very similar. However, a spell checker will find the differences. Run a text through a spell checker for modern Swedish (or Danish or Norwegian) spelling, and all the words with old spelling come out.

Last week (see Jan. 22 below) I released word frequency statistics for old Norwegian texts. I have now completely mapped all texts in Project Runeberg to language and year and started to look closer at Danish. There have been two major Danish spelling reforms in 1892 and 1948, as described in the timeline below. The following table shows how some large text bodies in Project Runeberg relate to these dates:

Works of Danish literature in Project Runeberg Years Size Occurrences of Comment
Volumes Pages Words Vocabulary foer fór skjøn... skøn... ere bleve vox...
vex...
voks...
veks...
Salmonsens konversationsleksikon (*scanned so far, out of 26) 1915-1930 9* 9173 7,397,317 539,286 - - 6 2499 39 1 8 3587
Dagligt Liv i Norden i det sekstende Aarhundrede 1914-1915 14 3817 1,197,865 83,248 11 10 5 582 198 17 - 509
Gustav Wied : Mindeudgave 1920 8 3414 897,195 70,043 - 85 9 224 23 - 1 193
Nordisk illustreret Havebrugsleksikon 1920-1921 2 1130 846,839 71,720 - - 1 211 6 - 9 1275
Historiske Afhandlinger af A. D. Jørgensen 1898-1899 4 1864 587,235 57,421 - - 4 281 82 1 98 15 Also uses "å", non-capitalized nouns
Georg Brandes Levned 1905-1908 3 1212 331,587 39,455 15 - 3 377 3 - - 90
Pre-1892 spelling
Dansk biografisk Lexikon 1887-1905 17 12036 4,388,789 168,177 - 45 2192 6 2286 1579 901 4
Tidsskrift for Physik og Chemi 1871-1878 8 3068 832,077 78,905 1 - 239 1 2496 476 370 -
Illustreret dansk Litteraturhistorie 1902 3 2460 726,421 79,768 25 - 738 2 1021 112 197 -
Illustreret dansk Literaturhistorie. Danske Digtere i det 19de Aarhundrede 1907 1 814 328,631 42,407 6 - 386 26 91 20 124 3

The most visible sign of the 1892 spelling reform was the dropping of the silent j after g and k. This is shown here in the "skjø" and "skø" occurrences. Also, letters c/qv/x/z were changed to s/kv/ks/s in many words, as shown here in the change from vox/vex... to voks/veks...

Plural verbs (ere, bleve) were made optional in 1900, and this reform largely coincides with the 1892 spelling reform.

Dropping the silent e after long wovels is a less clear sign. The word "for", being a preposition with the same meaning as in English (Swedish "för"; German "für"), is one of the 20 most common words in Danish (ranking 11 thru 19 in the texts above). However, it is also the imperfect of the verb "at fare" (to fare, to travel, to go, to leave; German "fuhr", Swedish "for"). In this capacity, it has a longer o sound (just like German "fuhr" and Swedish "for") and has historically been spelled "foer", then "fór" and in modern Danish just "for". As can be seen above, the occurrences of these older forms is one distinct feature of the spelling in the period 1880-1930. Adding to this complexity, "foer" can also be the spelling of another word (lining, the inner cloth of a jacket; Swedish "foder"). A non-ambigious case is erfoer/erfór/erfor (experienced, learned; German "erfuhr"), but this is far too uncommon to be useful for statistics.

Salmonsens encyclopedia seems to be useful as a reference, not only because it is the largest body, but also since it consequently sticks to the 1892 reform.

Of the total 8.3 million words in Salmonsen+Wied, there are 576K unique words, including some OCR errors. Here is the coverage distribution:

Corpus
coverage %
Required number
of word forms
Comment
3.38 1 og
8.62 3 i, af
16.81 10 en, til, er, den, at, de, med
28.28 30 der, som, det, for, paa, ved, han, et, -, var, har, sig, fra, ikke, men, blev, B., A., om, e
39.42 100 ... første (7648 occurrences)
49.91 300 ... smaa (2499 occurrences)
60.87 1000 ... hvorpaa (712 occurrences)
70.11 3000 ... Venstre (231 occurrences)
79.62 10000 ... Tiltrækning (60 occurrences)
86.73 30000 ... diplomatique (16 occurrences)
92.64 100000 ... Wanderjahre (3 occurrences)
96.66 300000
100.00 576534


January 24, 2007: What about translation dictionaries. Could that be a new component for OpenOffice? What's available and how are they used? Two command line applications for English-German are leo and translate, both available as Ubuntu packages. Below is a comparison screenshot of the two GUI applications OpenDict (left, using FreeDict dictionaries) and Ding (right), both showing a lookup of the word "fly" in the English-German translation dictionary. In this particular comparison, Ding wins on a number of points:

  • In Ding you don't have to click to see the different "fly" words, only scroll.
  • Ding shows word classes (adj.), gender of nouns (Fliege f.), and the inflection of verbs (flew, flown).

LA2-dictfly.png

January 22, 2007: As an experiment, I publish some Norwegian word frequency lists by year 1880-1935 based on Project Runeberg's texts.

January 18, 2007: Aspell has some very annoying limitations in that colons and digits cannot be parts of words. How should I handle Swedish words such as Maj:ts and p2p-överföring? I have tried to send my questions to hunspell-devel, but does anybody read that list? colon in WORDCHARS (Jan. 4) and digits in words (today).

January 14, 2007: The Finns are running their own software project for spell and grammar checking, Voikko. Fortunately for the rest of us, a description of their architecture is available in English. For more details on the project, see Harri Pitkänen's Hunspell-fi in Kesäkoodi 2006: Final Report (PDF, 14 pages).

January 7, 2007: A self-appointed committee, named "Stavekontrolden", for the improvement of the Danish spell checker holds a constituting assembly in Odense, as Finn Gruwier Larsen reports on the "dansk" mailing list. Chairman is Esben Aaberg. There is already a website at www.stavekontrolden.dk.

January 6, 2007: Wikipedia history diff as a revision corpus, summary of experiments by Marcin Miłkowski, after we discussed this on the dev@lingucomponent mailing list. "In short, it seems that Lars' idea was brilliant".

December 30, 2006: Here's an experiment in coverage. One very classical Swedish text is Nils Holgerssons underbara resa genom Sverige, by Nobel Prize winner Selma Lagerlöf, at the same time a geography textbook and a novel. The text at Project Runeberg is in modern spelling (post 1906), but uses plural forms of verbs (pre 1970). The text with HTML markup removed contains 198148 words or 1091951 characters. I'm assuming the spelling is all correct. Passing this through aspell with the 2003 Swedish dictionary (aspell -l sv list) outputs 6741 words as not recognized or 3.4 percent of all words in the text body. This means the Aspell dictionary covers 96.6 percent of the text, which actually isn't so bad. Further filtering through my own dictionary of old spellings (including plural verbs) removes another 1842 words or 0.93 percent, increasing the coverage to 97.5 percent. If instead I change to my own main dictionary, coverage increases only marginally. Apparently, the improvements I feel I have made, don't really matter if you are spell checking Nils Holgersson.

 $ cat k*.html | wc -w
 198148
 $ cat k*.html | aspell -l sv list | wc -w
 6741

The 4883 words (97.5 %) output from the combination of my own dictionaries are 3670 unique word forms, of which 2991 appear only once, 433 appear twice, 129 appear three times and 117 appear four or more times. If I added these 117 word forms to my dictionary, that would cover another 639 words or 0.32 percent of this text, pushing the coverage to 97.8 percent.

It turns out some of those words shouldn't be added to a dictionary because they are names of fictional characters that only appear in this book. A select few spelling errors are also found and will be corrected. Of the unrecognized words, many are minor variations (in case and punctuation) that are covered by just adding one word to the dictionary. After some work, my coverage is up to 98.49 percent, leaving 2991 words unrecognized, being 2674 unique word forms of which 2509 appear only once and 111 appear twice.

December 21, 2006: Apparently the Swedish spell checker in Microsoft Word 6.0 accepts the following misspelled words from my test page: andledning, andvänd, andvändning, bakrund, ballett, diskusanalys, finlandsvensk, finness, fiskeläger, följetång, företeckning, förmögenhetskatt, hårddraget, innerbär, jämnlik, kolrot, Lindköping, lösensumma, majonäs, model, modellbetäckning, parantes, situationstecken, stadsbesök, stadschef, terass, trilogi, vädersträck, överrens

And Microsoft Word 2003 accepts these errors: andledning, alvarlig, andvändare, ballett, Ceasar, diskusanalys, europisk, fiskeläger, frisörsalong, följetång, företeckning, förmögenhetskatt, grejor, hårddraget, interesse, krigsföring, landsbyggd, Lindköping, lösensumma, mediespelare, modellbetäckning, parantes, San Fransisco, sattelit, situationstecken, stadsbesök, stadschef, Stockolm, Storbrittanninen, tabblett, tipps, utryck, våldtäckt, vädersträck, ytterliggare, åldersbestigna, överrens.

December 20, 2006: As I observed the other day, writely.com automatically understands which language I'm using, and applies the right spelling dictionary. Apparently, Microsoft Word has a similar feature. Why have I never seen this feature in free software for spell checking? A quick web search indicates that I'm not the first to ask this question. Suggested solutions include trying an ad hoc list of common words, prefixes and suffixes from each language or sampling trigrams. There is also an attempt at Bayesian language detection. Nothing indicates that the creators of these three approaches are familiar with Zipf's law. Looking for "you" and "me" is not optimal, when in fact "the" is known to be the most common word in English texts. The Bayesian filter approach probably comes closest to optimum, but in an overly complicated way. If we were to look for a single word from each language, that would be the most common word, which is "the" for English, "der" for German and "och" for Swedish. Since 7% of all English words are "the" we would need on average 14 words to find one occurrence of "the". In Swedish, which doesn't use a definite article (the, der, le, los), the most common word is "och" (and), which makes up 3.8 % of all words, so we'd need a sample of 26 words to expect one occurrence of "och". In order to determine the language from a shorter sample, we could throw in some more words from each language, from the top of the frequency listing. The 10 most frequent words in Swedish (och, i, att, en, av, som, den, till, med, på) together account for 14% of the words in any text corpus. The top 20 words (det, för, de, han, är, ett, sig, så, jag, var) account for 21%, the top 50 words account for 29%, and the top 100 words account for 36%. Some of these words also appear in English but aren't so frequent there, so they are better predictors for Swedish than for English. For each language, a weighted prediction score can be computed. Finding an "I" adds more to the Swedish score than to the English. These top ranking word frequencies haven't changed much since Shakespeare, so a C program should be able to keep a static table of the weights for 100 words in source code. The downloadable text of Wikipedia and the written OpenOffice.org documentation could be used as a corpus for sampling the weights.

The paper by Géza Németh and Csaba Zainkó, Multilingual Statistical Text Analysis, Zipf's Law and Hungarian Speech Generation explains that to cover 97.5% of an English corpus, you need a dictionary of 20,000 unique word forms, but to achieve the same coverage in German you need 80,000 word forms and for Hungarian 400,000 word forms. This is an important observation, as it sets a goal for how large spell checker dictionaries need to be. Presumably, Finnish and Estonian have the same characteristics as Hungarian, while the Scandinavian languages (Swedish, Danish, Norwegian) take a middle position between that and German. Our current dictionaries contain many specialized words that aren't very common. These words don't contribute so much to the dictionary's footprint or coverage of a typical corpus as they should do. For example, for word we add in its basic form, we also try to include every other form of that word, even though those forms aren't very common. What we need for German are the 80,000 most common word forms, but when we add the really common word Haus, we also tend to add the less common Hauses, Häuser and Häusern. Perhaps we should aim for 80,000 basic forms (this is very close to what Björn Jacke has, see below) instead of 80,000 variations (he has 300,000 variations).

I made some tests on a corpus of 17.99 million Swedish words from proofread texts in Project Runeberg. Many of these words use old spelling (and describe old concepts) and won't be found in modern dictionaries, so this is not a perfect test case for contemporary spell checking dictionaries. When I run this corpus through my own dictionaries, which do contain some words in old spelling, it leaves a remainder of 1.482 million words or 8.2% of the corpus, meaning that I now have 91.8% coverage of this corpus. If I combine my old spelling component with the existing Aspell dictionary (Göran Andersson's from 2003), it leaves a remainder of 1.837 million words or 10.4% of the corpus, meaning a 89.6% coverage. So my progress above Göran's dictionary is indeed very small. This coverage around 90% can be achieved for German with a dictionary of the 20,000 most common word forms, which can be compared to the 24,000 basic forms in Göran Anderssons's 2003 dictionary. Even though my dictionary has many additional word forms, their contribution to the corpus coverage isn't very large.

Corpus
coverage %
Required number
of word forms
Comment
3.44 1 och
5 2 i
10 6 att, en, av, som
15 10 den, till, med, på
20 17 det, för, de, han, är, ett, sig
25 30
30 53
35 90
40 151
45 260
50 451
55 795
60 1387
65 2415
70 4227
75 7452
80 13606
The long tail
Corpus
coverage %
Required number
of word forms
85 26544
86 30731
87 35800
88 42026
89 49735
90 59402
91 71767
92 87837
93 109094
94 137919
95 178319
96 234459
97 320515
98 453458
99 633358
100 812979

December 19, 2006: On the dev@lingucomponent list, Kevin Scannell discusses how to use precision and recall metrics for spell checkers.

December 18, 2006: I update the Nordic Words page and publish a version of my own Swedish dictionary (ss100.txt) with 221.599 words. I also do a little survey of the spell checking support in various programs:

Google's writely.com is a web word processor. It has a built-in spell checker that automatically recognizes the language. It's Swedish spell checker behaves exactly like OpenOffice.org 2.0.4, which indicates the same Ispell/Aspell/Myspell/Hunspell dictionary is used for Swedish (Göran Andersson's dictionary from 2003). When I pasted the words from my test page, there were so many errors that the spell checker automatically shut down and I had manually to turn it back on again.

The Opera web browser (I tried version 9.10) has a built-in spell checker for web forms. The user interface is a bit old-fashioned, in that it doesn't underline the errors, but uses a dialog window that steps through the web form. On Apple's Mac OS/X it uses the system's built-in spell checker, but on all other platforms it requires the user install GNU Aspell.

The word processor Abiword has built-in spell checking support. The user interface underlines any errors. The Swedish spell checking is apparently based on Göran Andersson's 2003 dictionary, although I cannot find out which software it uses (GNU Aspell, ispell or Myspell).

KDE's editors kate and kwrite have built-in spell checking support, apparently based on GNU Aspell. The user interface doesn't underline errors, but provides a dialog window that steps through the text one word at a time.

Note to self: I should take a closer look at Freedict.de. Where do these dictionaries really come from? Are they maintained?

A look at the German ispell dictionaries by Björn Jacke:

          Occurrences            Affix
Nov. 2005  Feb. 2003  Nov. 1999  flag   Usage
 -------    -------    -------   ----   ----------------------
  79681      81191      75748           Basic forms
 307261     308860     294897           Unique words
 -------    -------    -------   ----   ----------------------
  13755      13933      11257     /S    Genitive -s
  11815      11723      11397     /A    Adjective inflexion
   8848       9319      10070     /P    Plural -en
   8166       8374       8048     /N    Plural -n
   7367       7346       7004     /D    Participle -d
   6837       6828       6595     /I    Regular verbs, present tense
   6620       6611       6358     /X    Regular verbs, present tense
   5310       5303       5140     /Y    Regular verbs, past tense
   4315       4578       4525     /T    Genitive -es
   4189       4406       4118     /E    Plural -e
   2066       2061       1991     /O    Participle inflexion
   1999       1971         82     /J    -ung and inflexions
   1846       1840       1813     /C    Adjective comparison
   1580       1656       1636     /p    Irregular plurals
   1452       1047        831     /F    -in and inflexions
    721        719        615     /Z    Non-regular verbs, past tense
    672        665        619     /U    Prefix un-
    619        620        615     /V    Prefix ver-
    574        569        486     /B    -bar and inflexions
    497        492        138     /W    Imperatives
    289        310        280     /R    Plural -er
    235        251        250     /Q    Plural -sse
    206        206        208     /G    Prefix ge-
     64         68         65     /q    Plural -sse, special case for feminines
     57         56         44     /M    -chen and inflexions
     20         21         22     /f    Words ending in -ph can also have -f
     18         17         19     /L    -lich and inflexions
      4          4          4     /H    -heit and inflexions

December 15, 2006: Two Danish OpenOffice developers meet with CST, Center for Sprogteknologi, a commercial provider of Danish dictionaries, to discuss how to improve the Danish spell checking dictionary for OpenOffice. Brief report on the 'dansk' mailing list. To me it seems unlikely that any useful solution would be found this way.

December 11, 2006: Version 2.0.4 of OpenOffice.org has auto corrections (AutoKorrigeringar) for Swedish, based on a static list of about 100 word pairs, e.g. HJE -> hej, MEDECIN -> medicin. Where do they come from? They're not part of the spelling dictionary. There are also word pairs for Danish and German (both have longer lists), but none for Norwegian.

Firefox 2.0 offers spell checking for web forms (e.g. wiki editing). There is a Swedish spelling dictionary by Hasse Wallanger, based on the Swedih Myspell dictionary of August 14, 2003 ("baserad på den svenska ordlistan från 20030814 för Myspell"). It behaves a little differently than the Swedish spell checker in OpenOffice 2.0.4, in that it allows free concatenation of words. It also only spell checks an initial fraction of a web form. In bug 360434 this is explained. Type about:config in the URL field and look for the variable extensions.spellcheck.inline.max-misspellings which defaults to 500. Double click on this value and change it to a much higher value, e.g. 15000.

December 7, 2006: I think we need a test case for the Swedish spell checking, that is separate from the development of the dictionary. As a pilot test, I'm starting a subpage /Test av stavningskontrollen. Göran Andersson publishes version 1.22 of DSSO.

December 2, 2006: I sign up for various OpenOffice mailing lists, and this wiki. What takes me here is the poor spell checking support for Swedish in OpenOffice 2.0.2. The spelling dictionary is version 1.3.8 from sv.speling.org, which hasn't been updated since March 2002. It only contains 24490 words (basic forms), some of which are misspelled. The myspell affix file seems to have been automatically converted from the ispell affix file.

Timeline of Scandinavian orthography

  • November 25, 2006: Göran Andersson publishes version 1.19 of DSSO. Version 1.21 follows on December 1.
  • 2006: The Swedish Academy publishes the 13th edition of SAOL.
  • 2005: Volume 34 of SAOB ends at Tojs. The full work is expected to be completed in 2017.
  • 2005: Spelling reform in bokmål. Some forms from riksmål are introduced: frem.
  • January 2005: Project Runeberg's OCR spelling dictionaries for Swedish and Danish are published within Nordic Words.
  • April 2004: Public editing of susning.nu is closed. The user community migrates to the Swedish Wikipedia.
  • March 6, 2003: My posting Svensk ordlista on the SSLUG-LOCALE mailing list.
  • February 2003: On the Swedish web forum Gnuheter, I ask around for a business case for a Swedish dictionary (Affärsmodeller och fritt innehåll) without getting any useful answers.
  • 2003: Göran Andersson takes back control of the Swedish spelling dictionary, now dsso.se, dissatisfied with some modifications made to it during the time it was at sv.speling.org.
  • May 6, 2002: I join sslug-locale mailing list for speling.org.
  • 2002-2003: I digitize two editions (58 volumes) of the classic Swedish encyclopedia Nordisk familjebok (1876-1926). This is more food for word frequencies and spelling dictionaries.
  • October 2001: I start susning.nu, a Swedish wiki, which grows very fast. As a spinoff I return to computing word frequencies and compiling my own spelling dictionary.
  • January 29, 1998: Göran Andersson hands over his Swedish ispell dictionary (now version 1.2.1) to sv.speling.org
  • September 26, 1997: Göran Andersson's ispell dictionary version 1.2 accepts compound words. The list has 24082 basic forms, expanding to 117617 unique words.
  • February 23, 1997: Göran Andersson's ispell dictionary version 1.1 has 24722 basic forms, expanding to 84740 unique words.
  • January 15, 1997: Göran Andersson's ispell dictionary version 1.0 has 27737 basic forms, expanding to 76364 unique words. The brand new affix file is based on inspiration from a Danish affix file.
  • November 1996: Within Project Runeberg, the subproject "Nordic Words" is started, maintained by Anders Brun. No updates are made after December 1997.
  • 1993: The Swedish Academy introduces computers in editing SAOB.
  • December 1992: I start Project Runeberg, the Scandinavian e-text archive
  • 1991-1993: I experiment with spelling dictionaries for spell and ispell.
  • 1986: Spelling reform in riksmål. Some words from bokmål are introduced: Etter, språk, nå.
  • 1970s-1980s: A Swedish morphological spellchecker "stava" is developed at FOA/QZ in Stockholm. Traces of this might be available at KTH. Viggo Kann would know. Several later Swedish spell checkers with the same name exist. Various dictionaries float around. Linguists have access to prorietary lists for research purposes, and are not interesting in creating "open content".
  • June 1, 1981: Norwegian parliament adopts a proposal from Norsk språkråd (January 1979) to once again allow in bokmål many of the forms that were banned in 1938. Female gender inflections become optional. The new rules are introduced in schools during 1982.
  • 1979: Of Norwegian children 16.4 % receive school education in nynorsk.
  • 1972: Norsk språkråd (Norwegian language council) replaces Norsk språknemnd. A paragraph on uniting the two languages is dropped from the mission statement. The new council includes representatives from the two protest organizations Riksmålsforbundet and Foreldreaksjonen mot samnorsk.
  • 1970: Major Swedish newspapers abandon plural forms of verbs.
  • 1968: The polite use of "Ni" (You/Sie) is replaced with simple "du" (you/du), making Swedish conversation as simple as English.
  • 1960s: A young computational linguist Sture Allén uses paper tape from newspaper typesetters to compute word frequencies of the Swedish language. Laying the foundation for the Språkdata department at Gothenburg University, he later becomes secretary of the Swedish Academy.
  • 1959: Norsk språknemnd (Norwegian language committee), formed by the government in 1951, publishes Ny læreboknormal 1959, that relaxes parts of the 1938 reform.
  • 1959: Friends of further reform and unification of Norwegian language form an association, Landslaget for språklig samling.
  • February 8, 1955: Danish ministry of education recommends that the letter å be sorted at the end of the alphabet, after æ, ø. There was no recommendation on this when the letter å was introduced in 1948.
  • 1955: Dansk Sprognævn (Danish language committee) is formed by the ministry of education. It continues the publication of the school dictionary. The first orthography textbook after the 1948 reform is published the same year.
  • 1952: The Norwegian association "Riksmålsforbundet" protests against further reform and publishes their own dictionary, reinstating many words that were removed from bokmål in the 1938 reform.
  • 1951: Norwegian parliament unanimously decides to change counting words from the German/Danish pattern (tre-og-femti, three-and-fifty) to English/Swedish (femti-tre, fifty-three). In 1970 a poll shows that 70% of the population agree this was a good reform, but only 30% actually use it.
  • July 2, 1948: The new Danish orthography is prescribed for use in government agencies.
  • March 22, 1948: A Danish spelling reform for schools introduces å (for aa) and removes capitalization of nouns. Also, the words kunde, skulde, vilde are replaced with kunne, skulle, ville. Since the powers of the orthographic committee have not yet been reinstated (cf. 1943), the reform is enacted by government and education minister Hartvig Frisch. Already in the year before, some municipalities had introduced the reform in their administration.
  • 1945: Swedish public schools make plural endings of verbs optional. Students who opt not to use them, must indicate this and then stick to their chosen style.
  • 1944: The percentage of Norwegian children that receive school education in nynorsk peak at 34.1 %. Ongoing industrialization, urbanization and increased wealth benefits bokmål.
  • March 1943: Danish ministry of education (under Nazi occupation), revokes the orthographic committee's (Retskrivningsudvalget) power to alter spelling. Any changes must be approved by the ministry.
  • 1941: "Dansk forening til nordisk sprogrøgt", a society for making Danish more Scandinavian and less German, is founded during Nazi occupation. Among its proposals is the introduction of å, the dropping of capitalization of nouns, and changing "af" to "av".
  • 1939: At Easter, with fascicle 156, the Swedish Academy celebrates SAOB being halfway (A--K) completed.
  • 1938: Spelling reform for both Norwegian languages aims to bring them closer to each other. Female gender is made mandatory in bokmål.
  • 1929: The two Norwegian languages get new names. Riksmål changes to bokmål, and landsmål changes to nynorsk. However, those who protested the 1938 reform of bokmål took up the old name riksmål.
  • 1923: As a replacement for the 1892 dictionary by Såby/Thorsen, Dansk Retskrivningsordbog is published by an orthographic committee (Retskrivningsudvalget) appointed the Danish ministry of education.
  • 1917: Norwegian spelling reform for both languages. The letter Å is introduced. R is removed from plurals (hestane/hestene). In riksmål many æ change to e (menn, verk). Female gender is introduced in riksmål and made optional.
  • 1913: The proceedings of the Swedish parliament (riksdagens protokoll) adopt the spelling of the 1906 reform.
  • June 10, 1910: Norwegian government resolution allows -e for some female nouns (vise, jente), dropping -r- in definite form plural, using y instead of ju, jo in some words (bryta, fyka).
  • 1910: The polite use of "Ni" (You/Sie) is introduced in Swedish as a replacement for complicated titles, making Swedish conversation as simple as German.
  • February 19, 1907: A government resolution establishes the first official spelling standard for Norwegian riksmål, based on a proposal from J. Aars and M. Nygaard. This is close to the language of Bjørnson (hesterne, mænd, mænn, værk, ryg). Many Danish b/d/g are changed to p/t/k. Nouns and verbs get Norwegian inflexions. In part this norm was guided by the idea to unify the two Norwegian languages (samnorsktanken).
  • 1907: Bjørnstjerne Bjørnson becomes chairman of riksmålsforeningen in Oslo. In 1909 all such societies are united in a federation, Riksmålsforbundet.
  • 1906: A major Swedish spelling reform does away with the combinations dt, fv, and hv. This is introduced by minister of church and schools Fridtjuv Berg (1851-1916).
  • 1901: Norway's education (church) ministry defines a standard orthography for those school textbooks that are writen in landsmål. Vin, Dyr, Sjo are changed to ven, dør, sjø. Verb forms kastade-kastat are changed to kasta-kasta. This reform of 1901 is also known as "Midlandsnormalen".
  • 1900: Danish education ministry allows the dropping of plural forms of verbs (ere, bleve).
  • 1899: A society for riksmål, riksmålsforeningen, is founded in Oslo, chaired by Hj. Falk.
  • 1892: Norway's school districts can decide whether they should teach landsmål or the common language known from books. Secondary schools introduce this reform in 1896.
  • 1888 or 1889, and revised in 1891 or 1892: Denmark's ministry of schools and churches (under minster Jacob Frederik Scavenius) authorizes a new spelling using j and v rather than i and u at the end of diphthongs, reducing the use of c, q, z, and x to foreign words, abandoning double wovels, abandoning the silent e, abandoning silent j after k and g. The dictionary by Viggo Såby (1835-1898), Ordbog med befalet Retskrivning til Brug for Skolene becomes the norm for spelling in Danish schools. The dictionary was later continued by Peder Kristian Thorsen (born October 28, 1851, died 1920, [2]). In 1889 the word sexten (sixteen) is changed to sejsten (its actual pronounciation), but in 1892 this all too radical reform is moderated to seksten (x to ks), which is the current spelling.
  • 1889: The 6th edition of SAOL introduces many of the changes proposed by the 1869 congress. This includes the change from e to ä in elf/älf, jern/järn. It also allows a change from qv to kv, e.g. qvarn/kvarn, qvinna/kvinna.
  • 1885: Norwegian parliament rules that landsmål is a parallel official language.
  • 1883: A new editor restarts the Academy's dictionary. The first fascicle is printed in 1893 and the first volume of "Svenska Akademiens Ordbok" (SAOB) is completed in 1898. The dictionary documents Swedish spelling since 1526.
  • 1877: Norway no longer requires capitalization of nouns.
  • 1874: The Swedish Academy publishes a spelling dictionary in one volume, Svenska Akademiens Ordlista (SAOL). This 1st edition by Johan Erik Rydqvist (1800-1877) is very conservative in spelling, as a direct protest against Hazelius and the changes proposed by the 1869 congress. Its 6th edition (1889) and 8th edition (1923) are out of copyright.
  • 1872: Danish ministry of education recommends a new orthography for the use in schools, implementing most of the proposals from the Stockholm meeting. However, capitalization of nouns and the use of aa is kept. Protests come from writers who defend the literaire orthography (rather than "litterære").
  • 1869: A Scandinavian spelling congress (det nordiske Retskrivningsmøde, det nordiska rättstavningsmötet) is held in Stockholm, suggesting that nouns should no longer be capitalized (in Danish-Norwegian) and that ä should replace e in many places (in Swedish). Among the Norwegian representatives was Henrik Ibsen, who immediately adopted the new proposals in his own writing, such as changing from gj/kj to g/k (gerne, kærlighed, skemt, igen), from ei/øi to ej/øj (Freja, dreje, fløjel), from ch/x/qv to k/ks/kv (Krist, veksel, kvinde). In the 1870s he set out to republish his older works in this new language. The Swedish Academy was not invited, because the organizers wanted to achieve consensus in the direction of reform, and this would not have been accepted by the Academy. Secretary for the Swedish section was Artur Hazelius (1833-1901), who published Om svensk rättstafning. 1. Om rättstafningens grunder med särskildt afseende på svenska språket (1870, "On the foundations of orthography with special consideration on the Swedish language") and 2. Redogörelse för Nordiska rättstafningsmötets förslag till ändringar i stafningssättet jemte berättelse om mötet (1871, "Presentation of the Scandinavian spelling congress' proposals for changes in orthography and proceedings of the congress").
  • 1864: Prussia and Austria attack Denmark in the Second war on Schleswig. Sweden and Norway don't come to Denmark's rescue and this puts an end to any dreams of political Scandinavism. All that remains now is Scandinavian exchange in literature and language.
  • 1862: Norwegian instruction on orthography changes ph/ch/x/qv to f/k/ks/kv. Double wovels (Eed, Huus, siig, viid) and silent e (gaaer, roer, Tyrannie) are dropped (Ed, Hus, sig, vid, gaar, ror, Tyranni). This reform has no effect in Denmark. This reform is similar to the Danish reform of 1892, but Norwegian doesn't change gj/kj.
  • 1853: Ivar Aasen publishes some samples of dialects in Prøver af Landsmaalet i Norge (1853). In this collection, some stories are also printed in a standardized version of Norwegian language, which marks the creation of landsmaal (in 1929 renamed nynorsk), one of the two Norwegian languages. Writers Aa. O. Vinje and Arne Garborg start to use the new language.
  • 1842-1846: Norwegian linguist Ivar Aasen travels the country to collect samples from dialects. His observations are summarized in a grammar and a dictionary: Det norske Folksprogs Grammatik (1848) and Ordbog over det norske Folksprog (1850).
  • 1842: Public schools are made compulsory by law in Sweden.
  • 1830: Sixteen years after Norway's separation from Denmark, the idea to create a Norwegian language is first mentioned.
  • 1826: Danish linguist Rasmus Rask (1787-1832) publishes a Forsøg til en videnskabelig dansk Retskrivningslære (Attempt to a scientific Danish orthography), in which he proposes a far-reaching reform of Danish spelling. Among other things, he proposed the introduction of å to replace aa (this reform took place in 1948). One of his disciples was Niels Mathias Petersen (1791-1862), who continued to work for reforming Danish language.
  • 1822: Mauritz Hansen tries to reform spelling in Norway, but get very few followers. He introduced k in kontor, Kristiania, karakter, dropped the silent h from ti, ur, te, changed from x/ph to ks/f. He also wrote losji, sersjant, marsj, sitron, sensur, sjalu, sjef, angballasje, proposisjon.
  • 1814: Public schools are made compulsory by law in Denmark.
  • 1801: Carl Gustaf af Leopold, a member of the Swedish Academy, publishes Afhandling om svenska stafsättet (treatise on Swedish orthography), 267 pages, in Proceedings of the Swedish Academy part 1, also printed separately, and later republished in the author's collected works. This is a de facto description of the already existing practice, with some new guidelines of how to adopt foreign loanwords to Swedish. This orthography is the official one until 1889.
  • 1786: The Swedish Academy is founded by king Gustav III. One of its main tasks is to compile a dictionary of the Swedish language. Work begins immediately, but stops already in 1814. New attempts are started in 1834 and 1855. A fascicle for the letter "A" is published in 1870.
  • 1775: Danish government issues the first of instructions on spelling to higher schools.
  • 1750-1800: In the latter half of the 18th century, capitalization of nouns is introduced in Danish.
  • 1753: Swedish scholar Sven Hof publishes Swänska språkets rätta skrifsätt ("The right spelling of the Swedish language")
  • 1726: Danish-Norwegian playwright Ludvig Holberg (1684–1754) documents his own spelling in "Orthographiske Anmerkninger" in Metamorphosis. An online version is found here. However, that text does not use Holberg's unique orthography. This might seem odd, but is explained by the fact that book printers changed Holberg's very disciplined spelling to their own random spelling. A good background is given in this article on the history of Danish language in the encyclopedia Salmonsens Konversationsleksikon.
  • 1703: First attempt to use Antiqva (rather than Fraktur) for Danish books, but only very few books are printed this way. Fraktur continues to dominate.
  • 1647: A Danish Bible translation does away with male/female gender of nouns and always uses "den".
  • 1526: Sweden's Lutheran church reformer Olaus Petri translates the New Testament to Swedish. Old Testament follows in 1541. His style of writing marks the beginning of modern Swedish orthography.
  • 9th Century A.D.: About the same time as Iceland is populated by the Norwegians, Sweden's longest runic inscription, the Rök runestone is carved. Runes are Scandinavian letters inspired by Greek/Latin alphabets but adopted for carving in stone or wood. Two different runic alphabets were used between c. 500 and 1000 A.D., the first with 24 letters, later simplified to one with 16 letters. With the introduction of Christianity around 1000 A.D., runes are gradually replaced with Latin script.
Personal tools