Compiled on 2017-04-22 from rev f1a9714315d6.
We use major.minor schema.
Until we reach 5000 words major is 0. minor updated from time to time.
$ hg clone http://hg.defun.work/gadict gadict $ hg clone http://hg.code.sf.net/p/gadict/code gadict-hg
$ hg push ssh://$USER@hg.defun.work/gadict $ hg push ssh://$USER@hg.code.sf.net/p/gadict/code $ hg push https://$USER:$PASS@hg.code.sf.net/p/gadict/code
gadict project uses dictd C5 source file format in the past. C5 format have several issues:
- C5 is not structural format. So producing another forms and conversion to other formats is not possible.
- C5 have no markup for links neither for any other markups.
Before that project used dictd TAB file format which require placing article in a single long line. That format is not for human editing at all.
Other dictionary source file formats are considered as choice, like TEI, ISO, xdxf, MDF. XML like formats also are not for human editing. Also XML lack of syntax locality and full file should be scanned to validate local changes...
Note that StarDict, AbbyLinguo, Babylon, dictd formats are not considered because they all about a presentation but not a structure. They are target formats for compilation.
Fancy looking analog to MDF + C5 was developed.
Beginning of file describe dictionary information.
Each article separated by \n__\n\n and consists of two parts:
- word variations with pronunciation
- word translations, with supplementary information like part of speach, synonyms, antonyms, example of usage
Word variation are:
Parts of speech are:
I try to keep word meanings in article in above POS order.
Each meaning may refer to topics, like:
Translation marked by lowercase ISO 639-1 code with : (colon) character, like:
Example marked by lowercase ISO 639-1 code with > (greater) character.
Explanation or glossary marked by lowercase ISO 639-1 code with = (equal) character.
Pronunciation variants marked by:
rare attribute to first headword used as marker that word has low frequency. SRS file writers skip entries marked as rare. I found it convenient to check frequency with:
For cut-off point I chose beseech word. All less frequent words receive rare marker.
For source file format used dictd C5 file format. See:
$ man 1 dictfmt
- Headwords was preceded by 5 or more underscore characters _ and a blank line.
- Article may have several headwords, in that case they are placed in one line and separated by ;<SPACE>.
- All text until the next headword is considered as the definition.
- Any leading @ characters are stripped out, but the file is otherwise unchanged.
- UTF-8 encoding is supported at least by Goldendict.
gadict project used C5 format in the past but switched to own format.
Entries or parts of text that was not completed marked by keywords:
- urgent incomplete
Makefile rules todo find this occurrence in sources:
$ make todo
- Dictionary writing system
- Multi-Dictionary Formatter (MDF). It defines about 100 data field markers.
- FieldWorks Language Explorer (or FLEx, for short) is designed to help field linguists perform many common language documentation and analysis tasks.
- LIFT (Lexicon Interchange FormaT) is an XML format for storing lexical information, as used in the creation of dictionaries. It's not necessarily the format for your lexicon.
- Lexique Pro is an interactive lexicon viewer and editor, with hyperlinks between entries, category views, dictionary reversal, search, and export tools. It's designed to display your data in a user-friendly format so you can distribute it to others.
- DEBII — Dictionary Editor and Browser
National corpus of Russian language. There is parallel Russian-Ukrainian texts. Search by keywords, grammatical function, thesaurus properties and other properties.
Corpus of mova.info project. Thtere are literal search and search by word family.
Frequency wordlists use several statistics:
number of word occurrences in corpus, usually marked by F
adjusted number of occurrences per 1.000.000 in corpus, usually marked by U
Standard Frequency Index (SFI) is a:
|90||1 per 10|
|80||1 per 100|
|70||1 per 1000|
|60||1 per 10.000|
|50||1 per 100.000|
|40||1 per 1.000.000|
|30||1 per 10.000.000|
deviation of word frequency across documents in corpus, usually marked by D
Sorting numerically on first= column:
$ sort -k 1nr,2 <$IN >$OUT
The Open American National Corpus (OANC) is a roughly 15 million word subset of the ANC Second Release that is unrestricted in terms of usage and redistribution.
I've got OANC from link: http://www.anc.org/OANC/OANC-1.0.1-UTF8.zip
After unpacking only .txt files:
$ unzip OANC-1.0.1-UTF8.zip '*.txt' $ cd OANC; find . -type f | xargs cat | wc 2090929 14586935 96737202
I built frequency list with:
manually removed single and double letter words, filter out misspelled words with en_US hunspell spell-checker and merged word variations to baseform with using WordNet. See details in obsolete/oanc.py.
Useful word lists:
Obsolete or proprietary word list:
Updated GSL (General Service List) was obtained from:
First column represents the number of occurrences per 1,000,000 words of the Brown corpus based on counting word families.
NGSL was obtained from:
First column represents the adjusted frequency per 1,000,000 words and counting base word families.
The Academic Word List (AWL) was published in the Summer, 2000 issue of the TESOL Quarterly (v. 34, no. 2). It was devloped by Averil Coxhead, of Victoria University of Wellington, in New Zealand. The AWL is a replacement for the University Word List (published by Paul Nation in 1984).
AWL (Academic Word List) is obtained from:
Its structure is headword following by frequency level (from 1 as most frequent to 10 as least frequent).
Frequency word list was obtained from:
SFI and D columns was deleted and U and Word column was swapped. Data was sorted by U column (adjusted frequency per 1,000,000 words).
NSWL headword list with word variations was obtained from:
It is encoded in latin-1 and recoded into utf-8 (because of É symbol).
The 1700 words of the BSL 1.01 version gives up to 97% coverage of general business English materials when combined with the 2800 words of the NGSL.
Wordlist with variations was obtained from:
Based on a 1.5 million word corpus of various TOEIC preparation materials, the 1200 words of the TSL 1.1 version gives up to 99% coverage of TOEIC materials and tests when combined with the 2800 words of the NGSL.
Wordlist with variations was obtained from:
Paul Nation prepare frequency wordlist from combined BNC and COCA corpus:
It has 25000 basewords (and each baseword comes with variations) splited into chunks by 1000 words.
I get list from:
For entering IPA chars use IPA input method. To enable it type:
C-u C-\ ipa <enter>
All chars from alphabet typed as usual. To type special IPA chars use next key bindings (or read help in Emacs by M-x describe-input-method or C-h I).
æ ae ɑ o| or A ɒ |o or /A ʊ U ɛ /3 or E ɔ /c ə /e ʌ /v ɪ I
θ th ð dh ʃ sh ʧ tsh ʒ zh or 3 ŋ ng ɡ g ɹ /r
ː : (semicolon) ˈ ' (quote) ˌ ` (back quote)
Alternatively use ipa-x-sampa or ipa-kirshenbaum input method (for help type: C-h I ipa-x-sampa RET or C-h I ipa-kirshenbaum RET).