gadict HACKING guide

Versioning rules

We use major.minor schema.

Until we reach 5000 words major is 0. minor updated from time to time.

Getting sources

Cloning repository:

$ hg clone http://hg.defun.work/gadict gadict
$ hg clone http://hg.code.sf.net/p/gadict/code gadict-hg

Pushing changes:

$ hg push ssh://$USER@hg.defun.work/gadict
$ hg push ssh://$USER@hg.code.sf.net/p/gadict/code
$ hg push https://$USER:$PASS@hg.code.sf.net/p/gadict/code

Browsing sources online

http://hg.defun.work/gadict

hgweb at home page.

http://hg.code.sf.net/p/gadict/code

hgweb at old home page (but supported as mirror).

https://sourceforge.net/p/gadict/code/

Sourceforge Allure interface (not primary, a mirror).

Building project

gadict project provides dictionaries encoded in custom format. In order to precess them you need GNU Make and Python 2.7 and possibly other tools.

To produce dictionaries in dictd format you need to install dictd dictribution with dictfmt and dictzip utilities:

sudo apt install dictfmt dictzip

and run:

$ make dict

To make Anki decks checkout Anki sources:

$ git clone https://github.com/dae/anki.git
$ cd anki

and update to specific revision (before strong dependency to pyaudio which is not available on Cygwin):

$ git co 1d75cff5e7458c6538a4e75728c16bef8b7adb3e^

$ git show 1d75cff5e7458c6538a4e75728c16bef8b7adb3e
commit 1d75cff5e7458c6538a4e75728c16bef8b7adb3e
Author: Damien Elmes <git@ichi2.net>
Date:   2016-06-23 12:04:48 +1000

    pyaudio is no longer optional

Previously build uses Python 2 and depends on earlier source revitions (before port to Python 3):

$ git co  15b349e3^

$ git show 15b349e3
commit 15b349e3a8b34bf80c134b406c9b90f61250ee9e
Author: Damien Elmes <git@ichi2.net>
Date:   2016-05-12 14:45:35 +1000

    start port to python 3

and put path to Anki project source dir inside Makefile.config:

ANKI_PY_DIR := $(HOME)/devel/anki

Build command to make Anki deks is:

$ make anki

Alternative Anki generators

https://github.com/kerrickstaley/genanki: A Library for Generating Anki Decks.
https://github.com/lervag/apy: CLI script for interacting with local Anki collection.
https://github.com/damaru2/ankigenbot/blob/master/src/send_card.py: Pushes cards to https://ankiweb.net

Dictionary source file format

gadict project uses dictd C5 source file format in the past. C5 format have several issues:

C5 is not structural format. So producing another forms and conversion to other formats is not possible.

C5 have no markup for links neither for any other markups.

Before that project used dictd TAB file format which require placing article in a single long line. That format is not for human editing at all.

Other dictionary source file formats are considered as choice, like TEI, ISO, xdxf, MDF. XML like formats also are not for human editing. Also XML lack of syntax locality and full file should be scanned to validate local changes...

Note that StarDict, AbbyLinguo, Babylon, dictd formats are not considered because they all about a presentation but not a structure. They are target formats for compilation.

Fancy looking analog to MDF + C5 was developed.

Beginning of file describe dictionary information.

Each article separated by \n__\n\n and consists of two parts:

word variations with pronunciation

word translations, with supplementary information like part of speach, synonyms, antonyms, example of usage

Word variation are:

singularity or number: s - single, pl - plural.
verb voice or verb tense: v1 - infinitive, v2 - past tense, v3 past participle tense.
gender: male or female.
comparison: comp - comparative or super - superlative.

Parts of speech (ordered by preference):

v - verb
n - noun
pron - pronoun
adv - adverb
adj - adjective
prep - preposition
conj - conjunction
num - numeral
int - interjection
abbr - abbreviation
phr - phrase
phr.v - phrasal verb
contr - contraction
prefix - word prefix

Note

I try to keep word meanings in article in above POS order.

Each meaning may refer to topics, like:

sci - about science
body - part of body
math - mathematics
chem - chemicals
bio - biology
music
meal, office, etc
size, shape, age, color
archaic - old fashioned, no longer used

Word relation (ordered by preference):

topic: - topics/tags
ant: - antonyms
syn: - synonyms
hyper: - hypernyms
hypo: - hyponyms
rel: - related (see also) terms

Translation marked by lowercase ISO 639-1 code with : (colon) character, like:

en: - English
ru: - Russian
uk: - Ukrainian
la: - Latin

Example marked by lowercase ISO 639-1 code with > (greater) character.

Explanation or glossary are marked by lowercase ISO 639-1 code with = (equal) character.

Pronunciation variants marked by:

Am - American
Br - Great Britain
Au - Australian

rare attribute to first headword used as marker that word has low frequency. SRS file writers skip entries marked as rare. I found it convenient to check frequency with:

https://books.google.com/ngrams/: Google N-grams from books 1800-2010.

For cut-off point I chose beseech word. All less frequent words receive rare marker.

gaphrase & gadialog file formats

gaphrase & gadialog files keeps data for generating one side Anki cards.

Both use same numbering schema that allows to merge updated articles with original without losing learning progress:

First line of file starts with ## NUM - to keep track latest used number.
Aticles are separated by number line with format # NUM.

gadialog additionally maintains dialog, each part is marked by line starting with - TEXT.

C5 dictionary source file format

For source file format used dictd C5 file format. See:

$ man 1 dictfmt

Shortly:

Headwords was preceded by 5 or more underscore characters _ and a blank line.

Article may have several headwords, in that case they are placed in one line and separated by ;<SPACE>.

All text until the next headword is considered as the definition.

Any leading @ characters are stripped out, but the file is otherwise unchanged.

UTF-8 encoding is supported at least by Goldendict.

gadict project used C5 format in the past but switched to own format.

TODO convention

Entries or parts of text that was not completed marked by keywords:

TODO

incomplete

XXX

urgent incomplete

Makefile rules todo find this occurrence in sources:

$ make todo

World wide dictionary formats and standards

http://en.wikipedia.org/wiki/Dictionary_writing_system

Dictionary writing system

http://www.sil.org/computing/shoebox/mdf.html

Multi-Dictionary Formatter (MDF). It defines about 100 data field markers.

http://fieldworks.sil.org/flex/

FieldWorks Language Explorer (or FLEx, for short) is designed to help field linguists perform many common language documentation and analysis tasks.

http://code.google.com/p/lift-standard/

LIFT (Lexicon Interchange FormaT) is an XML format for storing lexical information, as used in the creation of dictionaries. It's not necessarily the format for your lexicon.

http://www.lexiquepro.com/

Lexique Pro is an interactive lexicon viewer and editor, with hyperlinks between entries, category views, dictionary reversal, search, and export tools. It's designed to display your data in a user-friendly format so you can distribute it to others.

http://deb.fi.muni.cz/index.php

DEBII — Dictionary Editor and Browser

Linguistic sources

Ukrainian linguistics corpora

National corpus of Russian language. There is parallel Russian-Ukrainian texts. Search by keywords, grammatical function, thesaurus properties and other properties.

http://www.ruscorpora.ru/search-para-uk.html: Page for querying online.

Corpus of mova.info project. Thtere are literal search and search by word family.

http://www.mova.info/corpus.aspx: Page for querying online.

Word lists

Frequency wordlists use several statistics:

number of word occurrences in corpus, usually marked by F
adjusted number of occurrences per 1.000.000 in corpus, usually marked by U
Standard Frequency Index (SFI) is a:

SFI = 40 + 10*log₁₀(U)

SFI Freq

90 1 per 10

80 1 per 100

70 1 per 1000

60 1 per 10.000

50 1 per 100.000

40 1 per 1.000.000

30 1 per 10.000.000
deviation of word frequency across documents in corpus, usually marked by D

SFI	Freq
90	1 per 10
80	1 per 100
70	1 per 1000
60	1 per 10.000
50	1 per 100.000
40	1 per 1.000.000
30	1 per 10.000.000

Sorting numerically on first column:

$ sort -k 1nr,2 <$IN >$OUT

https://www.wordandphrase.info/frequencyList.asp: Word frequency info based on COCA.
https://www.english-corpora.org/coca/: COCA corpus with word frequency info.

OANC frequency wordlist

The Open American National Corpus (OANC) is a roughly 15 million word subset of the ANC Second Release that is unrestricted in terms of usage and redistribution.

I've got OANC from link: http://www.anc.org/OANC/OANC-1.0.1-UTF8.zip

After unpacking only .txt files:

$ unzip OANC-1.0.1-UTF8.zip '*.txt'
$ cd OANC; find . -type f | xargs cat | wc
2090929 14586935 96737202

I built frequency list with:

http://www.laurenceanthony.net/software/antconc/: A freeware corpus analysis toolkit for concordancing and text analysis.

manually removed single and double letter words, filter out misspelled words with en_US hunspell spell-checker and merged word variations to baseform with using WordNet. See details in obsolete/oanc.py.

http://www.anc.org/data/oanc/download/: OANC download page.
http://www.anc.org/data/oanc/: OANC home page.
https://anc.org/data/anc-second-release/frequency-data/: 2nd release of ANC.

https://en.wikipedia.org/wiki/Word_lists_by_frequency

Useful word lists:

https://en.wikipedia.org/wiki/Academic_Word_List: Academic Word List at Wikipedia.
https://web.archive.org/web/20080212073904/http://language.massey.ac.nz/staff/awl/headwords.shtml: Academic Word List by Averil Coxhead created in 2000 as addition to GSL and has 570 headwords.

Obsolete or proprietary word list:

https://en.wikipedia.org/wiki/Basic_English: 850 headword list created in 1930.

General Service List

Updated GSL (General Service List) was obtained from:

http://jbauman.com/gsl.html: A 1995 revised version of the GSL with minor changes by John Bauman. He added 284 new headwords to original 2000 word list created by Michael West in 1953.

First column represents the number of occurrences per 1,000,000 words of the Brown corpus based on counting word families.

https://en.wikipedia.org/wiki/General_Service_List: General Service List at Wikipedia.
http://jbauman.com/aboutgsl.html: About the General Service List by John Bauman.
https://www.eapfoundation.com/vocab/general/gsl/: Sheldon Smith about GSL.

New General Service List

NGSL was obtained from:

http://www.newgeneralservicelist.org/s/NGSL-101-by-band-qq9o.xlsx: Microsoft XLS file with headword, frequency and SFI.

First column represents the adjusted frequency per 1,000,000 words and counting base word families.

Academic Word List

The Academic Word List (AWL) was published in the Summer, 2000 issue of the TESOL Quarterly (v. 34, no. 2). It was devloped by Averil Coxhead, of Victoria University of Wellington, in New Zealand. The AWL is a replacement for the University Word List (published by Paul Nation in 1984).

AWL (Academic Word List) is obtained from:

https://web.archive.org/web/20081014065815/http://language.massey.ac.nz/staff/awl/download/awlheadwords.rtf: Original Academic Word List in RTF format.

Its structure is headword following by frequency level (from 1 as most frequent to 10 as least frequent).

New Academic Word List

Frequency word list was obtained from:

http://www.newacademicwordlist.org/s/NAWL_SFI.csv: CSV with colums Word,SFI,U,D.

SFI and D columns was deleted and U and Word column was swapped. Data was sorted by U column (adjusted frequency per 1,000,000 words).

NSWL headword list with word variations was obtained from:

http://www.laurenceanthony.net/software/antwordprofiler/: Laurence Anthony's AntWordProfiler home page.

It is encoded in latin-1 and recoded into utf-8 (because of É symbol).

Special English word list

https://en.wikipedia.org/wiki/Special_English: Special English is a controlled version of the English languageused by the United States broadcasting service Voice of America (VOA). 1557 headwords.

Business Service List

The 1700 words of the BSL 1.01 version gives up to 97% coverage of general business English materials when combined with the 2800 words of the NGSL.

Wordlist with variations was obtained from:

http://www.newgeneralservicelist.org/s/AWPngslbsl-twcg.zip: In AntWordProfiler compatable format.
http://www.newgeneralservicelist.org/bsl-business-service-list/: BSL home & download page.

TOEIC Service List

Based on a 1.5 million word corpus of various TOEIC preparation materials, the 1200 words of the TSL 1.1 version gives up to 99% coverage of TOEIC materials and tests when combined with the 2800 words of the NGSL.

Wordlist with variations was obtained from:

http://www.newgeneralservicelist.org/s/AWPngsltsl.zip: In AntWordProfiler compatable format.
http://www.newgeneralservicelist.org/toeic-list/: The TOEIC Service List home page.

KET wordlist

The KET Vocabulary List gives teachers a guide to the vocabulary needed when preparing students for the KET and KET for Schools examinations.

The list covers vocabulary appropriate to the A2 level on the CEFR.

http://www.cambridgeenglish.org/images/22105-ket-vocabulary-list.pdf: Key English Test (KET) Vocabulary List © UCLES 2012.

PET wordlist

Preliminary and Preliminary for Schools Vocabulary List gives teachers a guide to the vocabulary needed when preparing students for the Preliminary and Preliminary for Schools exam inations.

The list covers vocabulary appropriate to the B1 level on the CEFR.

BNC+COCA wordlist

Paul Nation prepared a frequency wordlist from combined BNC and COCA corpus:

http://www.victoria.ac.nz/lals/about/staff/paul-nation: Paul Nation's home page and list download page.
https://simple.wiktionary.org/wiki/Wiktionary:BNC_spoken_freq: About list on Wikimedia.

It has 25000 basewords (and each baseword comes with variations) splited into chunks by 1000 words.

I get list from:

http://www.laurenceanthony.net/software/antwordprofiler/: Laurence Anthony's AntWordProfiler home page.
https://www.laurenceanthony.net/resources/wordlists/bnc_coca_cleaned_ver_002_20141015.zip: Direct download link with 25k words + extra (dated by 2014).
https://www.wgtn.ac.nz/lals/resources/paul-nations-resources/vocabulary-lists: Paul's page at Victoria University with download of wordlist (first 10k).

Oxford 3000/5000

https://www.oxfordlearnersdictionaries.com/wordlists/: Based on extensive corpora and aligned to the CEFR.

Miscellaneous wordlists

The Dolch word list is a list of frequently used English words compiled by Edward William Dolch. The list was prepared in 1936 and was originally published in his book Problems in Reading in 1948. Dolch compiled the list based on children's books of his era. The list contains 220 "service words". The compilation excludes nouns, which comprise a separate 95-word list.

Dolch wordlist already covered by gadict.

https://en.wikipedia.org/wiki/Dolch_word_list: Wikipedia article with list itself.

The Leipzig-Jakarta list is a 100-word word list used by linguists to test the degree of chronological separation of languages by comparing words that are resistant to borrowing. The Leipzig-Jakarta list became available in 2009.

Leipzig-Jakarta wordlist already covered by gadict.

https://en.wikipedia.org/wiki/Leipzig%E2%80%93Jakarta_list: Wikipedia article with list itself.

The words in the Swadesh lists were chosen for their universal, culturally independent availability in as many languages as possible. Swadesh's final list, published in 1971, contains 100 terms.

Swadesh wordlist already covered by gadict except some rare words.

https://en.wikipedia.org/wiki/Swadesh_list

Typing IPA chars in Emacs

For entering IPA chars use IPA input method. To enable it type:

C-u C-\ ipa <enter>

All chars from alphabet typed as usual. To type special IPA chars use next key bindings (or read help in Emacs by M-x describe-input-method or C-h I).

For vowel:

æ  ae
ɑ  o| or A
ɒ  |o  or /A
ʊ  U
ɛ  /3 or E
ɔ  /c
ə  /e
ʌ  /v
ɪ  I

For consonant:

θ  th
ð  dh
ʃ  sh
ʧ  tsh
ʒ  zh or 3
ŋ  ng
ɡ  g
ɹ  /r

Special chars:

ː  : (semicolon)
ˈ  ' (quote)
ˌ  ` (back quote)

Alternatively use ipa-x-sampa or ipa-kirshenbaum input method (for help type: C-h I ipa-x-sampa RET or C-h I ipa-kirshenbaum RET).