Skip to content

StrombergNLP/bornholmsk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bornholmsk

Language processing resources and tools for Bornholmsk, a language spoken on the island of Bornholm, with roots in Danish and closely related to Scanian. Includes corpora, word vectors, and parallel Bornholmsk/Danish text.

The language resources are in resources/. Processed corpora are in the top directory, including the parallel text for machine translation (parallel.*). The CC embeddings are 300-dimension and aligned with the FastText multilingual space using the algorithm in align.py. The .clean corpus skips some less-canonical resources. The resources/ directory also contains some source PDFs, including a good scan of Kuhre's Sansager and Espersen's 1908 Ordbog (both public domain). The Kuhre is split into chapters and aligned, tidied Danish and Bornholmsk. The split-chapter versions have had some slight edits to ensure 1:1 sentence alignment. There are also a few selected works by Otto Lunch. Note that the resources/ contains also some Danish gazetteers, mostly for anchoring the machine translation system, which can otherwise be ignored.

Licensing

Annotations licensed CC-BY. Some material public domain (e.g. CC0). The embeddings are licensed CC-BY (open for use, attribution is required). If you use this data/tools, you must acknowledge their source.

Contact

Contact: Leon Strømberg-Derczynski, ITU Copenhagen (ld@itu.dk)

Attribution

Some data is credited to Allan B. Hansen / Gubbana.dk, to Espersen's 1908 dictionary, to Dion Westh / Bornholmersnak.dk, and to Wikipedia / The Wikimedia Foundation.

Acknowledgment

Credit this as "Derczynski, L. S., & Kjeldsen, A. S. (2019, October). Bornholmsk Natural Language Processing: Resources and Tools. In Proceedings of the 22nd Nordic Conference on Computational Linguistics. NEALT."

Bibtex is:

@inproceedings{derczynski2019bornholmsk,
  title={Bornholmsk Natural Language Processing: Resources and Tools},
  author={Derczynski, Leon Str{\o}mberg and Copenhagen, ITU and Kjeldsen, Alex Speed},
  booktitle={Proceedings of the 22nd Nordic Conference on Computational Linguistics},
  year={2019},
  pages={338--410},
  publisher={NEALT}
}