Časopis Naše řeč
en cz

Problémy automatické morfologické disambiguace češtiny

Vladimír Petkevič

[Články]

(pdf)

Problems of automatic morphological disambiguation of Czech

The article focuses on some of the main problems in the current automatic morphological disambiguation of Czech. Following a description of the disambiguation methods used for disambiguating Czech texts and of their accuracy, the author discusses the main reasons why the correct morphological disambiguation of Czech texts contained in the corpora of the SYN series of the Czech National Corpus project is very difficult to achieve, and why, notwithstanding can improvement in disambiguation (e.g. the SYN2013PUB corpus is tagged in a better way than the SYN2000 corpus), there is still a lot of work to be accomplished. The author concentrates exclusively on the problems of rule-based disambiguation rather than on the stochastic one, trying to identify areas where disambiguation could be improved in the future. The necessity of a reliable disambiguation of Czech texts as a key prerequisite for their successful subsequent syntactic analysis is also stressed.

Key words: automatic morphological disambiguation, corpora of the SYN series, improvement in tagging, rule-based and stochastic disambiguation
Klíčová slova: automatická morfologická disambiguace, korpusy řady SYN, zlepšení značkování, disambiguace pomocí pravidel a disambiguace stochastická

Text je on-line k dispozici v databázi CEEOL.

Ústav teoretické a komputační lingvistiky FF UK
Celetná 13, 110 00 Praha 1
vladimir.petkevic@ff.cuni.cz

Naše řeč, ročník 97 (2014), číslo 4–5, s. 194-207

Předchozí Jan Chromý: Korpus a reprezentativnost

Následující Karel Kučera: Diachronní složka Českého národního korpusu a hranice možností korpusového výzkumu vývoje češtiny