stanford-postagger-full-2016-10-31/models/README-Models.txt

Stanford POS Tagger, v3.7.0 - 2016-10-31
Copyright (c) 2002-2012 The Board of Trustees of
The Leland Stanford Junior University. All Rights Reserved.

This document contains (some) information about the models included in
this release and that may be downloaded for the POS tagger website at
http://nlp.stanford.edu/software/tagger.shtml . If you have downloaded
the full tagger, all of the models mentioned in this document are in the
downloaded package in the same directory as this readme. Otherwise,
included in the download are two
English taggers, and the other taggers may be downloaded from the
website. All taggers are accompanied by the props files used to create
them; please examine these files for more detailed information about the
creation of the taggers.

For English, the bidirectional taggers are slightly more accurate, but
tag much more slowly; choose the appropriate tagger based on your
speed/performance needs.

English taggers
---------------------------
wsj-0-18-bidirectional-distsim.tagger
Trained on WSJ sections 0-18 using a bidirectional architecture and
including word shape and distributional similarity features.
Penn Treebank tagset.
Performance:
97.28% correct on WSJ 19-21
(90.46% correct on unknown words)

wsj-0-18-left3words.tagger
Trained on WSJ sections 0-18 using the left3words architecture and
includes word shape features. Penn tagset.
Performance:
96.97% correct on WSJ 19-21
(88.85% correct on unknown words)

wsj-0-18-left3words-distsim.tagger
Trained on WSJ sections 0-18 using the left3words architecture and
includes word shape and distributional similarity features. Penn tagset.
Performance:
97.01% correct on WSJ 19-21
(89.81% correct on unknown words)

english-left3words-distsim.tagger
Trained on WSJ sections 0-18 and extra parser training data using the
left3words architecture and includes word shape and distributional
similarity features. Penn tagset.

english-bidirectional-distsim.tagger
Trained on WSJ sections 0-18 using a bidirectional architecture and
including word shape and distributional similarity features.
Penn Treebank tagset.

wsj-0-18-caseless-left3words-distsim.tagger
Trained on WSJ sections 0-18 left3words architecture and includes word
shape and distributional similarity features. Penn tagset. Ignores case.

english-caseless-left3words-distsim.tagger
Trained on WSJ sections 0-18 and extra parser training data using the
left3words architecture and includes word shape and distributional
similarity features. Penn tagset. Ignores case.


Chinese tagger
---------------------------
chinese-nodistsim.tagger
Trained on a combination of CTB7 texts from Chinese and Hong Kong
sources.
LDC Chinese Treebank POS tag set.
Performance:
93.46% on a combination of Chinese and Hong Kong texts
(79.40% on unknown words)

chinese-distsim.tagger
Trained on a combination of CTB7 texts from Chinese and Hong Kong
sources with distributional similarity clusters.
LDC Chinese Treebank POS tag set.
Performance:
93.99% on a combination of Chinese and Hong Kong texts
(84.60% on unknown words)

Arabic tagger
---------------------------
arabic.tagger
Trained on the *entire* ATB p1-3.
When trained on the train part of the ATB p1-3 split done for the 2005
JHU Summer Workshop (Diab split), using (augmented) Bies tags, it gets
the following performance:
96.26% on test portion according to Diab split
(80.14% on unknown words)

French tagger
---------------------------
french.tagger
Trained on the French treebank.

German tagger
---------------------------
german-hgc.tagger
Trained on the first 80% of the Negra corpus, which uses the STTS tagset.
The Stuttgart-Tübingen Tagset (STTS) is a set of 54 tags for annotating
German text corpora with part-of-speech labels, which was jointly
developed by the Institut für maschinelle Sprachverarbeitung of the
University of Stuttgart and the Seminar für Sprachwissenschaft of the
University of Tübingen. See:
http://www.ims.uni-stuttgart.de/projekte/CQPDemos/Bundestag/help-tagset.html
This model uses features from the distributional similarity clusters
built over the HGC (Huge German Corpus).
Performance:
96.90% on the first half of the remaining 20% of the Negra corpus (dev set)
(90.33% on unknown words)

german-fast.tagger
Lacks distributional similarity features, but is several times faster
than the other alternatives.
Performance:
96.61% overall / 86.72% unknown.

german-ud.tagger
This is a model that produces Universal Dependencies POS tags.