Differences between revisions 6 and 24 (spanning 18 versions)
Revision 6 as of 2009-09-23 09:31:49
Size: 4620
Editor: FlorianLaws
Comment:
Revision 24 as of 2014-05-19 06:47:08
Size: 194
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
= WordGraph research project = = WordGraph Research Project =
Line 3: Line 3:
Most natural language processing systems rely heavily on information about words and their meanings as provided by a lexicon.
However, a lexicon is never complete. Language evolves constantly, among others due to morphological productivity, sense extensions, loans from other languages, and the constant introduction of new technological and scientific terminology.
Since the manual maintenance of lexicons is not only slow, but also susceptible to inconsistencies, automatic acquisition of lexical information has become an important research area and a practical necessity for large systems working with real data.

The goal of the DFG-sponsored research project ''!WordGraph'' is to develop new approaches for the acquisition of lexical information from text corpora. These approaches are base on graph theory.

Relationships between words in a text can be naturally represented by a graph which has words as nodes and relationships between them as edges. The nodes and edges in such a textual graph are of various types. Node types correspond to word classes (e.g. nouns, verbs, adjectives), and edge types represent different kinds of dependencies between them (e.g. syntactic dependencies, joint occurrence in a coordination, co-occurrence). The meaning of a word is characterized by its relationships (links) to the other words (nodes) in the word graph. The connectivity structure of the word graph thus contains valuable information about words and their meanings.


In particular, we are investigating node similarity algorithms such as !SimRank for the induction and extension of bilingual lexicons.


== Resources ==

As part of this ongoing projects, we have created resources that we believe to be useful for other researches on lexcial acquisition as well as the general NLP research community. We provide these recources as a service to the community.


=== Parsed Wikipedia data ===

We have parsed the text of English and German Wikipedia articles using [[http://www.ims.uni-stuttgart.de/tcl/SOFTWARE/BitPar.html|BitPar]].
This is one of the few large collection of compararable text parsed with the same parser.

 * English parses (3,4GB, approx. 30M sentences): [[http://www.ims.uni-stuttgart.de/tcl/RESOURCES/WordGraph/parsed-english-wp.tar|Download English parses]]
 * German parses (1,6GB, ): [[http://www.ims.uni-stuttgart.de/tcl/RESOURCES/WordGraph/parsed-german-wp.tar|Download German parses]]

The data comes in an archive that bundles gzipped files, each containing about 500 parsed sentences.
Each line consists the parse tree of one sentence encoded as structure of nested brackets.
Example sentence (rewrapped for better readability):
{{{
(TOP (S/fin/. (NP-SBJ/3s/base (PRP/3s It))
              (VP/3s (VBZ/n is)
                     (NP-PRD/pp (NP/base (QP (\<QP\[CD\]IN/of|CD\> (CD one)(IN/of of))
                                (CD 58))(NNS counties))
                                (PP/of/NP (IN/of of)(NP/base (NNP Gansu)))))(. .)))
}}}





=== Noun-Noun coordination data ===

Large dataset of nouns that occurr together in a coordination (such as "X and Y"). Extracted from Wikipedia.

 * English noun coordinations (approx. 5M coordinations): [[http://www.ims.uni-stuttgart.de/tcl/RESOURCES/WordGraph/en-noun-coordinations.txt.gz|Download English coordinations]]
 * German noun coordinations (approx. 2M coordinations): [[http://www.ims.uni-stuttgart.de/tcl/RESOURCES/WordGraph/de-noun-coordinations.txt.gz|Download English coordinations]]

Data comes as gzipped text files. Each line contains a single coordination.
Each word is annotated with part of speech tag and lemma, separated by slashes: Word/Tag/Lemma

Examples:
{{{
Luft/NN/Luft und/KON/und Wasser/NN/Wasser
der/ART/d Starbesetzung/NN/Starbesetzung und/KON/und der/ART/d technischen/ADJA/technisch Raffinessen/NN/Raffinesse

complexity/NN/complexity and/CC/and length/NN/length
history/NN/history and/CC/and cultural/JJ/cultural heritage/NN/heritage
}}}



=== Co-occurrence data ===

List of co-occurring word tuples. (extracted from Wikipedia)

=== Lexicon induction test dataset ===

Comparative evaluation of methods for bilingual lexicon induction is hampered by the lack of a common evaluation methodology and a common test dataset. Together with [[http://www.fask.uni-mainz.de/user/rapp/|Reinhard Rapp]] (Johannes Gutenberg University Mainz), we propose a common test dataset for the evaluation of lexicon induction experiments. We hope that this data will serve as a basis for a standard evaluation.
Please visit: http://www.ims.uni-stuttgart.de/forschung/projekte/WordGraph.en.html or http://www.cis.uni-muenchen.de/forschung/statnlp-ir/wordgraph/index.html

extern/WordGraph (last edited 2014-05-19 06:47:08 by AndreBlessing)