6108
Comment:
|
7173
|
Deletions are marked like this. | Additions are marked like this. |
Line 9: | Line 9: |
The goal of the DFG-sponsored research project ''!WordGraph'' is to develop new approaches for the acquisition of lexical information from text corpora. These approaches are base on graph theory. | The goal of the DFG-sponsored research project ''!WordGraph'' is to develop new approaches for the acquisition of lexical information from text corpora. These approaches are based on graph theory. |
Line 18: | Line 18: |
As part of this ongoing projects, we have created resources that we believe to be useful for other researches on lexcial acquisition as well as the general NLP research community. We provide these recources as a service to the community. | As part of this ongoing project, we have created resources that we believe to be useful for other researchers on lexcial acquisition as well as the general NLP research community. We provide these recources as a service to the community. |
Line 20: | Line 20: |
=== Noun Coordination Data === | |
Line 21: | Line 22: |
=== Noun-Noun Coordination Data === | Large dataset of nouns that occur together in a coordination (such as "X and Y"). Extracted from Wikipedia. |
Line 23: | Line 24: |
Large dataset of nouns that occurr together in a coordination (such as "X and Y"). Extracted from Wikipedia. | * English noun coordinations (approx. 5.8M coordinations): [[http://www.ims.uni-stuttgart.de/tcl/RESOURCES/WordGraph/en-noun-coordinations.txt.gz|Download EN data (gzipped, 114MB)]] * German noun coordinations (approx. 2.2M coordinations): [[http://www.ims.uni-stuttgart.de/tcl/RESOURCES/WordGraph/de-noun-coordinations.txt.gz|Download DE data (gzipped, 50MB)]] |
Line 25: | Line 27: |
* English noun coordinations (approx. 5.8M coordinations): [[http://www.ims.uni-stuttgart.de/tcl/RESOURCES/WordGraph/en-noun-coordinations.txt.gz|Download English coordinations]] * German noun coordinations (approx. 2.2M coordinations): [[http://www.ims.uni-stuttgart.de/tcl/RESOURCES/WordGraph/de-noun-coordinations.txt.gz|Download German coordinations]] Data comes as gzipped text files. Each line contains a single coordination. |
Each line contains a single coordination. |
Line 33: | Line 32: |
complexity/NN/complexity and/CC/and length/NN/length history/NN/history and/CC/and cultural/JJ/cultural heritage/NN/heritage |
|
Line 35: | Line 37: |
complexity/NN/complexity and/CC/and length/NN/length history/NN/history and/CC/and cultural/JJ/cultural heritage/NN/heritage |
|
Line 40: | Line 39: |
=== Adjective-Noun modification data === List of adjectives modifying nouns (extracted from Wikipedia) * [[http://www.ims.uni-stuttgart.de/tcl/RESOURCES/WordGraph/en-adj-n.gz|Download EN data (gzipped, 157MB)]] (32M relationships) * [[http://www.ims.uni-stuttgart.de/tcl/RESOURCES/WordGraph/de-adj-n.gz|Download DE data (gzipped, 71MB)]] (12M relationships) Each line contains a single adjective-noun pair. Examples: {{{ left-wing ideology political party religious leader chemisch Element deutsch Film grell Lampe }}} === Verb-object data === List of verbs and their direct object (extracted from the Wikipedia-derived parse trees above). * [[http://www.ims.uni-stuttgart.de/tcl/RESOURCES/WordGraph/en-v-obj.gz|Download EN data (gzipped, 5.3MB)]] (11.7M relationships) * [[http://www.ims.uni-stuttgart.de/tcl/RESOURCES/WordGraph/de-v-obj.gz|Download DE data (gzipped, 1.6MB)]] (1.6M relationships) Each line contains a single verb-object pair. Examples: {{{ turn#off brain outwit enemy rouse suspicion abfahren Strecke weiterentwickeln Technik annehmen Ruf }}} |
|
Line 42: | Line 79: |
We used graph similarity algorithms to create a bilingual semantic relatedness thesaurus. For every English word, there are ten German words deemed related by the algorithm. and vice versa. The method used to create this resource will be described in a forthcoming publication (submitted to LREC2010). | We used graph similarity algorithms to create a bilingual semantic relatedness thesaurus. For every English word, there are ten German words deemed related by the algorithm. and vice versa. The method used to create this resource will be described in a forthcoming publication (accepted at LREC2010). |
Line 62: | Line 99: |
(Möwe,n) (gull,n) (swan,n) (goose,n) (duck,n) (teal,n) (flamingo,n) (loon,n) (grebe,n) (cormorant,n) (tern,n) |
|
Line 67: | Line 116: |
This is one of the few large collection of compararable text parsed with the same parser. | This is one of the few large collections of compararable text parsed with the same parser. |
Line 103: | Line 152: |
=== Adjective-Noun modification data === List of adjectives modifying nouns (extracted from Wikipedia) coming soon. === Verb-object data === List of verbs and their direct object (extracted from the Wikipedia-derived parse trees above). coming soon. |
WordGraph Research Project
Contents
Introduction
Most natural language processing systems rely heavily on information about words and their meanings as provided by a lexicon. However, a lexicon is never complete. Language evolves constantly, among others due to morphological productivity, sense extensions, loans from other languages, and the constant introduction of new technological and scientific terminology. Since the manual maintenance of lexicons is not only slow, but also susceptible to inconsistencies, automatic acquisition of lexical information has become an important research area and a practical necessity for large systems working with real data.
The goal of the DFG-sponsored research project WordGraph is to develop new approaches for the acquisition of lexical information from text corpora. These approaches are based on graph theory.
Relationships between words in a text can be naturally represented by a graph which has words as nodes and relationships between them as edges. The nodes and edges in such a textual graph are of various types. Node types correspond to word classes (e.g. nouns, verbs, adjectives), and edge types represent different kinds of dependencies between them (e.g. syntactic dependencies, joint occurrence in a coordination, co-occurrence). The meaning of a word is characterized by its relationships (links) to the other words (nodes) in the word graph. The connectivity structure of the word graph thus contains valuable information about words and their meanings.
In particular, we are investigating node similarity algorithms such as SimRank for the induction and extension of bilingual lexicons.
Resources
As part of this ongoing project, we have created resources that we believe to be useful for other researchers on lexcial acquisition as well as the general NLP research community. We provide these recources as a service to the community.
Noun Coordination Data
Large dataset of nouns that occur together in a coordination (such as "X and Y"). Extracted from Wikipedia.
English noun coordinations (approx. 5.8M coordinations): Download EN data (gzipped, 114MB)
German noun coordinations (approx. 2.2M coordinations): Download DE data (gzipped, 50MB)
Each line contains a single coordination. Each word is annotated with part of speech tag and lemma, separated by slashes: Word/Tag/Lemma
Examples:
complexity/NN/complexity and/CC/and length/NN/length history/NN/history and/CC/and cultural/JJ/cultural heritage/NN/heritage Luft/NN/Luft und/KON/und Wasser/NN/Wasser der/ART/d Starbesetzung/NN/Starbesetzung und/KON/und der/ART/d technischen/ADJA/technisch Raffinessen/NN/Raffinesse
Adjective-Noun modification data
List of adjectives modifying nouns (extracted from Wikipedia)
Download EN data (gzipped, 157MB) (32M relationships)
Download DE data (gzipped, 71MB) (12M relationships)
Each line contains a single adjective-noun pair.
Examples:
left-wing ideology political party religious leader chemisch Element deutsch Film grell Lampe
Verb-object data
List of verbs and their direct object (extracted from the Wikipedia-derived parse trees above).
Download EN data (gzipped, 5.3MB) (11.7M relationships)
Download DE data (gzipped, 1.6MB) (1.6M relationships)
Each line contains a single verb-object pair.
Examples:
turn#off brain outwit enemy rouse suspicion abfahren Strecke weiterentwickeln Technik annehmen Ruf
Cross-lingual Relatedness Thesaurus
We used graph similarity algorithms to create a bilingual semantic relatedness thesaurus. For every English word, there are ten German words deemed related by the algorithm. and vice versa. The method used to create this resource will be described in a forthcoming publication (accepted at LREC2010).
English->German relatedness data (approx. 9000 entries): Download EN->DE data
German->English relatedness data (approx. 6000 entries): Download EN->DE data
The data comes in gzipped text files, each containing blocks of one word and ten related words, each on its own line. The lines of the related words are indented with a TAB character. The next block is separated by an empty line.
Example:
(lion,n) (Panther,n) (Nashorn,n) (Löwe,n) (Büffel,n) (Jaguar,n) (Leopard,n) (Tiger,n) (Puma,n) (Elefant,n) (Antilope,n) (Möwe,n) (gull,n) (swan,n) (goose,n) (duck,n) (teal,n) (flamingo,n) (loon,n) (grebe,n) (cormorant,n) (tern,n)
Parsed Wikipedia Data
We have parsed the text of English and German Wikipedia articles using BitPar. This is one of the few large collections of compararable text parsed with the same parser.
English parses (3,4GB, approx. 30M sentences): Download English parses
German parses (1,6GB, approx. 12.7M sentences): Download German parses
The data comes in an archive that bundles gzipped files, each containing about 500 parsed sentences. Each line consists the parse tree of one sentence encoded as structure of nested brackets.
Example sentence "It is one of 58 counties of Gansu.":
(TOP (S/fin/. (NP-SBJ/3s/base (PRP/3s It)) (VP/3s (VBZ/n is) (NP-PRD/pp (NP/base (QP (\<QP\[CD\]IN/of|CD\> (CD one) (IN/of of)) (CD 58)) (NNS counties)) (PP/of/NP (IN/of of) (NP/base (NNP Gansu))))) (. .)))
Co-occurrence Data
List of co-occurring word tuples. (extracted from Wikipedia) coming soon.
Lexicon Induction Test Dataset
Comparative evaluation of methods for bilingual lexicon induction is hampered by the lack of a common evaluation methodology and a common test dataset. Together with Reinhard Rapp (Johannes Gutenberg University Mainz), we propose a common test dataset for the evaluation of lexicon induction experiments. We hope that this data will serve as a basis for a standard evaluation.