The Dictionary Project

images

files

description

The Dictionary Project is a poetic and theoretical investigative project aimed at the construction of software which generates new domain specific terminology using an ontologically driven text generation system. The software not only creates new words which have never before existed, but definitions for the words themselves which simulates the knowledge creation process. 

The input to the system is a set of domain specific source texts (e.g. Homeric epics). The software utilizes the Princeton University's WordNet project as its ontological source which drives the definition generation. The process consists of 7 stages:

  1. Select and analyze source texts
  2. Building an ontology
  3. Generating new words
  4. Search for an etymology
  5. Creating a definition
  6. Creating a usage example
  7. Creating a context example
Select and analyze source texts

In order to create terminology for a specific semantic domain, the system requires source texts from that domain to analyze. For example, to create new terminology for the domain of orthodontics, various texts related to dental techniques would be required. The choice of source texts influence two elements of the definition creation process:

  1. Types of words generated
    Source texts are used in the creation of new words. The new word creation process is based on a statistical analysis of letter groupings (a Markov model variant) from the source texts. One interesting side effect of this process is that the generated words take on a similar "feel" to the vocabulary of the source texts. For example, a set of scientific texts would generate scientific sounding words etc. Once created, a new word is checked against a large common vocabulary to ensure it does not already exist. 
  2. The ontology created for definitions
    Source texts are also used in the generation of a domain specific ontology. A frequency analysis is performed on the source texts in order to extract some of the key terminology from the source texts. This set of terminology is used to prune out a subset ontology from the larger WordNet ontological system
Building an ontology

The ontology used in the dictionary project is the freely available WordNet project developed at Princeton University . WordNet is defined by its authors as a "lexical reference system organized according to principles governing human lexical memory". English nouns, verbs and adjectives are organized into synonym sets, each representing one underlying lexical concept. Different semantic and lexical relations link the synonym sets providing a rich structure to work with. The current version of the Dictionary Project is currently limited to the creation of new noun terminology with the remaining parts of speech (verbs and adjectives) being reserved for future investigation.An ontology is an important element in the definition creation process. It provides the software with a hierarchical typological system which can be used as a reference point for the placement of new terminology into the system. Essentially, it provides a framework from which to hang new definitions which can themselves be used as knowledge enhancing extension points. For example, if the software were instructed to generate new terminology for the domain of pets (cats, dogs etc.), the software may decide that a new type of dog should be defined. Through the ontological attributes available to the software, it knows that a dog is a domesticated carnivorous animal possessing a tail, claws and a muzzle. The software can extend these existing attributes in infinite ways in order to differentiate this new type of dog from the more generic concept of dog.

The nouns within WordNet are organized as an inheritance system. Version 2.0 of WordNet used in the project contains 114,648 unique noun concepts.

An ontology is created for the target domain by pruning the existing WordNet ontology to achieve a more specific ontological subset. The following diagram illustrates the process given Homer's "The Odyssey" as a the source text and target domain.

Generating New Words

New words are generated using a statistical method based on letter groupings from the original source texts. The use of this particular method (a Markov Model variation) provides a beneficial side effect to our project. The choice of source texts directly influences the quality of the new words generated. For example, a selection of scientific papers will generate words which "sound" scientific. This effect adds to the intended goal of domain specificity. A target scientific domain will contain new words which sound scientific.

Searching for an etymology

To augment the definition of the word, an etymological basis is searched for within the newly generated word. Currently, only Latin and Greek etymological analysis is supported.

Creating a definition

Definition generation consists of three phases:

  1. The newly generated word is assigned as a conceptual subtype of an existing concept within the domain specific ontology (eg. X is a type of musical instrument, where X is the newly generated word).
  2. Semantic relationships to other concepts and a restricted vocabulary to use in the definition are calculated.
  3. The assigned concept subtype, semantic relationships and vocabulary are fed to a template based text generation system to create a defining statement about the new concept.
Creating a usage example

A usage example is also provided for some new definitions as a concrete example of how this new term may be properly used in a syntactically correct sentence. Usage examples are created using a verbal grammar system based on the University of Pennsylvania's VerbNet system. VerbNet is a verb lexicon compatible with the WordNet system but containing both syntactic and semantic information. WordNet does not possess syntactic information for words.

Creating a context example

A context example is also generate for some new terms. Context examples are designed to provide examples of how the new term may be used in popular or historical contexts (e.g. Song lyrics, novel excerpts, speeches, quotations etc.). Context examples are generated using a similar template based text generation system used for creating the term definition.

Report abuse

10x's