Integration of WordNet 1.7 in WebKB-2

The reader is assumed to have navigated a bit the shared ontology of WebKB-2. If this is not the case, please do so, starting with the Category Search interface, or the following simple text field (enter an English noun). Then, come back to this document. Note: this document countains various links, including to the ontology in various formats and to all the corrections that I have been made; click here if you are after a published article that summarizes the re-use, correction and extension of WordNet 1.7, the adopted methodology and top-level ontology.

Reminder: the links subtypeOf, subtype, instanceOf, instance, equal, similar, exclusion and closed_exclusion are respectively represented by the characters: <, >, ^, :, =, ~, ! and /. The content-oriented links (part, location, member, substance and object) are not abbreviated anymore.

WordNet is a lexical database connecting English words/expressions to categories representing their meanings. It can also be seen as an ontology for natural language since the categories are connected by various kinds of semantic links, e.g. generalization, similar, exclusion, member, part and substance.

To ease and guide knowledge re-use/sharing/retrieval/entering, I initialized the knowledge base (KB) of WebKB-2 with the content of WordNet 1.7 related to nouns: 108,000 nouns and 74,500 categories referred by nouns (in accordance with my lexical recommendations, I ignored information regarding verbs, adverbs and adjectives).

A first problem was that, although WordNet categories have intuitive names (English nouns or nominal expressions), they do not have intuitive identifiers (the WordNet API mainly uses numbers). Intuitive identifiers are mandatory for permitting people to read, write and update knowledge statements in text files, i.e. outside the graphical interface of a particular tool. This is a minimal requirement for knowledge sharing/re-use and also greatly simplifies the development of knowledge-based tools.
Hence, I designed an algorithm to create intuitive identifiers for WordNet categories based on their names. This algorithm combines various heuristics I learnt from many trials. Although the final version worked quite well, I still had to update a few generated category identifiers manually. This algorithm is detailed below.

A second problem was that WordNet has a poorly structured top-level and does not always classify categories according to distinctions that are important for the use of these categories in knowledge representations. To permit this use and some semantic checkings on it, I have inserted WordNet top-level categories and some medium-level categories into a top-level ontology synthesizing and complementing various other top-level ontologies. This has led us to break a few WordNet generalization links. Click here for rationales behind my top-level ontology. Click here to explore the first specializations of my top-level concept type pm#thing, and here to explore the specializations of my top-level relation type pm#relation.

A third problem was that WordNet confuses subtypeOf and instanceOf links into generalization links, or in other words, does not distinguish types from individuals (categories that cannot have subtypes or instances). This distinction is important for knowledge checking -- although instanceOf link should not be over-used -- to avoid forcing arbitrary choices or to compensate for wrong choices, WebKB-2 permits the use of certain types without quantifiers (i.e. as if they were individuals) within statements. I have isolated 6211 true individuals within WordNet 1.7. Click here for a list.

A fourth problem was that WordNet 1.7 countains inconsistencies and redundancies. Conversely, some categories for common English words are missing. Click here for a list of my semantic corrections (more than 300) and additions (more than 150). It should be noted that most of the inconsistencies I corrected were automatically detected thanks to the exclusive links in my top-level ontology (and as mentioned above, the generalization of WordNet categories by categories in my top-level ontology). Two kinds of links, equal ('=') and location ('l') had to be introduced to correct certain erroneous uses of the generalization link.

A fifth problem was that some categories did not have explicit enough names, or their ordering was not correct (category names in WordNet are ordered by decreasing frequency of use, but this ordering is generated from a few concordance files and therefore can be misleading). Click here for a list of the lexical modifications that I made to the WordNet ontology.

A sixth problem for knowledge representation is the lack of structuration of WordNet and the fact that many categories have a lexical rather than semantic nature. Some structuration was added via semantic links (the above cited 161 additions). I also added sub-annotations at the beginning of some category annotations, e.g. $(value)$ to represent the fact that the category represent a value, and $(artificial)$ to represent that it has a lexical nature and/or should not be used for knowledge representation. Click here for a list of value/artificial sub-annotations.

Finally, it should be noted that the semantics of the links part, member, substance and object in WordNet is not always clear or inconsistent. For instance, does a part link from the category airplane to the category wing mean that "any airplane has for part at least 1 wing" or "all airplanes have for part the same wing", "any wing is part of a plane", "a wing is part of any plane", etc. For graph matching (and hence inferencing) in WebKB-2, I have assumed the first interpretation is correct; however, this is just an heuristic.

I integrated WordNet 1.7. in January 2002. When representing knowledge between January and June, I sometimes made some updates to the key names of the WordNet categories, and occasionally corrected some links, but more and more rarely. The WordNet part of the KB (and my top-level ontology) can now be considered quite stable. Hence, the identifiers can be used by people in their own files, and support knowledge sharing.

It is best to explore and filter parts of the ontology of WebKB-2 via my Category Search tool. However, if you do need all or parts of the ontology, see the file that permits to loads all the other input files, including the top_level ontology and the representation of WordNet (10.3Mb file). These files are up-to-date and in the FT format which permits to get a good understanding of the content of the ontology. Old versions of the whole ontology are also available in other formats:

A 35Mb DAML+OIL compliant RDF version was generated in June 2002 (hence, it is not up-to-date, e.g. changes due to the incorporation of the DOLCE ontology are not represented). Since this file includes more than very simple knowledge statements, the RDF representation is sometimes necessarily ad-hoc. It is also impoverished: the creators of links between categories have not been represented.
A 14.1Mb CGIF version was generated in January 2003. I am not entirely sure of this version and the creators of links between categories have not been represented. Use the category search tool with the CGIF option selected to check that this CGIF is ok for you.
The index.noun and data.noun WordNet files in their usual WordNet format, although since my ontology has several kinds of links that WordNet do not possess, the encodings of certain links between verbs, adverbs and adjectives has been re-used for links between categories for nouns (this permits the re-use of classic WordNet browsers). See the header of data.noun for details (and why this export is very impoverished). The links between categories for nouns and categories for adjectives has also currently been lost but this will be fixed soon (this release may be a bit premature since I have not finished working on the generation of these WordNet files; if you encounter problems please e-mail me).

More top-level ontologies, e.g. from the SUO Library and the DAML Library, will be incorporated into WebKB-2 knowledge base.

This work has now been published. It has been done to help principled and manual knowledge representation. It is insufficient for the inter-operation of fully automatic software agents, e.g. for e-commerce or database integration purposes; this article by R. Colomb gives some of the reasons why general automatic inter-operation (not pre-programmed business-to-business inter-operation) is not going to happen anytime soon.
My work is also very insufficient to help knowledge-based automatic natural language processing. One of the steps in this direction are provided by the ThoughTreasure^TM project and its downloadable resources. The Cyc and OpenCyc projects should of course also be cited. See also the pages about the Open Mind projects and the Natural Language Processing group at USC/ISI.

Generating user-friendly identifiers for WordNet categories

WordNet connects words to categories representing the meanings of these words. Each category has at least one name (word) and each name may be shared by several categories (since a word may have several meanings) Category keys (or "key names") need to be chosen for uniquely representing categories. (I use the expression "key name" instead of "category identifier" because in WebKB an identifier for a category is generally composed of a user identifier, a key name, and optionally other names separated by "__").

In the WordNet API and database files, a category is referred either with the offset of its description in one of the database files (e.g. the offset "12558316" for the category with names "Friday" and "Fri"), or some sense indices which are the names of the category with some suffixes to make unique key names (e.g. "friday%1:28:00::" and "fri%1:28:00::"; the "1" after the "%" indicates that the name is a noun; the "28" is a number for the lexicographer file containing this name; "00" is the order of the category in the list of the categories sharing this name in this lexicographer file).
Given WebKB only stores categories representing the meaning of nouns (i.e. categories having nouns as names), I could have adapted sense indices to make relatively readable key names, e.g. #Friday-28 and #Fri-28. However, I experienced that knowledge is not easy to read or write when all the category identifiers have such suffixes.

Ideally, the key name of a category should look like one of the English words or expressions most commonly used for referring to what the category represents, and be unambiguous enough for a human reader to distinguish its meaning from the meanings of other categories. In WordNet, the most common name for a category is the first in the list of its names, but less ambiguous names may appear after. When one of the other names is a compound name beginning or ending with the first name (as "Steve_Martin" begins with "Steve" and ends with "Martin"), it constitutes a better choice for a key name than the first name.

Hence, here are the first rules (ordered by decreasing order of priority) that I chose to generate key names:
1) when the 1st name of a category begins or ends one of the other names, select this other name as key name (unless it is shared by another category without generated key name yet);
2) select the 1st name of a category as key (unless it is shared by another category without generated key name yet);
3) try the first two rules on the 2nd name instead of the 1st;
4) try the first two rules on the 3rd name instead of the 1st;
5) etc.

To respect the decreasing order of priority of these rules, I have scanned the KB many times (each time, testing all remaining categories without key name), allowing the test of a lower priority rule only when the application of rules of greater prority did not lead to any more change. (The order of the rules was also respected when testing each category). This may not be an efficient approach but it was efficient enough given WebKB-2 could scan the whole KB quite quickly (0.45 second in average).
The application of the first two rules (i.e. trying to use only the 1st name of each category) permitted the affectation of key names to 75% of categories (56074 out of 74488). The gradual use of the other category names permitted to reach 84% of affected categories (62873 out of 74488). This means that each category in the remaining 16% shared all its names with another category (being in this 16% too).

To go further, I had to generate suffixes. I used numbers when I integrated WordNet 1.6 but, when using categories in knowledge representations, I realized that this option was not user-friendly enough and that a much clearer option was to use the key name of the first supertype. Such suffixes often help people to guess the meaning of a category without having to access all its supertypes. However, I did not want to give a key name with a suffixe to all remaining unaffected categories. Hence, I added the following rules (by decreasing order of priority and with a lower priority than the previous rules) to select the categories to which key names with suffix would be affected:
- select the category with a frequency-of-use number far lower than the other categories sharing all the same names (this number is given by WordNet and represents the frequency of appearence of the category in a few concordance documents; it is an indication but not of paramount importance; "far lower" was first set at 30 and then to decreasing values);
- select the category with a far lower number of subtypes than the other categories sharing all the same names (actually, in these last two rules, I used combinations of gradually decreasing values of frequency-of-use and number of subtypes; I also tried to reduce the affectation of suffixes to subtypes of #action, as these types are more frequently used than the others in knowledge representations).

After several additional scans of the KB with all the rules, there were still a few dozens of categories that were unaffected. To fix this, I added more precise names to these categories and/or re-ordered their names (some of my lexical additions to WordNet come from this phase). I also had to correct some attributions of suffixes and some choices of key names. For example, #Republic_of_Singapore, instead of #Singapore, had been selected as key name (in application of the 1st rule) but #Singapore is a more convenient identifier, while the island of Singapore and the capital of Singapore are better referred to via #Singapore.island and #Singapore.capital than #Singapore. To fix that, before re-running the key name affectation procedure from scratch, I semi-automatically pre-affected suffixes to many categories, especially the specializations of #location. For example, I added the suffixes ".capital", ".city", ".island", ".country", and ".colony" to desambiguify many category names. However, instead of using the generalizing category for creating the suffix, I sometimes followed the partOf link. For example, in WordNet 1.7, #town has three instances with unique name "Bangor" but part of different regions. Hence, I named them #Bangor.Northern_Ireland, #Bangor.Wales and #Bangor.Maine. I have not listed these manual and automatic additions of suffixes in my lexical additions to WordNet. However, you can click here for the current list of 5944 WordNet categories having been affected a key name with a suffix.

Philippe A. MARTIN Created on June 21st, 2002. Last updated on January 15th, 2003.