|
|
Under the Covers:How Search Engines WorkWhen you send a request to a smart search engine, it does more than just a lookup and return. Language processing can help an engine uncover what you really meant to find.Read the entire article here as it appears on Microsoft Internet Developer Magazine.
The amount of computerized data continues to grow exponentially. There are public and private databases on CD-ROMs and the Internet. As more and more data on more and more topics is accumulated in databases, how to retrieve all the information on a topic readily, efficiently, and economically is a hot topic. Adding intelligence that will assist in the search will be of enormous use. In this article I outline the structure of Natural Language Processing (NLP) and Information Retrieval (IR) technology that will eventually help Web users. Looking for InformationIndividual organizations used to construct private databases of accounting or scientific data that could only be interrogated by that organization's computer experts. Such mining required professional programming expertise and technical knowledge of the structure of individual databases. As database technology improved, stylized query languages emerged that still required some software expertise, knowledge of individual database schemas, and some professional and logical expertise. In those days you needed to know how to build sophisticated queries and how the search engine worked. You needed to master its internals and you needed to know how to use things like operators and modifiers precisely-just like a logician. Further, the response or the results of the search were often limited because non-logical connections between items were usually excluded. You no longer need to be a computer expert to find the information you are looking for.
This has been made possible by integrating Natural Language Processing capability into
search engine pre- and post-processors. Two of the main elements of NLP are morphology,
the study of word forms and their relationships to parts of speech, and syntax, the study
of the ways in which words are combined together to develop phrasal structures. With a natural language ranking method, all significant terms in a search query are used, and a search based on this approach can produce some results even if the terms in the query are misspelled. This would not be possible in a less-forgiving Boolean search. Novice users are familiar with the terms they want to find, but are not comfortable with Boolean searches. When a search engine has been augmented with NLP capability, there is no need for users to master a very complex query syntax because they can query data sources using everyday language. An NLP system can add relationship information to the words in a database of indexes. NLP systems make indexes of words more sophisticated because Information Retrieval systems can differentiate between search terms on the basis of morphosyntactic relationships. "Morphosyntactic" means nothing more than taking grammatical rules of a language's structure into consideration. For instance, "the White House" or "the Fourth Amendment" will be recognized as phrases with special meaning; therefore, the IR system will differentiate between documents containing "The White House" and documents containing phrases such as "my neighbor's white house." NLP systems are capable of detecting noun phrases as syntactic components of a language. For instance, I can analyze the syntax (how words are arranged into a structure) of the sentence "Internet companies have a high market value." I will identify "Internet companies" as a noun phrase where "Internet" is the adjective and "companies" is the noun. Why is noun phrase detection so fundamentally important? Because noun phrases capture the meaning of the information you want to find. In a content-rich and unstructured environment like the Web, this capability can save the user time and effort. The ability to distinguish noun phrases allows increased word discrimination that can greatly improve the relevance of returned documents. For instance, searching for "nuclear," "power," and "plant" separately instead of "nuclear power plant" would result in many irrelevant matches. Thus, IR systems that use noun phrases can enhance IR precision. LanguagesMajor search databases now exist in languages other than English, and NLP will work in them as well. Some vendors offer NLP capabilities bundled with their search and retrieval engines that are capable of handling languages such as French, German, Spanish, Italian, Dutch, Danish, Portuguese, and other European languages. These include Xerox/XSoft and InSo Corp. (previously InfoSoft, a spin-off of Houghton Mifflin, the company that provides thesauruses, spell checkers, and grammar checkers to several companies including Microsoft). Both Xerox and Inso are also developing support for double-byte character sets for languages such as Japanese, Chinese, and Korean. The combination of language support, NLP, and IR systems has given users much more flexibility. Multilingual document collections can be queried through a multilanguage query syntax capability. Users can set up a collection of email, tech notes, or other documentation in French, German, and Italian-even the same document can contain different languages-and perform successful queries. You can mix terms in different languages when you query a data set containing multilingual documents. For example, you can use this query "find all the documents regarding computer, computadora, logiciel, elaboratore dati." to retrieve any English document containing the word "computer," in addition to Spanish, French, and Italian documents containing the corresponding (if not translated) term. Clearly, multinationals and international organizations can benefit from this multilingual approach. As a rule, such companies have settled on a few popular languages for searches. However, Internet users will soon expect to use their own languages when searching for information. Indexing and RankingIndexing is the first task an IR system has to accomplish; you can only search and
retrieve a document if it has been indexed into a database. Each word (index) is
represented by a set of coordinates that describes where a word is located-in what
document, sentence, paragraph, heading, and so on. The resolution of the index controls
whether retrieved pointers point to whole documents, subsections within documents, or
exact word locations. Ultimately, IR systems analyze queries to see if the words in the queries are in the indexes. If they are, the system displays the matching documents and ranks or scores them according to various scoring methods/algorithms. The documents that receive the highest score represent the best match, and the system presents them at the top of the list of retrieved documents. Tool CharacteristicsNLP tools need to explicitly address the following issues for indexing and querying:
Computer-specific issues include platforms, nongrammatical input, internationalization (support for non-English character sets), code footprint, and lexical coverage (number of words recognized). The ability to customize lexicons (dictionaries, special vocabularies) and noun phrase grammars is useful too. A stemming function can reduce all tokens in the index and all tokens in queries to their baseform.Natural Language AnalysisAs programs and procedures for adding NLP capability to IR systems are developed, they are tested on a body of text. Sometimes a collection of news reports from Agence France (the French press) or Der Spiegel (the German press) will serve as an informal test bed. In a formal setting, a recognized corpus of documents is used. A corpus is a standard set of texts used for study and comment by linguistic scholars. Examples include the British National Corpus (http://info.ox.ac.uk/bnc/) and the Brown University Corpus of American English. One way to add NLP capability to a search and retrieval engine is to apply three analyses to the searcher's natural language query expression: morphosyntactic analysis, term expansion, and query construction. Each is described below. The combination of these analyses will generate a final form of the search query that, depending on the text to be retrieved, may be very complex, mirroring the input a computer analyst had to generate in the old days. This final form of the search query is then input to the search engine. The
first function, morphosyntactic analysis, tokenizes a query, recognizes
parts of speech including noun phrases in a free-form text, and tags them.
Put another way, this function parses the user's expression, breaks it
into recognizable words, then analyzes it. The IR system must provide part
of speech information to construct noun phrases. A
word's part of speech may be ambiguous when taken out of context. For
instance, the word "code" can be a noun or a verb, depending on
the context. Thus, the IR system must provide a basic disambiguation.
mechanism that increases the accuracy of noun phrase construction by
refining part of speech tags. Term expansion expands the number of key terms by adding related words.
Some words may be added to the text in the query expression. These words
may be related by stemming. Remember, the stem is the part of the word to
which inflectional endings are added. This means that if you identify
"think" as a key word, then "thinks" is a word related
by stemming. By applying the same concept, you identify "fly" or
"flight" as words related by stemming to the key word
"flew." "Better" is related to "good" and
"best," and "bought" to "buy." Stemming and
uninflection must be sophisticated, particularly in the cases of such
irregular verbs and adjectives. Take
the word "thought"_ in addition to being a noun, it's two forms
of the verb "to think" ('I thought," the simple past verb,
and "I have thought," the past participle form). A stemming
function can reduce all tokens in the index to a baseform and similarly
reduce all tokens in queries to their baseform. This can be complex; in
Italian, for instance, "sono," "siamo," and
"furono" are all inflected forms of the verb "essere"
(to be). In Spanish, there are over sixty inflected forms of the Spanish
verb "minar." Full morphological analyses for Romance-family
languages typically contain much more information than English; it is
claimed that in some languages like Finnish, a verb can appear in over ten
thousand forms! A different
implementation would relate words by thesaurus (a database of words
related by sense) or context, so that all the words related to a
particular subject defined in a thesaurus can be included. A key word such
as "information retrieval" can be expanded to words such as
"retrieval system" and "information database" when a
thesaurus is used. A
third term expansion method is truncation, in which words are fused by
using wildcard characters so that a prefix will match several words. For
instance, "histo*" will find "history,"
"historian," "historians," and "historical."
Conversely, a stoplist is used to eliminate a set of words that have no
indexing value, such as articles like "the" and "a,"
and prepositions like "for" and "to." Back
to natural language analysis methods. Query construction is the creation
of a query language expression, which use
'd to be built manually by a computer expert. The use of an NLP
operator such as <LANGUAGE>, for instance, would allow users to find
all the corresponding terms for "press" and
"politicians" in other languages. So,
the query: <LANGUAGE>press
<AND> <LANGUAGE>politicians where
<LANGUAGE> is an NLP operator and <AND> is a Boolean, would
find the American English terms "press" and
"politicians" when the following query is used: <American>press
<AND> <American>politicians. The query <French>presse
<et> <French>politicantes would retrieve
the French terms "Presse" and "politicantes." The
query <Italian>stampa
<e> <Italian>politici would retrieve
the Italian terms "stampa" and "politici." Information
Retrieval systems augmented by NLP capability can enhance two essential IR
requirements: recall and precision. By enhancing information retrieval
recall, the user can retrieve a larger amount of documents on the searched
topic. Improving IR precision translates into better relevance of the
retrieved information, which means users will most likely not retrieve
documents that have no pertinence with the subject researched. Better
precision alone could put an end to a very common complaint by Web users,
who find search engines currently available on the Web unsuitable to
retrieve precise and accurate information on a consistent basis. One very
simple use of Markov Chains is to record the probabilities that one word
or word type follows another.
Markov ChainsTwo
mathematical procedures are used in the development of NLP-based
information retrieval systems: Markov Chains and scoring.
A Markov Model
is a finite-state device (also called a word chain device) that, when
faced with a choice between two or more lists, chooses among them
according to specified probabilities. It is capable of creating an
unlimited number of distinct combinations from a finite set of elements.
The components of a word chain device are, a number of lists of words or
phrases and a set of directions for going from one list to another. A
processor selects one word from the first list, another word from the
second list, and so on to form a complete sentence. A word chain
is the simplest way to combine words together. An implementation of a word
chain device is the re- corded voice that gives you a phone number when
you dial directory assistance. In linguistics, there have been studies
that model the English language as a very large word chain. Some
psychologists have suggested that human language is based on a large word
chain stored in the brain. They have theorized that a stimulus causes a
human to respond with a word, which in turn constitutes the stimulus for
another word out of many possibilities, and so on. The linguist
Noam Chomsky has criticized the idea that a language can b 'e reduced to a
word chain, because a sentence in English is completely different than a
string of words chained together, according to the languages transition
probabilities. Chomsky has shown that an improbable sequence of words can
be grammatical and that nonsense can be grammatical. But the fundamental
difference between a language and a word chain is that a word chain can
only remember which word list it has just chosen from. It cannot remember
things such as an "if" said at the beginning of a sentence,
while a human being can, thus enabling a person to follow the "if
with the appropriate word categories. A Markov Chain is also defined as a
special type of dynamic statistical function. It is represented by a
matrix whose elements are conditional probabilities. This is practical if
the number of statistical states is small and manageable. One very simple
use of Markov Chains, in principle, is to record the probabilities that
one word or word type follows another in a language. For instance, the
word "thought" may be tagged ambiguously as both noun and verb;
examination of the con- text would allow the system to assign a most Rely
part of speech to the token. Estimates of the conditional probabilities
can be refined in successive testing stages. This process is often the one
involved in training networks. Scoring
A
scoring function assigns weights to components of a query based on such
variables as length, position within its paragraph, position of its
paragraph within the document, presence in headings, and presence or
absence of fixed phrases, theme words, and uppercase words. The resulting
score, used for ranking, is the sum of these weights. The values of the
weights differ among document types and languages. When weighting, words
that are indexed or used in a query are assigned a numerical value usually
based on the frequency of the words in stored documents. Ranking models
can be probabilistic. By using information about word frequency, IR
systems can assign a probability of relevance to all the documents that
can be retrieved. These documents are then ranked according to this
probability. Ranking is very useful because document databases tend to be
very large, and large return sets can be daunting if not sorted. XLT
For
this article, I examined XLT from Xerox/XSoft and Intelliscope from Inso.
At the time of my analysis, XLT was available in English, French, German,
Spanish, Italian, and Portuguese. Intelliscope was available for English,
French, German, and Spanish. Both products will segment or tokenize, stem
words, tag parts of speech, and detect noun phrases. XLT linguistic
database products perform sophisticated natural language analysis and have
an underlying architecture that can provide support for all European
languages with a single API. What
distinguishes XLT is that it is composed of algorithms rather than lists
of words and relationships between the words. The data structure
underlying this technology is called a finite-state transducer and the
code that runs these transducers-achieving morphological analysis,
generation, tagging, tokenization, OCR guidance, spellchecking and so
on-is very small and language-independent, saving on storage. New language
models can be plugged into the existing runtime code. Finite-state
transducers can be combined into one transducer. Xerox
uses two simple principles in creating lexical transducers: all inflected
forms of the same word are mapped to the same canonical dictionary in
citation form. Morphological parts of speech (categories) and features are
included in the lexical representation. Oracle and America OnLine have
signed leasing agreements for Xerox's XLT technology.
Intelliscope's flexibility allows users to build their own personal
dictionaries and noise-word stop lists.
IntelliscopeIntelliscope,
a new natural language tool, is based on a simpler concept and design than
Xerox's XLT, but is just as powerful in noun phrase detection, part of
speech tagging, stemming, and uninflection. Its flexibility allows users
to build their own personal dictionaries and noise-word stop lists. This
capability would require Xerox to build a new transducer, since Xerox
doesn't provide a way for users to build their own. Intelliscope also
offers custom lexicons and custom noun phrase grammars; Intelliscope users
can build their own dictionaries and change the way in which noun phrases
are detected. This is an important feature for customers like
pharmaceutical or chemical companies that have already developed their own
terminology through years of operation in their specific industry sector. Intelliscope
functions for index preprocessing and search preprocessing include
language- sensitive tokenization. Intelliscope defines tokens as
whitespace-delimited strings with leading and trailing punctuation
removed. Intelliscope follows predetermined punctuation rules to resolve
cases where attached punctuation is ambiguous. It also returns the
beginning and ending points of each sentence in the input buffer and the
number of tokens in each sentence. Token
normalization allows users to configure Intelliscope to perform customized
token conditioning according to code and data size, speed, and accuracy
requirements. Uninflection
and-inflection identifies the baseforms of inflected words. In ambiguous
cases, Intelliscope returns a list of baseforms. Compound
processing recognizes compound forms and their constituents. A compound is
a word form made up of other constituent words. For instance, the German
word "Eisenbahngesellschaft" could return "Eisenbahn"
and "Gesellschaft.” This process is available for languages such as
Danish, Dutch, German, Norwegian, and Swedish. Clitic
processing identifies clitics for their removal to find root words.
Clitics are articles, pronouns, and prepositions attached directly or with
punctuation to other words. For instance, Intelliscope can recognize the
French word "l' enfant" as a clitic form and will strip l' to
return the word root "enfant." This process is available for
romance languages such as French, Italian, Portuguese, and Spanish. Intelliscope
takes unverified tokens and returns spelling correction alternatives,
generated from its database, as suggestions for possible replacement. This
will allow users to avoid running IR queries with misspelled words. Intelliscope
databases are rich in linguistic information. Tokens are verified to
ensure their existence in the language. Capitalization or parts of speech
are examples of returned information about words from the database.
Unrecognized words are flagged as unverified. To maximize language
coverage, Intelliscope uses comprehensive wordlists in many languages. The
Intelliscope noise-word filter removes noise words, words that are not
helpful in discriminating relevant documents from non-relevant documents.
Microsoft Index Server, part of the Microsoft, Commercial Internet System,
uses Inso software in its implementation. In the FieldVerity,
a well-known company in the Internet industry, has deployed Intelliscope
in its sophisticated retrieval technology, The integration of Intelliscope
will allow users to query data sources using their everyday language. Verity
Natural Language Tool is capable of utilizing part of speech information
and detecting noun phrases that capture the essence of a free-text query.
The company plans to ship the free-text query parser as part of the
standard product. The free-text query parser will transform a natural
language query in any of the languages Verity supports (English, French,
German, Spanish, Italian, and Dutch) into a text reference query
expression (TRL). The tool can also perform spell checking in all of the
languages Verity supports. The
basic morphosyntactic analysis of the input that the tool performs could
be used with or without the help of Verity Topic thesaurus or other
knowledge bases (information structures) to automatically construct a
query in the TRL. In other words, the tool could auto-match any query
input to a predefined Topic set. From
preliminary testing, Verity has found a consistent and significant
improvement in the precision of the retrieved results of free-text
searching, without negatively affecting performance when compared with
queries run without the use of a natural language-based tool. Verity
is currently pursuing proper name detection. Since proper names add very
specific knowledge, they are very important to the quality of the
free-text search. Verity is also planning to integrate a statistical
algorithm to automatically create a corpus-derived thesaurus. This
functionality will further allow a free-text query to expand to a more
specific search, therefore improving the quality of precision and recall.
|