| United States Patent Application |
20040243568
|
| Kind Code
|
A1
|
|
Wang, Hai-Feng
;   et al.
|
December 2, 2004
|
Search engine with natural language-based robust parsing of user query and
relevance feedback learning
Abstract
A search engine architecture is designed to handle a full range of user
queries, from complex sentence-based queries to simple keyword searches.
The search engine architecture includes a natural language parser that
parses a user query and extracts syntactic and semantic information. The
parser is robust in the sense that it not only returns fully-parsed
results (e.g., a parse tree), but is also capable of returning
partially-parsed fragments in those cases where more accurate or
descriptive information in the user query is unavailable. A question
matcher is employed to match the fully-parsed output and the
partially-parsed fragments to a set of frequently asked questions (FAQs)
stored in a database. The question matcher then correlates the questions
with a group of possible answers arranged in standard templates that
represent possible solutions to the user query. The search engine
architecture also has a keyword searcher to locate other possible answers
by searching on any keywords returned from the parser. The answers
returned from the question matcher and the keyword searcher are presented
to the user for confirmation as to which answer best represents the
user's intentions when entering the initial search query. The search
engine architecture logs the queries, the answers returned to the user,
and the user's confirmation feedback in a log database. The search engine
has a log analyzer to evaluate the log database to glean information that
improves performance of the search engine over time by training the
parser and the question matcher.
| Inventors: |
Wang, Hai-Feng; (Kowloon, HK)
; Lee, Kai-Fu; (Woodinville, WA)
; Yang, Qiang; (Burnaby, CA)
|
| Correspondence Address:
|
LEE & HAYES PLLC
421 W RIVERSIDE AVENUE SUITE 500
SPOKANE
WA
99201
|
| Family ID:
|
32682766
|
| Appl. No.:
|
10/806789
|
| Filed:
|
March 22, 2004 |
Related U.S. Patent Documents
| | | | |
|
| Application Number | Filing Date | Patent Number | |
|---|
| | 10806789 | Mar 22, 2004 | | |
| | 09645806 | Aug 24, 2000 | | |
| | 6766320 | | | |
|
|
| Current U.S. Class: |
1/1 ; 707/999.003; 707/E17.084 |
| Current CPC Class: |
G06F 17/3043 20130101; G06F 17/30616 20130101; G06F 2216/03 20130101; Y10S 707/99935 20130101; Y10S 707/99933 20130101 |
| Class at Publication: |
707/003 |
| International Class: |
G06F 007/00 |
Claims
1. Canceled
2. Canceled
3. Canceled
4. Canceled
5. Canceled
6. Canceled
7. Canceled
8. Canceled
9. Canceled
10. Canceled
11. Canceled
12. Canceled
13. Canceled
14. Canceled
15. Canceled
16. Canceled
17. Canceled
18. Canceled
19. Canceled
20. Canceled
21. Canceled
22. Canceled
23. Canceled
24. Canceled
25. Canceled
26. Canceled
27. Canceled
28. Canceled
29. Canceled
30. Canceled
31. Canceled
32. Canceled
33. Canceled
34. Canceled
35. Canceled
36. Canceled
37. A method comprising: receiving a query; mapping the query to from a
query space to a question space to identify associated frequently asked
questions; mapping the questions from the question space to a template
space to identify associated templates; mapping the templates from the
template space to an answer space to identify associated answers; and
returning the answers in response to the query.
38. A method as recited in claim 37, wherein the mapping from the query
space to the question space comprises: parsing the query to identify at
least one associated concept; and correlating the concept to one or more
frequently asked questions.
39. A method as recited in claim 37, wherein the mapping from the question
space to the template space comprises cross-indexing from a first table
containing question identifications to a second table containing
templates identifications.
40. A method as recited in claim 39, wherein the mapping from the template
space to the answer space comprises cross-indexing from the second table
to a third table containing answer identifications.
41. A method as recited in claim 37, further comprising: presenting the
answers to a user for confirmation as to which of the answers represent
the user's intentions in the query; analyzing the query and the answers
confirmed by the user; and modifying the answers that are returned in
response to the query based on information gleaned from the analyzing.
42. Canceled
43. Canceled
44. Canceled
45. Canceled
46. Canceled
47. Canceled
48. Canceled
49. Canceled
50. Canceled
51. Canceled
52. Canceled
53. Canceled
54. Canceled
55. Canceled
56. Canceled
57. Canceled
58. Canceled
59. Canceled
60. Canceled
61. Canceled
62. Canceled
63. Canceled
64. Canceled
65. Canceled
66. Canceled
67. Canceled
68. Canceled
69. Canceled
70. Canceled
71. Canceled
72. A method of parsing a search query, comprising: segmenting the search
query into individual character strings; producing a parse tree from at
least one parsable character string of the search query; and generating
at least one keyword based at least one non-parsable character string of
the search query.
73. The method of claim 72, further comprising: conducting keyword
searching using the at least once keyword.
74. The method of claim 72, wherein the parse tree represents a collection
of concepts related to the search query.
75. The method of claim 74, further comprising matching the parsed
concepts to a list of frequently asked questions.
76. The method of claim 75, further comprising: identifying at least one
answer associated with the list of frequently asked questions that match
the parsed concepts and keywords; and presenting the at least one answer
to a user in a user interface that permits a user to select a desired
answer from the one or more answers.
77. The method of claim 76, further comprising: logging the search query
and at least one answer selected by the user in a log database; and
analyzing the log database to derive at least one weighting factor
indicating how relevant the frequently asked questions are to the parsed
concepts and keywords.
78. A parser for a search engine, comprising: a segmentation module that
segments a search query into one or more individual character strings; a
natural language parser module that produces a parse tree from one or
more parsable character strings of the search query; and a keyword
searcher to identify one or more keywords in the search query and to
output the keywords.
79. The parser of claim 78, wherein the parse tree represents a collection
of concepts related to the search query.
80. The parser of claim 78, further comprising a search module that
matches the parsed concepts to a list of frequently asked questions.
81. The parser of claim 80, wherein the search module: identifies at least
one answer associated with the list of frequently asked questions that
match the parsed concepts and keywords; and presents the at least one
answer to a user in a user interface that permits a user to select a
desired answer from the one or more answers.
82. The parser of claim 81, wherein the search module: logs the search
query and at least one answer selected by the user in a log database; and
analyzes the log database to derive at least one weighting factor
indicating how relevant the frequently asked questions are to the parsed
concepts and keywords.
Description
TECHNICAL FIELD
[0001] This invention relates to search engines and other information
retrieval tools.
BACKGROUND
[0002] With the explosive growth of information on the World Wide Web,
there is an acute need for search engine technology to keep pace with
users' need for searching speed and precision. Today's popular search
engines, such as "Yahoo!" and "MSN.com", are used by millions of users
each day to find information. Unfortunately, the basic search method has
remained essentially the same as the first search engine introduced years
ago.
[0003] Search engines have undergone two main evolutions. The first
evolution produced keyword-based search engines. The majority of search
engines on the Web today (e.g., Yahoo! and MSN.com) rely mainly on
keyword searching. These engines accept a keyword-based query from a user
and search in one or more index databases. For instance, a user
interested in Chinese restaurants in Seattle may type in "Seattle,
Chinese, Restaurants" or a short phrase "Chinese restaurants in Seattle".
[0004] Keyword-based search engines interpret the user query by focusing
only on identifiable keywords (e.g., "restaurant", "Chinese", and
"Seattle"). Because of its simplicity, the keyword-based search engines
can produce unsatisfactory search results, often returning many
irrelevant documents (e.g., documents on the Seattle area or restaurants
in general). In some cases, the engines return millions of documents in
response to a simple keyword query, which often makes it impossible for a
user to find the needed information.
[0005] This poor performance is primarily attributable to the
ineffectiveness of simple keywords being capable of capturing and
understanding complex search semantics a user wishes to express in the
query. Keyword-based search engines simply interpret the user query
without ascribing any intelligence to the form and expression entered by
the user.
[0006] In response to this problem of keyword-based engines, a second
generation of search engines evolved to go beyond simple keywords. The
second-generation search engines attempt to characterize the user's query
in terms of predefined frequently asked questions (FAQs), which are
manually indexed from user logs along with corresponding answers. One key
characteristic of FAQ searches is that they take advantage of the fact
that commonly asked questions are much fewer than total number of
questions, and thus can be manually entered. By using user logs, they can
compute which questions are most commonly asked. With these search
engines, one level of indirection is added by asking the user to confirm
one or more rephrased questions in order to find an answer. A prime
example of a FAQ-based search engine is the engine employed at the Web
site "Askjeeves.com".
[0007] Continuing our example to locate a Chinese restaurant in Seattle,
suppose a user at the "Askjeeves.com" site enters the following search
query:
[0008] "What Chinese restaurants are in Seattle?"
[0009] In response to this query, the search engine at the site rephrases
the question as one or more FAQs, as follows:
[0010] How can I find a restaurant in Seattle?
[0011] How can I find a yellow pages listing for restaurants in Seattle,
Wash.?
[0012] Where can I find tourist information for Seattle?
[0013] Where can I find geographical resources from Britannica.com on
Seattle?
[0014] Where can I find the official Web site for the city of Seattle?
[0015] How can I book a hotel in Seattle?
[0016] If any of these rephrased questions accurately reflect the user's
intention, the user is asked to confirm the rephrased question to
continue the searching process. Results from the confirmed question are
then presented.
[0017] An advantage of this style of interaction and cataloging is much
higher precision. Whereas the keyword-based search engines might return
thousands of results, the FAQ-based search engine often yields a few very
precise results as answers. It is plausible that this style of FAQ-based
search engines will enjoy remarkable success in limited domain
applications, such as web-based technical support.
[0018] However, the FAQ-based search engines are also limited in their
understanding the user's query, because they only look up frequently
occurring words in the query, and do not perform any deeper syntactic or
semantic analysis. In the above example, the search engine still
experiences difficulty locating "Chinese restaurants", as exemplified by
the omission of the modifier "Chinese" in any of the rephrased questions.
While FAQ-based second-generation search engines have improved search
precision, there remains a need for further improvement in search
engines.
[0019] Another problem with existing search engines is that most people
are dissatisfied with the user interface (UI). The chief complaint is
that the UI is not designed to allow people to express their intention.
Users often browse the Internet with the desire to obtain useful
information. For the keywords-based search engine, there are mainly two
problems that hinder the discovery of user intention. First, it is not so
easy for users to express their intention by simple keywords. Second,
keyword-based search engines often return too many results unrelated to
the users' intention. For example, a user may want to get travel
information about Beijing. Entering `travel` as a keyword query in Yahoo,
for example, a user is given 289 categories and 17925 sites and the
travel information about Beijing is nowhere in the first 100 items.
[0020] Existing FAQ-based search engines offer UIs that allow entry of
pseudo natural language queries to search for information. However, the
underlying engine does not try to understand the semantics of the query
or users' intention. Indeed, the user's intention and the actual query
are sometimes different.
[0021] Accordingly, there is a further need to improve the user interface
of search engines to better capture the user's intention as a way to
provide higher quality search results.
SUMMARY
[0022] A search engine architecture is designed to handle a full range of
user queries, from complex sentence-based queries to simple keyword
searches. The search engine architecture includes a natural language
parser that parses a user query and extracts syntactic and semantic
information. The parser is robust in the sense that it not only returns
fully-parsed results (e.g., a parse tree), but is also capable of
returning partially-parsed fragments in those cases where more accurate
or descriptive information in the user query is unavailable. This is
particularly beneficial in comparison to previous efforts that utilized
full parsers (i.e., not robust parsers) in information retrieval. Whereas
full parsers tended to fail on many reasonable sentences that were not
strictly grammatical, the search engine architecture described herein
always returns the best fully-parsed or partially-parsed interpretation
possible.
[0023] The search engine architecture has a question matcher to match the
fully-parsed output and the partially-parsed fragments to a set of
frequently asked questions (FAQs) stored in a database. The question
matcher correlates the questions with a group of possible answers
arranged in standard templates that represent possible solutions to the
user query.
[0024] The search engine architecture also has a keyword searcher to
locate other possible answers by searching on any keywords returned from
the parser. The search engine may be configured to search content in
databases or on the Web to return possible answers.
[0025] The search engine architecture includes a user interface to
facilitate entry of a natural language query and to present the answers
returned from the question matcher and the keyword searcher. The user is
asked to confirm which answer best represents his/her intentions when
entering the initial search query.
[0026] The search engine architecture logs the queries, the answers
returned to the user, and the user's confirmation feedback in a log
database. The search engine has a log analyzer to evaluate the log
database and glean information that improves performance of the search
engine over time. For instance, the search engine uses the log data to
train the parser and the question matcher. As part of this training, the
log analyzer is able to derive various weighting factors indicating how
relevant a question is to a parsed concept returned from the parser, or
how relevant a particular answer is to a particular question. These
weighting factors help the search engine obtain results that are more
likely to be what the user intended based on the user's query.
[0027] In this manner, depending upon the intelligence provided in the
query, the search engine's ability to identify relevant answers can be
statistically measured in terms of a confidence rating. Generally, the
confidence ratings of an accurate and precise search improve with the
ability to parse the user query. Search results based on a fully-parsed
output typically garner the highest confidence rating because the search
engine uses essentially most of the information in the user query to
discern the user's search intention. Search results based on a
partially-parsed fragment typically receive a comparatively moderate
confidence rating, while search results based on keyword searching are
given the lowest confidence rating.
BRIEF DESCRIPTION OF THE DRAWINGS
[0028] FIG. 1 is a block diagram of an exemplary computer network in which
a server computer implements a search engine for handling client queries.
[0029] FIG. 2 is a block diagram of a search engine architecture.
[0030] FIG. 3 is a flow diagram of a search process using the search
engine.
[0031] FIG. 4 is a block diagram of a robust parser employed in the search
engine.
[0032] FIG. 5 is a diagrammatic illustration of a tokenization of a
Chinese sentence to demonstrate the added difficulties of parsing
languages other than English.
[0033] FIG. 6 is a flow diagram of a question matching process employed in
the search engine.
[0034] FIG. 7 illustrates database tables used during the question
matching process of FIG. 6.
[0035] FIG. 8 illustrates a first screen view of Chinese-version search
engine user interface implemented by the search engine.
[0036] FIG. 9 illustrates a second screen view of Chinese-version search
engine user interface implemented by the search engine.
DETAILED DESCRIPTION
[0037] This disclosure describes a search engine architecture that handles
a full range of user queries, from complex sentence-based queries to
simple keyword searches. Unlike traditional search engines, the
architecture includes a natural language parser that parses a user query
and extracts syntactic and semantic information. The parser is robust in
the sense that it not only returns fully-parsed results, but is also
capable of returning partially-parsed fragments in those cases where more
accurate or descriptive information in the user query is unavailable.
[0038] When facing ambiguity, the search engine architecture interacts
with the user for confirmation in terms of the concept the user is
asking. The query logs are recorded and processed repeatedly, thus
providing a powerful language model for the natural language parser as
well as for indexing the frequently asked questions and providing
relevance-feedback learning capability.
[0039] The search engine architecture is described in the context of an
Internet-based system in which a client submits user queries to a server
and the server hosts the search engine to conduct the search on behalf of
the client. Moreover, the search engine architecture is described as
handling English and Chinese languages. However, the architecture may be
implemented in other environments and extended to other languages. For
instance, the architecture may be implemented on a proprietary local area
network and configured to handle one or more other languages (e.g.,
Japanese, French, German, etc.).
[0040] Exemplary Computing Environment
[0041] FIG. 1 shows an exemplary computer network system 100 in which the
search engine architecture may be implemented. The network system 100
includes a client computer 102 that submits user queries to a server
computer 104 via a network 106, such as the Internet. While the search
engine architecture can be implemented using other networks (e.g., a wide
area network or local area network) and should not be limited to the
Internet, the architecture will be described in the context of the
Internet as one suitable implementation.
[0042] The client 102 is representative of many diverse computer systems,
including general-purpose computers (e.g., desktop computer, laptop
computer, etc.), network appliances (e.g., set-top box (STB), game
console, etc.), and. wireless communication devices (e.g., cellular
phones, personal digital assistants (PDAs), pagers, or otherdevices
capable of receiving and/or sending wireless data communication). The
client 102 includes a processor 110, a volatile memory 112 (e.g., RAM), a
non-volatile memory 114 (e.g., ROM, Flash, hard disk, optical, etc.), one
or more input devices 116 (e.g., keyboard, keypad, mouse, remote control,
stylus, microphone, etc.) and one or more output devices 118 (e.g.,
display, audio speakers, etc.).
[0043] The client 102 is equipped with a browser 120, which is stored in
non-volatile memory 114 and executed on processor 110. The browser 120
facilitates communication with the server 104 via the network 106. For
discussion purposes, the browser 120 may be configured as a conventional
Internet browser that is capable of receiving and rendering documents
written in a markup language, such as HTML (hypertext markup language).
[0044] In the illustrated implementation, the server 104 implements a
search engine architecture that is capable of receiving user queries from
the client 102, parsing the queries to obtain complete phrases, partial
phrases, or keywords, and returning the appropriate results. The server
104 is representative of many different server environments, including a
server for a local area network or wide area network, a backend for such
a server, or a Web server. In this latter environment of a Web server,
the server 104 may be implemented as one or more computers that are
configured with server software to host a site on the Internet 106, such
as a Web site for searching.
[0045] The server 104 has a processor 130, volatile memory 132 (e.g.,
RAM), and non-volatile memory 134 (e.g., ROM, Flash, hard disk, optical,
RAID memory, etc.). The server 104 runs an operating system 136 and a
search engine 140. For purposes of illustration, operating system 136 and
search engine 142 are illustrated as discrete blocks stored in the
non-volatile memory 134, although it is recognized that such programs and
components reside at various times in different storage components of the
server 104 and are executed by the processor 130. Generally, these
software components are stored in non-volatile memory 134 and from there,
are loaded at least partially into the volatile main memory 132 for
execution on the processor 130.
[0046] The search engine 140 includes a robust parser 142 to parse a query
using natural language parsing. Depending on the search query, the robust
parser produces a fully-parsed output (e.g., a parse tree), one or more
partially-parsed fragments, and/or one or more keywords. A FAQ matcher
144 matches the fully-parsed output (e.g., a parse tree) and the
partially-parsed fragments to a set of possible frequently asked
questions that are stored in a database. The FAQ matcher then correlates
the questions with a group of possible answers to the user query. A
keyword searcher 146 attempts to locate other possible answers from
conducting keyword searching using the keywords returned from the parser.
[0047] Unlike traditional engines, the search engine architecture robustly
accommodates many types of user queries, from single keyword strings to
full, grammatically correct sentences. If the user enters a complete
sentence, the search engine 140 has the ability to parse the sentence for
syntactic and semantic information. This information better reveals the
user's intention and allows for a more precise search with higher quality
results. If the user enters a grammatically incorrect sentence or an
incomplete sentence (i.e., a phrase), the search engine 140 attempts to
map the partial fragments to FAQ concepts. Finally, even if the user
query contains only one or a few search terms, the search engine is able
to handle the query as a keyword-based search and return at least some
results, albeit not with the same precision and quality.
[0048] The search engine 140 presents the possible answers returned from
the FAQ matcher 144 and the keyword searcher 146 to a user. The user is
asked to confirm which of the answers best represents the user's
intentions in the query. Through this feedback, the search engine may
refine the search. Additionally, the search engine may use this relevance
feedback to train the architecture in its mapping of a parsed query into
relevant answers.
[0049] The search engine includes a query log analyzer 148 that tracks the
query, the returned results, and the user's feedback to those results in
a log database. The query log analyzer 148 analyzes the log database to
train the FAQ matcher 144. As part of this training, the query log
analyzer 148 is able to derive, over time, various weights indicating how
relevant a FAQ is to a parsed concept generated by parsing a particular
query, or how relevant a particular answer is to a particular FAQ. These
weights help the search engine obtain results that are more likely to be
what the user intended based on the user's query.
[0050] In this manner, depending upon the intelligence provided in the
query, the search engine's ability to identify relevant answers can be
statistically measured in terms of a confidence rating. Generally, the
confidence ratings of an accurate and precise search improve with the
ability to parse the user query. Search results based on a fully-parsed
output typically garner the highest confidence rating because the search
engine uses essentially most of the information in the user query to
discern the user's search intention. Search results based on a
partially-parsed fragment typically receive a comparatively moderate
confidence rating, while search results based on keyword searching are
given the lowest confidence rating.
[0051] Search Engine Architecture
[0052] The search engine architecture 140 is formulated according to an
underlying premise, referred to as the concept-space hypothesis, that a
small subset of concepts cover most user queries. Examples of concepts
are: "Finding computer and internet related products and services",
"Finding movies and toys on the Internet", and so on. It is believed that
the first few popular categories will actually cover most of the queries.
Upon analyzing a one-day log from MSN.com, the inventors discovered that
30% of the concepts covered approximately 80% of all queries in the
selected query pool.
[0053] FIG. 2 illustrates the search engine architecture 140 in more
detail. It has a search engine user interface (UI) 200 that seamlessly
integrates search functionality and browsing. In the FIG. 1 network
system, the search engine UI 200 is served in an HTML document to the
client 102 when the client initially addresses the Web site. One
exemplary implementation of the user interface 200 is described below in
more detail beneath the heading "Search Engine User Interface".
[0054] The user enters a search query via the search engine UI 200. A
query string is passed to the natural language-based robust parser 142,
which performs robust is parsing and extracts syntactic as well as
semantic information for natural language queries. The robust parser 142
includes a natural language parser (NLP) 202 that parses the query string
according to rules kept in a rules database 204. The parsed output is
ranked with a confidence rating to indicate how likely the output
represents the query intensions.
[0055] The output of the natural language robust parser 142 is a
collection of concepts and keywords. The concepts are obtained through a
semantic analysis and include a fully-parsed output (e.g., a parse tree)
and partially-parsed fragments. One suitable semantic analysis is
described below in the section under the heading "NL-based Robust
Parsing". The keywords are either the key phrases extracted directly from
the user query or are expanded queries through a synonym table.
[0056] After natural language processing, the concepts and keywords are
passed on to the FAQ matcher 144. The FAQ matcher 144 has a FAQ matching
component 206 that attempts to match the concepts and keywords to
predefined frequently asked questions stored in a FAQ database 208. From
the FAQs, the FAQ matching component 206 identifies related templates
from a template database 210 that group together similar question
parameters. The templates have associated indexed answers that are
maintained in an answer database 212.
[0057] Accordingly, the FAQ matcher 144 effectively maps the parsed
concepts and keywords to FAQs, the FAQs to templates, and the templates
to answers. In one implementation, the FAQ database 208 is configured as
a relational database that maintains a set of tables to correlate the
concepts, FAQs, templates, and answers. One example database structure is
described below with reference to FIG. 7.
[0058] Concurrent with FAQ-based searching, the NLP module 142 also sends
the keywords to a keyword-based module 146 for keyword searching on the
user's query. The keyword-based module 146 has a meta-search engine 214
that extracts answers from the Web 216.
[0059] The answers returned from the FAQ matcher 144 and keyword searcher
146 are presented to the user via UI 200. The user is asked to confirm
which, if any, of the returned answers best exemplifies the user's
intentions in the query. By analyzing which results the user selects, the
search engine may further refine the search using the confirmed answer as
a starting point and return even more accurate results.
[0060] In addition to facilitating various search levels in an integrated
manner, the search engine architecture 140 also supports a query log
analyzer 148 that implements methodology to process query logs for the
purpose of obtaining new question templates with indexed answers. It also
has relevance-feedback capability for improving its indexing and ranking
functions. This capability allows the architecture 140 to record users'
actions in browsing and selecting the search result, so that the ranking
of these results and the importance of each selection can be learned over
time.
[0061] The architecture has a log collector 218 to log user actions and
system output in a log database 220. Log data mining tools 222 may be
used to analyze the log database 220 to glean data used to refine the FAQ
database 208, template database 210, answer database 212, and FAQ
matching functions 206. A web crawler 224 may also be included to provide
information as needed from the Web 216.
[0062] In one implementation, the search engine architecture 140 may be
configured according to COM (Component Object Model) or DCOM (Distributed
COM). This allows for design modularity, allowing each individual module
to evolve independently from others as long as the inter-module interface
remains the same.
[0063] Compared to the traditional search engines, the search engine
architecture 140 offers many benefits, including a higher precision and
search efficiency on frequently asked questions. Additionally, the
indexed contents evolve with users' current interests and its ranking
ability improves with usage over time. The search engine architecture
scales easily to offer relatively large coverage for user's questions and
the natural user interface allow users to seamlessly integrate search and
browsing.
[0064] Search Process
[0065] FIG. 3 shows a search process 300 conducted on the search engine
architecture 140 of FIG. 2. The search process 300 is implemented as
computer executable instructions that, when executed, perform the
operations illustrated as blocks in FIG. 3. Selected operations of the
search process 300 are described after this section in more detail.
[0066] At block 302, the search engine 140 receives a user query entered
at remote client 102. At block 304, the user query is parsed at the
natural language robust parser 142 to produce the parsed concepts (if
any) and keywords. After parsing, the concepts and keywords are submitted
to the FAQ matcher 144 to match them with frequently asked questions in
the FAQ database (block 306). Upon identifying matched FAQs, the FAQ
matcher 144 identifies associated templates with indexed answers from
databases 210 and 212 to obtain answers for the user queries (block 308).
[0067] Concurrent to the FAQ-matching operations, the search engine also
performs a keyword search at keyword-based module 146 (block 310). At
block 312, the results of the FAQ matching and keyword searching are
presented to the user via the search engine UI 200. The user is then
given the opportunity to offer feedback in an attempt to confirm the
accuracy of the search.
[0068] Meanwhile, apart from the search functions, the search engine is
also providing relevance feedback learning through analysis of the query,
the returned results and the user feedback to the search results. At
block 314, the log collector 218 logs user queries, results returned to
the user, and selections made by the user. These records are stored in
the log database 220.
[0069] At block 316, the log database 220 is analyzed to ascertain
frequently asked questions from a large number of user questions and to
automatically develop or find answers for the questions. The log is
further analyzed to determine weights indicating how probable the
returned results pertain to the users' queries (block 318). In
particular, the log analyzer determines how likely the FAQs represent the
user queries and how likely the answers pertain to the FAQs. The
weightings are used to modify the FAQ matcher 144 (block 320).
[0070] NL-Based Robust Parsing (Block 304)
[0071] The natural language-based robust parser 142 employs robust parsing
to accommodate many diverse types of user queries, including full and
partial sentences, meaningful phrases, and independent search terms. User
queries are often entered into search engines as incomplete or
grammatically incorrect sentences. For instance, users who want to know
about Chinese restaurants in Seattle might enter queries quite
differently, as illustrated by the following examples:
[0072] Chinese restaurants in Seattle
[0073] Seattle's best Chinese restaurants
[0074] Any Chinese restaurants in Seattle?
[0075] Where is the closest Chinese restaurant?
[0076] What is the best Chinese restaurant in Seattle?
[0077] While it is difficult to parse such sentences using a traditional
natural language parser, the robust parser 142 is capable of handling
such partial or grammatically incorrect sentences. Unlike traditional
parsing that require a hypothesis and a partial parse to cover adjacent
words in the input, robust parsing relaxes this requirement, making it
possible to omit noisy words in the input. If a user query contains words
that are not parsable, the natural language parsing module 142 can skip
these words or phrases and still output a result.
[0078] Additionally, different hypotheses can result from partial parses
by skipping some symbols in parse rules. Thus, if a given sentence is
incomplete such that natural language parsing is unable to find a
suitable rule to match it exactly, the robust parser provides multiple
interpretations of the parsing result and associates with each output a
confidence level. In the search engine 140, this confidence level is
built based on statistical training.
[0079] FIG. 4 shows an exemplary implementation of the natural language
robust parser 142. The module includes a word segmentation unit 400,
which identifies individual words in a sentence. The word segmentation
unit 400 relies on data from a query log 402 and a dictionary 404. In
English, words are separated by spaces and hence, word segmentation is
easily accomplished. However, in other languages, segmentation is not a
trivial task. With Chinese text, for example, there is no separator
between words. A sequence of characters may have many possible parses in
the word-tokenization stage. Thus, effective information retrieval of
Chinese first requires good word segmentation.
[0080] FIG. 5 shows an example tokenization 500 of a simple Chinese
sentence "", having only four characters. Here, these four characters can
be parsed in five ways into words. For example, the dotted path 502
represents a parsing to the phrase "dismounted a horse", and the bold
path 504 represents "immediately coming down". This figure also shows
seven possible "words", some of which (e.g., ) might be disputable on
whether they should be considered "words."
[0081] To accommodate Chinese input, the robust parser can accept two
kinds of input: Lattice and N-best. The lattice input includes almost all
possible segmentations. However, as there may be too much ambiguity, the
parsing process can become very slow. An alternative choice is to use the
N-best input.
[0082] With reference again to FIG. 4, after segmentation, the segmented
sentence to is passed a natural language parser 410 and a keyword
modules. The parser 410 attempts to parse the segmented sentence
according to a set of rules found in a rule database 414. If a sentence
parses successfully, the parsing module 412 outputs a parse tree. If
parsing is unsuccessful, the keyword unit 412 uses a word database 416 to
extract and output keywords from the segmented sentence. As shown in FIG.
2, the parse tree and keywords are passed to the FAQ matcher 144 and the
keywords are passed to the keyword-based component 146. Accordingly, the
architecture 140 allows templates to be matched regardless of the type of
is output, whether parse trees or keywords.
[0083] Exemplary Parsing Methodology
[0084] One particular implementation of a robust parser is based on a
spoken language system known as "LEAP", which stands for Language Enabled
Applications. LEAP is technology being developed in Microsoft Research
that aims at spoken language understanding. For a more detailed
discussion of LEAP, the reader is directed to an article by Y. Wang,
entitled "A robust parser for spoken language understanding", Proc. of
6th European conference on speech communication and technology
(Eurospeech99), Budapest, Hungary, September 1999, pp. Vol. 5, 2055-2058.
[0085] The robust parser employs a parsing algorithm that is an extension
of a bottom-up chart-parsing algorithm. The grammar defines semantic
classes. Each semantic class is defined by a set of rules and
productions. For example, a semantic class <Route> is defined for
the travel path from one place to another. This class is represented as
follows:
1
<Route> TravelPath {
=> @from
<PlaceName:place1> @to
<PlaceName:place2> @route;
@from => from .vertline. ...;
. . . . . .
}
<PlaceName> Place {
Beijing .vertline. Shanghai
.vertline. ...;
}
[0086] In the semantic classes above, <Route> defines a return class
type, and TravelPath is a semantic class that contains a number of rules
(the first line) and productions (the second line). In this class,
"@from" parses a piece of the input sentence according to a production as
shown in the second line. The input item after the "@from" object matches
according to a <PlaceName> semantic class. If there are input
tokens that are not parsable by any parts of the rule, it will be ignored
by the parser. In this case, the scoring of the parse result will be
correspondingly discounted to reflect a lower level of confidence in the
parse result.
[0087] As an example, suppose the input query is:
[0088] ?(How to go from Beijing to Shanghai?)
[0089] The robust parser will return the following result:
[0090] <VOID> place place
[0091] <Route> place place
[0092] <PlaceNamne:place1> place
[0093] <PlaceName:place2> place
[0094] Here <VOID> represents the root semantic class. Note that
this input query cannot be parsed using the first rule in the semantic
class TravelPath if a traditional parser is used because the Chinese word
"" cannot match any objects in the rule. Since the robust parser can skip
this word to match the rest, parsing will continue to produce a partial
result. In one implementation, the score of the parsing result is
calculated by discounting the number of input items and rule items that
are skipped during the parsing operation. This score is normalized to
give a percentage confidence value.
[0095] Evaluating Parsing Results
[0096] A parsed result will be selected if it covers the most words in the
query and the most parts of rules. To improve the scoring strategy, the
search engine learns probabilities from query logs, including:
[0097] probabilities of the rules;
[0098] penalty for robust rule matching (insertion, deletion,
substitution);
[0099] probabilities of "non-matching" words;
[0100] term probability according to their frequency in query log.
[0101] "Considering the rule in the semantic class <Route>
TravelPath:
[0102] @from <PlaceName:place1> @to <PlaceName:place2> @route;
[0103] The search engine can train the probabilities associated with this
rule. A rule with a high probability value means that using the rule to
parse a query is more reliable. The search engine can also train the
penalty values for robust matching by exacting a penalty for any item in
either a rule or the query sentence that is skipped during parsing.
[0104] Consider the above rule for the sentence "" ("How to get from
Beijing to Shanghai?"). A relatively low penalty is set if the @from item
"(should)" is skipped. A higher penalty is assigned if the @route item
"(how to go)" is skipped.
[0105] Statistics are gathered using the query log files as the base data.
A more detailed discussion of training the robust parser using query log
files is described below beneath the heading "Training Robust Parser
Using Query Log Files".
[0106] Question Matching (Blocks 306 and 308)
[0107] The FAQ matcher 144 attempts to find a set of relevant concepts and
their related answers from a given user query. To accomplish this, the
FAQ matcher 144 maps the concepts through several intermediate spaces to
ultimately identify answers to the queries.
[0108] FIG. 6 shows a mapping process 600 of the question matching
operation. The mapping process 600 is implemented as computer executable
instructions that, when executed, perform the operations illustrated as
blocks in FIG. 6. For discussion purposes, the mapping process is
described in the context of a realistic example in which a user asks:
[0109] ? ("How to go from Beijing to Shanghai?")
[0110] At block 602, the FAQ matcher maps the parsed query from a query
space to a concept or FAQ space. The natural language processing module
142 returns a parse tree containing a semantic class and its parameters:
[0111] <VOID> place place
[0112] <Route> place place
[0113] <PlaceName:place1> place
[0114] <PlaceName:place2> place
[0115] A collection of concepts indexed on "" ("Route") and "" ("Travel"),
and possibly other related concepts, are stored in the FAQ database 208.
[0116] FIG. 7 illustrates example database tables 700 maintained in the
FAQ database 208. In this example, the FAQ database is configured as a
relational database in which data records are organized in tables that
may be associated with one another using definable relationships. The
database includes a Concept-FAQ table 702, a FAQ table 704, a template
table 706, and an answer table 708. For this example, the answer table
708 pertains to answers about a flight schedule, and hence is labeled as
a "Flight Table".
[0117] The Concept-FAQ table 702 is the core data structure for the whole
database. It correlates concepts with frequently asked questions (FAQs).
A FAQ is made up of a few concepts that are in fact represented by
certain terms, such as "Route". Every FAQ is related to one or more
concepts and every concept is related to one or more FAQs. Thus, there is
a many-to-many relationship between FAQs and concepts. Every FAQ is
assigned a FAQ ID to uniquely distinguish FAQs from one another.
[0118] A record in the Concept-FAQ table 702 includes a concept, a FAQ ID,
and a weight. Each record indicates that a FAQ (with a particular ID) is
related to the concept according to a correlation weighting factor. The
weighting factor indicates how probable the concept pertains to the
associated FAQ. The weighting factor is learned from a later analysis of
the query log file.
[0119] Using the Concept-FAQ table 702, the FAQ matcher 144 computes a
correlation between a concept set .PHI. (concept.sub.1, . . .
concept.sub.2, . . . concept.sub.n) and a FAQ with ID of x as follows: 1
i = 1 n Weight ( concept i , x ) .
[0120] Hence, given a concept set, the FAQ matcher can obtain the top n
best-matched FAQs. For example, the concept set of the question "" ("How
to go from Beijing to Shanghai") are "Travel" and "Route", where the
match result is a FAQ set{101 (weight 165), 105(weight 90)}.
[0121] The semantic class returned from the parser is used to search the
concept-FAQ table. In our example, the semantic class "Route" is used as
a key to search the Concept-FAQ table 702. The search determines that the
third entry 710 in the table yields a perfect match. Corresponding to the
"route" entry 710 is the FAQ with ID "101", which can be used to index
the FAQ table 704.
[0122] At block 604 in the mapping process of FIG. 6, the FAQ matcher maps
the FAQs from the FAQ space to a template space. A template represents a
class of standard questions and corresponds to a semantic class in the
robust parser. Every template has one or more parameters with values.
Once all the parameters in a template are assigned a value, a standard
question is derived from this template.
[0123] For example, " (Which flights are there)" is a template
representing a class of questions about the flight from or to a certain
location. Here, the wild card "*" denotes that there is a parameter in
the template that can be assigned an arbitrary place name. If
"(Shanghai)" is chosen, this template is transformed into a standard
question "(Which Shanghai flights are there)".
[0124] The FAQ table 704 associates frequently asked questions with
templates. The FAQ table 704 may also include a weight to indicate how
likely a FAQ pertains to a template. In our example, the frequently asked
question with an ID of "101" has three entries in the FAQ table 704,
identifying three corresponding templates with IDs 18, 21 and 24.
Template 24 carries a weight of "100", indicating that this template is
perhaps a better fit for the given FAQ than the other templates. The
template IDs can then be used to index into the template table 706.
[0125] The template table 706 correlates template IDs with template
descriptions and identities of corresponding answer sets. In FIG. 7, for
example, the template with ID 18 corresponds to an answer table that is
named "Flight Table."
[0126] It is infeasible to construct a template for every question because
there are many similar questions. Instead a single template is prepared
for all similar questions. This effectively compresses the FAQ set. In
our example, the mapping result for FAQ set {101, 105} is a template set
{24( weight 165+100), 18( weight 165+80), 21( weight 165+50), 31( weight
90+75)}, where the weights are obtained by a simple addition of the
weights from previous steps.
[0127] At block 606 in the mapping process of FIG. 6, the FAQ matcher
maps, templates from the template space to an answer space. All answers
for a template are previously stored in a separate answer table, such as
answer table 708. The answer table is indexed by parameter values of the
template. When matching is done, the best parameter is calculated and
passed to the search engine UI 200 to be shown to the user.
[0128] As shown in answer table 708, every answer has two parts: a URL and
its description. In our example, if the user chooses a template 18 (),
and value of the parameter is assigned to "", the flight table is
returned with the portion of "" in the table shown to the user.
[0129] Training Robust Parser Using Query Log Files
[0130] The search engine architecture 140 uses information mined by the
log analyzer 148 to adapt the robust parser 142 so that it evaluates the
output based on the coverage of a rule against the input query. A parsed
result will be selected if it covers the most words in the query and the
most parts of rules. To improve the scoring strategy, probabilities
learned from query logs include:
[0131] confidence values associated with each rule;
[0132] confidence values associated with each item in a rule;
[0133] confidence values associated with each word in an input sentence.
[0134] First, consider the confidence values associated with each rule. To
evaluate the parsing result more accurately, each rule is assigned a
probability. Since the rules are local to a semantic class, the sum of
probabilities of all the rules in a semantic class is one. Considering a
semantic class having n rules, the probabilities of the i.sup.th rule is
w.sub.r.sub..sub.i, then 2 i w ri = 1
[0135] The productions in grammar are either global or local to a semantic
class. The probabilities for all global productions (the productions
always available) that expand the same item sum to one. The probabilities
for all productions local to one semantic class (the productions only
available within a semantic class) that expand the same item sum to one
too.
[0136] After learning the probabilities for each rule, the next task is to
learn the confidence values associated with each item in a
rule._Considering a rule having N items, robust matching is performed on
the rule. Suppose the items T.sub.i.sub..sub.1, T.sub.i.sub..sub.2, K
T.sub.i.sub..sub.m are matched, but the items T.sub.j.sub..sub.1,
T.sub.j.sub..sub.2 K T.sub.j.sub..sub.n(1.ltoreq.i.sub.l,
j.sub.k.ltoreq.N) are not matched. A confidence value indicating how well
this rule is matched is then measured. The measurement may be performed,
for example, by using neural networks.
[0137] One suitable implementation is to use a perceptron to measure the
confidence. A perceptron has N input units, each of them representing an
item in the rule, and one output unit, which represents the confidence of
the rule matching. To represent the confidence continually, which is not
Boolean, a Sigmoid function is used as the activation function for the
output unit. For the matched item T.sub.i.sub..sub.l, the corresponding
input is I.sub.i.sub..sub.l=C.sub.i.sub..sub.l, in which
C.sub.i.sub..sub.l is the confidence of I.sub.i.sub..sub.l; whereas for
the non-matched item T.sub.j.sub..sub.k, the input is
I.sub.j.sub..sub.k=0.
[0138] The output unit is: 3 c r = sigmoid ( p w tp I p
)
[0139] where w.sub.tp is the weight from input unit I.sub.p to output
unit. A standard gradient descent method is used to train the perceptron,
such as that described in S. Russell, P. Norvig, "Artificial
Intelligence", Prentice-Hall, Inc. 1995, pp 573-577. The training data is
the user query log file where the sentences are classified as positive
and negative examples.
[0140] Finally, after learning the confidence values associated with each
item in a rule, the last task is to learn the confidence values
associated with each word in an input sentence. A non-matching word is
the word in the input sentence that does not match any item in the rule.
For a word W, if there are n non-matching occurrence in the training
corpus, and if m(m.ltoreq.n) of them result in correct rule-matching,
then the confidence of this non-matching is: p=m/n. The confidence of the
robust sentence matching is: 4 c s = i p i
[0141] The confidence of a rule r is calculated as below:
P=w.sub.r.multidot.c.sub.r.multidot.c.sub.s
[0142] Search Engine User Interface
[0143] The search engine UI 200 is designed to improve efficiency and
accuracy in information retrieval based on a user's search intention. The
intention-centric UI design guides users to a small number of
high-quality results, often consisting of fewer than ten
intention-related answers. The "intention" of a search on the Internet is
a process rather than an event. The search engine UI 200 attempts to
capture the process as three main tasks. First, users are permitted to
pose queries as natural language questions. Second, the UI presents
parameterized search results from the search engine and asks users to
confirm their intention. Finally, users are permitted to select their
desired answer.
[0144] FIG. 8 shows an example screen display 800 of the search engine UI
200. The screen display has a query entry area 802 that allows user to
enter natural language questions. Consider, for example, the following
two queries in the traveling domain search:
[0145] () () ? (How many traveling routes exist from (Beijing) to
(Shanghai)?)
[0146] () ? (Please tell me about the famous sights in (Beijing)?)
[0147] Natural language is a powerful tool for expressing the user
intention. The most important parts of a query are referred to as core
phrases. In these examples, the underlined words are core phrases, the
parenthesized words are keywords, and the remaining words are redundant
words.
[0148] In some cases, it is difficult or impossible to identify users'
intention from the original query alone. In this case, the search engine
selects all possibly relevant concept templates and asks the user to
confirm. Related concepts are clustered according to their similarity and
the different parts of the result are treated as parameters. From the
above query, two similar search results "" ("famous sites in Beijing")
and "; " ("famous sites in Shanghai") are combined into one group, where
(Beijing) and (Shanghai) are treated as parameters.
[0149] FIG. 9 shows an exemplary display screen 900 that is returned with
various parameterized search results. The result "() " (famous sites in
[Beijing.vertline.Shanghai]) is depicted in result area 902. The
parameterized result can help focus users' attention on the core phrases,
which in this case corresponds to "" (famous sites).
[0150] In addition to intention centricity, the search engine UI is
designed to seamlessly integrate searching and browsing. The search
engine UI is constructed with a strong sense of structure and navigation
support so that users know where they are, where they have been, and
where they can go. In particular, there are two kinds of combination
modes for search and browsing: (1) browsing followed by searching, and
(2) searching followed by browsing.
[0151] For discussion purposes, suppose a user wants to know how to travel
to Shanghai for fun. At first, the user does not know what kind of
information the web can provide. The user can open a travel
information-related web site and find that there is information about
"travel routes" (). At this point, the user may pose a query about the
specific route to go to Shanghai from Beijing by asking, for example, "?"
("How to get from Beijing to Shanghai?")
[0152] Alternatively, the user may wish to search first, rather than
browse to a travel web site. After the user inputs a natural language
query, the search engine judges the user intention by using the core
phrases. Because the intention extends beyond a simple question, the
search engine predicts the user's intention from the current query and
provides reasonable answers for confirmation. For example, in the above
example, the real goal of the user is to get useful information about
traveling to Shanghai. Thus, the sightseeing information about Shanghai
is related to the user's intention. In response to the above query, the
search results are two alternative answers related to the user's
intention:
[0153] () () . (The sightseeing routes from Beijing to Shanghai)
[0154] () (The sightseeing sites in Shanghai)
[0155] Conclusion
[0156] A new-generation search engine for Internet searching permits
natural language understanding, FAQ template database matching and user
interface components. The architecture is configured to precisely index
frequently asked concepts and intentions from user queries, based on
parsed results and/or keywords.
[0157] Although the description above uses language that is specific to
structural features and/or methodological acts, it is to be understood
that the invention defined in the appended claims is not limited to the
specific features or acts described. Rather, the specific features and
acts are disclosed as exemplary forms of implementing the invention.
* * * * *