| United States Patent Application |
20120136885
|
| Kind Code
|
A1
|
|
ZHOU; Hong
;   et al.
|
May 31, 2012
|
QUERY REWRITING WITH ENTITY DETECTION
Abstract
A system receives a search query, determines whether the received search
query includes an entity name, and determines whether the entity name is
associated with a common word or phrase. When the entity name is
associated with a common word or phrase, the system generates a link to a
rewritten query, performs a search based on the received search query to
obtain first search results, and provides the first search results and
the link to the rewritten query. When the entity name is not associated
with a common word or phrase, the system rewrites the received search
query to include a restrict identifier associated with the entity name,
generates a link to the received search query, performs a search based on
the rewritten search query to obtain second search results, and provides
the second search results and the link to the received search query.
| Inventors: |
ZHOU; Hong; (Sunnyvale, CA)
; BHARAT; Krishna; (San Jose, CA)
; SCHMITT; Michael; (Neufakru, DE)
; CURTISS; Michael; (Sunnyvale, CA)
; MAYER; Marissa; (Palo Alto, CA)
|
| Assignee: |
GOOGLE INC.
Mountain View
CA
|
| Serial No.:
|
367114 |
| Series Code:
|
13
|
| Filed:
|
February 6, 2012 |
| Current U.S. Class: |
707/765; 707/E17.066 |
| Class at Publication: |
707/765; 707/E17.066 |
| International Class: |
G06F 17/30 20060101 G06F017/30 |
Claims
1-22. (canceled)
23. A method comprising: receiving, by one or more devices, a first
search query that includes a particular search query term; determining,
by the one or more devices, that the particular search query term does
not correspond to one of a plurality of words or phrases stored in a data
structure; in response to the determining, modifying, by the one or more
devices, the first search query to obtain a second search query, the
second search query being different than the first search query, and the
second search query including information that restricts a search,
performed based on the second search query, to a particular domain; and
causing, by the one or more devices, a search to be performed based on
the second search query.
24. The method of claim 23, where the first search query is received from
another device that is different than the one or more devices, the method
further comprising: providing one or more search results, responsive to
the second search query, to the other device.
25. The method of claim 23, where the particular domain is associated
with the particular search query term.
26. The method of claim 23, further comprising: receiving a third search
query that includes another search query term; determining that the other
search query term corresponds to one of the plurality of words or
phrases; in response to determining that the other search query term
corresponds to one of the plurality of words or phrases: generating a
link to a fourth search query that is different than the third search
query, the fourth search query including information that restricts a
search, performed based on the fourth search query, to a domain
associated with the other search term; and providing the link for
display.
27. The method of claim 23, where the particular search query term is
associated with one of a news source, a store, a product, a brand, a
manufacturer, or a place, and the particular domain being associated with
the one of the news source, the store, the product, the brand, the
manufacturer, or the place.
28. The method of claim 23, further comprising: comparing the particular
search query term to information that includes a plurality of search
terms and identifies a plurality of domains, each of the plurality of
domains being associated with one or more of the plurality of search
terms; determining that the particular search query term is included in
the plurality of search terms based on comparing the particular search
term to the information, a domain, of the plurality of domains,
corresponding to the particular domain, the domain being associated with
the particular search query term; and determining that the particular
search query term does not correspond to one of the plurality of words or
phrases after determining that the particular search query term is
included in the plurality of search terms.
29. The method of claim 23, further comprising: generating a link to the
first search query; and providing the link for display.
30. A system, comprising: one or more computers to: receive a first
search query; determine that a particular search query term, included in
the first search query, does not correspond to one of a plurality of
words or phrases stored in a data structure; in response to the
determining, modifying the first search query to obtain a second search
query, the second search query being different than the first search
query, and the second search query including information that restricts a
search, performed based on the second search query, to a particular
corpus of documents; and cause a search to be performed based on the
second search query.
31. The system of claim 30, where the particular corpus of documents is
associated with the particular search query term.
32. The system of claim 30, where the first search query is received from
another device that is different than the one or more computers, and the
one or more computers further to: provide one or more search results,
responsive to the second search query, to the other device for display.
33. The system of claim 30, where the particular search query term is
associated with one of a news source, a store, a product, a brand, a
manufacturer, or a place, and the particular corpus of documents being
associated with the one of the news source, the store, the product, the
brand, the manufacturer, or the place.
34. A computer-readable memory device comprising: a plurality of
instructions which, when executed by one or more processors of one or
more devices, cause the one or more processors to: receive a first search
query; determine that a particular search query term, included in the
first search query, does not correspond to one of a plurality of words or
phrases; modify the first search query to obtain a second search query in
response to determining that the particular search query term does not
correspond to one of the plurality of words or phrases, the second search
query being different than the first search query, and the second search
query including information that restricts a search, performed based on
the second search query, to a particular corpus of documents; and cause a
search to be performed based on the second search query.
35. The computer-readable memory device of claim 34, where the particular
corpus of documents is associated with the particular search query term
and corresponds to documents associated with a single domain.
36. The computer-readable memory device of claim 34, further comprising
one or more instructions to: compare the particular search query term to
information that includes a plurality of search terms and identifies a
plurality of domains, each of the plurality of domains being associated
with one or more of the plurality of search terms; determine that the
particular search query term is included in the plurality of search terms
based on comparing the particular search query term to the information,
the particular corpus of documents corresponding to a particular domain
of the plurality of domains, the particular domain being associated with
the particular search query term; and determine that the particular
search query term does not correspond to the one of the plurality of
words or phrases after determining that the particular search query term
is included in the plurality of search terms.
37. The computer-readable memory device of claim 36, further comprising
one or more instructions to: obtain the information from a plurality of
sources that include at least one of an online directory, a group
posting, or a corpus of documents.
38. The computer-readable memory device of claim 34, where the first
search query is received from another device that is different than the
one or more devices, the computer-readable memory device further
comprising one or more instructions to: provide one or more search
results, responsive to the second search query, to the other device.
39. The computer-readable memory device of claim 34, further comprising
one or more instructions to: generate a link to the first search query;
provide the link for display; detect selection of the link; and cause a
search to be performed based on the first search query, in response to
the detected selection of the link, the search, performed based on the
first search query, not being restricted to the particular corpus of
documents.
40. The computer-readable memory device of claim 34, where the particular
search query term is associated with one of a news source, a store, a
product, a brand, a manufacturer, or a place, and where the particular
corpus of documents is associated with the one of the news source, the
store, the product, the brand, the manufacture, or the place.
41. The computer-readable memory device of claim 34, further comprising
one or more instructions to: receive a third search query that includes
another search query term; determine that the other search query term
corresponds to one of the plurality of words or phrases; and cause a
search to be performed based on the third search query in response to
determining that the other search query term corresponds to one of the
plurality of words or phrases.
42. The computer-readable memory device of claim 41, further comprising
one or more instructions to: generate a link to a fourth search query
that is different than the third search query, in response to determining
that the other search query term corresponds to one of the plurality of
words or phrases, the fourth search query including information that
restricts a search, performed based on the fourth search query, to a
domain associated with the other search term; and provide the link for
display with one or more search results responsive to the third search
query.
Description
BACKGROUND OF THE INVENTION
[0001] 1. Field of the Invention
[0002] Systems and methods consistent with the principles of the invention
relate generally to information retrieval and, more particularly, to
rewriting of search queries based on detection of the names of certain
entities in the queries.
[0003] 2. Description of Related Art
[0004] The World Wide Web ("web") contains a vast amount of information.
Search engines assist users in locating desired portions of this
information by cataloging web documents. Typically, in response to a
user's request, a search engine returns links to documents relevant to
the request.
[0005] Search engines may base their determination of the user's interest
on search terms (called a search query) provided by the user. The goal of
a search engine is to identify links to relevant results based on the
search query. Typically, the search engine accomplishes this by matching
the terms in the search query to a corpus of pre-stored web documents.
Web documents that contain the user's search terms are considered "hits"
and are returned to the user.
[0006] Some search engines permit a user to restrict a search to a set of
related documents, such as documents associated with the same web site,
by including special characters or terms in the search query. Oftentimes,
however, users forget to include these special characters/terms or do not
know about them.
SUMMARY OF THE INVENTION
[0007] According to one aspect consistent with the principles of the
invention, a method may include receiving a search query, determining
whether the received search query includes an entity name, and
determining whether the entity name is associated with a common word or
phrase. The method may also include selectively rewriting the received
search query based on whether the entity name is determined to be
associated with a common word or phrase, performing a search based on the
received search query or the rewritten search query to obtain search
results, and presenting the search results.
[0008] According to another aspect, a system may include means for
receiving a search query, means for determining whether the received
search query includes an entity name, and means for determining whether
the entity name is associated with a common word or phrase. The system
may also include means for rewriting the received search query when it is
determined that the entity name is associated with a common word or
phrase, means for performing a search based on the rewritten search query
to obtain search results, and means for providing the search results.
[0009] According to yet another aspect, a system includes a memory and a
processor connected to the memory to receive a search query, determine
whether the received search query includes an entity name, and
selectively rewrite the received search query to obtain a rewritten
search query when it is determined that the received search query
includes an entity name.
[0010] According to a further aspect, a method may include determining a
set of entity names, determining whether each of the entity names is
associated with a common word or phrase, and generating a table of the
entity names that are associated with common words or phrases.
[0011] According to another aspect, a method may include receiving a
search query, determining whether the received search query includes an
entity name, and determining whether the entity name is associated with a
common word or phrase. When the entity name is associated with a common
word or phrase, the method may include generating a link to a rewritten
query, performing a search based on the received search query to obtain
first search results, and providing the first search results and the link
to the rewritten query. When the entity name is not associated with a
common word or phrase, the method may include rewriting the received
search query to include a restrict identifier associated with the entity
name, generating a link to the received search query, performing a search
based on the rewritten search query to obtain second search results, and
providing the second search results and the link to the received search
query.
BRIEF DESCRIPTION OF THE DRAWINGS
[0012] The accompanying drawings, which are incorporated in and constitute
a part of this specification, illustrate an embodiment of the invention
and, together with the description, explain the invention. In the
drawings,
[0013] FIG. 1 is a diagram of an exemplary network in which systems and
methods consistent with the principles of the invention may be
implemented;
[0014] FIG. 2 is an exemplary diagram of a client and/or server of FIG. 1
according to an implementation consistent with the principles of the
invention;
[0015] FIG. 3 is an exemplary functional block diagram of a portion of a
server of FIG. 1 according to an implementation consistent with the
principles of the invention;
[0016] FIG. 4 is an exemplary diagram of a list of candidate strings
according to an implementation consistent with the principles of the
invention;
[0017] FIG. 5 is a flowchart of exemplary processing for generating a list
of candidate strings according to an implementation consistent with the
principles of the invention;
[0018] FIG. 6 is a flowchart of exemplary processing for selectively
rewriting a query according to an implementation consistent with the
principles of the invention;
[0019] FIGS. 7 and 8 are diagrams of an automatic query rewrite example in
a news context according to an implementation consistent with the
principles of the invention; and
[0020] FIGS. 9-11 are diagrams of a query rewrite suggestion example in
the news context according to an implementation consistent with the
principles of the invention.
DETAILED DESCRIPTION
[0021] The following detailed description of the invention refers to the
accompanying drawings. The same reference numbers in different drawings
may identify the same or similar elements. Also, the following detailed
description does not limit the invention.
Overview
[0022] Systems and methods consistent with the principles of the invention
may rewrite search queries or generate suggestion links to rewritten
search queries upon detection of the names of certain entities. An
"entity," as used herein, may refer to anything that can be tagged as
being associated with certain documents. Examples of entities may include
news sources, stores, such as online stores, product categories, brands
or manufacturers, specific product models, condition (e.g., new, used,
refurbished, etc.), authors, artists, people, places, and organizations.
[0023] Some entity names are unambiguous and uniquely identify particular
entities. A large number of names, however, are somewhat ambiguous or
generic, making it more difficult to identify the entities to which they
are intended to correspond when included in users' search queries.
Systems and methods consistent with the principles of the invention
provide mechanisms for determining the entities to which entity names
correspond and selectively rewriting users' search queries based on the
entity names. Accordingly, a user's search query may be restricted to a
search of document(s) associated with the entity that the user intended
in the search.
Exemplary Network Configuration
[0024] FIG. 1 is an exemplary diagram of a network 100 in which systems
and methods consistent with the principles of the invention may be
implemented. Network 100 may include multiple clients 110 connected to
multiple servers 120-140 via a network 150. Network 150 may include a
local area network (LAN), a wide area network (WAN), a telephone network,
such as the Public Switched Telephone Network (PSTN), an intranet, the
Internet, a memory device, or a combination of networks. Two clients 110
and three servers 120-140 have been illustrated as connected to network
150 for simplicity. In practice, there may be more or fewer clients and
servers. Also, in some instances, a client may perform the functions of a
server and a server may perform the functions of a client.
[0025] Clients 110 may include client components. A component may be
defined as a device, such as a wireless telephone, a personal computer, a
personal digital assistant (PDA), a lap top, or another type of
computation or communication device, a thread or process running on one
of these devices, and/or an object executable by one of these device.
Servers 120-140 may include server components that gather, process,
search, and/or maintain documents in a manner consistent with the
principles of the invention. Clients 110 and servers 120-140 may connect
to network 150 via wired, wireless, and/or optical connections.
[0026] In an implementation consistent with the principles of the
invention, server 120 may include a search engine 125 usable by clients
110. Server 120 may crawl a corpus of documents (e.g., web pages), index
the documents, and store information associated with the documents in a
repository of crawled documents. Servers 130 and 140 may store or
maintain documents that may be crawled by server 120. While servers
120-140 are shown as separate entities, it may be possible for one or
more of servers 120-140 to perform one or more of the functions of
another one or more of servers 120-140. For example, it may be possible
that two or more of servers 120-140 are implemented as a single server.
It may also be possible for a single one of servers 120-140 to be
implemented as two or more separate (and possibly distributed) devices.
[0027] A "document," as the term is used herein, is to be broadly
interpreted to include any machine-readable and machine-storable work
product. A document may include an e-mail, a web site, a file, a
combination of files, one or more files with embedded links to other
files, a news group posting, a blog, a web advertisement, etc. In the
context of the Internet, a common document is a web page. Web pages often
include textual information and may include embedded information (such as
meta information, images, hyperlinks, etc.) and/or embedded instructions
(such as Javascript, etc.). A "link," as the term is used herein, is to
be broadly interpreted to include any reference to or from a document.
Exemplary Client/Server Architecture
[0028] FIG. 2 is an exemplary diagram of a client or server component
(hereinafter called "client/server component"), which may correspond to
one or more of clients 110 and servers 120-140, according to an
implementation consistent with the principles of the invention. The
client/server component may include a bus 210, a processor 220, a main
memory 230, a read only memory (ROM) 240, a storage device 250, an input
device 260, an output device 270, and a communication interface 280. Bus
210 may include a path that permits communication among the elements of
the client/server component.
[0029] Processor 220 may include a conventional processor or
microprocessor, or another type of processing logic that interprets and
executes instructions. Main memory 230 may include a random access memory
(RAM) or another type of dynamic storage device that stores information
and instructions for execution by processor 220. ROM 240 may include a
conventional ROM device or another type of static storage device that
stores static information and instructions for use by processor 220.
Storage device 250 may include a magnetic and/or optical recording medium
and its corresponding drive.
[0030] Input device 260 may include a conventional mechanism that permits
an operator to input information to the client/server component, such as
a keyboard, a mouse, a pen, voice recognition and/or biometric
mechanisms, etc. Output device 270 may include a conventional mechanism
that outputs information to the operator, including a display, a printer,
a speaker, etc. Communication interface 280 may include any
transceiver-like mechanism that enables the client/server component to
communicate with other devices and/or systems. For example, communication
interface 280 may include mechanisms for communicating with another
device or system via a network, such as network 150.
[0031] As will be described in detail below, the client/server component,
consistent with the principles of the invention, may perform certain
searching-related operations. The client/server component may perform
these operations in response to processor 220 executing software
instructions contained in a computer-readable medium, such as memory 230.
A computer-readable medium may be defined as a physical or logical memory
device and/or carrier wave.
[0032] The software instructions may be read into memory 230 from another
computer-readable medium, such as data storage device 250, or from
another device via communication interface 280. The software instructions
contained in memory 230 may cause processor 220 to perform processes that
will be described later. Alternatively, hardwired circuitry may be used
in place of or in combination with software instructions to implement
processes consistent with the principles of the invention. Thus,
implementations consistent with the principles of the invention are not
limited to any specific combination of hardware circuitry and software.
Exemplary Server
[0033] FIG. 3 is an exemplary functional block diagram of a portion of
server 120 according to an implementation consistent with the principles
of the invention. According to one implementation, one or more of the
functions described below may be performed by search engine 125.
According to another implementation, one or more of these functions may
be performed by a component external to server 120, such as a computer
associated with server 120 or one of servers 130 and 140.
[0034] Server 120 may include an entity identification unit 310 and a
query processing unit 320 connected to a repository. The repository may
include information associated with documents that were previously
crawled and stored, for example, by server 120.
[0035] Entity identification unit 310 may generate a list of entity names.
Entity identification unit 310 may obtain an initial set of entity names
for entities in a particular context (e.g., names of news sources in the
news source context or store names in the store context). There are many
ways that entity identification unit 310 can obtain the initial set of
entity names in a particular context. For example, entity identification
unit 310 may obtain entity names from online directories, lists, group
postings, by analyzing a corpus of documents, etc.
[0036] For each of these names, entity identification unit 310 may also
identify an entity identifier, such as a homepage domain name or a
category identifier, associated with the name. For example, if the name
was Washington Post, then the associated entity identifier might be
washingtonpost.com. Entity identification unit 310 may identify the
associated entity identifier from, for example, an analysis of the
document information in the repository.
[0037] Entity identification unit 310 may then process the entity names to
produce a list of variations of the names. Entity identification unit 310
may apply several transformations to the name and/or its entity
identifier, such as: using the entity name as is; using the entity
identifier as is; removing modifiers, such as "a," "the," "inc," "inc.,"
"co," and "co." from the entity name; replacing spaces with hyphens or
underscores, or vice versa, within the entity name; removing apostrophes
from the entity name; interchanging "and" and "&" in the entity name
and/or the entity identifier; removing "and" and "&" from the entity name
and/or the entity identifier; removing the initial "www." and/or the
trailing ".com" from the entity identifier; and/or treating periods in
the entity identifier with no spaces on either side of them as spaces or
deleting the periods. Other or different transformations may also be
used.
[0038] Entity identification unit 310 may form these name variations into
a list of candidate strings. FIG. 4 is an exemplary diagram of a list of
candidate strings 400 according to an implementation consistent with the
principles of the invention. Candidate string list 400 might include a
number of entries (candidate strings) associated with the various
versions of entity names and their associated entity identifiers. An
entry in list 400 might include an entity name field 410 and an entity ID
field 420. Entity name field 410 may include a variation of an entity
name or its associated entity identifier. Entity ID field 420 may include
information that uniquely identifies the entity corresponding to the
entity name in entity name field 410, such as a domain, a URL, or a
category identifier. An example of an entry for the news source
Washington Post might include "washington post" in entity name field 410
and "www.washingtonpost.com" in entity ID field 420.
[0039] Returning to FIG. 3, query processing unit 320 may process the list
of candidate strings to determine whether a search query should be
automatically rewritten or whether rewriting of a query should be
suggested. For example, query processing unit 320 may determine whether a
query includes an entity name or any variation thereof. Query processing
unit 320 may check the terms of the query against list of candidate
strings 400 (FIG. 4). In one implementation, query processing unit 320
may check whether a word, or phrase (hereinafter "term" will be used to
encompass both a "word" and a "phrase"), at the left or right most
position of the query matches one of the candidate strings. In another
implementation, query processing unit 320 may check whether any term in
the query matches one of the candidate strings.
[0040] If a term matches one of the candidate strings, query processing
unit 320 may optionally determine whether a word in the query that
neighbors the term indicates that no further processing of the query
should occur. For example, query processing unit 320 may determine
whether a word that neighbors the term (e.g., is adjacent to or near the
term) forms a common phrase with the term, such that the combination of
this word with the term forms a phrase that should not be decomposed.
[0041] To illustrate this, assume that the query includes the words "time
travel" and the term "time" has been identified as an entity name. The
user who provided the query may have meant two things. First, the user
may want to find information on the phrase "time travel." Alternatively,
the user may want to find information on "travel" from the news source
"Time." In this case, query processing unit 320 may recognize the phrase
"time travel" as a common phrase and determine that the phrase should not
be decomposed.
[0042] Query processing unit 320 may identify common phrases from an
exhaustive list of phrases. The list of phrases may be obtained from a
number of sources. One such source may include the repository of
documents. For example, documents in the repository may be analyzed to
identify phrases that appear more than a threshold number of times in
different documents.
[0043] When query processing unit 320 determines that no further
processing of the query should occur, then query processor 320 may
perform a search using the original query and present the search results
to the user. In this case, query processing unit 320 may optionally
include a link to a rewritten query with the search results. The
rewritten query may restrict the search to the entity identifier (e.g.,
domain) associated with the entity name (or variation) in the query.
[0044] When query processing unit 320 determines that further processing
of the query should occur, then query processing unit 320 may determine
whether the term is associated with a common word or phrase. There are
several ways that query processing unit 320 may determine whether the
term is associated with a common word or phrase. For example, query
processing unit 320 may compare the term to a dictionary of English words
and phrases. Alternatively, query processing unit 320 may use an inverse
document frequency (IDF) weighting technique or a conventional linguistic
modeling technique. One such technique may involve analyzing a corpus of
documents and creating a hash table based on the terms in the documents.
For example, each term in a document may be identified and hashed. The
count value in the corresponding entry in the hash table may then be
incremented. Once the corpus has been analyzed, the count values may
reflect which terms occurred more often and which terms occurred less
often. Query processing unit 320 may identify terms that have occurred
more than a threshold amount as common terms.
[0045] If query processing unit 320 determines that the query term is not
associated with a common word or phrase, then query processing unit 320
may rewrite the query. The rewritten query may be based on the
identification of an entity name and restrict the query to a search
associated with the entity name. For example, if a user query includes
"washingtonpost," then the query may be rewritten to
"source:washingtonpost" to indicate that the search is to be restricted
to the entity identifier (domain) associated with the news source
Washington Post. The "source:" may correspond to a restrict identifier in
the news context that indicates that the search should be restricted to
the news source that follows it. Similar restrict identifiers may be used
in other contexts.
[0046] Query processing unit 320 may then perform a search based on the
rewritten query and present results to the user. Query processing unit
320 may also offer a query link associated with the original query to the
user. The query link, if selected by the user, may cause query processing
unit 320 to perform a search based on the original query (i.e., without
restricting the search to a particular entity).
[0047] If query processing unit 320 determines that the query term is
associated with a common word or phrase, then query processing unit 320
may use the original query to perform a search (i.e., without restricting
the search to a particular entity). Query processing unit 320 may also
generate a query link associated with a rewritten query. Query processing
unit 320 may rewrite the query, as described above, and provide a link to
this rewritten query to the user. The query link, if selected by the
user, may cause query processing unit 320 to perform a search based on
the rewritten query.
Exemplary Processing
[0048] FIG. 5 is a flowchart of exemplary processing for generating a list
of candidate strings according to an implementation consistent with the
principles of the invention. Processing may begin with obtaining a list
of entity names for a particular context (act 510). For each of the
entity names, a corresponding entity identifier may also be identified
(act 520). Several techniques exist for identifying entity names and/or
entity identifiers for the list. For example, entity names and/or entity
identifiers may be identified from online directories, lists, group
postings, by analyzing a corpus of documents, etc.
[0049] A list of candidate strings may then be produced by transforming
the entity names and/or entity identifiers (act 530). For example, the
list of candidate strings for a particular entity name and its associated
entity identifier may include the entity name as is, the entity
identifier as is, the entity name without modifiers (e.g., "a," "the,"
"inc," "inc.," "co," and "co."), the entity name with spaces replaced
with hyphens or underscores, and vice versa, the entity name without
apostrophes, the entity name and/or entity identifier with "and" replaced
with "&," and vice versa, the entity name and/or entity identifier
without "and" and "&," the entity identifier without an initial "www."
and/or a trailing ".com," and the entity identifier with a period with no
spaces on either side of it replaced with spaces or deleted. Other or
different transformations may also be used. One such list of candidate
strings is illustrated in FIG. 4.
[0050] FIG. 6 is a flowchart of exemplary processing for selectively
rewriting a search query according to an implementation consistent with
the principles of the invention. Processing may begin with receiving a
search query from a user (act 610). The search query may contain one or
more terms, which may or may not include the name of an entity.
[0051] The search query may be evaluated to identify possible entity names
based on the list of candidate strings (act 620). For example, a term of
the search query may be compared to the entity names, which include the
variations of the entity names, in the list of candidate strings. In one
implementation, the terms at the left-most position and/or right-most
position within the search query may be evaluated to determine whether
they correspond to one of the entity names in the list of candidate
strings. In another implementation, each term of the query may be
evaluated.
[0052] If a term in the search query matches one of the entity names, it
may then optionally be determined whether the search query should be
further processed (act 630). For example, it may be determined whether a
word in the search query that neighbors the entity name forms a common
phrase with the entity name, such that the combination of this word with
the entity name forms a phrase that should not be decomposed. Common
phrases may be identified from an exhaustive list of phrases, as
described above.
[0053] When it is determined that no further processing of the query
should occur, such as when a word in the search query forms a common
phrase with the entity name, a search using the original query may be
performed and the search results presented to the user. Optionally, a
link to a rewritten query may be presented with the search results. The
rewritten query may restrict the search to the entity identifier (e.g.,
domain) associated with the entity name in the query.
[0054] When it is determined that further processing of the query should
occur, then it may be determined whether the entity name is associated
with a common word or phrase (act 640). For example, the entity name may
be compared to a dictionary of English words and phrases to determine
whether it is associated with a common word or phrase. Alternatively, an
IDF weighting technique or a conventional linguistic may be used, as
described above.
[0055] In one implementation, portions of act 640 may be performed
beforehand to generate a table of entity names that are common words or
phrases. In this case, the determination of whether the entity name is
associated with a common word or phrase may be performed by a simple
table lookup operation.
[0056] If it is determined that the entity name is not associated with a
common word or phrase, then the query may be rewritten to restrict the
query to a search associated with the entity name (act 650). For example,
the query may be rewritten to include a restrict identifier associated
with a particular context. The restrict identifier may thereby restrict a
search associated with the query to a search associated with the entity
name. A search may then be performed based on the rewritten query.
[0057] A query link may also be generated that links to the original query
(i.e., without restricting the search to a particular entity name) (act
660). The query link may be beneficial in those instances where the user
did not intend a search based on the rewritten query.
[0058] If it is determined that the entity name is associated with a
common word or phrase, then a query link to a rewritten query may be
generated (act 670). For example, the query may be rewritten, as
described above. Selection of the query link by the user may cause a
search to be performed based on the rewritten query. A search may then be
performed using the original query (i.e., without restricting the search
to a particular entity name) (act 680).
[0059] The search, which may be performed based on the rewritten query, if
applicable, or the original query, if applicable, may identify documents
that are relevant to the rewritten/original query. For example, a
repository of documents may be searched to identify documents that
include one or more terms of the query. The resulting documents may form
search results that may be presented to the user (act 690). In one
implementation, the search results might take the form of links to the
documents.
Automatic Query Rewrite Example
News Context
[0060] FIGS. 7 and 8 are diagrams of an automatic query rewrite example in
the news context according to an implementation consistent with the
principles of the invention. As shown in FIG. 7, a user may enter a
search query via a graphical user interface associated with a search
engine, such as search engine 125 (FIG. 1). In this example, the user
enters the search query "george bush msnbc." Assume that the term "msnbc"
identifies the news source msnbc.com and, thus, is included in the list
of candidate strings (e.g., see FIG. 4).
[0061] Search engine 125 may identify "msnbc" as an entity name. Assume
that search engine 125 determines that the phrase "bush msnbc" and/or the
phrase "george bush msnbc" are not common phrases. Search engine 125 may
then evaluate the entity name "msnbc" to determine whether it is
associated with a common word or phrase. In this case, search engine 125
determines that "msnbc" is not associated with a common word or phrase.
Search engine 125 may then rewrite the query to "george bush
source:msnbc," as shown in FIG. 8.
[0062] Search engine 125 performs a search of a repository for documents
(e.g., news documents) associated with the source msnbc.com that are
relevant to the rewritten query. There are many ways to determine
document relevancy. For example, documents that contain one or more of
the search terms of the rewritten query may be identified as relevant.
Documents that include a greater number of the search toms may be
identified as more relevant than documents that include a fewer number of
the search terms.
[0063] Search engine 125 may then present the relevant documents to the
user as search results. As shown in FIG. 8, each search result may
include a link 810 to a corresponding document, a news source identifier
along with an indicator of when the document was created 820, and a brief
description 830 of the corresponding document. Search engine 125 may also
provide a query link 850 to the original query entered by the user. In
this case, query link 850 may correspond to a query associated with a
search for the search term "george," the search term "bush," and/or the
search term "msnbc."
Suggest Query Rewrite Example
News Context
[0064] FIGS. 9-11 are diagrams of a query rewrite suggestion example in
the news context according to an implementation consistent with the
principles of the invention. As shown in FIG. 9, a user may enter a
search query via a graphical user interface associated with a search
engine, such as search engine 125 (FIG. 1). In this example, the user
enters the search query "time korea." Assume that the term "time"
identifies the news source time.com and, thus, is included in the list of
candidate strings (e.g., see FIG. 4).
[0065] Search engine 125 may identify "time" as an entity name. Assume
that search engine 125 determines that the phrase "time korea" is not a
common phrase. Search engine 125 may then evaluate the entity name "time"
to determine whether it is associated with a common word or phrase. In
this case, search engine 125 determines that "time" is associated with a
common word or phrase. Search engine 125 may then rewrite the query to
"korea source:time" and generate a link 1010 ("Search News Source Time
for Korea") to the rewritten query, as shown in FIG. 10.
[0066] Search engine 125 performs a search of a repository for documents
(e.g., news documents) that are relevant to the original search query. As
described above, there are many ways to determine document relevancy. For
example, documents that contain one or more of the search terms of the
rewritten query may be identified as relevant. Documents that include a
greater number of the search terms may be identified as more relevant
than documents that include a fewer number of the search terms. In this
case, search engine 125 searches for documents that include the search
terms "time" and/or "korea."
[0067] Search engine 125 may then present the relevant documents to the
user as search results. As shown in FIG. 10, each search result may
include a link 1020 to a corresponding document, a news source identifier
along with an indicator of when the document was created 1030, and a
brief description 1040 of the corresponding document. Because the search
was not limited to the news source Time, the search results are
associated with a number of different news sources (e.g., the New York
Times, British Broadcasting Corporation (BBC), and Atlanta Journal
Constitution).
[0068] If the user selects link 1010 associated with the rewritten query,
search engine 125 performs a search of the repository for documents
(e.g., news documents) associated with the news source time.com that are
relevant to the rewritten query. Search engine 125 may then present the
relevant documents to the user as search results. As shown in FIG. 11,
each search result may include a link 1110 to a corresponding document, a
news source identifier along with a date indicator 1120 corresponding to
the date on which the document was created, and a brief description 1130
of the corresponding document. Optionally, search engine 125 may also
provide a link 1150 to the original query entered by the user. In this
case, link 1150 may correspond to a query associated with a search for
the search term "time" and/or the search term "korea."
CONCLUSION
[0069] Systems and methods consistent with the principles of the invention
may selectively rewrite search queries upon detection of the names of
certain entities.
[0070] The foregoing description of preferred embodiments of the present
invention provides illustration and description, but is not intended to
be exhaustive or to limit the invention to the precise form disclosed.
Modifications and variations are possible in light of the above teachings
or may be acquired from practice of the invention.
[0071] For example, it has been described that query processing unit 320
may perform a search based on the original or rewritten search query. In
other implementations, query processing unit 320 may not perform the
search, but may provide the original or rewritten search query to a
search engine, such as search engine 125 (FIG. 1) to perform the search
and provide the search results.
[0072] Also, while series of acts have been described with regard to FIGS.
5 and 6, the order of the acts may be modified in other implementations
consistent with the principles of the invention. Further, non-dependent
acts may be performed in parallel.
[0073] In one implementation, server 120 may perform most, if not all, of
the acts described with regard to the processing of FIGS. 5 and/or 6. In
another implementation consistent with the principles of the invention,
one or more, or all, of the acts may be performed by another component,
such as another server 130 and/or 140 or client 110.
[0074] It will also be apparent to one of ordinary skill in the art that
aspects of the invention, as described above, may be implemented in many
different forms of software, firmware, and hardware in the
implementations illustrated in the figures. The actual software code or
specialized control hardware used to implement aspects consistent with
the principles of the invention is not limiting of the present invention.
Thus, the operation and behavior of the aspects were described without
reference to the specific software code--it being understood that one of
ordinary skill in the art would be able to design software and control
hardware to implement the aspects based on the description herein.
[0075] No element, act, or instruction used in the present application
should be construed as critical or essential to the invention unless
explicitly described as such. Also, as used herein, the article "a" is
intended to include one or more items. Where only one item is intended,
the term "one" or similar language is used. Further, the phrase "based
on" is intended to mean "based, at least in part, on" unless explicitly
stated otherwise.
* * * * *