coleweb.dc.fi.udc.es

Anuncio
Towards the development of heuristics for
automatic query expansion
Jesús Vilares, Manuel Vilares, and Miguel A. Alonso
Departamento de Computación
Facultad de Informática, Universidad de La Coruña
Campus de Elviña s/n, 15071 La Coruña, Spain
[email protected], [email protected], [email protected]
http://coleweb.dc.fi.udc.es/
c Springer-Verlag
Abstract. In this paper we study the performance of linguisticallymotivated conflation techniques for Information Retrieval in Spanish.
In particular, we have studied the application of productive derivational
morphology for single word term conflation and the extraction of syntactic dependency pairs for multi-word term conflation. These techniques
have been tested on several search engines implementing different indexing models. The aim of this study is to find the strong and weak points
of each technique in order to develop heuristics for automatic query expansion.
1
Introduction
In Information Retrieval (IR) systems, documents are represented through a set
of index terms or keywords. For such a purpose, documents are conflated by
means of text operations [1, 6], which reduce their linguistic variety by grouping
together textual occurrences referring to similar or identical concepts. However,
most classical IR techniques for such tasks (such as the elimination of stopwords,
too frequent words or words without seeming significance, or the use of stemming, which reduces distinct words to their supposed grammatical root) lack
solid linguistic grounding. Even text operations with an apparent linguistic basis (e.g. stemming) which obtain very good results for English, perform badly
when applied to languages with a very rich lexis and morphology, as in the case
of Spanish. For these languages, we must face such tasks by employing Natural
Language Processing (NLP) techniques, which redounds in a greater complexity
and a higher computational cost.
2
NLP Techniques for Term Indexing
One of the main problems of natural language processing in Spanish is the lack
of available resources: large tagged corpora, treebanks and advanced lexicons are
not freely available. In this context, we propose to extend classical IR techniques
in two ways: firstly, at word level, using morphological families; and secondly,
at phrase level, using groups of related words with regard to their syntactic
structure.
2.1
Morphological Families
Single word term conflation is usually accomplished in English through a stemmer [8], a simple tool from a linguistic point of view, with a low computational
cost. The results obtained are satisfactory enough since the inflectional morphology of English is very simple. The situation for Spanish is completely different,
because inflectional modifications exist at multiple levels with many irregularities [10]. The case of generative morphology is similarly very rich and complex
in Spanish [7].
Using a lemmatizer we can solve the problems derived from inflection in Spanish. As a second step, we have developed a new approach based on morphological
families [9]. We define a morphological family as a set of words obtained from
the same morphological root through derivation mechanisms. It is expected that
a basic semantic relationship will remain between the words of a given family.
For single word term conflation via morphological families, we first obtain
the part of speech and the lemmas of the text to be indexed. Next, we replace
each of the lemmas obtained by the representative of its morphological family. In
this way, we are covering relations between terms of the type process-result, e.g.
producción (production) / producto (product), process-agent, e.g. manipulación
(manipulation) / manipulador (manipulator), and similar ones. These relations
remain in the index because related terms are conflated to the same index term.
2.2
Syntactic and Morpho-Syntactic Variants
A multi-word term is a term containing two or more content words (nouns, adjectives and verbs). There exist several techniques to obtain them. The first one
is text simplification: in a first step, we make a single word stemming, after
which stopwords are deleted; in the final step, terms are extracted and conflated employing pattern matching [2] or statistical criteria [3]. As we can see,
most operations lack solid linguistic grounding, which often results in incorrect
conflations. Nevertheless, this is the easiest and least costly method.
At the other extreme, we find the morpho-syntactic analysis of the text by
using a parser that produces syntactic trees which denote dependency relations
between involved words. This way, structures with similar dependency relations
are conflated in the same way.
At the mid point, we have syntactic pattern matching, which is based on the
hypothesis that the most informative parts of the texts correspond to specific
syntactic patterns [5].
We take an approach which conjugates these two last solutions, based on
indexing noun syntagmas and their syntactic and morpho-syntactic variants [4].
A syntactic or morpho-syntactic variant of a multi-word term is a textual utterance that can be substituted for the original term in a task of information
access:
Syntactic variants result from the inflection of individual words and from
modifying the syntactic structure of the original term, e.g. chico gordo (fat
boy) → chicos gordos y altos (fat and tall boys).
Morpho-syntactic variants differ from syntactic variants in that at least one
of the content words of the original term is transformed into another word
derived from the same morphological stem, e.g. medición del contenido (measurement of the content) → medir el contenido (to measure the content).
From a morphological point of view, syntactic variants refer to inflectional
morphology, whereas morpho-syntactic variants also refer to derivational morphology. In the case of syntax, syntactic variants have a very restricted scope, a
noun syntagma, whereas morpho-syntactic variants can span a whole sentence,
including a verb and its complements, e.g. comida de perros (dog food) → los
perros comen carne (dogs feed on meat). However, both variants can be obtained
through transformations from noun syntagmas.
To extract such index terms we will use syntactic matching patterns obtained
from the syntactic structure of the noun syntagmas and their variants. For such
a task we take as our basis an approximate grammar for Spanish.
2.3
Syntactic and Morpho-Syntactic Variants as a Text Operation
The first task to be performed when indexing a text is to identify the index
terms. Taking as our basis the syntactic trees corresponding to noun syntagmas
and according to an approximate grammar for Spanish, we apply the mechanisms associated with syntactic and morpho-syntactic variants, obtaining their
syntactic trees. Then, these trees are flattened into regular expressions formed
by the part of speech labels of the words involved. Such matching patterns will
be applied over the tagged text to be indexed, to identify the index terms. In
this way, we are dealing with the problem from a surface processing approach
at lexical level, leading to a considerable reduction of the running cost.
Once index terms have been identified, they must be conflated. This process
consists of two phases. Firstly, we identify relations between pairs of content
words inside the multi-word term, to conflate it into syntactic dependency-pairs.
Secondly, single word term conflation mechanisms (lemmatization or morphological families) are applied to the words which form such pairs.
The relations we can find in a multi-word term correspond to three types:
1. Modified-Modifier, found in noun syntagmas. A dependency-pair is obtained for each combination of the head of the modifiers with the head
of the modified terms. For example, coches y motos rojas is conflated into
(coche,rojo),(moto,rojo) 1 .
1
red cars and bikes and (car, red), (bike, red), respectively
2. Subject-Verb, relating the head of the subject and the verb.
3. Verb-Complement, relating the verb and the head of the complement.
In the case of syntactic variants, the dependencies of the original term always remain in the variant. In the case of morpho-syntactic variants we cannot
guarantee the presence of the original term dependencies unless morphological
families are applied. For example, given the term recorte de gastos (spending
cutback) and its morpho-syntactic variant recortar gastos (to cut back spending), using lemmatization we obtain the following two different dependency pairs
(recorte, gasto) and (recortar, gasto), respectively. Whereas, using morphological
families and supposing that the representatives are recorte (cutback) and gastar
(to spend), we obtain the same dependency pair (recorte, gastar) for both the
original term and its variant. Therefore, the degree of conflation obtained when
using morphological families is higher than when using lemmatization.
3
Evaluation
The techniques proposed in this paper are independent of the indexing engine we
choose to use because documents are preprocessed before being treated. We have
performed experiments using the following engines: Altavista SDK2 , SMART3
(based on a vector model) and SWISH-E4 (based on a boolean model). For all
engines, we have tested five different conflation techniques:
1.
2.
3.
4.
5.
Elimination of stopwords using the list provided by SMART (pln).
Lemmatization of content words (lem).
Morphological families of content words (fam).
Syntactic dependency-pairs with lemmatization (FNL).
Syntactic dependency-pairs with morphological families (FNF ).
We have tested the five proposed approaches on a corpus of 21,899 documents
of a journalistic nature (national, international, economy, culture, . . . ) covering
the year 2000. The average length of the documents is 447 words. We have
considered a set of 14 natural language queries with an average length of 7.85
words per query, 4.36 of which were content words.
3.1
Altavista SDK
Average precision and recall are shown at the bottom-right of Fig. 1. Single word
term conflation techniques, fam and lem, has led to a remarkable increase in precision and recall with respect to pln. The increase in precision is even higher for
multi-word term conflation techniques FNL and FNF. Moreover, the technique
FNF attains good recall, which implies that it can significantly increase the
precision without seriously affecting the recall.
2
3
4
http://solutions.altavista.com/
ftp://ftp.cs.cornell.edu/pub/smart/
http://sunsite.berkeley.edu/SWISH-E/
1
1
FNF
FNL
fam
lem
pln
0.8
0.8
0.6
0.6
Precision (P)
Precision (P)
FNF
FNL
fam
lem
pln
0.4
0.2
0.4
0.2
0
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
Recall (R)
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall (R)
1
FNF
FNL
fam
lem
pln
pln
Precision (P)
0.8
0.6
0.4
lem fam FNL FNF
P (Al)
R (Al)
0.20 0.21 0.22 0.32
0.60 0.63 0.68 0.50
0.33
0.57
P (SM)
R (SM)
0.17 0.20 0.20 0.31
0.55 0.63 0.60 0.48
0.32
0.56
P (SW)
R (SW)
0.39 0.52 0.42 0.14
0.13 0.41 0.45 0.01
0.14
0.01
0.2
0
0
0.1
0.2
0.3
0.4
0.5
Recall (R)
0.6
0.7
0.8
0.9
1
Fig. 1. Global results: Altavista (top-left), SMART (top-right), SWISH-E (bottom-left)
and average precision and recall (bottom-right)
With respect to the evolution of precision vs. recall, Fig.1 confirms the tecnique pln as being the worst one. The best behaviour corresponds to the technique FNF, except for the segment of recall between 0.5 to 0.7, where single
word term conflation techniques, lem and fam, are slightly better. For low recall
rates (≤ 0.3) FNF is clearly the best one, whilst the other conflation techniques
show a similar behaviour. For the segment of recall between 0.3 to 0.5, single
word term conflation techniques are closer to FNF, whereas FNL is closer to pln.
For recall rates greater than 0.7, conflation techniques using lemmatization tend
to converge, as do the techniques using morphological families.
3.2
SMART
The average recall and precision are similar to those obtained with Altavista
SDK, except for the case of single word term conflation via morphological families, fam, which does not appear to improve the global behaviour of the system
in relation to lemmatization. On the contrary, its efficiency is somewhat reduced,
in contrast with the results for Altavista. Nevertheless, the use of such families
together with the use of multi-word terms gives a remarkable increase of recall,
as in the case of Altavista, with regard to the use of lemmatization with complex
terms. In fact, FNF improves the recall of pln. We can also notice that there is
a greater homogeneity in the behaviour of all methods in the case of recall.
With respect to the evolution of precision vs. recall, we can observe in Fig. 1
that the greater complexity of the vectorial model tends to reduce the differences between all techniques. We can observe some noticeable differences between the behaviour of conflating techniques in SMART and Altavista. Firstly,
in SMART the behaviour of fam technique is clearly worse than the behaviour
of lem technique. This supports the results obtained for average precision and
recall. Another difference we can observe is that in SMART the FNF technique
only obtains better results than the rest of methods for low and high recalls
(≤ 20, 50 ≥), while for the rest of the interval the best method is clearly lem.
On the other hand, we also find some similarities, such as the fact that pln and
FNL have the worst behaviour in comparison with the other methods.
3.3
SWISH-E
The first conclusion we can reach is that the use of multi-word term methods in
combination with the boolean model is completely inadequate due to boolean
engines require all terms involved in a query to match index terms in a given
document, a rare situation when dealing with syntactic dependencies. The use of
plain text with a boolean model is also completely inadequate because this model
is more sensitive to inflectional variations than the previous engines. When using
lemmatization to conflate the text we reach a noticeably higher level of recall,
with very high precision. The employment of morphological families for single
word term conflation obtains a higher level of recall, but the level of precision
is lower than the level of precision attained with lem. This is due to the noise
introduced by inaccurate families. However, it is also interesting to remark that
the precision reached by pln, lem and fam is the highest reached for all the test
suite, but at the cost of reducing recall.
3.4
Behaviour for Particular Queries
The behaviour of the different techniques varies according to the characteristics
of each particular query. We will try to illustrate this fact with some practical
examples obtained during the test process. This study is a first step towards the
development of heuristics for automatic query expansion.
The first example we will work with is the query “experimentos sobre la
clonación de monos” (experiments on the cloning of monkeys), trying to illustrate how multi-word terms can discriminate very specific information. For
this query only two relevant documents were found in the corpus. Nevertheless,
since cloning is a very popular topic nowadays, there are a lot of related but
non-relevant documents which introduce a lot of noise.
We center this discussion on the results obtained by multi-word term conflation techniques. As we can see in the graph of precision vs. recall for Altavista
and SMART of Fig. 2 the evolution for techniques based on families (fam and
FNF ) is similar, and the same occurs with techniques using lemmatization (lem
1
0.6
0.4
0.2
0.6
0.4
0.2
0
0
0
0.1
0.2
0.3
0.4
0.5
0.6
Recall (R)
0.7
1
0.8
0.9
1
0
0.1
FNF
FNL
fam
lem
pln
0.8
Precision (P)
FNF
FNL
fam
lem
pln
0.8
Precision (P)
0.8
Precision (P)
1
FNF
FNL
fam
lem
pln
0.2
0.3
pln
0.6
0.4
0.4
0.5
0.6
Recall (R)
0.7
0.8
0.9
1
lem fam FNL FNF
P (Al)
R (Al)
0.03 0.03 0.05 0.50
0.50 0.50 1
0.50
0.67
1
P (SM)
R (SM)
0.03 0.05 0.05 0.50
0.50 1
1
0.50
0.67
1
P (SW)
R (SW)
0.33 0.33 0.33
0.50 0.50 0.50
0.2
0
0
0.1
0.2
0.3
0.4
0.5
0.6
Recall (R)
0.7
0.8
0.9
1
0
0
0
0
Fig. 2. “experimentos sobre la clonación de monos”: results for Altavista (top-left),
SMART (top-right), SWISH-E (bottom-left) and average precision and recall
and FNL). Nevertheless these last two techniques obtain a lesser degree of recall
and precision. However, it is in the table of average precision and recall where
we can find very important differences. We can see that the average precision
reached by using multi-word term conflation techniques is significatively higher
than the average precision obtained by using single word term techniques. Morover, the levels of recall are maintained. This means that the set of documents
returned is small but precise, because multi-word term techniques have been
able to discriminate the relevant documents adequately without losing recall.
As a second example, we consider the query “negociaciones del PP con el
PSOE sobre el pacto antiterrorista” (negotiations between PP and PSOE about
the pact against terrorism) to illustrate a case where single word term conflation
techniques achieve a better performance than multi-word techniques. As we can
observe in figure 3, there are some similarities in the precision vs. recall graphs
for the three indexing engines. In particular, the fam technique shows the best
behaviour, even better than multi-word techniques. The lem technique performs
worse than fam but better than pln for all search engines. These results are due
to the fact that lemmatization solves variations caused by inflectional morphology. In addition, morphological families solve variations caused by derivational
morphology, retrieving more documents, most of which are relevant because this
1
0.6
0.4
0.2
0.6
0.4
0.2
0
0
0
0.1
0.2
0.3
0.4
0.5
0.6
Recall (R)
0.7
1
0.8
0.9
1
0
0.1
FNF
FNL
fam
lem
pln
0.8
Precision (P)
FNF
FNL
fam
lem
pln
0.8
Precision (P)
0.8
Precision (P)
1
FNF
FNL
fam
lem
pln
0.2
0.3
pln
0.6
0.4
0.4
0.5
0.6
Recall (R)
0.7
0.8
0.9
1
lem fam FNL FNF
P (Al)
R (Al)
0.15 0.13 0.28 0.20
0.38 0.31 0.69 0.50
0.15
0.38
P (SM)
R (SM)
0.10 0.18 0.18 0.18
0.25 0.44 0.44 0.44
0.15
0.38
P (SW)
R (SW)
0.50 0.24 0.31
0.13 0.31 0.50
0.2
0
0
0.1
0.2
0.3
0.4
0.5
0.6
Recall (R)
0.7
0.8
0.9
1
0
0
0
0
Fig. 3. “negociaciones del PP con el PSOE sobre el pacto antiterrorista”: Altavista
(top-left), SMART (top-right), SWISH-E (bottom-left) and average precision and recall
query involves words with derivatives which frequently appear in the texts referring to the topic of the query, such as negociación (negotiation), negociar (to
negotiate) and negociador (negotiator) or pacto (pact) and pactar (to agree on).
The third example we consider corresponds to the query “el PSOE reclama
un debate entre Aznar y Almunia” (PSOE demands a debate between Aznar
and Almunia). This query refers to the constant demand by the PSOE party for
a TV debate between the two main candidates in the general elections in the
year 2000. We must take into account that words like PSOE, Almunia, Aznar
and debate appear in several hundreds of non-relevant documents about political
issues, introducing a lot of noise. Finally, we have found that only 11 documents
were relevant to this query. With this example we try to illustrate the situations
where a boolean model beats other indexing models, as is shown in Fig. 4.
Comparing the precision vs. recall graph for SWISH-E with respect to the
graphs for Altavista and SMART, we observe that precision in these two last
models is lower or similar to precision in SWISH-E for all levels of recall, except
for the interval ≤ 0.1. However, the main difference arises in the measures of
average precision and recall. Recall in SWISH-E reaches 55%, in contrast to 64%
and 73% reached by Altavista and SMART, respectively. Nevertheless, precision
in SWISH-E reaches 40%, in contrast to a maximum of 29% reached by multi-
1
0.6
0.4
0.2
0.6
0.4
0.2
0
0
0
0.1
0.2
0.3
0.4
0.5
0.6
Recall (R)
0.7
1
0.8
0.9
1
0
0.1
FNF
FNL
fam
lem
pln
0.8
Precision (P)
FNF
FNL
fam
lem
pln
0.8
Precision (P)
0.8
Precision (P)
1
FNF
FNL
fam
lem
pln
0.2
0.3
pln
0.6
0.4
0.4
0.5
0.6
Recall (R)
0.7
0.8
0.9
1
lem fam FNL FNF
P (Al)
R (Al)
0.15 0.18 0.15 0.29
0.55 0.64 0.55 0.36
0.29
0.36
P (SM)
R (SM)
0.20 0.20 0.20 0.29
0.73 0.73 0.73 0.36
0.29
0.36
0.2
0
0
0.1
0.2
0.3
0.4
0.5
0.6
Recall (R)
0.7
0.8
0.9
1
P (SW)
R (SW)
0
0
0.40 0.40
0.55 0.55
0
0
0
0
Fig. 4. “el PSOE reclama un debate entre Aznar y Almunia”: results for Altavista (topleft), SMART (top-right), SWISH-E (bottom-left) and average precision and recall
word term conflation techniques in the other engines. This increase in precision
is due to the high level of discrimination achieved by the boolean model between
relevant and non-relevant documents when the query is very similar to the way
the information is expressed in the documents, i.e. there exists little variation in
the way the concepts involved in the query are expressed.
4
Conclusions
We have shown how linguistically-motivated indexing can improve the performance of information retrieval systems working on languages with a rich lexis and
morphology, such as Spanish. In particular, two text operations have been applied to effectively reduce the linguistic variety of documents: productive derivational morphology for single word term conflation and extraction of syntactic
dependency-pairs for multi-word term conflation. These techniques require a
minimum of linguistic resources, which make them adequate for processing European minority languages. The increase of computational cost is also minimal
due to the fact that they are based on finite state technology, which makes them
useful for practical systems.
These techniques have been tested on a testsuite of journalistic documents
using different search engines. We have found that:
– Indexing of plain text (pln) is the worst option, independently of the indexing
model used.
– Morphological families show good recall and precision when they do not
introduce noise.
– Multi-word term conflation techniques (FNL,FNF ) do not work properly in
combination with the boolean model.
– Multi-word term conflation significantly increases the precision in nonboolean models, and when combined with morphological families (FNF )
it also shows a good level of recall.
– Lemmatization (lem) is not the best technique but it is a good balance
between all the techniques considered.
In consequence, we can propose an automatic heuristic consisting in the use
of morphological families (fam) with a boolean model when the words involved
in the query have variants with a high frequency of appearance in the corpus of
documents. Depending on user need, we can also propose the following heuristics
for interaction between the information retrieval system and the user:
– When the user is searching nearly literal utterances, the best option is to
employ lemmatized text (lem) with a boolean search engine.
– If the user requires high precision, even at the expense of reducing recall,
dependency-pairs with families (FNF ) is the most accurate approach.
– When the user wishes to increase recall, for example if the other techniques
return few documents, he may use morphological families (fam).
Acknowledgments
This research has been partially supported by Plan Nacional de Investigación
Cientı́fica, Desarrollo e Innovación Tecnológica (Grant TIC2000-0370-C02-01),
FEDER of EU (Grant 1FD97-0047-C04-02) and Xunta de Galicia (Grants
PGIDT99XI10502B and PGIDT01PXI10506PN).
References
1. R. Baeza-Yates and B. Ribeiro-Neto. 1999. Modern information retrieval. AddisonWesley, Harlow, England.
2. M. Dillon and A.S. Gray. 1983. FASIT: A fully automatic syntactically based
indexing system. Journal of the American Society for Information Science, 34(2):99–
108.
3. J.L. Fagan. 1987. Automatic phrase indexing for document retrieval: An examination of syntactic and non-syntactic methods. In Proceedings of ACM SIGIR’87,
pages 91–101.
4. C. Jacquemin and E. Tzoukerman. 1999. NLP for term variant extraction: A synergy
of morphology, lexicon and syntax. In T. Strzalkowski, editor, Natural Language
Information Retrieval, pages 25–74. Kluwer Academic, Boston.
5. J.S. Justeson and S.M. Katz. 1995. Technical terminology: some linguistic properties
and an algorithm for identification in text. Natural Language Engineering, 1:9–27.
6. G. Kowalski. 1997. Information retrieval systems: theory and implementation.
Kluwer Academic, Boston.
7. M. F. Lang. 1990. Spanish Word Formation: Productive Derivational Morphology
in the Modern Lexis. Croom Helm. Routledge, London and New York.
8. M. Lennon, D.S. Pierce, and P. Willett. 1981. An evaluation of some conflation
algorithms. Journal of Information Science, 3:177–183.
9. J. Vilares, D. Cabrero, and M. A. Alonso. 2001. Applying Productive Derivational
Morphology to Term Indexing of Spanish Texts. In Proc. 2nd International Conference on Intelligent Text Processing and Computational Linguistics, CICLing-2001,
Mexico City, Mexico; February 18–24. To be published in Lecture Notes in Artificial
Intelligence.
10. M. Vilares, J. Graña, and P. Alvariño. 1997. Finite-state morphology and formal
verification. In A. Kornai, editor, Extended Finite State Models of Language, pages
37–47. Cambridge University Press.
Descargar