slides1

Anuncio
NLP aka
Computational Linguistics
Introducción al
Procesamiento del Lenguaje Natural
2011
Slides tomadas de J. Pustejovsky
What is Computational Linguistics?
• Computational Linguistics is the computational
analysis of natural languages.
– Process information contained in natural language.
• Can machines understand human language?
– Define ‘understand’
– Understanding is the ultimate goal. However, one
doesn’t need to fully understand to be useful.
Silly sentences
•
•
•
•
•
•
•
•
•
•
•
Children make delicious snacks
Stolen painting found by tree
I saw the Grand Canyon flying to New York
Court to try shooting defendant
Ban on nude dancing on Governor’s desk
Red tape holds up new bridges
Iraqi head seeks arms
Blair wins on budget, more lies ahead
Local high school dropouts cut in half
Hospitals are sued by seven foot doctors
In America a woman has a baby every 15 minutes.
CL/NLP
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Information extraction
Named entity recognition
Trend analysis
Subjectivity analysis
Text classification
Anaphora resolution, alias resolution
Cross-document cross reference
Parsing
Semantic analysis
Word sense disambiguation
Word clustering
Question answering
Summarization
Document retrieval (filtering, routing)
Structured text (relational tables)
Paraphrasing and paraphrasing/entailment ID
Text generation
Machine translation
FY
What is needed: (1) linguistic knowledge
•
Examples:
– Zipf’s law: rank(wi)*freq(wi) = const
– Collocations:
• Strong beer but *powerful beer
• Big sister but *large sister
• Stocks rise but ?stocks ascend (225,000 hits on Google vs. 47 hits)
– Constituents:
•
•
•
•
•
Children eat pizza.
They eat pizza.
My cousin’s neighbor’s children eat pizza.
_ Eat pizza!
How to get it:
– Manual rules
– Automatically acquired from large text collections (corpora)
Linguistics
• Knowledge about language:
–
–
–
–
–
–
–
Phonetics and phonology - the study of sounds
Morphology - the study of word components
Syntax - the study of sentence and phrase structure
Lexical semantics - the study of the meanings of words
Compositional semantics - how to combine words
Pragmatics - how to accomplish goals
Discourse conventions - how to deal with units larger than
utterances
What is needed: (2) mathematical and
computational tools
•
•
•
•
•
•
•
•
•
•
•
Language models
Estimation methods
Hidden Markov Models (HMM): for sequences
Context-free grammars (CFG): for trees
Conditional Random Fields (CRF)
C
Generative/discriminative models
Maximum entropy models
Random walks
Latent semantic indexing (LSI)
L
+ Representation issues
+ Feature engineering
Theoretical Computer Science
• Automata
– Deterministic and non-deterministic finite-state automata
– Push-down automata
• Grammars
– Regular grammars
– Context-free grammars
– Context-sensitive grammars
• Complexity
• Algorithms
– Dynamic programming
Mathematics and Statistics
•
•
•
•
•
•
Probabilities
Statistical models
Hypothesis testing
Linear algebra
Optimization
Numerical methods
Artificial Intelligence
• Logic
– First-order logic
– Predicate calculus
• Agents
– Speech acts
• Planning
• Constraint satisfaction
• Machine learning
Relation of CL to
Other Disciplines
Artificial Intelligence (AI)
(notions of rep, search, etc.)
Machine Learning
(particularly, probabilistic or statistic ML techniques)
Human Computer
Interaction (HCI)
Electrical Engineering (EE) (Optical Character Recognition)
Linguistics (Syntax, Semantics, etc.)
CL
Psychology
Philosophy of Language, Formal Logic
Theory of Computation
Information
Retrieval
A Sampling of
“Other Disciplines”
 Linguistics: formal grammars, abstract characterization of what is to be learned.
 Computer Science: algorithms for efficient learning or online deployment of these systems in automata.
 Engineering: stochastic techniques for characterizing regular patterns for learning and ambiguity resolution.
 Psychology: Insights into what linguistic constructions are easy or difficult for people to learn or to use
Indice
•
•
•
•
Historia de PLN
Test de Turing
PLN. Repaso. Desafíos
Linguistic Issues
–
–
–
–
–
–
Part of Speech (POS)
Morphology
Sintáxis
Semántica
Pragmática
Análisis del discurso
Historia: 1940-1950’s
• Desarrollo de teoría de lenguajes formal
(Chomsky, Kleene, Backus).
– Caracterización formal de clases de gramáticas
(regulares, libres de contexto)
– Asociación con autómatas
• Teoría de las probabilidades: entendimiento
del lenguaje como decodificación de noisy
channels (Shannon)
– Uso de conceptos de teoría de la información para
medir éxito de modelos de lenguaje (por ej.
entropía).
1957-1983
Modelado simbólico vs estocástico
• Modelado simbólico
– Uso de gramáticas formales como base para PLN y sistemas de
aprendizaje (Chomsky, Harris).
a
– Uso de lógica (y programación lógica) para caracterizar inferencia
ssemántica y sintáctica (Kaplan, Kay, Pereira).
– Primeros sistemas “juguete” de comprensión y generación de lenguaje
natural (Woods, Minsky, Schank, Winograd, Colmerauer).
n
– P
Procesamiento del discurso: intención, foco (Grosz, Sidner, Hobbs).
• Modelado estocástico
– Primeros sistemas de reconocimiento del habla y óptico de caracteres
((OCR) (Bledsoe; Browning, Jelinek, Black, Mercer).
1983-1993:
Regreso al empirismo
• Uso de técnicas estocásticas para part
of speech tagging, parsing, word sense
disambiguation, etc.
• Comparación de modelos estocásticos,
simbólicos para tareas de comprensión
y aprendizaje del lenguaje.
1993-Actualidad
• Los avances en software y hardware crean
necesidades de PLN para:
– Recuperación de la información (Information retrieval) (web)
– Traducción automática (machine translation)
– Chequeo ortográfico y gramatical (spelling and grammar
checking)
– Reconocimiento y síntesis del habla (speech recognition and
synthesis)
• Combinación de métodos estocásticos y simbólicos
para aplicaciones reales.
Lenguaje e Inteligencia: Test
de Turing
• Test de Turing:
– Computadora, humano y juicio humano
• El juez le hace preguntas al humano y a la
computadora
– El trabajo de la computadora es actuar como un humano
– El trabajo del humano es convencer al juez de que él no es
la computadora
– La computadora es juzgada como “inteligente” si puede
burlar al juez.
• Juicio de “inteligencia” relacionado con respuestas
adecuadas del sistema.
ELIZA
• Simple psicólogo Rogeriano.
• Usa Pattern Matching para llevar una
conversación limitada.
• Parece pasar el Test de Turing
((McCorduck, 1979, pp. 225-226)
• Demo: http://www.lpa.co.uk/pws_dem4.htm
Qué está implicado en una
respuesta “inteligente”?
Analisis:
Descomposición de la señal (hablada o escrita) en unidades con significado. Reconocimiento del habla y
de caracteres
• Descomposición en palabras
Requiere conocimiento
de patrones
– Descomposición
en palabras
fonológicos (VER):
– Segmentación de palabras en fonemas o
• I mean to make you proud.
letras
Recursos para PLN
Componentes
• Reglas de morfología y spelling
• Reglas gramáticas
• Lexicones
• Reglas de interpretación semántica
• Interpretación del discurso
• Conocimiento del mundo (enciclopédico),gazeteers
PLN implica:
•Aprendizaje de reglas para cada componente,
•Uso de reglas en algoritmos
•Uso de algoritmos para procesar la entrada.
¿Por qué estudiar PLN?
• El lenguaje es un punto distintivo de la
inteligencia humana.
• El texto es el repositorio más grande de
conocimiento humano y esta creciendo
rápidamente:
– emails, artículos de diario, páginas web, IRC,
artículos científicos,reclamos a seguros, cartas de
reclamos de clientes, transcripciones de llamadas
telefónicas, documentos técnicos, documentos de
gobierno, contratos, etc.
Aplicaciones de PLN
• Respuestas a preguntas
– Who won the World Series in 2004?
• Categorización de texto/ruteo
– e-mails de clientes.
• Text Mining
– Find everything that interacts with PDGF.
• Machine (Assisted) Translation
• Enseñanza y aprendizaje de lenguas
– Verificacion de usos
• Spelling correction
– Is that just dictionary lookup?
Desafíos de PLN: Ambigüedad
• Muchas veces se pueden entender
palabras o frases de distintas maneras:
– Teacher Strikes Idle Kids
– Killer Sentenced to Die for Second Time in
10 Years
– They denied the petition for his release that
was signed by over 10,000 people.
– child abuse expert/child computer expert
– Who does Mary love?
Resolución de ambigüedades
probabilística/estadística
• Elegir interpretación con mayor
probabilidad de ser la correcta.
Por ejemplo: cuántas veces se dice
• “Mary loves …”
• “the Mary love”
Y cuál interpretación es la más probable?
Desafíos en CL: Variaciones
• El mismo significado puede ser
expresado de distintas formas:
– Who wrote “The Language Instinct”?
– Steven Pinker, a Harvard professor and
author of “The Language Instinct”, said …
Niveles Linguisticos
• Part of Speech (POS)
– Clase de palabra (verbo, sustantivo, preposición, etc.)
• Morphology
– Estructura interna de palabras: (am –é, aba, ó)
• Sintáxis
– Estructura interna de oraciones (árboles sintácticos u otra
representación)
• Semántica
– Interpretación de significado de palabras, frases y
oraciones.
• Pragmática
• Discurso
Part of Speech
• Categorías sintácticas a las que
responden las palabras:
– N, V, Adj/Adv, Prep, Aux,
– Open/Closed class, lexical/functional
categories
También conocidos como: categorías
gramáticas, tags sintácticos, POS
tags, clases de palabras, entre otros.
Ejemplos de POS
Open Class
N
noun
baby, toy
V
verb
see, kiss
ADJ
adjective
tall, grateful, alleged
ADV
adverb
quickly, frankly, ...
P
preposition
in, on, near
DET
determiner
the, a, that
WhPron
wh-pronoun
who, what, which, …
COORD coordinator
and, or
Test de sustitución
• Dos palabras pertenecen a la misma
categoría si se puede sustituir una por
otra.
– The _____ is angry.
– The ____ dog is angry.
– Fifi ____ .
– Fifi ____ the book.
POS Tags
• No existe un conjunto estándar de Part of
Speech tags
– Algunos usan clases gruesas, por ej: N para noun
(sustantivo)
– Otros prefieren distinciones de más bajo nivel (por
ej; Penn Treebank):
• PRP: personal pronouns (you, me, she, he, them, him,
her, …)
h
• PRP$: possessive pronouns (my, our, her, his, …)
• NN: singular common nouns (sky, door, theorem, …)
• NNS: plural common nouns (doors, theorems, women…)
• NNP: singular proper names (Fifi, IBM, Canada, …)
• NNPS: plural proper names (Americas, Carolinas, …)
PRP
PRP$
Part of Speech Tagging
• Las palabras pueden tener más de un
POS: por ej; back
– The back door = JJ (adjetive)
– On my back = NN (noun)
– Win the voters back = RB (adverb)
– Promised to back the bill = VB (verb)
• El problema de POS tagging consiste en
determinar el POS tag para una instancia
particular de una palabra.
Morfología
• La morfología se ocupa de la constitucion
interna de las palabras:
– The fearsome cats attacked the foolish dog
Distintos tipos de morfología
• Infleccional
– No cambia la categoría gramatical de las
palabras: cats/cat-s, attacked/attack-ed
• Derivacional
– Puede implicar cambios a categorías
gramaticales:
fearsome/fear-some, foolish/fool-ish
Análisis morfológico
• Infleccional
– duck + s = [N duck] + [plural s]
– duck + s = [V duck] + [3rd person s]
• Derivacional
– kind, kindness
• Spelling changes
– drop, dropping
– hide, hiding
La morfología no es tan fácil coo
parece
• Ejemplos de Woods et. al. 2000
– delegate (de + leg + ate) take the legs from
– caress (car + ess) female car
– cashier (cashy + er) more wealthy
– lacerate (lace + rate) speed of tatting
– ratify (rat + ify) infest with rodents
– infantry (infant + ry) childish behavior
A Turkish Example
[Oflazer & Guzey 1994]
• uygarlastiramayabileceklerimizdenmissinizcesine
• urgar/civilized las/BECOME tir/CAUS ama/NEG yabil/
POT ecek/FUT ler/3PL imiz/POSS-1SG den/ABL
mis/NARR siniz/2PL cesine/AS-IF
• an adverb meaning roughly “(behaving) as if you
were one of those whom we might not be able to
civilize.”
Sintaxis
• La sintaxis es el estudio de las
regularidades y restricciones del orden
de las palabras y la estructura de las
frases
– Cómo se organizan las palabras en frases
– Cómo se combinan las frases en frases
más grandes (incluyendo oraciones).
Estructura de Frases
• Restricciones en orden de las palabras
• Constituyentes: NP, PP, VP, AP
• Gramáticas de estructuras de frases
S
NP
PN
VP
V
N
Spot chased Det
a
N
bird
Phrase structure
• Paradigmatic relationships (e.g., constituency)
• Syntagmatic relationships (e.g., collocations)
S
NP
That
VP
man
VBD
PP
NP
caught the butterfly
NP
IN
with
a
net
Phrase-structure grammars
Peter gave Mary a book.
Mary gave Peter a book.
•
•
•
•
•
•
•
C
Constituent
order (SVO, SOV)
imperative forms
sentences with auxiliary verbs
interrogative sentences
declarative sentences
start symbol and rewrite rules
context-free view of language
Sample phrase-structure
grammar
S
NP
NP
NP
VP
VP
VP
P
→
→
→
→
→
→
→
→
NP
DET
DET
NP
VP
VBD
VBD
IN
VP
NNS
NN
PP
PP
NP
NP
DET
NNS
NNS
NNS
VBD
VBD
VBD
IN
IN
NN
→ the
→ children
→ students
→ mountains
→ slept
→ ate
→ saw
→ in
→ of
→ cake
Phrase structure grammars
• Local dependencies
• Non-local dependencies
• Subject-verb agreement
The women who found the wallet were given a reward.
• wh-extraction
Should Peter buy a book?
Which book should Peter buy?
• Empty nodes
Parsing
• Análisis de la estructura de una oración
S
VP
NP
PP
NP
D
N
V
D
NP
N
P
D
N
The student put the book on the table
Ambigüidad sintáctica
S
S
VP
NP
NP
N
VP
NP
N
V
N
Teacher strikes idle kids
NP
N
V
A
N
Teacher strikes idle kids
Ambigüedad
I made her duck.
I made duckling for her
I made the duckling belonging to her
I created the duck she owns
I forced her to lower her head
By magic, I changed her into a duck
Desambigüación sintáctica
• Ambigüedad estructural:
S
NP
I
S
VP
V
NP
NP VP
made her
V
duck
I
VP
V
NP
made det
N
her duck
Semántica
• Una forma de representar significado
• Abstrae de estructura sintáctica
watch(I,show) de lógica de primer orden.
Puede ser: -“I watched the show” o
-“The show was watched by me”
• Mas complejo:
– What did I watch?
Lexical Semantics
The show is what I watched.
I
= experiencer
Watch the show = predicate
The show
= patient
Pragmatics
• Real world knowledge, speaker
intention, goal of utterance.
• Related to sociology.
• Example 1:
– Could you turn in your assignments now (command)
– Could you finish the homework? (question,
c
command)
• Example 2:
– I couldn’t decide how to catch the crook. Then I
decided to spy on the crook with binoculars.
– To my surprise, I found out he had them too. Then I
knew to just follow the crook with binoculars.
[ the crook [with binoculars]]
[ the crook] [ with binoculars]
Discourse Analysis
• Discourse: How propositions fit together
in a conversation—multi-sentence
processing.
– Pronoun reference:
The professor told the student to finish the
assignment. He was pretty aggravated at how long
it was taking to pass it in.
– Multiple reference to same entity:
George W. Bush, president of the U.S.
– Relation between sentences:
John hit the man. He had stolen his bicycle.
NLP Pipeline
speech
text
Phonetic Analysis
OCR/Tokenization
Morphological analysis
Syntactic analysis
Semantic Interpretation
Discourse Processing
Dependency: arguments and
adjuncts
Sue watched the man at the next table.
•
•
•
•
•
Event + dependents (verb arguments are usually NPs)
agent, patient, instrument, goal - semantic roles
subject, direct object, indirect object
transitive, intransitive, and ditransitive verbs
active and passive voice
Subcategorization
• Arguments: subject + complements
• adjuncts vs. complements
• adjuncts are optional and describe time,
place, manner…
• subordinate clauses
• subcategorization frames
Subcategorization
Subject: The children eat candy.
Object: The children eat candy.
Prepositional phrase: She put the book on the table.
Predicative adjective: We made the man angry.
Bare infinitive: She helped me walk.
To-infinitive: She likes to walk.
Participial phrase: She stopped singing that tune at the
end.
That-clause: She thinks that it will rain tomorrow.
Question-form clauses: She asked me what book I was
reading.
Semantics and pragmatics
• Lexical semantics and compositional semantics
• Hypernyms, hyponyms, antonyms, meronyms and holonyms
(part-whole relationship, tire is a meronym of car), synonyms,
homonyms
• Senses of words, polysemous words
• Homophony (bass).
• Collocations: white hair, white wine
• Idioms: to kick the bucket
Discourse analysis
• Anaphoric relations:
1. Mary helped Peter get out of the car. He thanked her.
2. Mary helped the other passenger out of the car.
The man had asked her for help because of his foot
injury.
• Information extraction problems (entity
crossreferencing)
Hurricane Katrina destroyed 400,000 homes.
At an estimated cost of 3 billion dollars, the disaster
has been the most costly in the nation’s history.
Pragmatics
• The study of how knowledge about the
world and language conventions
interact with literal meaning.
• Speech acts
• Research issues: resolution of
anaphoric relations, modeling of speech
acts in dialogues
Descargar