NLP aka Computational Linguistics Introducción al Procesamiento del Lenguaje Natural 2011 Slides tomadas de J. Pustejovsky What is Computational Linguistics? • Computational Linguistics is the computational analysis of natural languages. – Process information contained in natural language. • Can machines understand human language? – Define ‘understand’ – Understanding is the ultimate goal. However, one doesn’t need to fully understand to be useful. Silly sentences • • • • • • • • • • • Children make delicious snacks Stolen painting found by tree I saw the Grand Canyon flying to New York Court to try shooting defendant Ban on nude dancing on Governor’s desk Red tape holds up new bridges Iraqi head seeks arms Blair wins on budget, more lies ahead Local high school dropouts cut in half Hospitals are sued by seven foot doctors In America a woman has a baby every 15 minutes. CL/NLP • • • • • • • • • • • • • • • • • • Information extraction Named entity recognition Trend analysis Subjectivity analysis Text classification Anaphora resolution, alias resolution Cross-document cross reference Parsing Semantic analysis Word sense disambiguation Word clustering Question answering Summarization Document retrieval (filtering, routing) Structured text (relational tables) Paraphrasing and paraphrasing/entailment ID Text generation Machine translation FY What is needed: (1) linguistic knowledge • Examples: – Zipf’s law: rank(wi)*freq(wi) = const – Collocations: • Strong beer but *powerful beer • Big sister but *large sister • Stocks rise but ?stocks ascend (225,000 hits on Google vs. 47 hits) – Constituents: • • • • • Children eat pizza. They eat pizza. My cousin’s neighbor’s children eat pizza. _ Eat pizza! How to get it: – Manual rules – Automatically acquired from large text collections (corpora) Linguistics • Knowledge about language: – – – – – – – Phonetics and phonology - the study of sounds Morphology - the study of word components Syntax - the study of sentence and phrase structure Lexical semantics - the study of the meanings of words Compositional semantics - how to combine words Pragmatics - how to accomplish goals Discourse conventions - how to deal with units larger than utterances What is needed: (2) mathematical and computational tools • • • • • • • • • • • Language models Estimation methods Hidden Markov Models (HMM): for sequences Context-free grammars (CFG): for trees Conditional Random Fields (CRF) C Generative/discriminative models Maximum entropy models Random walks Latent semantic indexing (LSI) L + Representation issues + Feature engineering Theoretical Computer Science • Automata – Deterministic and non-deterministic finite-state automata – Push-down automata • Grammars – Regular grammars – Context-free grammars – Context-sensitive grammars • Complexity • Algorithms – Dynamic programming Mathematics and Statistics • • • • • • Probabilities Statistical models Hypothesis testing Linear algebra Optimization Numerical methods Artificial Intelligence • Logic – First-order logic – Predicate calculus • Agents – Speech acts • Planning • Constraint satisfaction • Machine learning Relation of CL to Other Disciplines Artificial Intelligence (AI) (notions of rep, search, etc.) Machine Learning (particularly, probabilistic or statistic ML techniques) Human Computer Interaction (HCI) Electrical Engineering (EE) (Optical Character Recognition) Linguistics (Syntax, Semantics, etc.) CL Psychology Philosophy of Language, Formal Logic Theory of Computation Information Retrieval A Sampling of “Other Disciplines” Linguistics: formal grammars, abstract characterization of what is to be learned. Computer Science: algorithms for efficient learning or online deployment of these systems in automata. Engineering: stochastic techniques for characterizing regular patterns for learning and ambiguity resolution. Psychology: Insights into what linguistic constructions are easy or difficult for people to learn or to use Indice • • • • Historia de PLN Test de Turing PLN. Repaso. Desafíos Linguistic Issues – – – – – – Part of Speech (POS) Morphology Sintáxis Semántica Pragmática Análisis del discurso Historia: 1940-1950’s • Desarrollo de teoría de lenguajes formal (Chomsky, Kleene, Backus). – Caracterización formal de clases de gramáticas (regulares, libres de contexto) – Asociación con autómatas • Teoría de las probabilidades: entendimiento del lenguaje como decodificación de noisy channels (Shannon) – Uso de conceptos de teoría de la información para medir éxito de modelos de lenguaje (por ej. entropía). 1957-1983 Modelado simbólico vs estocástico • Modelado simbólico – Uso de gramáticas formales como base para PLN y sistemas de aprendizaje (Chomsky, Harris). a – Uso de lógica (y programación lógica) para caracterizar inferencia ssemántica y sintáctica (Kaplan, Kay, Pereira). – Primeros sistemas “juguete” de comprensión y generación de lenguaje natural (Woods, Minsky, Schank, Winograd, Colmerauer). n – P Procesamiento del discurso: intención, foco (Grosz, Sidner, Hobbs). • Modelado estocástico – Primeros sistemas de reconocimiento del habla y óptico de caracteres ((OCR) (Bledsoe; Browning, Jelinek, Black, Mercer). 1983-1993: Regreso al empirismo • Uso de técnicas estocásticas para part of speech tagging, parsing, word sense disambiguation, etc. • Comparación de modelos estocásticos, simbólicos para tareas de comprensión y aprendizaje del lenguaje. 1993-Actualidad • Los avances en software y hardware crean necesidades de PLN para: – Recuperación de la información (Information retrieval) (web) – Traducción automática (machine translation) – Chequeo ortográfico y gramatical (spelling and grammar checking) – Reconocimiento y síntesis del habla (speech recognition and synthesis) • Combinación de métodos estocásticos y simbólicos para aplicaciones reales. Lenguaje e Inteligencia: Test de Turing • Test de Turing: – Computadora, humano y juicio humano • El juez le hace preguntas al humano y a la computadora – El trabajo de la computadora es actuar como un humano – El trabajo del humano es convencer al juez de que él no es la computadora – La computadora es juzgada como “inteligente” si puede burlar al juez. • Juicio de “inteligencia” relacionado con respuestas adecuadas del sistema. ELIZA • Simple psicólogo Rogeriano. • Usa Pattern Matching para llevar una conversación limitada. • Parece pasar el Test de Turing ((McCorduck, 1979, pp. 225-226) • Demo: http://www.lpa.co.uk/pws_dem4.htm Qué está implicado en una respuesta “inteligente”? Analisis: Descomposición de la señal (hablada o escrita) en unidades con significado. Reconocimiento del habla y de caracteres • Descomposición en palabras Requiere conocimiento de patrones – Descomposición en palabras fonológicos (VER): – Segmentación de palabras en fonemas o • I mean to make you proud. letras Recursos para PLN Componentes • Reglas de morfología y spelling • Reglas gramáticas • Lexicones • Reglas de interpretación semántica • Interpretación del discurso • Conocimiento del mundo (enciclopédico),gazeteers PLN implica: •Aprendizaje de reglas para cada componente, •Uso de reglas en algoritmos •Uso de algoritmos para procesar la entrada. ¿Por qué estudiar PLN? • El lenguaje es un punto distintivo de la inteligencia humana. • El texto es el repositorio más grande de conocimiento humano y esta creciendo rápidamente: – emails, artículos de diario, páginas web, IRC, artículos científicos,reclamos a seguros, cartas de reclamos de clientes, transcripciones de llamadas telefónicas, documentos técnicos, documentos de gobierno, contratos, etc. Aplicaciones de PLN • Respuestas a preguntas – Who won the World Series in 2004? • Categorización de texto/ruteo – e-mails de clientes. • Text Mining – Find everything that interacts with PDGF. • Machine (Assisted) Translation • Enseñanza y aprendizaje de lenguas – Verificacion de usos • Spelling correction – Is that just dictionary lookup? Desafíos de PLN: Ambigüedad • Muchas veces se pueden entender palabras o frases de distintas maneras: – Teacher Strikes Idle Kids – Killer Sentenced to Die for Second Time in 10 Years – They denied the petition for his release that was signed by over 10,000 people. – child abuse expert/child computer expert – Who does Mary love? Resolución de ambigüedades probabilística/estadística • Elegir interpretación con mayor probabilidad de ser la correcta. Por ejemplo: cuántas veces se dice • “Mary loves …” • “the Mary love” Y cuál interpretación es la más probable? Desafíos en CL: Variaciones • El mismo significado puede ser expresado de distintas formas: – Who wrote “The Language Instinct”? – Steven Pinker, a Harvard professor and author of “The Language Instinct”, said … Niveles Linguisticos • Part of Speech (POS) – Clase de palabra (verbo, sustantivo, preposición, etc.) • Morphology – Estructura interna de palabras: (am –é, aba, ó) • Sintáxis – Estructura interna de oraciones (árboles sintácticos u otra representación) • Semántica – Interpretación de significado de palabras, frases y oraciones. • Pragmática • Discurso Part of Speech • Categorías sintácticas a las que responden las palabras: – N, V, Adj/Adv, Prep, Aux, – Open/Closed class, lexical/functional categories También conocidos como: categorías gramáticas, tags sintácticos, POS tags, clases de palabras, entre otros. Ejemplos de POS Open Class N noun baby, toy V verb see, kiss ADJ adjective tall, grateful, alleged ADV adverb quickly, frankly, ... P preposition in, on, near DET determiner the, a, that WhPron wh-pronoun who, what, which, … COORD coordinator and, or Test de sustitución • Dos palabras pertenecen a la misma categoría si se puede sustituir una por otra. – The _____ is angry. – The ____ dog is angry. – Fifi ____ . – Fifi ____ the book. POS Tags • No existe un conjunto estándar de Part of Speech tags – Algunos usan clases gruesas, por ej: N para noun (sustantivo) – Otros prefieren distinciones de más bajo nivel (por ej; Penn Treebank): • PRP: personal pronouns (you, me, she, he, them, him, her, …) h • PRP$: possessive pronouns (my, our, her, his, …) • NN: singular common nouns (sky, door, theorem, …) • NNS: plural common nouns (doors, theorems, women…) • NNP: singular proper names (Fifi, IBM, Canada, …) • NNPS: plural proper names (Americas, Carolinas, …) PRP PRP$ Part of Speech Tagging • Las palabras pueden tener más de un POS: por ej; back – The back door = JJ (adjetive) – On my back = NN (noun) – Win the voters back = RB (adverb) – Promised to back the bill = VB (verb) • El problema de POS tagging consiste en determinar el POS tag para una instancia particular de una palabra. Morfología • La morfología se ocupa de la constitucion interna de las palabras: – The fearsome cats attacked the foolish dog Distintos tipos de morfología • Infleccional – No cambia la categoría gramatical de las palabras: cats/cat-s, attacked/attack-ed • Derivacional – Puede implicar cambios a categorías gramaticales: fearsome/fear-some, foolish/fool-ish Análisis morfológico • Infleccional – duck + s = [N duck] + [plural s] – duck + s = [V duck] + [3rd person s] • Derivacional – kind, kindness • Spelling changes – drop, dropping – hide, hiding La morfología no es tan fácil coo parece • Ejemplos de Woods et. al. 2000 – delegate (de + leg + ate) take the legs from – caress (car + ess) female car – cashier (cashy + er) more wealthy – lacerate (lace + rate) speed of tatting – ratify (rat + ify) infest with rodents – infantry (infant + ry) childish behavior A Turkish Example [Oflazer & Guzey 1994] • uygarlastiramayabileceklerimizdenmissinizcesine • urgar/civilized las/BECOME tir/CAUS ama/NEG yabil/ POT ecek/FUT ler/3PL imiz/POSS-1SG den/ABL mis/NARR siniz/2PL cesine/AS-IF • an adverb meaning roughly “(behaving) as if you were one of those whom we might not be able to civilize.” Sintaxis • La sintaxis es el estudio de las regularidades y restricciones del orden de las palabras y la estructura de las frases – Cómo se organizan las palabras en frases – Cómo se combinan las frases en frases más grandes (incluyendo oraciones). Estructura de Frases • Restricciones en orden de las palabras • Constituyentes: NP, PP, VP, AP • Gramáticas de estructuras de frases S NP PN VP V N Spot chased Det a N bird Phrase structure • Paradigmatic relationships (e.g., constituency) • Syntagmatic relationships (e.g., collocations) S NP That VP man VBD PP NP caught the butterfly NP IN with a net Phrase-structure grammars Peter gave Mary a book. Mary gave Peter a book. • • • • • • • C Constituent order (SVO, SOV) imperative forms sentences with auxiliary verbs interrogative sentences declarative sentences start symbol and rewrite rules context-free view of language Sample phrase-structure grammar S NP NP NP VP VP VP P → → → → → → → → NP DET DET NP VP VBD VBD IN VP NNS NN PP PP NP NP DET NNS NNS NNS VBD VBD VBD IN IN NN → the → children → students → mountains → slept → ate → saw → in → of → cake Phrase structure grammars • Local dependencies • Non-local dependencies • Subject-verb agreement The women who found the wallet were given a reward. • wh-extraction Should Peter buy a book? Which book should Peter buy? • Empty nodes Parsing • Análisis de la estructura de una oración S VP NP PP NP D N V D NP N P D N The student put the book on the table Ambigüidad sintáctica S S VP NP NP N VP NP N V N Teacher strikes idle kids NP N V A N Teacher strikes idle kids Ambigüedad I made her duck. I made duckling for her I made the duckling belonging to her I created the duck she owns I forced her to lower her head By magic, I changed her into a duck Desambigüación sintáctica • Ambigüedad estructural: S NP I S VP V NP NP VP made her V duck I VP V NP made det N her duck Semántica • Una forma de representar significado • Abstrae de estructura sintáctica watch(I,show) de lógica de primer orden. Puede ser: -“I watched the show” o -“The show was watched by me” • Mas complejo: – What did I watch? Lexical Semantics The show is what I watched. I = experiencer Watch the show = predicate The show = patient Pragmatics • Real world knowledge, speaker intention, goal of utterance. • Related to sociology. • Example 1: – Could you turn in your assignments now (command) – Could you finish the homework? (question, c command) • Example 2: – I couldn’t decide how to catch the crook. Then I decided to spy on the crook with binoculars. – To my surprise, I found out he had them too. Then I knew to just follow the crook with binoculars. [ the crook [with binoculars]] [ the crook] [ with binoculars] Discourse Analysis • Discourse: How propositions fit together in a conversation—multi-sentence processing. – Pronoun reference: The professor told the student to finish the assignment. He was pretty aggravated at how long it was taking to pass it in. – Multiple reference to same entity: George W. Bush, president of the U.S. – Relation between sentences: John hit the man. He had stolen his bicycle. NLP Pipeline speech text Phonetic Analysis OCR/Tokenization Morphological analysis Syntactic analysis Semantic Interpretation Discourse Processing Dependency: arguments and adjuncts Sue watched the man at the next table. • • • • • Event + dependents (verb arguments are usually NPs) agent, patient, instrument, goal - semantic roles subject, direct object, indirect object transitive, intransitive, and ditransitive verbs active and passive voice Subcategorization • Arguments: subject + complements • adjuncts vs. complements • adjuncts are optional and describe time, place, manner… • subordinate clauses • subcategorization frames Subcategorization Subject: The children eat candy. Object: The children eat candy. Prepositional phrase: She put the book on the table. Predicative adjective: We made the man angry. Bare infinitive: She helped me walk. To-infinitive: She likes to walk. Participial phrase: She stopped singing that tune at the end. That-clause: She thinks that it will rain tomorrow. Question-form clauses: She asked me what book I was reading. Semantics and pragmatics • Lexical semantics and compositional semantics • Hypernyms, hyponyms, antonyms, meronyms and holonyms (part-whole relationship, tire is a meronym of car), synonyms, homonyms • Senses of words, polysemous words • Homophony (bass). • Collocations: white hair, white wine • Idioms: to kick the bucket Discourse analysis • Anaphoric relations: 1. Mary helped Peter get out of the car. He thanked her. 2. Mary helped the other passenger out of the car. The man had asked her for help because of his foot injury. • Information extraction problems (entity crossreferencing) Hurricane Katrina destroyed 400,000 homes. At an estimated cost of 3 billion dollars, the disaster has been the most costly in the nation’s history. Pragmatics • The study of how knowledge about the world and language conventions interact with literal meaning. • Speech acts • Research issues: resolution of anaphoric relations, modeling of speech acts in dialogues