UM-IBM Sapphire Project at the University of Michigan

The Sapphire project aims to improve student advising at the University of Michigan and beyond by leveraging conversational systems to let students get course advise anytime, anywhere they need it.



Sequence-to-sequence (seq2seq) neural models have shown promise in generating structured meaning representations, such as logical forms and SQL, from natural language utterances. However, prior methods do not take full advantage of information about the structure of relevant knowledge bases. We therefore add attention to a representation of the database schema to seq2seq models. We create a new Student Advising SQL dataset. We also explore how schema representations can help in generalizing training data from different domains to a target domain. In addition, we expose a shortcoming in previous evaluation methods for this task and offer a new way of splitting datasets to measure how well models generalize to novel questions within a known domain.

Neural methods for interacting with Named Entities and Databases

The challenging problem of using neural methods to extract information from databases is made even more difficult by the presence of many Named Entities (NE) in databases. Adding each NE to the vocabulary and learning individual representations for them is problematic because it causes an explosion in vocabulary size, and because in many cases the NEs occur only a small number of times. Additionally, it is common for new NEs to arise during test time, which leads to issues because they will be out of vocabulary words. We propose a general method for interacting with NEs, where each NE is encoded on the fly using a distributed representation and stored in a specialized NE table. This representation then acts as a key to retrieve the NE whenever required. We then propose a multiple-attention based method for retrieving information from large databases, which can use the above idea to handle the NEs present in them.

Translation of Complex Natural Language Queries into SQL

In natural language, the same idea can be expressed in many different ways: not only using different words but also using entirely different sentence structures. In databases too we can have radically different schema designs to represent the same information. Even against a specific schema, we can have multiple equivalent query expressions. This great variety makes it difficult to learn translations, and makes it unlikely that any rule-based translation that works for one natural language statement will work for another. To address these challenges, we propose to divide and conquer the problem. We use deep-learning to address variations in natural language vocabulary; we use past history to determine desired query structure; we use database statistics for entity disambiguation.

World Knowledge for Semantic Parsing with Abstract Meaning Representation

In our experiments, we introduce modifications to a parser for Abstract Meaning Representation that allow it to more accurately identify concepts and named entities in semantic sentence representations. We look at the types of errors that currently exist in a state of the art parser and explore the problem of how to integrate world knowledge to reduce these errors. An examination of the limitations of these types of features is included, which provides insight into the potential for world knowledge to benefit future work in AMR parsing.

Crowd Annotation of Expert Content

Annotation is usually seen as a task undertaken by domain experts (or at least those who have had significant training in a given annotation schema and understand the content being annotated). Annotation guidelines can easily number in the hundreds of pages, and often require days of training for gaining expertise. Expert annotators often also work on annotations as a full-time job. However, what if we could get non-experts to annotate content, thereby parallelizing effort, as well as dramatically reducing the amount of training required? We explore the feasibility of leveraging crowd workers to annotate text (e.g., Ubuntu IRC logs) with complex labels (e.g., entities) that requires expertise they may not possess. In this project, we introduce novel interfaces and explore methods that provide domain-specific knowledge to help workers more accurately annotate expert content.

AI for disentanglement and conversation graph extraction

Dialogue often consists of multiple threads of conversation mixed together, with a complex structure in which an utterance could be responding to multiple previous utterances of which some might be far back in the conversation, and there could be multiple utterances that respond to a single one. One key step in learning to understand conversation is disentangling these threads. We intend to develop neural network based models that can extract the graph representing the conversation, along with a hand-labeled dataset for evaluation. We will focus on the Ubuntu corpus, but also consider smaller IRC datasets used in prior work.



Satinder Baveja

Artificial Intelligence

H.V. Jagadish

Databases / Data Mining

Walter Lasecki

Crowdsourcing / HCI

Rada Mihalcea

Natural Language Processing

Jason Mars


Emily Provost

Emily Mower Provost

Speech Processing

Dragomir Radev

Dragomir Radev

Natural Language Processing

Post Docs / Staff

Falk Pollok

Software Engineer