About the talk
We present STAR, a schema-guided task-oriented dialog dataset consisting of 127,833 utterances and knowledge base queries across 5,820 task-oriented dialogs in 13 domains that is especially designed to facilitate task and domain transfer learning in task-oriented dialog. Furthermore, we propose a scalable crowd-sourcing paradigm to collect arbitrarily large datasets of the same quality as STAR. Moreover, we introduce novel schema-guided dialog models that use an explicit description of the task(s) to generalize from known to unknown tasks. We demonstrate the effectiveness of these models, particularly for zero-shot generalization across tasks and domains.
Presented by Shikib Mehri PhD student at Carnegie Mellon University at the 2021 Rasa Summit.
- Learn more about Rasa: [https://rasa.com](https://www.youtube.com/redirect?even...)
- Rasa documentation: [http://rasa.com/docs](https://www.youtube.com/redirect?even...)
- Join the Rasa Community: [https://forum.rasa.com](https://www.youtube.com/redirect?even...)
- Twitter: [https://twitter.com/Rasa_HQ](https://www.youtube.com/redirect?even...)
- Facebook: [https://www.facebook.com/RasaHQ](https://www.youtube.com/redirect?even...)
- Linkedin: [https://www.linkedin.com/company/rasa](https://www.youtube.com/redirect?even...)
#conversationalAI #opensource #aichatbot
Hello. My name is Shakib, Mary. And today, I'm excited to talk about our work on Star. Schema got it. Download dating site for transfer learning, this work was done in cooperation with my co-authors Yohannes and Thomas as we all know preacher and models of changed how we do natural language processing. The dominant Paradigm is to perform large-scale pre-training. And then use those feature in models for Downstream fine-tuning. Get you as a preacher in. Mortals have been especially significant in open Dome, a dialogue with models
like dialogue CPT Mina and blunger achieving impressive results, and open domain dialog, response, generation and an interactions with real users. But what about us going to dialogue? Why have we not seen pre-trained models be as useful forecast going to settings as they are for open domain settings? I'm like chitchat systems task going to systems must accomplish a goal. This means that there's a limited space available responses that any given time and the preacher and models must be a lot more precise when generating responses in a task wanted setting.
Furthermore models for task-oriented systems must often interface with API and knowledge. To really understand why preacher and models haven't helped as much for task. We did settings, I'd like to pose the question after training on breaded is it reasonable to expect the system to make restaurant reservations? Answer this question. Let's consider a scenario. Mary joins. The covid-19 hotline Mary has human level and OU and algae and open domain. Dialogue, abilities can marry do her job without any training
In some sense, Mary is the upper Bound for what a free train model can be the best that we can hold for after large-scale pre-training is human level language understanding language, generation and open domain dialog ability. So is it reasonable here? Just because Mary is very good at open. Domain dialog and chit chat and understanding other incision generating utterances, is it reasonable to expect married to perform well as a covid-19 hotline? Probably not this task requires a lot of domain specific knowledge. It requires a lot of
knowledge about the task and knowledge about how to respond to a scenario that you might encounter. So if we can't get this ability to perform without any training on a new task from our scope retraining, what can we get? We can get someone might have language, understanding language, generation and general die, looks gills. We don't get is to Specific Instructions and rules. So what's reformulate this question? Mary joins. A covid-19 hotline with her human level and a
u and an LG an open tomato walkability, what is the fastest way to get Mary up to speed? Well, you can give Mary a giant textbook that covers every single situation, she might encounter and how to best respond to him. This is analogous to training data. Instead of burning Mary with reading a giant textbook, we could instead give her a few examples and we could design these examples to be to cover as much as they possibly can. This is analogous to fuchsia. Log. Especially with the task, like operating a covid-19 hotline
these two examples, might not work that. Well, the problem here is that if a task is rapidly adapt, and you might encounter sit in new situations and you may have to respond differently to certain situations as time progresses. What we don't want to do is to have to rewrite all of our training data or to rewrite over you shot examples after something new happens. So the third approach and the approach that we focus on one collecting the star J two sides is we give Mary a flow chart that describes
the task. And what this flowchart can do is it can cover the different settings that you might encounter and how to best respond to them. The good thing about this flowchart is that, it's rapidly adaptable. You just need to draw new arrows and new nodes and you can adapt to whatever new information you get. This is what we call it a specific schema. So let's say, covid-19 is cured and Mary is out of a job. Luckily, she finds a new job at a tech support call center. They give her at a specific. Schema that looks a lot like
what she use. At her old job. Can marry do her job without any training. Probably. Did you hear is the Mary already knows how to read the task specific? Schema, she no such an interpreted, she knows how to use it to generate responses, and she knows how to follow it to complete a task. So give him this intuition in the scenario. We presented our luck, schema paradigm. When you need to perform a new task, the schema accident in Duck Tobias. That gives
you the relevant information about the tasks rather than pre-training. And then expecting zero shot performance on a new task, the new paradigm involves this intermediate stop of training with schemas. So you pre-trained you learn how to understand the scheme has and how to use them to generate responses and how to complete a task. And then after training with a schemas and knowing how to interpret schema you then perform 0 shot on a new task. So what transfers in this approach? Well, from the pre-training, we
get animal, you and algae, and general dialogue skills. And from the step of training with a scheme, has the Mortal get some understanding of how to follow the schema. So now when it sees a completely new scheme on with the animal, you the nlg the general died look skills and the knowledge of how to interpret and how to utilize this chemo for generating responses and knows how to perform the task. So to this end, we collect star schema guided, download data, set for transfer learning, we perform Wizard of Oz data collection using schemas
along the way. We have 24 * 6 and 13, different domains. And here's a list of tasks or test include things. Like checking your bank balance searching for a hotel scheduling, a meeting booking, a ride and getting directions. And for every single one of these 24 x x, we design is schema and the scheme has her varying degrees of complexity on the right here. You see, schema for planning a party. The system must ask the user, a couple of questions. Potentially request,
something, optional, make a database query, and then inform the user. Here we see a fork, the wizard has to make a query and depending on the response of the query, it informs the user of something different. We also schemas with loops on them. And so on. So in addition to collecting a schema guided are locked at is that we also think about what else makes it a square into dialogue dataset good. First of all system action should be consistent. We want realistic and variable user Behavior. We want to have an explicit interface, will the API or knowledge base. And
there should be a progression of difficulty in the data set. So don't work, often has a one-to-many problem for tasks going to settings. It's possible to eliminate this one too many problem and we try to do this. So first of all, responses in a task oriented system need not be diverse, we don't need to exemplify the phenomena of human interaction in a task system. We don't need to generate engaging diverse or natural responses. We just need to generate responses that get a task done. So to do
this, we ensure that the system action of each time stuff is deterministic, and we achieved this by designing, the schemas and telling the EMT workers to fold the scheme, has as much as they could. And then we also had a suggestions module that aim to normalize the different responses. So, the way, the suggestions module worked is the wizard entered a free-form response. And then given this reform response, we would use an mlu model to get a list of suggestions that occurred in the schema. And then the wizard can either select one of the
suggestions which they did 80% of the time or write a custom freeform response, which occurred 20% of the time. We also want to have realistic user Behavior. Well, the system must be deterministic and as rigid as possible, the user side of things should be realistic and it should follow what we might encounter in a real, with a real user. So, realistic dialogues, rarely follow the tasks, which we refer to, as a happy past, users often change their mind. They
request explanation or justification gauging small talk to get angry and they do anything that deviates from the standard. Schema So, the way we addressed this problem and aim to make the dialogue dataset have realistic user interactions as we had in dialogue user instructions. For example, here the system, the Amtrak system informs the user of some background information and then it tells the system to it tells the user to ensure that the driver is not Connor.
This is a behavior that is realistic. And deviates forces, the Wizards to handle a situation that deviates from the schema. We also want to have an explicit API interface. You can't really transfer to a new task without making task-specific API request interpreting task-specific API outfits. So instead what we do is we make the HR request and the responses part of the schema, and part of the dialogue. So instead of having the user talk to the system, with the API being an implicit black box within the system, we formulate the
dialogue as a three-party interaction with the user engaging in a dialogue with the system. But then, in addition to responding to the user, the system May engage in a dialogue with the API. and to do this, we essentially had this, this child that allowed the the Wizard to explicitly make queries, custom queries to the to the test Pacific database and then get responses from the test Pacific database and then use those responses to to create to respond to the user.
You know, Jada said, we also wanted to have progression of difficulty. We had a, a single task dialogue that fold a happy path. We had a single task dialogues that had the fold an unhappy path meaning that they deviated from the schema and had realistic user behavior and then we have multitest dialogues. We perform the data collection in several stages. First of all, before every single dialogue stage, we had a video tutorial and a quiz. So to be qualified to actually collect dialogues for us
workers had to watch a video and then answer several quiz questions to ensure that they actually watch the video and understood the content, then workers that pass. Stage one would be allowed to engage in stage 2 stage, Tuesday happy and unhappy single task data collection. After this we had a tutorial for the multitest. Dialogues and a quiz 4 for the multitask dialogues. And then finally, we do the multi test data collection, by having stage 1 and stage 3, where they had to watch a tutorial and answer quick questions. We ensure
that we were getting high quality workers and that the workers really interested this complex data collection task. So here's an example of a quiz question. For the assistant, one of these things is the most important being helpful to the user falling. The flowchart, making the conversation as short as possible, making the conversation as long as possible. Hopefully if you watch this video so far, you know, that the answer here is B, we want the, the Wizard to be following the schema as much as possible. And
finally here's our data collection interface. We see on the left, the knowledgebase tab. If if the worker click the instructions to have, they would see the email for the specific tasks. We see the response from the user, the response from the knowledge base and the suggestions module, the suggesting several responses here. So the final data set, has 5820 dialogues. Consisting of 127000 tenses across 24 tasks. We have several single test dialogues and then several multitest dialogues. And the state of science
was collected with schema guided data collection, which makes it better for transfer learning, and has system size, consistency, Expose and expose an API request. Here's an example, happy dialogue. This dialog follows the schema and is generally the user is a very Cooperative with the system, did user provides all the information, and then when the user provides all the information to the wizard, The Wizard makes a query to the API saying that the name is North Heights
menu. The host name is Alexia and so on. And then the API responds with the message. Here's an unhappy dialogue. So here what happens is the user's trying to make a hotel reservation and then after the wizard has made the reservation or is about to make the reservation, the user provides, some background information and kind of like a chat type of thing. And then unlike indicates that they they don't want the specific Hotel. And then we have a multi test dialog that contains several different tasks at once of the users planning a party and then asks
for the weather and babe because of what the weather is the user instead wants to book a restaurant. so we have the party plan, task the weather task, and then the restaurant search task, and then eventually, in this dialogue, we will have the restaurant reservation to ask This is a nice graph of the tasks that Austin co-worker, we made the coke ring toss as natural as possible to kind of have the multitask dialogues. Be realistic in Folsom type of story. Often when people are
performing other tasks, they might ask for the weather because they might change their plans. so, What can we do with this data set? A couple of things that we explore here, our response generation producing the next response given the dialogue history, the outputs of the apis and eat a specific schema. And next action prediction, or we predicting next action or intent of the system based on the dialogue history, the API outputs and the scheme In its presentation. I'm going to focus on
next action, prediction. But in our paper, would you present results for both response generation and next action prediction? So, here's an example of the response and the corresponding action, given the dialogue history and the API response. No, I'd like to talk about the models that we train for next action. Prediction. First, I'm going to present a scheme of three classification model. No, I'd like to talk about the models that we introduced for next action prediction on the star data set. First. Let's look at a schema free
classification model. In this Mortal, we encode the dialogue history using Birch and get given the vector representation produce white birds. We passed the Specter of representation HCL us to install Flex player. To get a probability distribution over the set of next actions. So how do we incorporate schemas into remodels? Well, the first step is to figure out how to effectively represent the scheme has since we see on the left here. This is a scheme, our flow chart for the
task of reporting Bank frogs. We see that the system has to request a number of things from the user, including their full names or account number, and pin, and depending on whether or not, the user's able to provide these things out. This system must perform different actions, so free sample of the user provides. The information, the system should ask for the details for the fraud report if the user is missing the pen. For example, the next thing for this system to do is to ask the user's date of birth. So we can represent this as a
graph is shown on the right, may I have your pen and then there's a note that corresponds to a user action and then depending on the user action that is taken ADI, system has two different responses into different parts of the dialogue. One thing that I'd like to point out here, it says, the system action will always be deterministic. The nose representing a dialog State before the system action always have an Arctic we have one. This means that if there's an arrow pointing to system action such as a response or a query,
The note before the system action will only ever have one node going one error, one Edge going out of it. So given this representation of the scheme, as a graph. We can now introduce a schema guided classification model. And the idea behind the scheme, a guide of classification model is to use the test specific schema to guide next action prediction. Play guiding next action prediction condition on the scheme on this matter. Its it should ideally be easier to transfer to a new task even without data.
First thing that we do is we represent the schema graph as a set of keys for the Burt representations and values which are the action labels represented as one heart. Doctors, the way we do this, it's we take every single note that occurs in the graph and we encode the text associated with the node. So, for example, inform booking available. The text might be something like the doctors available on that date. Would you like me to make? Would you like me to book an appointment? We passed the tax course morning to every two, birds to get a
vector representation for that note, then for every single node, we look at the action that follows it. So for example, for this, yes, the next action is query book. So we encoded, yes, through bird to get pay one and then the corresponding value for this key is query book. The Action that follows the yes. So the first step is to get these these keys and these values. Next, we do exactly what we did in the scheme of free models. We encode the dialogue history with birth to get this Vector representation of the dialogue history HCL us.
We didn't produce a probability distribution using a linear classifier. So this is without the schema and this is exactly the same as the schema free model. No, we aim to produce a probability distribution using the schema. The way we do this is we use the vector representation of the dialogue history. HCL us to do a dog product and a softmax over all the keys or all the nodes in the scheme of graph. Then given the values attributed to every single one of these Keys, We assign those reassign. The scores of the
soft Max to the values associated with every one of the keys. So, for example, for K1, if it squirts, 0.22 after the soft Max, we assign a probability value of 0.2 to the next action of query and then K2 got 0.7 and the value corresponding to the Kate. You knowed is ask doctor named. So the next action in the next action probability distribution, ask doctor named will have a probability of stroke. 7. And in this way, we are attending over the schema graph
with the keys, being the nodes in the values, being the following action. So now that we have a probability distribution of tan through a classifier and a probability distribution obtained through the schema, we need to combine these in some way the way we can find them is we use HCL us the hidden representation of the dialogue history to produce a value between 0 and 1 by passing it through a linear layer and then using a sigmoid function given this value between zero and
one. We use this to do a weighted combination of the probability distribution from the schema and the probability distribution of the classifier through this Dynamic. Waited some we got a final distribution over the set of next actions. First, I'm going to present the results when given the full data. For each stage, the happy dialogues. The unhappy dialogues in the multi task. Dialogues are Mortals are trained with 80% of the data from the current stage and all the data from the previous stages. What this means that the
unhappy models are trained with all the happy dialogues and 80% of the unhappy dialogues. Similarly, the multitask models are trained with all the happy and unhappy single test, dialogues, as well, as 80% of the multitest dialogues. What we see in these results is that the schema doesn't really help too much and the Standalone bird classifier performs pretty well. This is not surprising because we expect a sophisticated model like bird to do well. When given a lot of data. The volume of the scheming really
shines in 0 shot prediction here. What we do is we don't train the model at all, we don't train the model at all on a specific task. Instead, what we do is in the test transfer settings, we train the model on 23 of the tasks and then we evaluated on the 24th to ask. Why? And what we see here is one provided the schema and using a schema guided model. The we perform a lot better and 0 shot. Next action prediction What's Happening? Here is one training on the 23 tasks. I were scheming guided
model is learning to follow the schema. It's learning how to Leverage The schema to detect. What's the next action should be. Now when it's given a completely new task and the corresponding schema, it's able to effectively predict What what, how to interpret the schema and use the schema to predict what the next action should be. And we see strong performance in both tasks transfer and domain transfer which is rather than transfer across chest. We transfer across domains. We also observe that the schema guided model helps
slightly more in the happy task than it does in the unhappy tasks. And this is this is kind of intuitive just because the unhappy dialogues are less likely to follow the steamer. In addition, to next action prediction, the start dataset can be used for response generation which we presented our paper knowledge base. Query prediction prediction to the knowledge base query which is equivalent to stay tracking. Schema prediction given a set of dialogues. Can you predict the schema graph out of domain detection? Where you predict when you've gone outside? The
schema? To summarize in. In this presentation, I talked about star a schema guided dialogue dataset for transfer learning. The core idea behind star is that you have a chance Pacific schema that allows the potential for zero shot transfer learning. When leveraging, the schema The stored data set has system consistency, and realistic user Behavior, as well as a progression of difficulty from happy dialogues, too unhappy dialogues to multitask dialogues. We also presents, gutted models for classification and generation.
Future work can improve the schema guided models and then with the potential, what does Jada said? There's potential to investigate the hottie too unhappy transfer in the single task to multitask transfer. Another interesting direction for future work, could be predicting the schemas from a set of example dylox. Thank you for your attention and the Cure codes in the lynx. Year, point to our paper and the dataset respectively.
Buy this talk
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.