Dr. Jie Wang is Professor of Computer Science at the University of Massachusetts Lowell and an adjunct data scientist at the VA Hospital in Massachusetts. He chaired the department from 2007 to 2016. He received a PhD in Computer Science from Boston University in 1990, an MS in Computer Science and a BS in Computational Mathematics both from Sun Yat-sen University in, respectively, 1984 and 1982. He has about 30 years of teaching and research experience at the university level and has worked as a network security consultant for a national bank. His research interests include data modeling and applications, text mining and learning, text automation systems, machine learning, algorithms and combinatorial optimizations, medical computation, network security, and computational complexity theory. He has published over 180 journal and conference papers, 12 books, and 4 edited books. His research has been funded by the National Science Foundation, IBM, Intel, and a few startup companies. He is active in professional service, including chairing conference program committees and organizing workshops, serving as journal editors and the editor-in-chief of a book series on mathematical and interdisciplinary modeling. He has graduated 18 PhD students and is currently directing 7 PhD students.View the profile
About the talk
Thank you for the nice introduction and thank you for listening to my talk. So I'm going to walk so so the title of my talk is AI assistant text mining to facilitate reading for understanding. I sold my name first name. So people here call me. Jet some people call me Gigi and some people call me chili, so I'm used to all the bills. So that's fine. That's okay. salt so so why are we doing this? We know that reading is a basic skill. And so we are interested in developing text mining tools to facilitate reading for understanding. And
assessing reading comprehension reading has several different goals. Sometimes we read for fun. Sometimes we read for information and most of the time when you read for understanding when Stansbury the paper we read report. So what we want to know what is going on. So that's the main part of the money. So that is also the man. Media, I would say on main way for us to learn new knowledge. the main message so we usually read in Edina water. So you've given your given an
article. So we start reading the title and then the abstract and then the sex cancel following the natural order. So that has been what we have been doing. But there are many people talking about. How do you read better? How do you get the main points faster? So then I thought we realize that I have a hierarchical reading would be more effective. So what do I mean by hierarchical reading? Imagine that we have an oracle. That can drink sentences according to the importance pays on the article. You are you aren't even
or you are going to read which that makes it possible for us to read layers of sentences in descending order of importance in the years to strengthen understandings of audio sentences. So it's only another worst return and eat near Reading into a layer reading. Okay. So easily believe what I mean by there, so let me I'll provide a example. So we're going to read a case study research paper published by Harvard Business School on Apple's design and Innovation. So this is a
14-page long PDF documents and business school students. They often have to leave a pile of case studies in a short. Of time so I can help them get the ideas quicker. So let me switch to. a website Okay. So this is the tool we built currently we can handle English and Chinese. So I have already downloaded this, right? So I sold it this article was a case study on apples design thinking and Innovation. It was published in May 1st 2012. So that was several years ago. So so the reason I chose this is because we do know something about apples or when we pick up the
most important part is Stephen Curry on. Where is it? Okay, so now let's go. So this is research paper. Remember I am I select this 1% as a default. So you can begin with the D 40 to 10% if you're low 10% is still too long when snooze if I put 10% down is a is a very loyal and that still require something sometimes I just 1% So if you read that article carefully, then you will know at least that's what he said was talking about about the apples design thinking. And Innovation, not even you
still and look at this may be 5 or 7 sentences here if you still want to see okay. What's the most important one? I can actually do that. This is the one sentence for companies like apple who put a premium on design resources and time invested in the initial product is leverage across their products needs can be developed and ran back more quickly because they do on and make use of a system design elements in the pack. Okay, so this is specifically talks about Apple and then he has even want to read the second sentence. And this do you get one more sentence you can you can
read more and so on so normally experience is that once you read about 10 sentences you basically can get the main ideas out of this article is a 14 page articles. Okay, so that's what I mean by layer reading so we can now. Sa reduce 10% 10% is quite a bit. So that probably gives you almost main points in the one details you can you carry more. So this would be the 20% and the paint color is the 10% copper and 10% of sentences added to the earliest 10% of the most important one on the right hand side is
out of my keywords. Okay, so that's what I mean by hierarchical reading but then the next question would be how accurate your ranking is because you have an oracle to do this, but that'll cuz we have to have an algorism. to do this, so Okay, so lost here in this in the same conference. I presented with little more detail on this sentence length algorithm. So that time it all performs individual humans watches over the Sun Bank Benchmark. So this is a collection of Articles and then there are three
human gestures utopian rank sentences on each article because they may be different. So there is also a combined ranking of these three human judges. Real reason we presently last year already out forms individual humans are not always compares favorably was the combined ranking combined send his ranking of all Justice by Between the between the range of 40 to 50. This is the bottom line is what we did last year. When does the 40% so this is 80.4 * 82 is a human combined human ranking. So we were we were able
to increase the strength to 8111. So which means that now we are really really very close to combine rankings of Human Services in terms of ranking sentences. So in that regard ranking algorithms we are using in this tool is is accurate because it repeats individual dress and also is really really close to combine rankings of all three judges. Okay, so by then the next thing we want you really want to do which is also I want to spend most of my time talking about today
is how to I have this tool and I read through the documents and I how do I access my understanding of the document? Okay. So we also need some kind of automatic tools to do this. And so the easiest or the natural way for doing so is to generate multiple choice questions on a given article. Okay, so so we want to have a algorism on the system. Weather in Puerto Vallarta call and then it generates a number of multiple choice questions. We know that to generate multiple choice questions has two parts. So one is we need a question. And also we need an
answer. So this is a question and a hare and the second part is that we need a bunch of distractors in particular to a tea kettle. And so there are some things that look like answers, but they are incorrect. So you do need to have sudden understand all the article in order to tell whether they are incorrect answers. Okay, so so that's what we are trying to do cuz it's starting up a QA has been really active had to been very active in recent years to particularly
in the neural network model studies cultures, but in general is a question answer Miss which means that builds a so you have a question. Can you provide an answer by this answer may be coming from a very big database? Okay. So so that that's all that's what that is qasp. But right now we want to have we just haven't even articles. So we wanted to generate q a q a p so question answer. And also if you think of is a new network approach, this is sort of
like in the language environment without going to school because these models are trained on a very large amount of training day. So those are similar to The Language environment like that. Maybe we're young. We learn to speak from listening to our parents or siblings and people around me who is sort of like that what the new networks are doing today that we also know that humans are good and summarizing. And following language rules and that's why when we were they say at the age of six and we will need to go to school
to learn so almost there so that we can learn Grandma we can learn how to ski and outright properly not just a language environment. This latest thing. How about we find a way to to do something similar questions and answers by generating by going to school something like this. So all about you Dan is So it is the following natural sequence of money. So the idea then is this so we're going to represent each sentence in this sentence sentence. Yes, or a interrogative sentence when you want to assess
reading comprehension. What you have is a bunch of facts or decorated statement sentences, and then you're going to ask about these facts and these decorative sentence. So so what we are going to so this matter sequence is actually a sequence of tuples and vegetable is a vector of several syntactic and semantic facts about that word so we can use semantic rolled evening. And we can use a part of speech tag. You can use Lane and Edie recognition. And as a matter of at least three are currently what we are using
currently back. Okay, so we can also are others such as sending back and some other text a decorative sentence and the corresponding interrogative sentence as a matter of sequence sold out. There are no more words it just a sequence of syntactic and semantic packed. Okay, then we built a system call mad at UAP to learn matter sequins pairs of decorative sentence and interrogative sentences. So in other words, we will have a database a training day. And then and the corresponding
interrogative sentence of the proof on there is because the one we could come up with different ways to ask questions. So that means that you have multiple in the rocket what we have is a databases and so we beat the system learns and that is sequins on these events. Training day. So then we store them once we have a new article than we look at each of the decorative sentences sentences in return is decorative sentence into a matter sequence. And then we're going to search for a match.
We have already learned that looks like a knowledge or the rules we learn by going to school when we have a new sentence comes in. I'm going to be okay which one that matches. That is music. It is New Madrid secret. So why so I found a good match then I will look at then. I can generate interrogative sentences for this decorating Central. So then that's the idea that you may not always find a good match when that happens. That means we don't have enough. So that means that we have to keep doing it so happens that
we will ask the user to tell us more about the possible interrogative sentences so that we can add new medicine senses into our knowledge base. So that is the idea and this is the this is the architecture. And so so we have a training training data e s and I asked into a matter sequence and restore them when I still got to be applied on so we checked all the decorating sentence sentences and then we generate the sequence in order to know how to ask a question. Sometimes. It doesn't have a good match so we can get on my way to to identify. Well basically is the longest
longest substring to see if you can see if it's not a good match we have to Option 1 option is Tobiko has to ask a question. So that question may not be Perfect best mechanics to ask questions the other way, is that okay. I'm going to ask the user your system so we can ask that you was at the type to enter some in the rocket in the rocket League settings so that the system can order new medicine has all these sentences. So once you turn your data and be trained and trained ones, they have a new passenger train So eventually you will come up with a
database with sequences that can generate pretty good question answer pair. So that is gold and we had already filled the system. So what time do I have? So the next thing I know we need to generate distractor. Okay, these reckon means that I have an answer. I have a question. I have an answer. I need a I need a generator and I will go is degenerates re-stucco with the answer so we can send them to the to the to the to the reader and they have to pick which one is correct. So that means that looks very similar to a
correct answer by cannot be correct. They are all answers that is similar in terms of grandma in terms of semantics in terms of the contract neither. Would we call you to answer a Target? Okay, so any chance so we caught in a Target words or talk? It's Rays or Target sentences because sometimes they answer could be a sentence itself. But in general they have three. Okay. So when is the apocalypse about involving numerical value City's right like a wage that kind of thing is
I like a locations people's names. So these are the answers to the way we do so is for type 1 and type 2 sometime when we just need to come up with a reason together a different time. If a numeric value in the same phone at Target, so that was not so hard and for type 2 we would need to rely on domain knowledge base for names and Photo locations Winston-Salem, politicians, and we have to come up with names and politics at the same time in the same region to generate distracting effect locations. We also need to do something similar.
How do you spell theoretical? What are the four the last one? So he's a general word hug? How can you come out wrist extractor the easiest way to think about how about we use were in battle find semantically similar words. Okay, so that's what with it. However, that may you come up with a singular words using the word embeddings you sometimes it will have you have to get rid of the ones that are very similar and the ones that are not so similar some of the words in the middle in the
middle part. Remember we need sleep if you have more than sweet then we still need to pick. So it seems we also found of renting so we are going to rain this candidates so that we can pick three candidates that are more suitable as distracted. This rain is what we do. We use three different measures. So just WC is a candidate WT is the target word. So deucedly of WC is where is Vector is a symbiotic so we can deal with the singularity. That's why on the other one is we look at the one that you Peace Corps and the
answer is that we look at how similar measures together then you can give me the ranking for how close is candidate to my target work. So if I say is if it is a candidate is an antonym of the combination by putting more weights on the similarity. Otherwise each of these measure equal and finally we use the information that we can survive this truck all by the way, one of the things we found that for embedding, is that a word embedding that's not handle words with multiple meanings.
Okay, so we spend some time to studying. How can we come up with a better way to represent the meanings of the word if he has multiple meanings in which meaning of that word we are really using But nevertheless that's the system rebuild is what caused everyone to try to see what it looks like. Okay. So for the article with this mentions, this is the case against the case study 2012 page article a PDF file. So we were able to generate over 400 capsules.
Some of them are good. Some of them may still need to be improved if you can take a look at what it looks like. Oh, by the way, this is computation take some time. So I pray you computed it and it's the other results. So so this is my first question. So this is a few in blanks type of question. We have is distractors is almost 15 or the brass or when is Apple share price reach 6000 so you ask some questions. White What is the iPad tablet computer a cigarette next what did
Buy this talk
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.