About the talk
There are many resources for building models from numeric data, which meant processing text had to occur outside the model. This talk will introduce RaggedTensors and tf.text, showcasing its text-based ops, and show how you can quickly build a model starting with text input in tf.data.
Presented by: Robby Neale
I'm Robbie. Yeah, I'm on engineering Google and I'm going to talk to you about how to leverage TS text for your pre-processing and language models inside of tensorflow wasn't familiar with language models words for to article summaries spell check autocomplete text to speech spam filters chatbots really can't get away from them and it's really good time to be in 2 NLT right now because we're going through someone of a Renaissance last year this paper on but it was released that uses attention in
Transformers not going to go too much into it other than a traditionally Obviously when you're working to text Manos don't play as well with strings. I'm so you can bet those until numbers and so we've used embeddings different ways to embed like love, where tobacco different ways to create Vector representations of your words and these works pretty well. The one problem is you have with some words when you look them up in your vocab like bat am I talking about it animal am I talking about baseball
equipment even worth? It don't sound the same entrance and in trance, they're spelled the exact same so when you're trying to represent these is a vector you're getting your trying to have two different meanings to the same sequence of letters and Hobart has got gone around this by it's a model that actually uses the context of the sentence to actually create this Vector for the words and so is done really well. This is Stanford a question-and-answer dataset. So it was late 2018 the scores before that
were in the low 70s. It came out and jumped out to round 82 and ever since then people have been iterating on this model through Roberta XL Matt Albert, and I pulled the scores from last week and you can see the like the most recent model Albert is actually outperforming humans. So that's pretty crazy. So it's really starting right now to be introduced. And so let's just jump right in our trailer was basically to make programming language models and tensorflow
easier and traditionally it's been very difficult to do this. You would get would say you have some data like here's an example of queries and we wanted to write well before we can do that. We need to do some food processing which is basically tokenization outside of that because we didn't have that availability inside of tensorflow. And then once we did this pre-processing we have to somehow fit into a tensor so we would get this pre-process text adding to the graph and then normally like patout are tensors to make them a uniform shape. So they're available
we go to publish it and put it on a Model server. Okay, we're ready to go, right? So when we get the serving data, well, you can't just plug that serving data right in right we had to either die on the client to transform the data as well or you're doing it yourself and a lot of times it's a different language than what you did with your normal scripts in and I seen it even like when the pre-processing exactly the same like it's used the exact same regex like
because it's different libraries like one might consider a character class me punctuation or the other doesn't until you get training SKU Wendy's light pre-processing steps are different. And so when you actually go to serve the model, you don't have the same performance and that's problematic, right? So our Charter was to make this as easy as possible to support tensorflow or export text inside of tensorflow and to do this we want to do all the text processing and graph. And through we do this to a series of like
text and sequential apis that were not available and actually create a new temper type color. I get sensors that better represents this text. So we go back to when it was painful we really want to do is just get rid of this be processing set bright put everything in the grass. Until all your food processing happens in graph and then when you go to serve the model, you're not relying on the client to perform that same those same steps when you serve the model and they call. And so really the main thing that was missing was
tokenization. So last year we had an arc with the tokenizer API and we want to make this as easy as possible in the straight for it. So it's a very simple it's a abstract tokenizer class as one method tokenize it takes input a string and gives you back your tokens. And so if you see this, it's very simple. We have a couple sentences here. We took a nice them into words are we not completely obvious immediately until you see examples is that our input is a rank 1 sensor and our outfit is a rank to the reason why this is are tokens are grouped by
the string that they're split from and so it's really easy. From the user the engineers perspectives to be able to tell which stream tokens were pulled from which string in the original tensor. The one thing you can't do from this output is tell wearing that originating strain came from And for that we have like one extra token razor with offsets class abstract class that has tokenizer with offset, which is the same thing. You give it an input sensor of strings. It gives you your
tokens, but also gives you wear those tokens start and end so we can see that example here. We took a nice with offset and we can see the letters like I start Sub-Zero news one position and then no is in the second position and moves up 6 characters. Go through these off. That's if you want to know where the tokens are in the originating string you can do that. Man in the Moon knows the shapes are exactly the same as the shapes of the tokens. So mapping token to start sin limites is very simple from here.
So we provide five basic tokenizer, you know, one of the questions when we first did the ICC was why don't we just have one and it like one tokenizer to rule them? All the problem is every models different you have different limitations and things you want to get around and we don't want to like push our opinion on you because they're all different. We just want to build the tools and allow you to make the decision and so a lot of these are very simple white space obviously just with the white Space Unicorn
script. So if you know Unicode characters are grouped together and what they call Eunice hope code scripts so you would have liked Latin characters Greek Arabic Japanese are just some examples in and they also group spaces punctuation and numbers as well. What song goes I would say in the most simple case if you're just working with English the main difference between white spaces it splits at the punctuation. So we're peace. This was popularized by the burnt model which had mentioned earlier. It basically
takes a text that you've already tokenized and then split those words into even smaller Subway units. So this is actually a greatly reduces the size of your vocabulary as you're trying to encapsulate more information your vocabulary will grow and buy actually breaking the words down into sub word units. You can greatly get that smaller and encapsulate more meaning and less data. And then write the book have we at we have a beam Pipeline and a GitHub so you can generate your own
or the original burnt model has of the centerpiece is a very popular tokenizer. So this is actually released previously. There's a GitHub where people have downloaded free popular places. They take the configuration where you set up a bunch of pre-processing steps already and you feed that to it and it does that and so it does Subway word. I took possession word and character. I'm finally we're releasing a burnt one that does all the processing that the
original paper did and so you can use like I said there were peace tokenization and it'll like it'll do that free tokenization steps some other normalization in the word peace tokenization. So now that we had tokenized service, we really just needed a way to represent these and that's where we create a dragon sensors. for the better representation attacks We look at an example. We have two sentences. And like I said, like normal your sentences are never of the same length. I'm so we need to try and create a tensor
out of these you get a value are like it needs to be over uniform shape. I told you this was only like I said previously we parted out the strings and in this example, you're like, okay. So 3 Extreme values is not so bad. But when you're actually running out these models, you don't know how long your sentences are going to be. So you have that fixed eyes and so a lot of times they seem like dick size of a hundred 2828 words and you just had has like all this extra information that like, you don't really need inside your tensor. And then
if you try and make that smaller when you do have a long sentence the new senators are truncated. So you might think well, we have breast cancer. And this is also not quite as good cuz there's a lot of waste of data that you're having to supply sparse tensor as you know, or if you don't Spar Spencer's because really intense airflow everything's made of tensors. So it's actually made a three sensors which is values a shape and where those values exists within your Matrix shape. And so you can see you like there's actually a pattern like his
ragged sensors aren't necessarily or 4th strings. Not necessarily that they're sparse. It's dense. They just have a varying lengths so it would be good if we could say hey, the first row has indices 0 through 5. The second row has NC 0 through 2, and those make up our sentences. And so that's what we did was right. It censors. It's easy to create you just had a TF ragged constant to create it. It's similar, we built like a sparse tensor. It's made up of values and Rose splits and so it minimizes
the waist of information so you can see that all the values are in one sensor and then we say where we want to split up at 10:30 to build up our different rows. It's easier to kind of see it in this warm where the grade block on the left side is with the rabbit. The racket sensor is and its representation. And on the right is like how it look represented. And down below is how you would actually do that call. It would build this if you're using guys inside of them to flow and said this was the original way we
had roast Blitz with some people come to us. They represented these different ways. So we also provide real IDs where the ID tells where that value is inside your tensor and roll length that says these are the lengths of each row. So, you know, the first one takes the first four values, I'm you can have empty rows of 02. Until we want to treat these like any normal temperature. So Ryka tennis shoes. They have ranked just like you would see you in normal temp service on
some of this example. We have a rank or two. They'll try to shape the? When we find her shape is that denotes the Ragged Dimension? It's not necessarily always on the ends. But in this case it is on the end and we can use normal tensorflow functions in Ops like we would with normal tensors and so here we just using gather that grabs the second and then the first row Gathering D, which country index Can cat can cat on the different activities? And of course,
you know we made this for sequential and like text processing so your string Ops work. Well was right at 10 Sous. So here we go the strings into Coke points and killed them back in two strings. And conditionals work as well. So miss Caso where Clause we use racquet sensors inside. The one case here, is that the right answers for the where they must have the same row split. So the the rose must be of the same length. Easy to convert into and out of Racket sensors. You can just do from tensor or from Sparks to create a racket sensor
and then to move back you just have your racket censoring just called to 1032 Sparks and into the stock actually gives you a list if you want to bring it up. We're also adding support to Charis. These are the layers that are currently available are compatible and there's a lot left front. I'm so we're pushing to get more layers compatible with sensors. If you do are using them within your Charis model and come across something is not compatible in tensorflow text there at the bottom. We
do provide a too dense layer that we'll just convert it for you. And the other thing that I want to point out is Arnold support. So we stand test that we get 10% average speed up with like large batches like 30% or more. This is very exciting. And you know, I won't go into details, but it's very intuitive. Cuz if you think about when you're at you are looping through your tensor, you know, when you're at the end of that ragged Dimension, you can stop time computation before if you're using tensors you're using mass values and masks can
be not necessarily at the end but in the middle, so you have to keep Computing until your end of those pool fence or length or width. So yeah, you just have a lot less computation. And so you save a lot there. All right, so I wanted to go over a couple examples so you how easy it is to work with? View first, you can induce install tensorflow text with Pip are versions Now map to tensorflow version. So if you using tensorflow 2.0 you stencil text 2.0 if you using tensorflow 1.15 used 10-foot x 1.15 because of the custom Ops
III versioning must match. You can import it like this. We generally in Port Isabel Texas tax on these examples. He will see it written as text. Let's go very basic food processor what you might do. So normally you'll get your input tax here. We have a couple sentences. We want to tokenizer put those sentences into words and then we want to map those words into ideas inside a vocabulary that will feed into a model. And so the pre-process function might look like this where we just instantiate the tokenizer trailer.
And then map table look up into our vocabulary along the values that I ragged sensor. So if you remember what the right attention looks like underneath when we have our words and tokens we have that record sensor above where the values are set and then the road splits are separate in a separate tensor. So really when we want to like map those words to IDs, we're keeping the same shape. We only want a map over the values. And so that's why the map over values is there is
were just converting. We're doing the look up on each word individually. And so the resulting ragged sensors there at the end and we can see what it actually represents above. And so this happy processing once we're done you using CF data normally like create a data set a map that pre-processing function over your data set. I won't go in the mall details, but you can create a model with Charis pretty simply and then V at dataset on the model that trains and model until you can use that same pre-processing function
and you are serving input function. So you have the same pre-processing as it isn't serving time with your entrance and has proven screen is too that we have seen multiple times in the past. I have at least So let's go over another example character by Graham model here before. I just want to quickly go over and grams. So by Grandma's like a width of 2 basically like a grouping of a fixed size Dover series. We provide three different ways to join those together. So just bring join and you can some values and also take averages.
Call an example Hair by gram of words. So we have to split it up into words and then we called and Grand function and tensorflow text that groups those words together, which basically is joining a string on so every two words are grouped together by then. So is 3:00 so you can see here. We split our sentence into characters and then we group them together with every 3 characters with the three. And then this situation the default to separator is a space and so we just
do the empty string. I'm also works with numbers. So if we have a a series here to 46810 Hazard tenser, we want to come up every two numbers. So 2 + 464 + 6 is 10 in San Juan and then also average which is the mean reduction type. Where does my virginity when you talk about like and Grand you are talking about in a language context but where this would be helpful that say, you know, if you were taking my temperature reading every 20 minutes until you had a
series of temperature readings every 20 minutes, but what do you want to actually feed in your model is an average of those temperatures over an hour. Every 20 minutes. You can do a trigram with a reduction of of mean so it take the average of those 20 minute intervals. And so you get average temperature over the hour at every 20 minutes and you can see that in your motto. But generally like I said with a buyer grants and often used in NLP and how how that work.
You generally like split it up. Either into words are characters and have a vocabulary dictionary. You can look up those groupings in. In our example, we cheat a little bit. We get our Arco points from right input. So so we have both input we can get Coke points as you see you again. The rank is increased. So we had a shape of three and then had a narrow strip of 3 with a ragged Dimension and we use merge Dimensions to actually combine those two Dimensions cuz we don't care about in this case. I'm so take the second-to-last taxes and the last taxes and combines them
and then we're just sending those up to create kind of our unique ID. In this case that will fit into the model. Generally. Like I said, you would do strange coins and look those up in the vocabulary. But for this case model that works on this is our pre-processing function that again we create a data set using TF record data set mapa pre-processing function on those values and then the model that's created we can train using this pre-processing function.
Finally, I was going to go over the bird food processing is a little bit more code in this one. So I just want to say that like, you know, we provide the Bert tokenizer for you so still feel comfortable and knowing that you don't really have to write this. If you don't want to you can just use the Bert tokenizer tokenized does all this stuff for you, but I feel like there's a lot of good examples in like what this does and if you're doing text for food processing, these are things you should probably
think about it. I know about if I wanted to go over it with you. Doesn't like a swing version of that. I'm so what it does in a free processing tokenization split out Chinese characters in emoji by character splitting and then it did wear piece on top of all that So with Laura Cason normalizing, this is like Barry, that you would do when you're looking upwards in your vocab. Do you want the words to match and not have like duplicate words? So capitalization kind of gets in the way with that,
you know where to the beginning of signs are capitalized when you look it up it would be like in your dictionary or vocabulary twice. And so generally thought did you would lower case these? And normalization is a lot of Unicode characters with accents can be represented in different ways. And so normalization basically normalizes that text was represented a single way in again. So you don't have the same word multiple times in your vocabulary which way the computer model also has to make any real calculate larger.
Sweet for my taste fold-over to just a aggressive version of to lower what it does is it lowers lowers and also does it works with non Latin characters accented characters? It doesn't mess up nine letters. So it keeps them as is and does in a Casey folding in normalization. So I'll talk a little bit more about that. So we do that in our first step examples of what this would look like as soon as example. It really just lowercase in your eye and it's And a bird actually normalize the NFD and because case
full doesn't have Casey. We're going to normalize to that next. You know, I won't go over this just know again that like letters have many different forms. So it's good to have like a single normalization. So when you're working with International characters, they're not represent in different ways. So here we are and we need to snow in life to empty. Now. We're going to do some basic tokenization and we'll split on you to come scripts or text. And then when you might notice here is while I sent it to the
Chinese characters have not and that's because it's a single scripts throughout that whole sentence and there are no spaces or any other method of like defining a separations and words. Do you want to do is we want to like split that up? I'm just as kind of like we're a lot of code comes in, you know, you can follow and if I think the main point is just to know that like these things, you know, we thought about and if you run across it, there's like ways to work around this.
I prepared you or yes, we can code points of the characters. And then we just get script ideas of those characters so you can see that the first sentence is Allscripts 17, which is Hans grips is Chinese women are Latin characters with 25 and emoji and punctuation zero. Play we can just applied math. Equal like you can on a Ragga tensor it gives you and we just checking if it's on strip. So we have true and then we use the slice notation to just grab the first character cuz that's we know they're
all the same already from the Unicode. Skip we also want to check for emoji and Central text. We provide a function word shape which you can ask basically different questions on Words. It's basically like different regular Expressions that you want to ask. So I'm here asking is this does this text have any Emoji other ones is like, is there any punctuation are there? Any numbers is my string all numbers in so these are things you might want to like find out about provide you a method to do that.
So here we just or two conditions together to say whether we should split or not. It works with ragged and then we go ahead and split everything into characters so that when we do our work laws and our conditional equation split or not, if we should split we grabbed it from the characters that we've already split. If not, we just grab it from my tokens that we use when we talked last and here we just do a little reformatting of how the shape looks. Do in Sweden that can finally wear peace tokenize we provided with our book a
table. We split up into sub words. And we have an extra Dimension. So we just get rid of that with my students. All right, we made it through that wasn't too bad. So we have a data set with cream with teens data in Napa or pre-processing across from that here. We can grab the a classifier bird model from the official beer bottles and just trained on that classify. So I know that was a lot to go through. Hopefully you follow it along. But the main thing to know is that you know, ask Tia text. We're looking to basically bring in all that pre-processing inside the graph
so you don't have a problem. You don't have to worry about training SKU. You can just write your tensorflow and train until we do that by giving you what I consider Superior data structure for sequence data as well as Text data to ragged sensors and apis that are required for the processing again. You can install it with Pip install tensorflow text. And thank you hear some links. You know if there is anything that you think is missing that we should add feel free to add an issue.
And we also have a collab tutorial on tensorflow. Org that you should check out in a walk through some of this more slowly through till then for you. Thanks. Oh, yeah. Okay, so we still have a about 7 minutes left so we can open this up for Q&A. There are some microphones on either side, but I can also help provide those if needed. You want to go to the microphone for income-based Mike's ahead to see you if you have all the support just a quick question. Does
MTF text handled the Japanese text. It's a mixture of hiragana Katakana. Kanji romaji all thrown it really we go through the characters. There's a lot of a lot of McCrory Unicode we've added to court tensorflow, but when I'm so when were searching for like the scripts, this is ICU, which is like the open-source Library scripts and so you can just as well like Grande. Conte and they have a different script took him today. thanks for the information for inferencing and do send in text or do you have to Yeah, I know you can send text. So
at entrance time like it here the training we use this like reprocessing function. And so you can use that same pre-processing function. When you save your model and save model you give it a serving input function that basically does pre-processing on your input. And so if you send in like this. Ring sentences, you can use the same function or a variation of that and that input function and so it should process. And I thank you very much in my question kind of relates to his questions of what's the advantages of applying it with a map versus having a letter
that does it cuz you could even with alarm delay or with the new rsv4 pre-processing there have a letter that does it and then it's saved as a checkable as part of the Moto actually, you know, we're looking at like what layers we should provide and someone on team is helping and character-building out like their pre-processing layer which is basic and if they added functionality that we will supply it for an RN Arkansas Tech Library, so it's really up to you as someone who wants to
model it like how you want to apply those. Thank you for the talk. But you could questions. The first one is due date. Can I just as you provide also have a recording functions where to go from the tokens from the integers to the sequence of tax and talking about Sir Francis. If you decode them in for instance with a birth vocabulary than you will have all these additional characters there and then you want to go getting a sequence of Texas property tax tonight. I think what you're asking is
you send your word through a model and you get affect representation. He can translate that back to a presentation back to text us when you princess Genevieve text. And so then you may want them up in that text back. Sequence of genetic 10 tokens back into a string of text. It's not something that is like we provide inside the library in tensorflow. Yeah, so with 2.0 and it's kind of for the core team has gotten in hand like unhandily. It's it's too much to handle. The tasks are running too long and it's really like, it's too much for one
team to maintain. Until you know why I think we'll see more kind of modules like tensorflow text to that are like focus on one particular area like this as a team like we want to make it easier across to a lot of the stuff we've done like with ragged tensors. Some of the string Ops are actually in court sensor flow, but for some of these things that I like are outside the scope like engrams tokenization. It's just a separate module. Thank you. Hi Gotham, towson's TF text. It can be incorporated into the 10th Pro clock graph. Is it intended that you actually build a model with his pee
processing step and if true like what are are there like performance implications in like 10-15 serving if that's their if that's had that been measured and this definitely like some performance I think you know it it's done at the input level. This is actually a problem at T of data is looking at as far as like consuming this data and then paralyzing paralyzing earlier today to paralyzing these like input function. And so if you're actually your models on GPU or CPU use the inputs paralyzed and then
you're feeding as much Possible. So this is like something you might worry about it and look at but it's also what a lot of other people are out looking at me over and then yeah, I guess if it's part of the graph infants roll serving like how are like the No Doubt allocated in computed right? Like it's pre-processing on the CPU or quick quickly on skin. Are you compatible with TF2? Because I just peeped install tensorflow text and it uninstalled tensorflow to in installed 1.14.
So if you did, I'm just to pick to install a potential tax equals equals 2.00, which I think may be why I did that is cuz that versions actually believe Send it to RC zero. It'll be installed and spo2 for you. First of all, I'd like to say that this is really cool. Second is dead do duty of that text integrate with other NLP Library such as like space see anything in that area. Just out of curiosity.
Buy this talk
Access to all the recordings of the event
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.