About the talk
As billions of people around the world continue to use products or services with AI at their core, it's more important than ever to use AI responsibly. It has always been our highest priority to build products that are socially beneficial, safe, inclusive, and accountable to our communities. In this Session, we go over the collection of offerings in the Responsible AI Toolkit, a growing library of lessons learned and resources that anyone can apply to AI deployments.
Responsible AI Toolkit Documentation → https://goo.gle/2QEEuVV
Building trusted AI products → https://goo.gle/3eXWLFM
Know Your Data → https://goo.gle/3gLYxwh
Beyond evaluation: Improving fairness with Model Remediation → https://goo.gle/3fiw7aR
Speakers: Catherina Xu, Ludovic Peran
TensorFlow at Google I/O 2021 Playlist → https://goo.gle/io21-TensorFlow-1
All Google I/O 2021 Technical Sessions → https://goo.gle/io21-technicalsessions
All Google I/O 2021 Sessions → https://goo.gle/io21-allsessions
Subscribe to TensorFlow → https://goo.gle/TensorFlow
product: TensorFlow - Responsible AI; event: Google I/O 2021; fullname: Catherina Xu, Ludovic Peran; re_ty: Premiere;
Ludo is the Product Lead for the People + AI Research team where he helps ship products around machine learning interpretability and data exploration. Ludo is passionate about the impact of technology on society. Before joining Google AI as a Product Manager, he worked in Google's public policy teams in France and in the US, focusing on Artificial Intelligence policy.View the profile
Heroin, my name is product manager at googly eyes. And I'm here with my cat how to build with the responsible AI 2K. Today we cover for an overview of what responsive means to Google. Introduced a responsibility. I take it, you can use with a real world example. And finally, the news around the world from aiding Physicians to debate to suggesting the fastest route to an idle dining location. Serves millions of purposes and billions of people daily responsibly Indies contact.
Google we've set of principles to guide, her understanding of responsibility. I believe that I should be safe private accountable to users. Will be introducing the responsibilities. I took a collection of tool that helps put her principles into practice. We opened. So the toolkit last do, I live in Reading tools and resources to it ever since? You better contact your eyes, the future. Let's take a look at the typical machine anywhere, flu and see how responsible. I consideration can Arise at every single staff. Has passed. It supposed to wonder
if your product requirements match, your consideration, Emily vinegar. Inventors. Yes, the next time you do a contract and prepare the data into your letter. I don't reflect in the real world in balance or treating you and decision you make during collection in Lebanon. What is the appropriate data sets chat, it's time to build and train the mobile. I do see the rest of the presentation. Dammit Susan tools that can be used to train the Middle with fernet, interpretability privacy and security principles in
mind. Evolution staff. It is important to not only if original performance also a specific slices of your choice of geometric. I'm finally when you decide what people need to know about your model and its Invasion appropriate. The reasons why I took it is Suite of tools that help put responsible into practice at full stage of the machine. Anywhere poop music. We can be accessed through this on this light. Is it okay to cover Brady of tools from to get for each step of the workroom? Let's talk
with a real-world example that we put these tools into contact. Imagine you're the owner of a new Global Ristorante. This is what the inside of diners leave online reviews with the experience at your restaurant. When they leave nothing out of the ordinary here. The inputs is a sentiment. I See Fire. You're going to use the review and any Associated image. That may be part of the review. So first, we need to scoop our problem and define success, but I won't
expand on it right now because of the specific station. Did the launch of the new version of the two main objectives. We want to be sure that our sentiment over like, Tracy, Chapman. And we also want to make sure that the classifier does not miss Cafe, reviews based on sensitive characteristic. Mentioning the review on user specific characteristic. We can be language as a metric. Be using. I'm full up here across the group's. No harm little as a PSI test performance.
The next step is to inspected. Do the soup today? We're launching the better of YouTube who knew your data product owners identified as an issues such as featuring, balance by Avicii Levels. Do you visualize any spray that says based on existing data feature on the restaurants in the USA over and presented weather? Barely any types of African bees reviews? That's where we do not represent the needs of And Reese's you with additional features not available here. Do you need?
I need, that is not very important to note. Is that only a few outdoor photo in the training that as a proxy for Consumer sentiment? On a different note, when using user-generated is very important to make sure that your mother is trained with privacy in mind in the next month. We launched, privacy measurement inch FX privacy attack against our own little and get back a privacy score. As you can see, in the grass between x-ray, see, and price. Chief privacy,. How to make informed decisions to mitigate potential South
Cummings. A bucket of puke at 2 continues for the rest of the responsible. AI works on Discover malt tools. We do next. We will use various indicators to evaluate a model with fairness principles in mind. Ernest indicators allows us to compute common classification, fairness metrics such as false positive rate and false-negative rate across individual groups with confidence intervals. It is also integrated with what is 12 too low for a deeper dive into individual data points. We can use their teeth in Decatur's to evaluate our first objective
accuracy, which is the presence of gendered words. At the category in question. We wouldn't want the presence of these words to unduly influence the outcome of the crossfire indicators shows us that reviews. Containing terms and flying certain genders have a lower accuracy than other reviews, which is concerning. Now, we turn to our second objective false positive rate. This graph shows us that reviews containing terms that in by certain genders since have a higher fpr than other review, especially for the male gender. This means that they are far more likely to be wrongly classified
as positive, which can lead to Feist misleading conclusions for your assessment efforts. We can use the language interpretability tool or lid to further investigate these findings. Here, you could inspect each text example in the u, i. It also allows you to tweak the text and evaluate examples side-by-side, a concept known as kind of actual testing. You can edit each individual point or used Transformer functions to generate counterfactuals for a batch of examples. Let's try this for the data point selected here. Seemingly,
sarcastic negative review. Let's change the gender reference in the sentence. Inlet, you can edit the data point directly in the u i and added to the data set. And compare the old and new version. The bar graph shows that the classifier considers the sentence with waitress more likely to be negative sentiment than the identical one with waiter, even though these words probably shouldn't their impact on sentiment. Little Isaac's to you soon. See methods that indicate how much each word accounts for the models prediction the identity terms waitress and waiter or darker
in color than the surrounding words. Indicating that they have very strong prediction power. This could be one of the hidden causes behind the disparities. We surfaced in various indicators in which false positive rate was higher for samples containing male gender terms. We could do further probing by, for example, re-evaluating after dropping the gender turns in Tire Lee from the review. You can use lit with your own model by installing the pit package lit and Oakey. I hear you just fine a lit model that can make predictions based on a given input, which is defined in the
infant spec, which in this case includes every text and brown, true sentiment label. Now that we have a better grasp of what might be the underlying problem. We can use the model remediation library to retrain and improve our model. As part of the library's, we partnered with responsible. Ml, research teams to watch Mindy and techniques that can balance are rates across different slices of your data. It works by penalizing distributional, differences between these groups. We can try this as one of many potential remediation techniques on our model. We take our original
model. Define the type of lost, we want to use during Windows training and the mendip. Wait a hyperparameter that defines the allowable trade-off between minimizing Intergroup differences and maximizing accuracy. Then we create them in this model and train, the model as usual with the men defeated set, which will specify which subgroups to apply mendip to in this case, comments containing gender identity terms. As you can see, in our example, applying Mindy, produced fpr, for both, the male and female groups, and decrease the difference in
between these groups as well. This can take a bit of hyperparameter tweaking to get, right? I'll also add them in deep is not the only modern mediation method and it is important to understand what each method can and can't do before applying it to your model. You can learn more on the responsible. The last one we'll dive into today is the model car tool kit model transparency is important for a wide variety of audiences. Developers were making decisions about how to incorporate a model into a product users were impacted by those models and overseers. Who wants to
ensure the model is working as intended. Mouth guards, helped enable transparency for providing a framework. That communicates the essential facts of an ml model in a structured accessible way. We know that building model documentation is time. Intensive and require specific expertise to accurately represent but the qualitative and quantitative details of the model kit to help automate card creation using data generated during pipeline runs. Stevenson practice reinitialize. A model car tool kit. Expect And filter quantitative statistics that are automatically generated
from the run and manually. Enter qualitative model car deals. Then we updated model car, Jason template, and Export the document as HTML. Here's the quality of section of are generated model card, with an overview section, use cases, ethical, considerations and limitations. This quantitative section describes, the distributions of our training and evaluation test spliced by mention identity terms. And finally, we were able to produce last evaluation results on the metrics. We really care about. That was a lot of tools. As we've seen responsible, AI is an
iterative process. Their workflow is not linear as we often have to go back to step to re-evaluate our data or model as circumstances change. There are many more tools than the ones we mentioned today, in the responsible, and I took it, which collectively represents the work of many teams across responsible. A, i responsible AI question of the new test Bookworm. Please don't hesitate to reach out to us. If you have any questions or feedback. Thank you.
Buy this talk
Buy this video
Our other topics
With ConferenceCast.tv, you get access to our library of the world's best conference talks.