Duration 26:47
16+
Play
Video

Data Science and ML with an emphasis on Banking... By Deepti Gupta, Data Scientist, Santander Bank

Deepti Gupta
Strategy Consultant, Data Scientis at IBM
  • Video
  • Table of contents
  • Video
Request Q&A
Video
Data Science and ML with an emphasis on Banking... By Deepti Gupta, Data Scientist, Santander Bank
Available
In cart
Free
Free
Free
Free
Free
Free
Add to favorites
58
I like 0
I dislike 0
Available
In cart
Free
Free
Free
Free
Free
Free
  • Description
  • Transcript
  • Discussion

About speaker

Deepti Gupta
Strategy Consultant, Data Scientis at IBM

Deepti Gupta is a data scientist in credit risk team at Santander Bank. She has authored a book on data science and machine learning platform. Deepti has an MBA in Finance and operation research.

View the profile

About the talk

Data is fuel of all the industries and machine learning is the art of using patterns in data to make predictions. In this talk I will provide an overview of how data science and machine learning algorithms are reshaping the industries like healthcare, retail, aviation, insurance and banking with real-time case studies.

Share

So good morning and good evening everyone. Welcome to Global artificial intelligence conference. I hope everyone is doing well and staying safe. I want to thank the organizer for giving me the opportunity to present and thank you keep a smoothie for the kind introduction. I'm joining in from Boston, Massachusetts and the title of my today's talk is data science and machine learning with an emphasis on banking industry to start off here is an outline of my today stock stock need an introduction of

data science and machine learning artificial intelligence and Big Data Industries and applications of data science tools and evolving capabilities data and uniqueness and challenges. Today, I'll be focusing on one use cases for classification of garden from banking body to where we demonstrate that use case on predicting probably off bank loan default and finally Below close to talk with a short conclusion. I want to provide a quick interaction on for buzzwords in the world of data. So let's start with the designs

science continues to as one of the most promising and in-demand career paths for school professionals. It is the field of study that combines domain expertise programming skills and knowledge of mathematics and statistics to extract meaningful and fight from structured and unstructured data. machine learning algorithms are responsible for the water authority of the artificial intelligence advancements and applications you hear about Machine learning provides

systems the ability to automatically learn and improve from experience without being explicitly programmed machine learning is the process that powers many of the services we use today the Commendation engines on Netflix Voice Assistant like Siri and Alexa the list goes on. Now let's move on to artificial intelligence. So AI is a branch of computer science connected with building smart machines capable of Performing chant. Typically require human intelligence for example, robots and self-driving cars

that runs in put through a biologically inspired neural network are the layers through which the data is processed allowing the machine to go deep and it's learning making connection and waging input for the best results for example conversational boards for marketing and customer service. As the data science field continues to grow the to send software's are evolving at a high bass boat in terms of capabilities and diversity. So we have a few tools. We have most frequently used tool for data science open platform like are 529 chance of lube Garage in

visualization. We have Tableau qlikview IBM Watson even Barbie. I wish I have not mentioned in the slides. But yes Barbie eyes, one of the frequently used tool for visualization purpose than coming to Big Data platform. We have to do mapreduce Apache spark louder. And when we talked about commercial fax almost Tito signs, we have size IBM SPSS modeler. I pick my nose and many more. Now we will talk about the machine learning applications on various Industries to let start with banking sector. The banking sector

was one of the first adopters of a dust mites and I'll explain important role in redefining banking industry in a holistic science with applications ranging from fraud detection, predicting landfall customer Acquisitions and retention and increase in cross-selling and Absalom. Now we'll talk about the applications on retail industry to Predictive. Analytics is widely used by both conventional retail stores as well as eCommerce warm for analyzing their historical data and building models for customer engagement

supply chain optimization and price optimization. Now, we'll talk about it. He's collected from different sources, like passengers travelling informations Census Data from the plane in two days time airline industry must consider them as data companies first and travel company second machine learning algorithms are widely used in aviation factor for personalized Braunfels and passengers expedients predicting flight delays and safe for flight novel focus on Telecom industry. State Farm industry is one of the most Progressive and challenging industry. The

High Roller to literally of this Market is due to breast changes in technology specially in Wireless actor The Flash changing needs of the target audience make with necessary for the telecommunication form to quickly adapt to the modern technology as well as align their marketing and sales strategies accordingly Telecom sector relies heavily on Advanced analytics for wide variety of applications, which include Network optimization, Florida identification and predicting customer churn. Now we will talk

about healthcare industry did not find and machine learning is playing very very important role in healthcare industry in a holistic science applications ranging from predicting the outbreak of disease and preventive management predicting to chronic diseases and improve patient satisfaction is now the focus last but not the least about the fmcg industry and services in Athens. GA industry relies on sales volumes and speed due to low prices and non-durable natural products to Predictive Analytics and machine

learning is playing an important role in redefining fmcg industry applications range from customer experience and engagement logistic management and markdown optimization. Now let's talk about data. The textbook definition of data is basically a collection of variables facts and figures which cells as raw material to create information and generate inside the data needs to be manipulated processed and aligned in order to draw expandable inside. There are various forms in which data scientist.

It can be structured data, unstructured data and semi-structured data. Now we'll talk about challenges and data processing ability to access integrate and utilize data from multiple sources has drastically and Hans the capabilities of data science and machine learning but it brings its own challenges to integrating multiple file formats, including text audio and video requires significant data, preprocessing mean and homogeneous. Typically, they need to be in complete

noisy and inconsistent and it is an important task of a data scientist to pre-process the detox by Celine missing values important to be handled as they could lead to wrong predictions or classifications for any given model being used. Do you have can have missing values for number of reasons such as observations at 1 not recorded and data is corrupted. We can handle missing values by using various methods. Like you can Mark invalid or corrupt values as missing in your data sent

you can remove with missing data from your data set. Are you can impute missing values with you mean median mode values in a data set is John your business the Nigel's now why did you not also affects selections of your body. Which can be implemented to such a no-go by designs are still missing data such as skiing and decision trees that can ignore him from a distance Music Man of value is missing while and not works Mandela missing values in the data set. So it's very important to handle the missing

values before we go ahead and start modeling of a data set. Next important data processing is to identify outliers or anomalies in the data. Outliers on enemies are extreme values that follow along with outside of the other observations. So now why do we care about animal is nobody times are sensitive to the range and distributions of attribute values in the input data can skew and mislead the training process of machine learning algorithms resulting in Long guard training times less accurate models and ultimately

World Resorts. Outlier scan represents example of data in stencils that are relevant to the problems such as animal is in the case of fraud detection and computer security. Another reason why we need to detect animal is is that when preparing datasets for machine learning model. It is really important to detect all the outliers and either get rid of them or analyze them to know why you had them there in the first place. Like I just discussed about fraud detection

in that place out life can be very useful information and identify fraud activities. Common ways to detect anomalies are by going through standard deviations and boxplots by opting for robust random cut Forest. Last but not the least. It's about feature Story another common problem with a challenges in solving his class in balance issues and balances used and if it remained an address, it will lead to misleading accuracies and I will discuss about the various methods to handle class in Balance shoes in banking use is in today and my upcoming flight. Last but not

least is about feature skill and feature engineering and feature selection. Feature scaling is a technique to standardize but independent features present in the data in a fixed range. It is performed during that data preprocessing. So now the question is why Skilling most of the time your data set will contain features highly wearing in magnitude units and range if left alone these algorithms only take in the magnitude of features neglecting the units the results would very greatly

between different units like 5 kg in 5000 grams the feature with high magnitude away in a lot more in the distance calculations and feature with low magnitude to suppress this effect. We need to bring all features to the same level of magnitude and this can be achieved by scaling. Shakira videos method to perform feature scaling can be done by standardization me normalization and min-max scaling. Now what is feature selection? Computer Engineering and feature selection are critical parts of any machine learning

pipeline the data features that you use to train your machine learning models have a huge influence on the performance. You can achieve the feature selection is the process where you automatically or manually select those features which contributes most to your prediction variable or output in which water interested in before we proceed. We need to answer this question that why don't we give all the features to the machine learning or Watertown and let the machine decide which feature is important. So there are three

reasons why we don't do that. The first very first is scarce of dimensionality, which is the router overfitting. So if you have more columns in the day job, and then the number of rows we will be able to switch over training data perfectly, but Warren generalize to the museum and dust we learn absolutely nothing from her mother. Next is Occam's razors. We want her mother to the simple an explanation explanation will we lose explaining we lose explainability when we have lot of features and tardies about your garbage in and garbage out the most of the times we

will have many non-conformity features and redundant features for example name or ID variables. So poor Quality Imports will produce poor quality output and also a large number of features make a model bulky time teaching and harder to implement production. So he slept only useful features. There are a lot of ways in which we can think of is your selection, but mostly just selection method used our and but that matters to Amber. Matter you logos that have built-in feature selection method for example,

lacerations and Phantom Forest have their own feature selection methods another one is principal component analysis, which is widely used in dimensional deduction technique. Now let's start with over isn't now let's talk about essential workflows. So the workflow of all the defense budget goes to that store. The very first job will be data collection is collected by data engineer and videos for it can be structured data, unstructured data and semi structured and it can be coming from different sources. It can be your social media data. It can be your internal data. It can be

coming from your client or Windows and Soul video sources and videos more than once the detox collector. Then you move to the step two facts about data preparation and enrichment. So data preparation faced is actually the most labor-intensive one and all the data scientist average of 70 to 80 percentage of the data scientist time. Bling. And that's the very important steps in all the designs in that Little Battlers resort to your mood affect about dentist chapter is modeling and predictions. Once we have over cleaned and apply appropriate of Origins

base download business scenario, and we SSD performance and compare Resorts and optimize remodels accordingly and then stuff for which is the final strap and that's about her execution and deployment which produces a result that can be consumed by divers internal and external system. Then you talk about classification in machine learning sewing machine learning classification is basically a type of supervised learning to in which the computer launched from the values or predictor given to it and then uses start learning to classify new observations this

fraud detection to identify the fraudulent activities loan defaulters who are the customers who is having the high probability to default. The loan machine learning is our decision trees svm. Support Vector machines random Forest extreme gradient boosting and animal. So now let's start with the word first case study how classification algorithm can be used for loan default prediction in banking sector? So let's take a look at the problem statement. A large banking firm wanted to find more effective ways of detecting applications, which have a high probability of defaulting on

loans. They only had sufficient resources to follow up on 5% update applications identified as potentially defaulter. So they want her to find a way to ensure to the highest possible probability that the applications referred for investigations were indeed. Those most likely to be default. The four audience lace Rambler with the banking where to go. I want to briefly explain long before Daniel consequences principal on a loan on time. And the person is considered

as a loan default loans can be home loan education personal business and Auto Loan and don't pay for it will cause a huge Capital loss for tea bags. Now we will talk a little bit about to do Tar Heels and just starting bid is machine learning methods were applied for predicting probability of bank loan before to hear what the for Target variable is going to fall today or not for Canon S110 loan defaulters and only 13 percentage of customers are there who are so that we have a issue of imbalance drive. We have unbalanced redox

it right because of a loan to Poncho's at 80% and lonely photos of 30 percentage. So whenever we have in balance sheet assets right to swear in in balance sheet assets. What is a civil suit in the class distribution such as 1 raised 200 200 mm. Right examples in the minority cases to the majority glasses. So this Buys in the training data set. Influence many machine learning algorithm leading some to ignore the minority class in Charlie. And this is a problem as it is. Typically the minor G Class on which predictions are most important to want

to go to address the problem of class and balances to randomly to sample the training data set. So the two main approaches to randomly dissemble an imbalanced dataset to delete examples from the Missouri Glass call under sampling and to duplicate examples from the minority class called over Stampy and turn method is combination of over and under sampling where we can combine over-and-under sampling techniques into a hybrid strategy and then balance of a data set. This is architectural proposed model. Like step one goes for the data collection then data processing

selection feature selection. And then if we have imbalanced dataset them going for sampling techniques Berea going to balance of a data set and then model building and finally or validation part. Now it's just placed a correlation plot of few of the variables which we have taken in that used yesterday. So we have variable is known as charm and employed and checking him out on Philips. Go saving amount in each equation loan home loan employed unemployed many more. So here's our default is a dependent variable and VC in the correlation plot. The default is positively related to Tom

and unemployed. So it means the customers who is having more time in terms of the loan. The probability of default is very high as compared to the one who is having less certain. The person who is unemployed is having a high probability of default as compared with the person who is employed. Similar cases your body checking amount and Prince Court saving amount in each severe UVC that default which is a dependent variable is negatively correlated to what independent variables that default is negatively correlated to checking him out. So it means if you have a high

checking him out, the probability of default will be less and vice-versa same thing with credit score. If you have a high credit score, the probability of the default will be less and similar vice-versa same thing applies with saving amount and angels. Now we'll talk about a greasy of the avoid eating before they dress in class and balance issues when we build model Andover place in balance issues was not spotted in Nevada. And we apply the blue retro machines random Forest + Xtreme Legends boosting decision tree. We saw our accuracy percentage of 85 87

and 88 percentage and specificity has 92 93 94 person and sensitivity as 31 32 and 35 percentage. Right? And if you look into this percentage, these percentages are not very promising because we know that we have glass in Balance shoes in over data. So we have to go and we have to solve a class in balance issues by using various techniques. And in this case study we have going for a sampling approach and that is under sampling Glenda Santos sampling balance and a sampling approaches. We have balances the data said by reducing the size of the Abundant class. This

method is used when quantity of the dyes efficient. So by keeping all the samples in the real class and rammed and randomly selecting an equal number of the samples in the Abundant class a balance new dataset can be retrieved. So in this case, we are predicting class 1 to class 1 me to customers who are having hypergamy do before so reducing the size of the little in order to balance the data set is the better way to handle the issues. Accuracy of algorithm after addressing class in Balance issue by using under sampling.

So we have again applied to Singaporean support Vector machine random for the stream gradient boosting decision tree, but we have applied Visa water them after addressing the problem of class and balances and we can see a big changes in Veracruz Easter Since the city and sensitivity percentage, which is security at 80% 81-79 81 82 and similar you see sensitivity is now 8081 83% And if you look into those three models comparison, we can definitely see that gradient boosting is the one which is coming up with better accuracy

and petrol sensitivity. So this is the way our class and balance issues house models in increase accuracy and helping and downs of being better predictions. So in summary, the solution provider was that the banking data science team combined internal data on customer data and story completed and tested videos loan defaulted deductions of Baltimore against disaggregated data analyst Watertown that provides reliable. They set up a real-time API based on. Overton to guide the automated routing of

applications within the system based on the likelihood of defaulting and the result of the benefits which relies were the company reports that the new system of identifying potential loan defaulters has proven itself to be four times more effective at detections of loan defaulted than the Legacy approach and they anticipate even better performance as the model is continuously improved through the flow of real-time data. So I would like to conclude my talk with following take away is data science is truly transforming how business is conducted across what it goes the

most critical aspect Remains the data total understanding of data its uniqueness and how the limitation of available data on hands can be addressed. The ones outcomes of the project onto his box of fruit can be assumed various models in a garden constitutes the two which can be applied across what to use to address driver's industry problem. Thank you for your time and attention and I will be

Cackle comments for the website

Buy this talk

Access to the talk “Data Science and ML with an emphasis on Banking... By Deepti Gupta, Data Scientist, Santander Bank”
Available
In cart
Free
Free
Free
Free
Free
Free

Ticket

Get access to all videos “Global Artificial Intelligence Virtual Conference”
Available
In cart
Free
Free
Free
Free
Free
Free
Ticket

Buy this video

Video

Access to the talk “Data Science and ML with an emphasis on Banking... By Deepti Gupta, Data Scientist, Santander Bank”
Available
In cart
Free
Free
Free
Free
Free
Free

Conference Cast

With ConferenceCast.tv, you get access to our library of the world's best conference talks.

Conference Cast
566 conferences
22974 speakers
8597 hours of content