Sebastian is a Co-founder & CEO at Statice, a Berlin-based data anonymization company. Statice enables companies to undertake safe and efficient data science, without compromising the security or privacy of consumer data. Leading corporates from the finance, insurance, healthcare and automotive fields leverage Statice technology to safely build value with data assets. Prior to founding Statice, Sebastian developed ventures in IoT, Data Science and Big Data at WATTx, a corporate company builder. Sebastian holds a masters degree from the Rotterdam School of Management, and is a member of the Working Group on Data Privacy of the German Initiative for Artificial Intelligence. He has always been passionate about data-driven innovation that creates value for people and their businesses. He believes in a data-driven present and future, and is sure that we can combine data-based value creation with ethical business practices.View the profile
About the talk
Privacy defines a state in which one is free from public attention and not observed or disturbed by others. Taken in the context of data, privacy is therefore a state in which an individual’s data is used only with their specific consent, and where any person or organization party to that individual’s data guarantee to prevent unauthorized disclosures or misuse of that information.
Therefore, in order to protect the individual's privacy, strict regulations have already been introduced in many regions and countries worldwide, such as CCPA in California or GDPR in the EU and we can expect many more to come. This puts businesses in a position in which they need to find a solution in order to leverage data while preserving privacy. We will address this topic and answer how businesses can benefit from synthetic data and unlock the value of data.
Hi, everybody rushing here from status very happy to be talking to you about the private business benefit of privacy preserving synthetic data, which sounds will be like I say, it's title that could mean a lot of different things actually be focusing on their particular topic which is the use of synthetic data for anonymization Ashley protect sensitive information in Census, Data before start a quick backgrounds to myself and Status company. I'm representing as well. We're German company focused on vacation
leaving augmentation in data protection. We are operating again out of Berlin Germany right now. And yeah, I have a few interesting use cases in a few also interesting ideas and examples around the use of synthetic data that I like to Showcase them explore a little further. Now we've obviously talked and heard a lot about the use of data, especially since different customer day off today and the general idea overall. Of course is that user Data customer data, you can patient data is very much at the core off
the I'm going to shove a tional companies and businesses. It's a great resource to provide proof Services the customer experience as well as stop scooping you very personalized products and experiences to users and customers which obviously is a great thing and I think there's a lot like hiding use cases application for using and leveraging the customer data, but obviously the rally often looks like white different we talk to our customers and partners work with we realized that if the ability to leverage and you sense that they'd. Is far from
ask fluent as you would expect it to be really looks like this. It's not accessible for users internal to an organization that collects it on first place on 22 very complex governments in compliance process. Visa customer comes next for the ability to use data for expiration. For example, if it's basically not feasible and additionally does also needs for prom that quite often companies that we Dracula for enable to Ashley production or Census Data with for example call providers in order to use scalable infrastructures, which at this point is
basically absolutely not doable due to the sensitivity and therefore, of course, the attributed compliance aspects to be considered when using such data and the question do we are considering facing and always seeing when we also talked to organizations globally on the all the possible you benefit from customer data in a way. It's presented earlier on too long to really build services in applications in a way that these could be rude meaningful in grapefruit. User base while still preserving customers privacy and booze to make question like to focus on to rain today
in talking little dog, the several different approaches towards protecting privacy and then leading up to the idea of what can be done to actually really protect customer data and still keeping it in the meaning for enemy for matter for further analysis things expiration. We've seen all examples of password companies seemingly protected sense of day. With these right now, I'm only being a few but we also see that the idea of data protection is not as trivial as one might think because the reason is that data itself
has so many unique attributes and unique characteristics to be considered in their entirety when it comes to protecting in the Moon Rising data. So if you look at it Animation on individual patient we obviously can see that there's almost always in direct identifiers available in the set that will link this data to a direct person in this case right now. We're obviously looking at a phone number. So what we're seeing here is that there's always his ability to usually have an idea which can be a customer ID
make an email address and phone number and name obviously that are able to be linked to an actual individual person and therefore identify Dana quite easily. This obviously is already been shown in the famous case of Cambridge analytica. Where was able to use user information attributed by ID hours 59 people to I mean as we're all familiar with influence political campaigns were actually build algorithms that were able to predict to user Behavior properly currently in are false. Of course, they're recommend to a certain content to be shown such people win. Weasley while technically
this is an exciting opportunity, of course after cliente on the other epic Lee a little legally. Very very helpful and obviously regulations such as to gdpr with cases and passed it already shown that the use or misuse off sensitive data can a very severe impact on not just company obviously, but on a global democracy as well as the actual Freedom into the brennity of individuals and example new gdpr, there is one line dickly Outlet. How companies would be able to use sensitive data in a way that does not a neat them to go to very difficult compliance protocols. For example,
when they would want to use data right now when a company wants you to sense that they do right now is usually what has to happen this that they needs to be a procession reason that to do to such data which can be most famously customer consent. But also other potential reasons such as such as legitimate interest of a company in a few others now, this is really quite difficult to argue for quite difficult truck paint. And in this case will we provide always support customers in Once I can always right straight back. Apparently, we're having some
connectivity issues. Seems like we had some issues with the connection on lower back as always saying there are several different reasons for companies to process tentative user data really this is quite difficult to manage and quite difficult to say to to enable in the real world is so this is why according to GDP are just one very direct way off using data that Prime piece time left in the Phoenix port or manner, which is the anonymization off data. Now, what is an organization of data anonymization update on means the ability to change a data set to
an extent that it does not allow for the end identification of an individual with even with very significant effort now easy to imagine the reality is often quite different again that this is a difficult to obtain earlier. There are certain methods that are quite known to aggregate data sets in order to also obvious skate shop called in here and Grace birthday sex or zip code to link talk to actual individual people reason being that even with these quality dental class. It's quite easy by just having certain combination of bad. I also have information.
Research just went ahead and had a look at it will probably be available medical data set and buy just linking this publicly available medical data set based on the attributes of the ZIP code the birthdate of the gender of a person with and publicly available voters list. They were actually able to identify most of the individuals and does medical data on without having the names address or other information in the medical field. But just by connecting to text based on these three attributes, it was easily able they were easily able to link information directly to a public folders list it
all this information which were seemingly Anonymous and seemingly protected what usually linked to a person and therefore scores expose very sensitive information. Ideally. Nobody wants to have publicly available. And that brings us to the typical problem update on in itself that even if there's no direct or even consider the indirect attributions in a set of data information that might seem very ordinary can in a very big very a house in a very big quantity itself serves as a fingerprint. So
for example for the information here and headache tracking it's possible to just by having enough datapoint Ematic use the information in this with a fingerprint and completely delete this back to Knox individual person distance case study or recently released shows that in the car systems. It takes it all to this 15 minutes of breaking patterns to uniquely identify the driver and this gives you an idea of how information that sell scrubs not to be relevant than passive at all. Just having enough of it already five debility to link very very defense information to direct
person. Irving a fast fingerprint researchers went ahead and use the publicly available datasets that Netflix released for data science competition about more than 10 years ago and looking at the set of data with seemingly was Anonymous again, we're able to find just looking at movie ratings of individuals and Link this with data from the IMDb a publicly available databases. Well just by looking at 425 movie ratings of person if they were able to link entries from the Netflix data to a public profile of person and again where you able to identify over 80%
of the individuals by just looking at at the sequence of movie ratings. And again, this gives you the idea is similar to how data is not for you to say that stimuli is protected really isn't now if we talked about it without data protection and privacy and obviously it is also different aspect of compliance if you're for example, look at the example. Strava running app that release date on geolocation traces of the users. What was interesting to see that while they were very beautiful pictures of running tracks available in the state of it also accidentally revealed very
sensitive information on stuff that ideally was not supposed to be revealed for sample here in the mess are in area in Africa running tracks in the very particular pattern on the small area actually revealed the presence of a secret military base. There was not supposed to help you sleep. Leak by such a date of it. So next to data protection for a privacy concern. We also have the issue that data itself while not being directly linked to the serpentine might reveal in a certain way very information from a sensitive information. That should be kept compliant. Now, Siri, or do you take
away? So we see here is that first of all to see him as a shin meaning the scratching or the deletion of information such as an ID and email or direct identification is not enough to protect data and the other side even if you go ahead and start aggregating data or changing data on some parts of the other parts of the data set untouched. This got a touch of data can still linked back to real individuals as it comes through tears or contains data that is self serve his finger and its importance. Therefore do not assume that you know, what information they
said if they does relevant needs to protective and what information Adidas it's not what I sent. This endeavor does not need to be protected as it only takes a specific amount of information point that it self serve as a fingerprint and can therefore be linked back to Natural individual or sense of identity. No, there's also means that if we take anonymization very least really we technically would have to go and just crashed out information instead of data be used somehow everything can link to a real person or entity that needs to be protected. pittard okay 10 to also
coincide with the loss of a lot of data utility and granularity and therefore we have this problem that we continuously deal with a trade-off between Pricing one sign data utility on the other side something that ideally should not be the case. And is it something work with the question comes around? Like is it even possible to protect it up properly while still preserving the strong aspect of his original statistical relevance in values in order for you to be processed further one ability?
Chelsea Grin lyrics of nutritional data sets sold a methodology that we focus on quite significantly. It status is the idea to use so-called deep generative models and the cons of differential privacy in combination to train models and price and preserving matter taking up statistical relevant information in an egg is original source of data and using this to generate a new set of data that we consider again started synthetic data that itself has no one to run relationship the original data but still preserves the oval structure and statistical properties now different surprised. You
tell me the very exciting mechanism. I'm not going to go into too much detail, but the general idea is that is basically mathematical definition that if I fulfilled and shows us that in the aquarium to a certain data set as well as to another copy of the state us at -1 entry results in roughly the same probability of the same result, which allows us to infer that order which doesn't allow us to in Search information of individual being or not being in this in either dataset
a first of all make sure that we don't have too many assumptions about an attack scenario provide the full multi-reticle definition on anonymity and it allows us also to not having to worry about how do you say computational complexity will computation heavy tax due to a very robust guarantee privacy. So what does this look like know why we start really your dad anonymization methodologies there more traditional 10 to change a set of data. What we do here is again have generated model the train voice train on the stores original day. Which we see in top and picks up statistical properties and
use this to generate new set of data. That would be looking at at the bottom. Now, the only five entries out of their logic fade upset, but we How to see which is quite exciting that if we compared to talk original date on the bottom of synthetic beta Trends are picked up almost perfectly in allowing us to compare several different statistical test about data sets resulting in almost the same or almost the same output. Now this allows us to how do you say used will provide a method off and on in it to your imagination that again from buying for concepts of privacy and utility
at the same time and I'm going to skip to these parts because that's what I like by eleven 5 now what is exciting or actually to use cases of possible on such privacy preserving synthetic data because we're beforehand data was generally not applicable and not available and where people work in corporate station at corporate entities or even between partners how to go to buy complex compliant governance processes are now able to use stayed up that mimics the original date on an almost perfect weight with with Having to go through these kind of poses base without having to wait
for Access and use it for Quiet Riot e of different applications such a statistical evaluation is bi analyses, but also the train with machine learning algorithms proprietor was nobody that can do so and the ability to do this, of course. I'm results in the idea of having quick and improve developing efficiencies for use up data retention to opening up new Revenue stream for companies. But of course also to ensure compliance along the whole day. Value chain from where do you send a data is used for further analysis, which ideally in our
perspectives Services very good compared alternative and a very viable alternative to personal protest to Innovation because it actually combined stability off protecting the other one side, but also ensuring the music filter testicle relevant information on the other side. Thank you very much.
Buy this talk
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.