About the talk
Starting from setting a product goal and going all the way to putting results in front of live customers, we talk about Cherre’s efforts to build a knowledge graph using Commercial Real Estate (CRE) data. Knowledge graphs are a data resource that can answer questions beyond the scope of traditional data analytics. By organizing and storing data to emphasize the relationship between entities, we can discover the complex connections between multiple sources of information. The CRE knowledge graph is a temporal graph (edges contain current and historical connections that can span decades) that connects entities such as properties, addresses, people, and companies to each other. Building such a graph is complicated by the inherently “messy” nature of CRE data, which we’ve collected from various data sources and requires NLP to standardize.
Our CRE graph is built to be the foundation of our ML efforts, allowing us to solve complex problems that leverage the connected nature of the data. This talk discusses the combination of multiple disciplines (NLP, data mining, graph analytics, deep learning) and domain knowledge that we use to build and extract value from the graph. Since this is done at the scale of big data (hundreds of millions of nodes, billions of edges), creating smart metrics to measure success is also discussed.
John Maiden is Head of Machine Learning at Cherre, developing ML solutions that leverage real estate data to provide insights across multiple industries. He and his team focus on building Cherre’s knowledge graph as a foundation for expanding the company’s product offerings. Prior to Cherre, John worked at JP Morgan Chase, where he led a team that produced personalized insights that were delivered directly to millions of Chase customers. He has a BA from Hamilton College and a PhD in Physics from University of Wisconsin – Madison.View the profile
Introduce myself. I'm actually by Mary. I'm director of research, infrastructure 23andMe and I've been on the Review Committee for a month, for a few years now. I'll get myself a quick introduction were waiting. So how everybody John, Maden. I am the head of machine learning at Sherry a data tech company in the commercial real estate space prior to working at Sherry. I was at JPMorgan Chase on the consumer side for about 4 years trying to deliver insights to millions of Chase customers that
long, long time ago also was in physics, but that's been a long time. That sounds great. Okay, I think it we're good to get started. So great that you already introduced yourself. So you made my job easy, so I don't have to do that. But the floor is yours. So I'll just duck out and then I'll come back for the next session. So, today, when I'm going to talk about these Lessons Learned From building a real World Knowledge Graph, all the exciting work. We've been doing it
cheering. So as I said before, my name is John Maden, I work at Cherry which is a commercial real estate data company. I'll have a slide for the company in a minute prior to this. Where did you pee Morgan? Chase by trying to find lots of insights into customer Behavior? Contact information. Feel free to reach out ask questions and like I also have it at the end. So, today we're going to talk about the knowledge graph that we built using commercial real estate data sharing. So the interesting thing that we did with this is It's a large graph,
it's constructed a lot of messy data. The intent was to build all of this to help Drive insights and drive where we wanted to go with the products. So a lot of different talks have been given about knowledge grass and Industry tends to focus on rafts beans record for queries that they might have very complex and tala. Geez, we built this with the intended building models off of this, though. The structure of the graph is going to be a little more simple than you might see. In some of them were, some of the talks, but the idea is that we wanted something that could be reused across multiple
product directions, and could be built using different model. So that's how we focus the work of what we're doing. And by doing that, we ended up using a lot of different ml techniques to First, build a graph and then mine information from it. So I'm going to start off with the business. Use case that drove our work in the first place for important from their finding, the right data. You can't have Knowledge Graph without really good data. So how we explored, what we had, what we needed to find the graph construction, adults of the nitty-gritty of everything that needed to be billed to
get it there. So, a little bit of machine learning a little engineering. And then, finally measuring success, which is always important that you are a business use case. So quick introduction to Cherry, Cherry is a data connector, in the commercial real estate space. What we do is we connect Foundation, which is usually public data sources paid. So third-party like demographic, data payroll dim the lights, as well as our customers own data, we connected in 3 series of pipelines that then deliver the data through apis that they can use to analyze all of the different
data feeds in one place. While we're doing introductions. I should also give a quick overview of real estate data and explain why it's very messy. So real estate date as a whole is collected in the US at the county level. Though. There are three thousand counties in the US. The other regulations are all the conditions and restrictions are based at the state level. So if you look at all of the data set, which is why they're very widely different texts is a good example of a state with. Not a lot of reporting requirements. Do different states have different reporting
requirements. They might have different codes. If you want to get all of the national real estate data, you actually have to go county-by-county and the way that real estate works as an industry. Typically, is there a couple big providers who that's what they do is it is go out on a regular basis have connections that do and collect all of the 3,000 county-level. Data said they need and combined it into a national data set that you can use. Archery, doesn't do that. That's not our part of the business, but we used these National providers to then add in sight on top of that, these
datasets going to be ready to pick with us in commercial, real estate, the types of data that is provided in the public level is going to be the tax owner. So whatever is currently paying taxes on attacks launched their attack slot or tax entity is going to be to be an apartment building. It could be Building itself, especially to The Coop. It could be an empty Lodge, like, whatever is the tax object. That's what people are. Looking at collected by the assessor's office. The tax owner isn't necessarily the owner of the property. You can have an accountant. Do you file? Your taxes could be the
tax on real estate even though you own the property where we should get the tax bill and then on top of that, the D transactions, which is in the recorder Saudi transactions could be buying and selling activity. It could be transferred, mortgage could be transferred title. Like there's a lot of data about the difference of residential real estate, vs. Commercial, real estate. So residential is what most people are familiar with single family homes, condos apartment unit. It's a betta c-type business volumes about the data points, commercial
is Multifamily, so you can looking at four more units in the buildings are apartment buildings office Regional Industrial in terms of this business model is more B2B. So, you know the difference of buying $100,000 worth for a apartment versus a million dollars for the building. So they're different. Look, no different perspectives, different sets of data. I mean, there's a lot of the difference of a fire. Do to build this crap. We have to look at what's our business use case. Why do we would have been building on track? What do we need to get out of this?
Once we building constructed? So the first initial use case that we used to drive. This was finding the true property ownership of the property. So commercial real estate gray, commonly buildings using an LLC on a property basis. Sometimes this is for tax reasons. Sometimes, this is your business structure. Sometimes, they just don't be found. And so you might see the same same company might on 123 Main Street 124 Main Street, but the two buildings, the owners on tax assessor would be 123 Main Street LLC in 124 Main Street Outlaws.
So what it is, I wouldn't choose distinct properties in the US one of them's in Oceanside, the listed tax assessor owner of Pulte Homes, New York LLC. The other one for North Carolina is looks or us special purpose vehicle one LLC, if you know the real estate industry, no Pulte Homes is a developer. Obviously. The actual owner is built a home so you can hide that up to The owner for looks or the actual owner is Amherst financial. And the way that I did this is I just looked at the mailing address though in the date of the public. Jada.
What royal player? Italy, the tax owner but it also lists the address of mailing address. And so you take that you dumped into Google. Google says, okay, Elmhurst Financial mailing address, so you can say okay. Well this isn't going to do is I'm just going to take all of the data. I have just take all those addresses and just dumped them into Google and see what I get might take a little while but I'll get all of the owners of the property. Well, that doesn't work very well. Where's that doesn't work very well. Is that they're certain orders. You just don't be found Williamsburg. There's
a house there, that is very popular with a certain part of the community where they just don't want to be found as owners. And so, you have a whole bunch of anonymous llc's that have the same exact address. They are all separate owners are all separate businesses. They are just making sure that you can't find out who they are. And so they all use the same property, Williamsburg. So if we actually accomplish this, if we're able to take the day to clean it up well enough and get it to find the true owner who's going to use that, right? That's the motivation is loser ultimate the owner for the
user for the data and that's going to be a couple of different player. So if we can get contact information of the true owner, that's going to be really useful for Brokers and developers. They find a property online where to find a property physically, they really like it's not listed the owners and Anonymous LLC. We can say true owner plus phone number and e-mail. They're very happy, they can call and try to make a phone call to make a deal. If I didn't say this property is owned by tishman. Speyer. I can flip it and say okay if I got the owner tell me all of the other properties that just
by our own. So I can make this a portfolio level analysis. Given that this is a temporal graph because we've got Decades of real estate. Either way. We can slice and dice and say, what is someone's history overtime. What are the buying and selling their other support network information that gets into here. So you get Who are the lender is usually more guns, are the mortgage information. Maybe they're using a certain property manager. You might get from the data. So you want to break into that market, want to go into that neighborhood. Here are the letters that have a history there. He was a
property manager that scan history of the other really useful part is if you can use this to then start building property,. So residential real estate, you know, if I find a house I want to say, I say, what are three or four of their houses that are close to my house, that have the same numbers of bedrooms and bathrooms and they sold for X in the past 6 months, do average of that is Valley. My house that works in residential because there's a lot of properties and it's easy to find something there by that. You can sell commercial, real estate to look harder because most buildings
are bespoke, and you're not going to find an apartment building or a industrial center. That looks exactly like yours that's close enough that sold with enough time. So you have to think about more data sets and more ways to give insight to the data to then build out of valuation models. So we know what we want to do. We want to be able to find on that donors. We need to find the data. So on the data side, this is a nice generic commercial. Real estate picture looking out into the distance. They see all the opportunities. They know they want it all. Okay, let's attach real
data to the snow on the real data side. We have transaction data. So we know building was sold on a certain date for certain certain amount. We might know who the mortgage lender is, if we know that was the mortgage was attached to the transaction given that most of this is tax-related. So, you know that you don't let you know what they assess taxes are. Maybe you care about abatement in the like so. And also certain places might provide a market value. So you'll get an estimated market value. You get some access taxes on a property. If we have permit data, then if you logging permits, you want
to be at, you have to list of person in the contact. So the person is connected, might be useful. Depending on the permit information, that might be someone connected to the owner or might be someone who is general contractor. And then you also get a whole bunch of different government listed properties. So that's what we got to start on, finding the data now on the general requirements of what we want. So finding the day too, so, General requirements overall. What we know we need is first. Who is the taxpayer behind us? We get that. No matter what we
get. The transaction is history as well. That's part of the public a delay that we want permits data requires a little bit more looking. But if we get that that's useful because we have contact information. We've got people who are definitely connected to owners. Those are things we can dump into the grass. No matter what public data is also really important. So the top developers all have a listing of all the property. So, if they have big properties are very proud of them, they're going to let them on the web page. If they let our customers can find them. We have to have the same
answer. So we need to know, the property is a tishman, Varnado, CBRE all the big names have and we've got to make sure that we incorporate that into a graph. And then Landmark information is also really useful. So, Anyone can look up, who owns the Empire State Building on Wikipedia. We make sure that we have to have the same answer as well. So no matter what the obvious answer is also have to be incorporated into the graph, to be able to do all of these connections. We need to know about corporations and the registration. The corporate structure is also very important. If I see a subsidiary, can
I buy back to the parents? If I see a random mailing address, not everything but they're going to lose their corporate headquarters. So it can I get enough mailing addresses that tiny back to different parts of the same company. Contact information also really relevant for our customers. So we also to capture that as a separate a bowl and then lots of anonymous llc's is particularly if you register everything in Delaware, so can we get as much LLC registration information? As we can do that? Also, join it back into the Arroyo graph. So we got our goal. We got our data.
If anyone who is interested in trying to look at real estate of themselves, I recommend looking at New York City to sources. So New York City has a great open data initiative. Most of the great day. Has provided by the Department of Finance. So tax date is in 80 roll. The transaction data is and actress. It's been there. Lots of History, lots of use of data buildings also has some great insights on New York City, buildings is a hole. So if you're interested in giving a try and see what real estate is available, New York City's got a lot of great source data and
they're also put a lot of effort into creating primary keys that allow you to connect. Only stir up too much effort. I should also add that we did this originally we did this as a prototype of New York City data, and then eventually we redid it all using National Data. So we have this, we have a data. We have a goal. Let's talk about the actual graph Construction. So as you know, building a Knowledge Graph, you want to focus on what is considered to be the most important parts of your data. Sonos are the name of the highlight of a graph and our case a property is going to be a big
Focus. Or when this case is really attack slot addresses that we captured from the data also can be important. So when I say address you say, it's going to be a mailing address because that's going to be Associated to the owner of a property. Addresses might be relevant. But do you know Jenna seem especially on the commercial side that property, addresses and mailing addresses are going to be two different things and then from the names that we collect, we're going to have, we want to distinguish between people names and ridge on what we called, Mom people names, but we really just
generally corporations. This is going to play into how we mined the graph and what we look at. So geode it resigning. Different relevance is in waves to whether the connections a person or not, has an important on how we decide ownership on the property itself, and then from a corporation, we care about if it Registered versus unknowns are we need that database of registered company information for our use cases, for a customer's. They care about is a property owned by government entity. Is it something maybe on vinyl station? So we need to provide additional filtering and tagging based
on the company's of zones. And then the edges are the source. So different sources are going to have different weights and different values. If we're looking at some type of a transaction. I want to know which side of the transaction you are on. So I'm going to put that in the information and then recent same frequency as well. So that's the basics of the graph. Like I said, this is not generally for overall quarrying, it was designed for modeling thing. So it doesn't have a rate deeper a complex ontological structure. From an
engineering perspective. We take all of the different data sources that were pulling in. So a lot of the source is already mentioned. We clean and sanitize names, we clean and sanitize addresses. I'll talk about why that's important. And then we put it into the knowledge. Graph is a series of edges. So for the purposes of what we're doing, it really is, and undirected graph. It doesn't really matter of a property next to an address, or just an extra property for the Technologies were using critically. We're using sparcraft frames, and I'd case it really expects a directed graph. So, just
making sure that we were building it out that we made. Sure that we kept our visit. Never dated was consistent, and we made sure that our graph was directing the right way. But otherwise, yeah, large graph required, Big Data tools. So what did we do with this? So we take the day that we pull it out, we build some edges and we try to mine it for owner is what we do is we just say there are a couple of different goat cheese and we're not looking for extremely complicated ones. We had about half a dozen motifs that we looked at total and we were saying property connects to this and
property connects to that if okay. That's our first one and then we did, this is a waterfall approach that we said. If I see this pattern for us, I know this is the most relevant pattern. I care about. If you match that be good. This property is put aside and do what. We were able to do is we were to reduce the pace of our connection Dover times of the graph actually shrunk, and we are terribly went through each one of the patterns to make it smaller and smaller. Basically, the farthest ages that we had with the most confident. I've got the largest chunk of the data and then we would reduce
it more and more until we got to the connections that will more more tenuous. But by pulling out a lot of the old data, so we pulled out as much as we can. That made the graph front of it fast. Later stages. But it also made sure that some of the data points that might have been more or less relevant were also removed from the graph so that it really kept his stuff. That was more that we cared about us. But generally I like I said it was we looked at specific motifs. We then looked at the data sources and we rent them based on what we consider to be relevant about this graph in itself, is built
its scale. So they're about couple of hundred nodes in the grass, couple hundred million billion and a half fish and edges to the graph. So there is engineering challenges, which goes back to know, Evan to run this all in Big Bear and park. So making sure that we could build this in the Assumption as you don't, even though the graph is large, what could be further compressor that we could add more data sources without making it even larger. So keeping it at a relative size that could be run in a couple of hours using spark, but then on top of that, there's also the analysis
challenge because it is so large. We're not going to be able to immediately point and say we know that we got all this, right? Like this is there is no way to check all of these points and determine exactly if we got the right answer. So analysis becomes very important in this ties into the the metrics that we going to talk about at the end. So on the machine learning side DeGraff at South building at there, a whole bunch of different interesting challenges. They are primarily MLP base. So on the address side, addresses can be very noisy. Some of them
are just there different ways to write stuff that happen. Based company. We used to be on 6th Avenue Sew 99 6th, Avenue 6th Avenue and Avenue Americas are all the same thing. You need to be able to take those standardized those and come up with one can Annika form of addresses. In addition to that, you also might have typos. So most of the address might be right, but you get the state wrong with the zip wrong. Maybe you had fingered a little bit. So on that side, what we had to do was way to build a point of speech tag her for address components there, a couple of different packages out there.
We build something from scratch, but no, there's a lot of analysis on how to take addresses, which are shorts to my start your texts and basically tied them for Street number street name. And everything else for those that are interested, you know, there's stuff there been papers and projects around conditional random Fields were working on this scene hidden Markov model. I know someone without their foot deep learning model together for tagging taking your dresses as well. So there's a couple different ways but it is now it's it's short, semi-structured text. All you got is the
address that's it and you want to make sure you can parsley dress quickly. You can identify the tags and then join it against a large database. Clean addresses that you can use. So you need a good set of clean address date of the begin with and then you can come home at WISE buys a match. It becomes a bit more of an engineering challenge, but then allows you to try to clean up the address as much as possible to come up with one form that can then be that know that you put in trash. That's at least a little bit more straightforward because with addresses, you have a gold standard of David work.
With on the name side, that becomes a bit more tricky because names are a lot bigger. There's a lock me out. We had hundreds of millions of names that we saw our names. Could be people, that could be companies, that could be trust, they could be a whole bunch of things. And so first of all, we wanted to type them. So the first way that we did it was people versus done people. John King has to be a person Burger King has to be a company. We started off with some obvious keywords and then didn't hit our approach. So LLC Corporation, trust, all of these things were potentially word that we can
build into tagging and then, You can literally find other similar words and then build upon that, you charge a galar gym or to do to stop. And what we found is that we had hundreds of millions of name. There was certain and names at the calendar were very obvious. So if you see LLC in a name, it's obviously a corporation. You can put it onto a land of your work. They said names that, you might see, if you see a name around, Stephen, Karen David, something like that. Potentially. It's going to be a person that would be on the other end of the spectrum. And then you have a whole bunch of names
that would fall in the middle. That would be kind of ambiguous. So you'd wake you start off just reading Adidas. At that was large enough. That covered everything figured out where you had the really strong signals from the end of just do a simple lift score based on the names. You've collected, these they clipped the tail end and then what we eventually had to move to wise for all the stuff in the middle. We actually did build a model. So we trained a model off of the labels that we created with the assumption that the labels were mostly good. Addresses are stored.
Semi-structured text names are not structured at all and people names versus non people names have different formats and challenges. So we built we use them out of the box, simple text embedding. So character level text embedding for all of the strings took that stump that into a model as well as a couple of other insights that we had on features and I predicted it's Gail over a couple hundred million names. People versus non people. Once we did the predictions on the types, then we had to go ahead and do for their cleaning. So, the example I have here is Grand Caravan sweats and clean it,
which is actually a law firm. So you might have different forms of the same name and you want to be able to standardize though. So you might have John Madden LLC and John Maden LLP. Those are probably going to be the same company. You want to make sure that you can cut down on the noise because if you have one, lets you have one name. One version of a company name that points to one address in another version of a company name that went to a different address. You want to take those two names consolidate them together. So that, you know, that this is really the same underline company and
it points to two different addresses which then gives you Insight the further connect. Cuz otherwise the graph itself isn't going to be too noisy and you're not going to have the inside some connections that you need. So for some of this retrospective patterns work really well, people names there isn't much to do because his is so much radiation people names. At least a company names are more behaved type of Corrections Galen, brough base feedback loop. I'll quickly talk about that. So the typos, there were lots of typos that when the data remember that, this is all collected at the
county level, which means that people just typing it in by hand. There's no consistent pattern at all. And so, yeah, we can pray, we could quickly wipe whip up a type of Corrections. Do we had millions of names, we extracted out the most common names. We then did some hashing to avoid and squared comparison. Because we had no hundreds of thousands of it. Where is it? We want to look at. So lsh to get some of the root words. In the right bucket. We were just looking for a simple typos. Two, things are released one character way because we actually went to go in and manually verify that the
typos were typos. And so, you know, simple type of Correction would be hashing. Reduces face Levenstein at 1 just make sure that you get something close enough and on top of that we just had a tight dictionary. So to make sure that we know hotels and motels didn't get mixed in together. Though, that's on the type of reaction, which is an iterative process, you run the other way that we also did it was using the graph to clean the graph. So taking the data cleaning it up, as well as weak as well enough as we could. And then using the connections that we had to, then try to
further imply additional connections and entity resolution. So if we knew that two different, John maidens, connected the same property, we could then say, Obviously, Johnny made me yawn be made with two different people. Let's keep the missing and it is, but if we have two different big-name corpse that both point to the same address, we could use that new things similarity, business rules. You take your pick all of those with insane up the direction, the same entity who need to clap them together into one single know. So then it actually becomes into a feedback loop where you take the
day to clean it as well, as you can use it to build the graph and then use the graph to then, go back and come up with new list of consolidation. And I don't have collapsed graph, even further. So that's on the ground outside. The last piece of the puzzle is measuring success. So should be able to do this. We have a hundred fifty million properties. Total more than one more. Then how do we actually determine? What's the way to measure? Success? Because there's no way we can verify all of these results in correctly. So this is where do maintenance really goes into it. We care about
certain asset types. We hear about certain owners for the most part. A lot of properties. In the database across the us are going to be single family homes. Those are owned by people. We expect that you are patterns and emojis are going to capture those write. Neatly. So if they could capture, those are pretty good and we should say, posted a large chunk of the data up front is going to be good enough to be able to use what we care about is the last 10-15 20%, Which is the commercial real estate, and that's where we need to put our time and effort into analyzing making sure we get it right?
Because that's what our customers will be looking for. So we're going to focus on your pinche. We care more about multi-family compared to retail office. We're going to have to narrow it down to a certain height that we care about, because we're going to have to go in and manually. Check some of us at least some of the samples, and also on the other side, you know, the big names, everybody knows about the not that interested in its the middle space, on real estate, where they care about owners who have more than one building less than 500. Can we focus on those and get those right? And
additionally, we just had any luck finding gold standard, a $0.02, because we have little bits of one set that might cover one geographic region for when I said type, another will cover a different region for different, a subtype and try to use those too. But no matter what, Pryor dies, and continuous Improvement is important because we are constantly having to prove, we're getting new data. Sets the data constantly update. We find your other places. We're going to prove maybe we want better contact information. Maybe we find a certain region, the data isn't sufficient and so being
able to put in place ways that were constantly measuring the success. That's the talk. Are there any questions I might answer. Sounds great. Thank you very much. John. Will, there's a question from Dan a billion billion and a half, so far in the, in the setup. So that's about mean. It's not a huge crap, but it's definitely a large and it's not something that would fit in a single computer. Okay,, g you. So how would I do without getting data Pinnacle, changes in structure and ecology in other changes and a sadness of stable.
So, in terms of ontology, the general structure is like I said we kept a pretty simple because we want to use it for a couple different approaches though, and we only have a certain limit of number theater said, we only have a certain limited virgin Eric types of nose. And so mean, we are out constantly updating the day that we get new text Data, continually, we get new transaction records, it's more about the infrastructure, making sure that can update and run on a regular basis, then the actual intelligence because those are pretty, like I said, the temperature.
Buy this talk
Buy this video
Our other topics
With ConferenceCast.tv, you get access to our library of the world's best conference talks.