Cities are central to our lives. More than half of the global population now lives in and around c
Industry disruptor Zillow leverages data about residential real estate and makes it available to the general public. The company's senior director of data science and engineering shares the secrets behind Zillow's data stack.
Residential real estate site Zillow stormed onto the market in the 2000s, letting consumers check on the property value of their own homes and those of all their friends, family members, and acquaintances, too, much to the dismay of real estate professionals.
Founded by a couple of former Microsoft executives who went on to start travel site Expedia and then Zillow, this site threatened to disrupt the real estate market when it debuted in 2006. It gave people access to information that had previously only been available through real estate pros.
Ten years later Zillow has proven it has staying power. Built on the idea of ingesting, processing, and serving data from multiple sources to consumers, the company has made a name for its "Zestimate" -- its secret data-driven formula for predicting the value of a piece of real estate. But none of this happens without a sophisticated IT department and data operation behind the scenes.
[Can machine learning impact your enterprise? Read What eBay's Machine Learning Advances Can Teach IT Professionals.]
Jasjeet Thind, senior director of data science and engineering at Zillow, says that Zestimate is one of the ways Zillow uses machine learning. This real estate value estimate was the first available home valuation model, and it's composed of hundreds of models behind the scenes -- linear models, decision trees, deep learning, and more -- to predict values for every single home in the country, Thind said.
Thind gave IT and data professionals an inside view of what is under the hood at Zillow during a presentation at September's Strata + Hadoop event in New York.
"Zillow Group's mission is to build the largest, most trusted, and vibrant home-related marketplace," he said during the session. Zillow Group refers to the company that Zillow has grown into in the decade since its launch. Now a publicly held company, Zillow owns several brands, including Trulia, HotPads, StreetEasy, Naked Apartments, Mortech, dotloop, and Retsly.
Thind said that Zillow operates a data lake composed of data from all those brands. It also gets data from counties, the MLS, real estate brokers, and directly from users via the "Claim Your Home" feature. Thind said that Zillow's ability to get updated information directly from homeowners is one if its key competitive edges.
Data obtained from government records can be tricky and not very glamorous to ingest. Some of this property data is in JPG form, while other data is typed text. Thind said that Zillow leverages OCR technology in its ingestion process to help optimize costs. Because the data can be input faster, the system also improves user experience.
Ensuring data quality is a big topic at Zillow, Thind said. Public records data comes in many different formats, and the company employs a data analyst whose full-time job is to ensure data quality. Zillow uses trend detection to look for anomalies in number of sales transactions.
There are also checks at the data field level, too, looking for listings that have, for example, 30,000 bedrooms. Zillow also flags certain types of transactions such as foreclosures, because these deals are not used in the Zestimate calculations.
Zillow's technology platform includes Apache Spark. The company also uses Redis and Python for real-time scoring. Zillow taps AWS S3 for cloud storage and relies on AWS Redshift and Presto for its data warehouse. Thind said Zillow specifically turns to Presto when looking at historical data.
Beyond the Zestimate, Zillow provides other numbers to its audience, too, such as a Turbo Zestimate, and a "hot homes" designation (which predicts how fast a home will sell). Many of these figures are based on Zillow's Zestimate calculation.
Zillow has also invested in predicting the preferences of its consumer users through personalization and search. Thind said Zillow uses different kinds of user vectors depending upon how sparse the signals are for a particular user.
Users who share their email address with Zillow can get recommendations for homes they would like, based on what they've searched for in the past. Zillow may also send these users personalized collections of homes based on what factors seem important to the users, such as good school districts.
For the data pros in the audience, Zillow offers a special gift. The company publishes a small selection of data sets on its website that users can download. They are at Zillow.com/data.