One of the most important steps in our modern, mobile, and location-based app world is building and maintaining a places database: a collection of all the physical locations, businesses, points of interest, and structures in the US and worldwide. All location-based services, such as a check-in application (e.g. Foursquare), restaurant reviews app (e.g. Yelp), local discovery app (e.g. Loopt), social networks (e.g. Facebook), etc all utilize such a database because places are core to the product functionality.
At Fwix, we use a places database, unsurprisingly, for our geotagging technology. As mentioned in a previous blog post, we partnered with Factual as our baseline places database provider. Factual continues to be a great partner company, and their places data is open and free to use through our API.
One might think, however, that because we have partnered with another company to provide our baseline places data, our work is complete in building and maintaining a places database. Is that the case? Certainly not. We have a devoted, full-time team working on maintaining a high-quality and constantly evolving places database. And this blog post will discuss the key challenges in doing so because several developers and companies have reached out to us seeking insight on this topic. It’s a lengthy explanation, but I hope it provides insight on the challenges our great engineers have been working on.
1. Data Quality
While we only distribute our Factual places data through our public API, we actually consolidate from many (10+) data sources in forming our internal, canonical places database. A canonical place simply means reconstructing an authoritative place listing from the 10+ sources. For example, each source has data for the restaurant Chez Panisse in Berkeley, CA. In forming the canonical place listing, we might grab the phone number from one source and the street address for Chez Panisse from another source. (Note: we only distribute the Factual places data through our API. We do not distribute the data we collect from the other sources.)
Consolidating from multiple sources improves our breadth, depth, and quality of data, but building a canonical place has its challenges. How do you determine data quality? How do you know which data source is better than the other? One source might consistently have an accurate phone number for a business listing but lacks a street address. One might infer that a source is poor in quality if a source is missing a big percentage of attributes (e.g. phone number, address). As a result, determining source quality is difficult, and building a canonical place becomes a dynamic value that is a mathematical function of the quality of sources.
In another instance below, two maps are shown of two different neighborhoods within San Francisco: SOMA and South Beach. While these are 2 distinct neighborhoods within San Francisco, some sources state that SOMA includes South Beach and some sources show them as 2 separate neighborhoods.
2. Data Constantly Changing
Places data change all the time, and stale data must be updated. New businesses spring up, businesses shut down, businesses change their phone number, businesses move to a new location, businesses add a storefront URL, and the list goes on. But it’s not just businesses either. Zip codes continually change, neighborhood boundaries change, and more. The data is never static and must be continually refreshed to maintain data quality. On average, we are modifying and updating 2 million existing place entries per month.
3. Scale
Processing data for millions, and millions, of entries in a reasonable time is not a simple task. We have over 30 million place entries in our database, and this number is growing. No matter how much hardware you throw at this problem, it still requires a complex infrastructure for distributed processing. We use proprietary software tools for distributed processing to optimize performance. The chart below gives a sense of this constant fluctuation:
- Over 30,000,000 places in our database.
- Adding on average 300,000 places entries per month.
- Removing on average 200,000 place entries per month.
4. Removing Duplicates
Programatically identifying duplicate place records referring to the same place is critical in maintaining a clean dataset. Does the name of this place already exist in our database? If there are multiple occurrences, are they actual duplicates, or perhaps part of a chain (e.g. Starbucks Coffee)?
For example, there are multiple “Osha Thai” restaurants here in San Francisco. One source may refer to this place as “Osha Thai Restaurant”, while another one simply refers to it as “Osha Thai”. Are they referring to the same place? Another example is the address: one source may state the street address to be “149 2nd Street” while another source may reference the street as “149 Second Street”. Obviously, there are many similar examples.
Resolving these duplicates, a.ka. “de-duping”, requires fuzzy string matching, confining geospatial searches to a geographic region, and calculating a score based on a function of parameters to measure and determine the most accurate match.
5. Lack of Data, Going Global
It’s difficult to find data for every corner within the US, let alone worldwide. Census.gov is a good source for US data, but even within the US, there are hundreds of rural towns that lack this basic information. This problem multiplies dramatically in dealing with a global places data.
6. Data Inferences
Another challenge arises from the lack of data in some geographic areas. Our technology will algorithmically make data inferences on what it thinks the value should be.
An example will help illustrate. The map below displays the neighborhoods of Daly City (outlined in blue), which is a city just south of San Francisco. Suppose there is a business located by the dot below and we had the street address but not the neighborhood information. In that case, we would have to infer the neighborhood. It is not an easy inference to do given the complex and coiled shape of Daly City. Inferring the value then turns into a series of geospatial mapping calculations relating to the centroid of neighborhoods in close proximity to the business.