Research & Insights

By Jeff Gladchun, November 21, 2013

Four Tips for Crowdsourcing Database Deduplication

CRM database merging and de-duplication is a perfect use case for crowdsourcing. A variety of factors make these projects a challenge to solve programmatically. The standards and conventions used in the creation of the databases are inconsistent, minor spelling errors occur in names or addresses, records can be incomplete, and database maintenance can be poor or nonexistent. Developing algorithms to programmatically sort and match these records that cover all possible edge cases is both time consuming and difficult.

CrowdFlower’s platform makes merging and de-duplication easy. By coming up with a set of matching rules for our crowd of human contributors to use, we can quickly and efficiently determine whether CRM business records are duplicates. For instance, we can eliminate the need to account for different levels of specificity in business names (Mitsubishi Group versus Mitsubishi Logistics) by instructing and training the crowd to count these records as matching regardless of the business division. Or alternatively, these cases can be counted as not matching, depending on the defined rules of the job.

Once the rules have been determined, we develop a set of instructions to cover all the bases with our contributors and create test questions that ensure that all the edge cases are accounted for before we run the data through our platform to the crowd. Below is an example of the user interface (UI) of a task a contributor would use to sort these files.

Dedupe

Four things to keep in mind when you write the instructions for a de-duplication crowdsourcing job:

  1. Levels of specificity
    1. Business names (Inc, Plc, Ltd, etc.)
    2. Address information (suite names included or not included)
  2. Abbreviations
    1. Should abbreviations count as matching in business names?
    2. Example: Eddie Bauer LTD versus Eddie Bauer Limited
  3. Spelling errors
    1. Should minor spelling errors be tolerated?
    2. Example: CorwdFlower versus CrowdFlower
  4. Suite information
    1. Should the crowd ignore inconsistent suite information?
    2. Example: 2111 Mission St. versus 2111 Mission St. Ste. 302

Once the crowd of contributors has been trained, edge cases accounted for, and the data has been processed, there are a few last loose ends to tie up. Namely, how do you want to package and deliver the data? Typically, non-matching results for business name, street address, city, state, or country are automatically sorted as non-matching records. You will need to decide if matching suite information is important to you and whether you want to count a record with non-matching suite information as an overall match.

CRM database merging and de-duplication is a slam-dunk use case for crowdsourcing. The crowd loves working on these tasks. The results are delivered in a matter of minutes. More importantly, the results are accurate and cost effective than using an algorithmic or programmatic model. Overall, as Forrester noted in a post this summer, the crowd helps infuse agility into data quality by enabling quick turnaround without expensive process change or integration expense.