Match Engine Theory and Configuration

We started a blog series on match engines and their importance to an MDM solution last week and this is our first entry taking a deeper dive into each area.  First up, match engine theory to understand what must be configured.

Right up front, thanks go to Dan Cidon, NextGate’s CTO, for the background and technical details that went into this post.

To understand how to configure a match engine, it is first useful to consider some of the types of challenges it will encounter in matching fields in records from different systems.  Some common ones include:

  • Spelling errors
  • Transpositions in IDs/SSNs
  • Aliases, i.e. Jon and Jonathan being a match
  • Transpositions of given/family name
  • Properly handling twins/triplets/etc.
  • Transpositions in or fat-fingered dates
  • Default and/or anonymous data, i.e. John or Jane Doe, an SSN of 999-99-9999, etc.

Clearly, a simple “equals” match is not going to suffice.  The approach most every match engine takes, NextGate’s included, is to compute a score indicating the likelihood of a match based on comparing selected fields using industry standard comparison functions and weighting for each comparison.  This score can then be tested against configured thresholds to determine if two records match, don’t match, or may match.  It is the configuration of the comparison functions and determination of the appropriate weights that are core to the configuration of the match engine and determining how well it will perform.

The following figure shows three example comparisons and the resulting scores and whether the records are determined to match.

ngs_scores

So how were these scores arrived at?  To understand that, we have to delve into match theory a bit.

The total score is the sum of the agreement weights  and disagreements weights for each field comparison.  Each field’s weight depends on how well the fields compare to each other, the Yes and No cases from above shown in the two figures below.

yes_weightsYes case no_weightsNo case

The weights in the figures should be fairly intuitive to a human looking at it, but it is getting the match engine to arrive at this “intuitive” result automatically that is the magic.

So how are the weights arrived at?  Each fields agreement/disagreement weight is determined from the likelihood that two fields from a population of matched records are the same or that the two fields from a population of unmatched records are the same.  From these probabilities, the agreement and disagreements weights can be computed.

For agreement weight:

  1. If the field contents of two records are identical then they are given an agreement weight defined for that field.
  2. The agreement weight is based on how likely the fields are identical based on random chance.
  3. The more likely a random identical match, the lower the agreement weight.

For disagreement weight:

  1. If the field contents of two records do not match identically then they are given a disagreement weight for that field.
  2. The disagreement weight is based on the reliability of that field. Reliability is the likelihood that the field contents of two records from the matched set are identical.
  3. The more reliable a field, the stronger (more negative) the disagreement weight.

Now, how are fields compared to determine if they are identical?

In some cases, it is pretty black and white, for example with genders where there are two possible values.  But when dealing with names, dates, addresses, etc. it isn’t, i.e. two records with a first name of John and Jon are not identical but it wouldn’t be appropriate to assign a complete disagreement weight.  This is where comparison functions come in.

Comparison functions return a value between 0 and 1 indicating how well two fields compare to each other, a 1 indicating complete agreement and 0 complete disagreement with partial agreement falling in between.  There are a host of functions used for different purposes from performing phonetic comparisons to handling transpositions to many many more.  It is the selection and configuration of the appropriate function that is critical to match engine performance.

The result of the comparison function must then transformed into a weight that ranges from the complete disagreement to complete agreement weight for that field.  One could simply do this linearly, but research has shown that piecewise linear functions where the weight drops off more rapidly as the comparison function result goes farther below 1 perform much better.  A graphical example of such similarity to weight mapping is shown in the figure below.

similarity_weight

But it isn’t enough to just have comparison functions for individual fields.  Sometimes, as is the case with multiple fields that make up a name, a comparison has to look at multiple fields or look at individual tokens of a field as if they were fields themselves.

For example, record A may store a person’s name as “John J Smith” and record B may store a person’s name as “Smith John Jason”.  Transpositions of name elements like this are a common source of data entry errors and something a match engine must deal with.

A multi-token match function compares all possible combinations of tokens between the record pairs and selects the highest weighted pairs when calculating an overall weight.  For the records above, the token comparisons that would be performed are shown in the table below (sorted by weight).

Token from Record A Token from Record B Comparison Weight
Smith Smith +6.1
John John +5.7
J Jason +4.0
J John +4.0
John Jason -6.0
J Smith -8.0
Smith John -8.0
Smith Jason -8.0
John Smith -8.0

The multi-token match function adds up all the weights, or can be configured to add the first N weights, to arrive at the weight used in the scoring.  For names with three components N would typically be 3 meaning the weight for this example would be 6.1+5.7+4.0=15.8.

This is just a sampling of how a match engine works and what is required to understand to configure it.  In future posts we’ll go into more detail and explore the other areas mentioned in the overview earlier.

The Importance of a Match Engine in your MDM Strategy

I came across an article, The What, Why and How of Master Data Management, that gives a good overview of what master data is, why it is important, what master data management is, and what it entails.

There is quite a lot that fits under the umbrella of MDM, but a key part of it is the matching of records from different systems, or new/updated records entering the system.  From the article:

Generate and test the master data: This step is where you use the tools you have developed or purchased to merge your source data into your master-data list. This is often an iterative process requiring tinkering with rules and settings to get the matching right. This process also requires a lot of manual inspection to ensure that the results are correct and meet the requirements established for the project. No tool will get the matching done correctly 100 percent of the time, so you will have to weigh the consequences of false matches versus missed matches to determine how to configure the matching tools. False matches can lead to customer dissatisfaction, if bills are inaccurate or the wrong person is arrested. Too many missed matches make the master data less useful, because you are not getting the benefits you invested in MDM to get.

Also:

Implement the maintenance processes: As we stated earlier, any MDM implementation must incorporate tools, processes, and people to maintain the quality of the data. All data must have a data steward who is responsible for ensuring the quality of the master data. The data steward is normally a business person who has knowledge of the data, can recognize incorrect data, and has the knowledge and authority to correct the issues. The MDM infrastructure should include tools that help the data steward recognize issues and simplify corrections. A good data-stewardship tool should point out questionable matches that were made, customers with different names and customer numbers that live at the same address, for example. The steward might also want to review items that were added as new, because the match criteria were close but below the threshold. It is important for the data steward to see the history of changes made to the data by the MDM systems, to isolate the source of errors and undo incorrect changes. Maintenance also includes the processes to pull changes and additions into the MDM system, and to distribute the cleansed data to the required places.

The article also describes cardinality, or the number of entities, as being an important criterion in deciding when to use an MDM solution.  A data set with “thousands of customers” is described as being a large data set and there are many MDM solutions that can handle this volume.  What separates the truly scalable solutions is how they deal with problem domains that are much larger.  For example, customer databases with tens of millions of records require very efficient algorithms not only for speed, but also to minimize the manual data stewardship required for such data sets.

Clearly, the match engine that the selected tool provides is critical to the performance and ROI of the solution.

So what are key features of a match engine?  At a high level they are:

  • Configurability – A match engine is useful only if it can be configured and tuned for the application at hand.
  • Integration – A match engine must be able to integrate with other systems to receive records and propagate changes/updates back to the systems.
  • Data Governance – A user interface and tools must be provided for comparing, merging, editing, and tracking the records.
  • Security / Auditing – Securing access to the data with role-based access controls as well as a complete audit trail of access is critical.
  • Accuracy – The whole point of a match engine is to automatically match the appropriate records so as to the number of records a human has to match manually.
  • Performance / Scalability – Initial load performance as well as ongoing performance is important, especially as the volume of data grows.
  • Reporting – Getting reports on match engine performance and data quality metrics can be very valuable.
  • Extensibility – The ability to extend backend functionality for specific domains or customize the user interface is often required.

In future entries on this blog, we’ll explore each of the above in much more detail. Stay tuned!