Match Engine Theory and Configuration
We started a blog series on match engines and their importance to an MDM solution last week and this is our first entry taking a deeper dive into each area. First up, match engine theory to understand what must be configured.
Right up front, thanks go to Dan Cidon, NextGate’s CTO, for the background and technical details that went into this post.
To understand how to configure a match engine, it is first useful to consider some of the types of challenges it will encounter in matching fields in records from different systems. Some common ones include:
- Spelling errors
- Transpositions in IDs/SSNs
- Aliases, i.e. Jon and Jonathan being a match
- Transpositions of given/family name
- Properly handling twins/triplets/etc.
- Transpositions in or fat-fingered dates
- Default and/or anonymous data, i.e. John or Jane Doe, an SSN of 999-99-9999, etc.
Clearly, a simple “equals” match is not going to suffice. The approach most every match engine takes, NextGate’s included, is to compute a score indicating the likelihood of a match based on comparing selected fields using industry standard comparison functions and weighting for each comparison. This score can then be tested against configured thresholds to determine if two records match, don’t match, or may match. It is the configuration of the comparison functions and determination of the appropriate weights that are core to the configuration of the match engine and determining how well it will perform.
The following figure shows three example comparisons and the resulting scores and whether the records are determined to match.
So how were these scores arrived at? To understand that, we have to delve into match theory a bit.
The total score is the sum of the agreement weights and disagreements weights for each field comparison. Each field’s weight depends on how well the fields compare to each other, the Yes and No cases from above shown in the two figures below.
|Yes case||No case|
The weights in the figures should be fairly intuitive to a human looking at it, but it is getting the match engine to arrive at this “intuitive” result automatically that is the magic.
So how are the weights arrived at? Each fields agreement/disagreement weight is determined from the likelihood that two fields from a population of matched records are the same or that the two fields from a population of unmatched records are the same. From these probabilities, the agreement and disagreements weights can be computed.
For agreement weight:
- If the field contents of two records are identical then they are given an agreement weight defined for that field.
- The agreement weight is based on how likely the fields are identical based on random chance.
- The more likely a random identical match, the lower the agreement weight.
For disagreement weight:
- If the field contents of two records do not match identically then they are given a disagreement weight for that field.
- The disagreement weight is based on the reliability of that field. Reliability is the likelihood that the field contents of two records from the matched set are identical.
- The more reliable a field, the stronger (more negative) the disagreement weight.
Now, how are fields compared to determine if they are identical?
In some cases, it is pretty black and white, for example with genders where there are two possible values. But when dealing with names, dates, addresses, etc. it isn’t, i.e. two records with a first name of John and Jon are not identical but it wouldn’t be appropriate to assign a complete disagreement weight. This is where comparison functions come in.
Comparison functions return a value between 0 and 1 indicating how well two fields compare to each other, a 1 indicating complete agreement and 0 complete disagreement with partial agreement falling in between. There are a host of functions used for different purposes from performing phonetic comparisons to handling transpositions to many many more. It is the selection and configuration of the appropriate function that is critical to match engine performance.
The result of the comparison function must then transformed into a weight that ranges from the complete disagreement to complete agreement weight for that field. One could simply do this linearly, but research has shown that piecewise linear functions where the weight drops off more rapidly as the comparison function result goes farther below 1 perform much better. A graphical example of such similarity to weight mapping is shown in the figure below.
But it isn’t enough to just have comparison functions for individual fields. Sometimes, as is the case with multiple fields that make up a name, a comparison has to look at multiple fields or look at individual tokens of a field as if they were fields themselves.
For example, record A may store a person’s name as “John J Smith” and record B may store a person’s name as “Smith John Jason”. Transpositions of name elements like this are a common source of data entry errors and something a match engine must deal with.
A multi-token match function compares all possible combinations of tokens between the record pairs and selects the highest weighted pairs when calculating an overall weight. For the records above, the token comparisons that would be performed are shown in the table below (sorted by weight).
|Token from Record A||Token from Record B||Comparison Weight|
The multi-token match function adds up all the weights, or can be configured to add the first N weights, to arrive at the weight used in the scoring. For names with three components N would typically be 3 meaning the weight for this example would be 6.1+5.7+4.0=15.8.
This is just a sampling of how a match engine works and what is required to understand to configure it. In future posts we’ll go into more detail and explore the other areas mentioned in the overview earlier.