Duplicate check algorithm

The following description of the algorithm for the CAS genesisWorld duplicate check contains information for the administrator about how to set parameters for duplicate checking.

In order to filter duplicates from a certain address volume, CAS genesisWorld uses an adapted implementation of the Levenshtein algorithm. Preselected fields of an address are extracted and transformed into a standardized form.

With the Levenshtein algorithm, the distance between the sequences in two strings is measured. The total distance has to be calculated before the difference in percentage between two addresses is measured.

Subsequently, the distance between both addresses is calculated.

This result and the total distance are both used to calculate the matches of both addresses as percentages. When the calculated value exceeds the threshold value, both addresses will be marked as duplicates.

When calculating the distance, each selected address field is assessed using the Levenshtein algorithm and the result is weighed against the set Factor in %. With the factor, you define which fields are taken into consideration. The different results are added and the distance between the two addresses is calculated.

Which fields are included in the duplicate check and the associated weighting factor is defined by your administrator on the Included fields tab.

Included fields tab

For performance reasons it is not possible to compare all addresses to each other for the duplicate search. For n addresses, this would result in (n*n-n)/2 comparisons. Consequently, addresses are split up beforehand into subsets and then the address subsets are compared with each other. The division into subsets is performed via the Town field.

The number of characters taken into account in subset fields influences the size of the subset. That means that two duplicates must be identical in the first n-characters of a town in order to be identified as duplicates.

The greater the number of included characters when forming subsets n, the smaller the subsets will be, which also makes the duplicate search faster and more precise. The opposite also applies, whereby, the smaller the number of characters n is, the larger the subsets will be, which in turn results in a more precise, but also slower duplicate search. At the same time, duplicates that do not match each other in the first n-characters of a town, are not identified as duplicate.

The subset fields and number of characters involved in forming subsets are determined by your adminstrator on the Search options tab.

Search options tab