This document is a (practical) description of a procedure for Record Linkage by means of Extreme Value Theory (EVT). No labeled training data are needed, but user decisions are necessary for the selection of thresholds in a mean residual life plot (also known as mean excess plot).

In the following, the data set RLdata500 will be used. As classification with EVT is weight-based, weights have to be calculated for the record pairs to classify. In this case an EM algorithm is applied.

data(RLdata500)
bf <- list(1, 3, 5, 6, 7)
rpairs <- compare.dedup(RLdata500, identity = identity.RLdata500,
                        blockfld = bf, strcmp = 1:4)
rpairs <- emWeights(rpairs)

Calling getParetoThreshold opens a mean residual life (MRL) plot for the computed weights, as shown in Figure 1. From this graph, an interval has to be selected where the graph has a relatively long and approximately linear descent. Usually this can be found in the range between 0 and 20 for weights computed with emWeights or between 0.5 and 0.9 for weights computed with epiWeights. Figure 2 shows the same MRL plot with the appropriate segment marked.

The interval is selected by clicking on the endpoints of the desired segment of the graph. In some cases the right endpoint is identical to the edge of the graph, in this case only selection of the left endpoint is necessary. See the documentation of identify for more information on selecting points on a plot.

## Not run: getParetoThreshold(rpairs)
**Figure 1:** Basic MRL plot

Figure 1: Basic MRL plot

**Figure 2:** MRL plot with appropriate graph segment marked

Figure 2: MRL plot with appropriate graph segment marked

As an alternative to interactive selection, the interval can be given as argument to getParetoThreshold. The return value is in every case a threshold which can be used directly for classification.

threshold <- getParetoThreshold(rpairs, interval = c(1.2, 12.8))
result <- emClassify(rpairs, threshold)
summary(result)
## 
## Deduplication Data Set
## 
## 500 records 
## 18643 record pairs 
## 
## 50 matches
## 18593 non-matches
## 0 pairs with unknown status
## 
## 
## Weight distribution:
## 
## [-30,-25] (-25,-20] (-20,-15] (-15,-10]  (-10,-5]    (-5,0] 
##     13320      2505      1492      1079       175        22 
##     (0,5]    (5,10]   (10,15]   (15,20]   (20,25]   (25,30] 
##         8        18        21         0         0         3 
## 
## 42 links detected 
## 0 possible links detected 
## 18601 non-links detected 
## 
## alpha error: 0.160000
## beta error: 0.000000
## accuracy: 0.999571
## 
## 
## Classification table:
## 
##            classification
## true status     N     P     L
##       FALSE 18593     0     0
##       TRUE      8     0    42