Using the Duplicate Detector Component

The Duplicate Detector Component is an SSIS data flow pipeline component that can be used to compare rows within a single data source and identify duplicate rows. Comparison filters can be added for fields in the source that determine how they will be compared. Comparisons can be either exact or approximate (fuzzy).

Each row is compared to other rows based on the specified filters, is determined to either be unique or a duplicate and is directed to the applicable output.

The component includes the following three pages to configure how you want to read data.

  • General
  • Comparison Settings
  • Error Handling

General Page

The General page of the Diff Detector Component allows you to specify the general settings of the component.

SSIS Duplicate Detector

Add Filter button

When a new duplicate detector component is added, it will not have any filters. Click the 'Add Filter' button to add one. Each filter has three properties, column name, match type and similarity threshold.

Column Name

This is the name of the column that will be compared. Column types other than IMAGE and BYTES can be used in filters.

Match Type

The match type of a filter can be set to either an Exact Match, the generic Fuzzy Match, or a more specific match type. Exact match columns must be matched exactly to be considered duplicates. Fuzzy match columns must match above a specified similarity threshold. If you select Fuzzy Match as the match type the similarity threshold will be enabled. Only string types such as STR, NTEXT, TEXT and WSTR can be used with a fuzzy match filter, a warning will be shown if any other type is selected.

  • Exact Match: Fields must be exactly the same to be considered a match. The Similarity Threshold slider is disabled because matches must always have a similarity of 1.00.
  • Fuzzy Match: Strings are considered a match if their similarity is above the specified similarity threshold.
  • Address Match: Strings are treated as street addresses, and common street name prefixes and directions are treated as equivalent. For example N. and North, or Avenue and Av. would be treated as exact matches.
  • First Name Match: Strings are treated as first names, and common nicknames are treated as the same name. For example, Bill and William would be an exact match.
  • Company Name: Strings are treated as company names, and common company suffixes such as INC, CORP and LLC are ignored when comparing.
  • Phone Number Match: Strings are parsed to attempt to identify a 10 digit phone number which is used for comparison, ignoring any other characters.
  • US Zip Code Match: Strings are parsed to attempt to identify a 5 digit zip code which is used for comparison, ignoring other characters.
  • Country Match (since v7.1): Strings are treated as country names, and country codes are matched with English short country names. For example, Canada and CA would be an exact match.
  • US State Match (since v7.1): Strings are treated as states, and codes are matched with the subdivision names. For example, New York and NY would be an exact match.
Similarity Threshold

When a match type other than ExactMatch is selected for a filter the similarity threshold slider will be enabled. The similarity threshold is a value between 0 and 1 that determines how similar the values need to be to be considered duplicates. A value of 0 would be not similar at all and a value of 1 would be an exact match.

Expression fx Button

Clicking the fx button to launch SSIS Expression Editor to enable dynamic updates of the property at run time.

Generate Documentation Button

Clicking the Generate Documentation button to generate a Word document that describes the component's metadata including relevant mapping, and so on.

Comparison Settings Page

The Comparison Settings page of the Duplicate Detector Component allows you to specify the settings that will be used when comparing fields.

SSIS Duplicate Detector - Comparison

Exact Filter and Fuzzy Filter
Ignore Leading and Trailing White Space

When Ignore Leading and Trailing White Space is checked leading and trailing white space of string fields will be ignored during the comparison.

Ignore Case

When Ignore Case is checked case will be ignored when comparing fields.

Treat Empty String As Null (since v6.0)
Enabling this option allows you to specify whether to treat an empty string as NULLs when performing a comparison in Duplicate Detector.
Token Delimiters

Specify the Token delimiters here for comparison. Some default values are provided already.

Company Name Filter
Ignore All White Space

When Ignore All White Space is checked all white space will be ignored when using the Company Name Filter.

Ignore Punctuation

When Ignore Punctuation is checked all punctuation when using the Company Name Filter.

Record Ranking Strategy(Since v22.1)

The Record Ranking Strategy option can be used to rank the records within the same duplicate record group, the result of such ranking could possibly be used to determine which record wins in the group. The strategy is defined by the combination of the following two options:

Output Column

Choose the output column based on which the records should be sorted.

Sort Direction

Choose the direction in which it should be sorted - Ascending or Descending

Error Handling Page

The Error Handling page allows you to specify how errors should be handled when they happen.

SSIS Duplicate Detector - errors

There are three options available.

  1. Fail on error
  2. Redirect rows to error output
  3. Ignore error

When the Redirect rows to error output option is selected, rows that failed to write to the Duplicate Detector Component will be redirected to the 'Error Output' output of the component. As indicated in the screenshot below, the green output connection represents rows that were successfully written, and the red 'Error Output' connection represents erroneous rows. The 'ErrorMessage' output column found in the 'Error Output' may contain the error message that was reported by the server or the component itself.

Duplicate Detection and Output

The duplicate detector has an output called "Unique Rows" and an output called "Duplicate Rows". In order for a row to be considered a duplicate of another row all of the fields found in filter columns must be considered matches, either an exact match or a fuzzy match above the specified similarity threshold. When two rows are compared to each other, if any one field fails to be a match the rows are not considered to be duplicates. If a row is not considered a duplicate of any other input row it is directed to the Unique Rows output.

When two rows are compared and are found to be a match they will be directed to the Duplicate Rows output.

The Duplicate Rows output has four calculated columns which are:

  • Richness Score: Each row is given a richness score which is meant to represent data quality. This score between 0 and 1 is equal to the number of fields that have data divided by the total number of fields.
  • Richness Rank: Each row in a duplicate group is given a rank based on how rich its data is.
  • Similarity Score: Each row in a duplicate group except the first one is given a similarity score based on how similar it is to the top-ranked row in the group.
  • Group Code: Two or more rows with the same group code are considered to be in the same group of duplicates. The group code for the group can be generated in two ways: the first way is using a hash of the values in the top-ranked group member, and the second way is a sequential integer starting from 1. The advanced property SequentialGroupCode can be set to true to enable sequential integer group codes.

Performance

Duplicate detection can be a performance-intensive process, especially with large datasets and fuzzy matching. The following information can be used to help address performance issues:

  • Exact match filters perform much better than fuzzy match filters; rows are compared for exact matches before any fuzzy matching is attempted. If you are able to add any exact match filters to your duplicate detector the component's performance will improve.
  • When using a fuzzy match filter the higher the similarity threshold you set, the better the performance will be (and the fewer matches you will get). A high similarity threshold allows potential duplicates to be eliminated more quickly.
  • There is an input column property, only available in the advanced editor, that is called MatchingPerformance. To access this property open the advanced editor of the duplicate detector component from the context menu, click on the Input and Output Properties tab, and expand the input columns node in the inputs and outputs tree. The MatchingPerformance property determines the balance between fuzzy match quality and performance and can be very useful to achieve desired levels of quality and performance. This property is a value between 0 and 1. It defaults to 0.75 and uses an asmyptotic scale, so small changes will have a bigger impact as they approach 1 and a smaller impact as they approach 0. If you are experiencing slow performance, especially with long strings (greater than 15 characters) try changing this value to something higher, such as 0.8 or 0.9.
  • If your performance is good, and you would like better fuzzy matching precision, you can try lowering the MatchingPerformance value for a field. This will be especially useful for very short string fields (less than 8 characters). Try setting this value to 0.66 or 0.5 for improved fuzzy matching precision and lower performance.