Using the Duplicate Detector Component
The Duplicate Detector Component is an SSIS data flow pipeline component that can be used to compare rows within a single data source and identify duplicate rows. Comparison filters can be added for fields in the source that determine how they will be compared. Comparisons can be either exact or approximate (fuzzy).
Each row is compared to other rows based on the specified filters, is determined to either be unique or a duplicate and is directed to the applicable output.
The component includes the following three pages to configure how you want to read data.
- Comparison Settings
- Error Handling
The General page of the Diff Detector Component allows you to specify the general settings of the component.
- Add Filter button
When a new duplicate detector component is added, it will not have any filters. Click the 'Add Filter' button to add one. Each filter has three properties, column name, match type and similarity threshold.
- Column Name
This is the name of the column that will be compared. Column types other than IMAGE and BYTES can be used in filters.
- Match Type
The match type of a filter can be set to either Exact Match, the generic Fuzzy Match, or a more specific match type. Exact match columns must be matched exactly to be considered duplicates. Fuzzy match columns must match above a specified similarity threshold. If you select Fuzzy Match as the match type the similarity threshold will be enabled. Only string types such as STR, NTEXT, TEXT and WSTR can be used with a fuzzy match filter, a warning will be shown if any other type is selected.
Exact Match - Fields must be exactly the same to be considered a match. The Similarity Threshold slider is disabled because matches must always have a similarity of 1.00.
Fuzzy Match - Strings are considered a match if their similarity is above the specified similarity threshold.
First Name Match - Strings are treated as first names, and common nicknames are treated as the same name. For example Bill and William would be an exact match.
Phone Number Match - Strings are parsed to attempt to identify a 10 digit phone number which is used for comparison, ignoring any other characters.
Street Address Match - Strings are treated as street addresses, and common street name prefixes and directions are treated as equivalent. For example N. and North, or Avenue and Av. would be treated as exact matches.
US Zip Code Match - Strings are parsed to attempt to identify a 5 digit zip code which is used for comparison, ignoring other characters.
Company Name - Strings are treated as company names, and common company suffixes such as INC, CORP and LLC are ignored when comparing.
- Similarity Threshold
When the match type other than ExactMatch is selected for a filter the similarity threshold slider will be enabled. The similarity threshold is a value between 0 and 1 that determines how similar the values need to be do be considered duplicates. A value of 0 would be not similar at all and a value of 1 would an exact match.
The Comparison Settings page of the Duplicate Detector Component allows you to specify the settings that will be used when comparing fields.
- Ignore Leading and Trailing White Space
When Ignore Leading and Trailing White Space is checked leading and trailing white space of string fields will be ignored during comparison.
- Ignore Case
When Ignore Case is checked case will be ignored when comparing fields.
- Ignore All White Space
When Ignore All White Space is checked all white space will be ignored when using the Company Name Filter.
- Ignore Punctuation
When Ignore Punctuation is checked all punctuation when using the Company Name Filter.
Error Handling page
The Error Handling page allows you to specify how errors should be handled when they happen.
There are three options available.
- Fail on error
- Redirect rows to error output
- Ignore error
When the Redirect rows to error output option is selected, rows that failed to write to the Duplicate Detector Component will be redirected to the 'Error Output' output of the component. As indicated in the screenshot below, the blue output connection represents rows that were successfully written, and the red 'Error Output' connection represents rows that were erroneous. The 'ErrorMessage' output column found in the 'Error Output' may contain the error message that was reported by server or the component itself.
Duplicate Detection and Output
The duplicate detector has an output called "Unique Rows" and an output called "Duplicate Rows". In order for a row to be considered a duplicate of another row all of the fields found in filter columns must be considered matches, either an exact match or a fuzzy match above the specified similarity threshold. If any one field fails to be a match the rows are not considered to be duplicates. If a row fails to be considered a match for every other row it is directed to the Unique Rows output.
When two rows are compared and are found to be a match they will be directed to the Duplicate Rows output.
The Duplicate Rows output has four calculated columns which are:
Richness Score - Each row is given a richness score which is meant to represent data quality. This score between 0 and 1 is equal to the number of fields that have data divided by the total number of fields.
Richness Rank - Each row in a duplicate group is given a rank based on how rich it's data is.
Similarity Score - Each row in a duplicate group except the first one is given a similarity score based on how similar it is the top ranked row in the group.
GroupID - Each row in a duplicate group is given an ID which is equal to the a hash of the values of the top ranked row in the group. There is an advanced property, SequentialGroupCode, which defaults to false, but when set to true will make group IDs sequential integers starting from 1.
Duplicate detection can be a performance intensive process, especially with large datasets and fuzzy matching. The following information can be used to help address performance issues:
- Exact match filters perform much better than fuzzy match filters; rows are compared for exact matches before any fuzzy matching is attempted. If you are able to add any exact match filters to your duplicate detector component performance will improve.
- When using a fuzzy match filter the higher the similarity threshold you set, the better the performance will be (and the fewer matches you will get). A high similarity threshold allows potential duplicates to be eliminated more quickly.
- There is an input column property, only available in the advanced editor, that is called MatchingPerformance. To access this property open the advanced editor of the duplicate detector component from the context menu, click on the Input and Output Properties tab, and expand the input columns node in the inputs and outputs tree. The MatchingPerformance property determines the balance between fuzzy match quality and performance and can be very useful to achieve desired levels of quality and performance. This property is a value between 0 and 1. It defaults to 0.75 and uses an asmyptotic scale, so small changes will have a bigger impact as they approach 1 and a smaller impact as they approach 0. If you are experiencing slow performance, especially with long strings (greater than 15 characters) try changing this value to something higher, such as 0.8 or 0.9.
- If your performance is good, and you would like better fuzzy matching precision, you can try lowering the MatchingPerformance value for a field. This will be especially useful for very short string fields (less than 8 characters). Try setting this value to 0.66 or 0.5 for improved fuzzy matching precision and lower performance.