Using the Premium Lookup Component
The Premium Lookup Component is an SSIS data flow pipeline component that can be used to search for rows in a lookup table based on fields from rows in a primary input. It offers a flexible way to perform lookup against any data sources. The component supports both exact match and approximate match (often referred to as fuzzy) for lookup purposes.
Each primary row is compared to the lookup table (or source) based on the specified filters and determined to either be an unmatched or matched row and is then directed to the applicable output. In the case of matched row, a single input row may produce multiple matched rows if there were several rows in the lookup table that were exact or approximate matches.
This component has 2 inputs:
|Lookup Table Input||The lookup table input contains the data that that the component will search for matching records.|
|Primary Input||The primary input rows are used to search the lookup table for a matching record.|
The component has 2 outputs:
|Output Name||Row Type||Description|
|Unmatched Rows||Primary Input Row||Primary input rows that did not match any rows in the lookup table.|
|Matched Rows||Primary Input Rows, Lookup Table Rows and Calculated Columns||When a primary input row matches a row in the lookup table a matched row is produced. A matched row has columns from the primary input, the lookup table and calculated columns that contain statistics about the match. If a primary row matches multiple rows in the lookup table a matched row will be directed for each matching row.|
The component includes the following four pages to configure how you want to search for data.
- Comparison Settings
- Error Handling
The General page of the Premium Lookup Component allows you to specify the general settings of the component.
- Fields to Compare grid
The Fields To Compare grid will display all the available fields from the lookup table. Each of these fields can be paired to a field in the primary input. For every paired field a lookup type must be chosen.
- Lookup Type Column
The lookup type for paired fields can be set to either Exact Match, the generic Fuzzy Match, or a more specific match type. Exact match columns must be matched exactly to be considered matches. Fuzzy Match columns must meet a similarity threshold. If you select Fuzzy Match as the lookup type the similarity threshold will be enabled. Only string types such as VARCHAR, NVARCHAR, NTEXT and TEXT can be used with a fuzzy match filter, a warning will be shown if any other type is selected. The available match types are:
Exact Match - Fields must be exactly the same to be considered a match. The Similarity Threshold slider is disabled because matches must always have a similarity of 100%.
Fuzzy Match - Strings are considered a match if their similarity is above the specified similarity threshold.
First Name Match - Strings are treated as first names, and common nicknames are treated as the same name. For example Bill and William would be an exact match.
Phone Number Match - Strings are parsed to attempt to identify a 10 digit phone number which is used for comparison, ignoring any other characters.
Street Address Match - Strings are treated as street addresses, and common street name prefixes and directions are treated as equivalent. For example N. and North, or Avenue and Av. would be treated as exact matches.
US Zip Code Match - Strings are parsed to attempt to identify a 5 or 9 digit zip code which is used for comparison, ignoring other characters.
Company Name - Strings are treated as company names, and common company suffixes such as INC, CORP and LLC are ignored when comparing.
- Similarity Threshold Column
When the a lookup type other than Exact Match is selected for a paired fields the similarity threshold slider will be enabled. The similarity threshold is a value between 1% and 100% that determines how similar the values need to be do be considered matches. A value of 1% would be not similar at all and a value of 100% would an exact match.
- Advanced Settings Column
The advanced properties displays which advanced settings have been selected. If you edit this column, and click on the button a dialog will launch that allows you to edit the advanced settings.
- Advanced Settings - Trim Leading and Trailing Whitespace
Selecting the Trim Leading and Trailing Whitespace option will remove all whitespace at the beginning and end of the primary field and the lookup field when comparing.
- Advanced Settings - Ignore Case
Selecting the Ignore Case option will ignore case of the primary field and the lookup field when comparing.
- Advanced Settings - Ignore All Whitespace
Selecting the Ignore All Whitespace option will remove all whitespace from the primary field and the lookup field when comparing.
- Advanced Settings - Ignore All Punctuation
Selecting the Ignore All Punctuation option will remove all punctuation from the primary field and the lookup field when comparing.
- Advanced Settings - Matching Performance
The Matching Performance slider will determine the balance between quality and speed of your results. Increasing this value will result in faster performance but fewer matches found in the lookup table. Default values are set for each of the fuzzy match types, but this can be adjusted if performance is found to be too slow, or if accuracy is too low.
- Hide Unpaired Fields
When the Hide Unpaired Fields checkbox is checked only paired reference table columns will be shown.
- Hide Paired Fields
When the Hide Paired Fields checkbox is checked only unpaired reference table columns will be shown.
- Refresh Component Button
Clicking the Refresh Component button will reload input columns from the reference table and primary inputs.
- Pair Unpaired Fields Button
Clicking the Pair Unpaired Fields button will attempt to pair all the reference table input fields to primary input fields with matching names.
- Clear All Pairings Button
Clicking the Clear All Pairings button will clear all pairings between reference table and primary fields.
The Comparison Settings page of the Premium Lookup Component allows you to specify the settings that will be used when comparing fields.
- Max Number of Matches
The Max Number of Matches property determines the maximum number of matches that will be found for each primary input row. If the number of matches is higher than this number the best matches will be displayed.
- Culture Identifier
The Culture Identifier allows better string comparison by taking cultural specific string comparison rules into account.
- Columns for Similarity of Each Compared Field
When the Columns for Similarity of Each Compared Field is checked a column will be added to matched output for each paired field. The value in each field will be the similarity of those fields for the compared row. The name of each column will be the name of the lookup field preceded by the _similarity_ prefix.
The Columns page of the Premium Lookup component allows you to specify which columns will be part of the output. Columns can be toggled for the Matched Rows output and the Unmatched Rows output. Available columns are all the input columns, and additionally there are 4 calculated columns in the Matched Rows output.
Calculated Column - Average Similarity Score
The Average Similarity Score column displays the average similarity score of all compared fields. Fields that are an exact match have a similarity of 100% and fields that were are a fuzzy match will have a score less than 100%.
Calculated Column - Richness Score
The Richness Score column displays the richness of the row from the lookup table. Richness is calculated based on number of fields are are null or empty. A row with data in every field would have a richness of 1.
Calculated Column - Richness Rank
The Richness Rank column displays the rank of the row based on its richness relative to the other matches found. For example if 5 matches are found the match with the highest richness will be ranked 1 and the match with the lowest richness will be ranked 5.
Calculated Column - Confidence Score
The Confidence column displays the confidence score for the match, which is a representation of how probable it is that this is the correct match. If there is a single match, the confidence will be the same as the average similarity for the row. The highst confidence scores will come from a single match with very high similarity. If there are multiple matches the matches will each be assigned a weight based on their similarity relative to the similarity of other matches, and will be assigned some fraction of the total confidence. The more matches that are found the lower the confidence will be for each match.
Error Handling page
The Error Handling page allows you to specify how errors should be handled when they happen.
There are three options available.
- Fail on error
- Redirect rows to error output
- Ignore error
When the Redirect rows to error output option is selected, rows that failed to write to the Premium Lookup Component will be redirected to the 'Error Output' output of the component. As indicated in the screenshot below, the blue output connection represents rows that were successfully written, and the red 'Error Output' connection represents rows that were erroneous. The 'ErrorMessage' output column found in the 'Error Output' may contain the error message that was reported by server or the component itself.
Lookup from a reference table can be a performance intensive process, especially with large datasets and fuzzy matching. The following information can be used to help address performance issues:
- Exact match filters perform much better than fuzzy match filters; rows are compared for exact matches before any fuzzy matching is attempted. If you are able to add any exact match filters to your premium lookup component performance will improve.
- When using a fuzzy match filter the higher the similarity threshold you set, the better the performance will be (and the fewer matches you will get). A high similarity threshold allows potential matches to be eliminated more quickly.
- The MatchingPerformance setting determines the balance between fuzzy match quality and performance and can be very useful to achieve desired levels of quality and performance. This property is a value between 0 and 1. It defaults to 0.75 for fuzzy match comparison type and uses an asymptotic scale, so small changes will have a bigger impact as they approach 1 and a smaller impact as they approach 0. If you are experiencing slow performance, especially with long strings (greater than 15 characters) try changing this value to something higher, such as 0.8 or 0.9. Each of the fuzzy lookup types has a default matching performance value which best suits that type of data.
- If your performance is good, and you would like better fuzzy matching precision, you can try lowering the MatchingPerformance value for a field. This will be especially useful for very short string fields (less than 8 characters). Try setting this value to 0.66 or 0.5 for improved fuzzy matching precision and lower performance.