Using the Data Profiler Component

The Data Profiler Component is an SSIS data flow pipeline component that can be used to analyze data and to compare rows from upstream data sources. Rows from any inputs will be passed through the component to corresponding outputs and once all the rows have been processed the component will output a single row to the "DataProfiler Output" with results of data analysis. Some profile types will create an additional output that will provide rows of relevant data.

The component includes one page to configure how you want to profile data.

Data Profiler Configuration

The data profiler configuration page allows you to add and configure profiles to the data profiler.

SSIS data profiler

Add button

The Add button will add a new profile to the profile list.

Remove button

Remove the selected profile from the profile list.

Profile List - Profile Category column

The category of the profile. This property filters the list of profile types available in the Type column.

Profile List - Profile Type column

The type of profile. This property determines what type the profile will be, and changes the properties in the profile properties grid to match the properties available for that profile type.

Profile List - Output / Output Column Name column

The name of the column in the DataProfilerOutput or the output where the data from the profile will be sent.  Some profiles output to a column in the DataProfilerOutput and some profiles have their own output.

Add Preconfigured Profile button

The Add Preconfigured Profile button allows common sets of profiles to be added more easily.

SSIS data profiler - preconfigured profile

After selecting the profile type one or more input columns will need to be selected. Using this information several profiles will be generated, configured and added to the component.

Profile Type Summary

Statistics Profiles
  • Standard Deviation: The standard deviation of a numeric column.
  • Min Value: The minimum value of a numeric or date column.
  • Max Value: The maximum value of a numeric or date column.
  • Mean Value: The mean value of a numeric column.
  • Count: The total number of rows.
Column Length Profiles
  • Column Length Min: The minimum length of a string column.
  • Column Length Max: The maximum length of a string column.
  • Column Length Distinct: The number of distinct string column lengths in the data.
  • Column Length Summary: A new output that lists all the distinct lengths that were found, and how many of each were found.
Null Value Profiles
  • Null Value Count: The number of null values in the column.
  • Not Null Count: The number of not null values in the column.
  • Null Value Ratio: The ratio of null values to not null values. A value of 1 would mean every row is null, a value of 0 would mean there were no null values.
Column Pattern Profiles
  • Column Pattern Match Count: The number of rows that match the specified regular expression.
  • Column Pattern Match Ratio: The ratio of rows that match the specified regular expression. A value of 1 would mean every row matched, a value of 0 would mean that no rows matched.
  • Column Pattern Match Violations: A new output that lists all of the values that violated the pattern.
Column Value Distribution
  • Column Value Distinct: The number of distinct values in the column.
  • Column Value Distinct Ratio: The ratio of distinct rows to total rows. A value of 1 would mean every row was distinct, a value of 0 would mean that every row was the same.
  • Column Value Summary: A new output that lists all of the distinct column values and their counts.
Candidate Key Profiles
  • Is Candidate Key: Determines if the selected columns could form a key. This will be true if the composite key formed by the data in each row is unique.
  • Distinct Key Count: The number of distinct composite keys formed by the selected columns.
  • Distinct Key Ratio: The ratio of distinct keys to total number of keys.
  • Key Count: The total number of rows in the selected columns.
  • Candidate Key Violations: A new output that lists all of the duplicate keys that were found, and how many times each duplicate was found.
Functional Dependency Profiles
  • Is Functional Dependency: Determines if the selected dependency columns are functionally dependent on the determinant columns.  For example if the dependency column was Zip Code and the determinate column was State then every unique zip code should have the same state in its row, any rows where this was not the case would be a violation.
  • Dependency Violation Count: The number of dependency violations in the data.
  • Dependency Violation Ratio: The ratio of dependency violations to the total number of rows  A value of 1 would mean that every row was a violation, a value of 0 would mean there were no violations.
  • Row Count: The total number of rows in the selected columns.
  • Dependency Summary: A new output that lists all the unique determinent columns found for each group of dependency columns.  The most frequently encountered determinent columns are considered to be the correct dependent columns, any less frequently occurring determinant columns are considered violations.
Value Inclusion Profiles
  • Value Inclusion Count: The total number of rows where the input row value was found in the lookup column.
  • Value Inclusion Count: The ratio of the number of value inclusions to the total rows. A value of 1 would mean that every input row was found in the lookup column, a value of 0 would mean that none of the input rows were found in the lookup column.
  • Value Inclusion Violations: A new output that list the items in the input column that were not found in the lookup table.

Profile Input Type Summary

Different profiles are compatible with different input types. The 4 relevant types are: String, Number, Date and Object (anything else).

Profile Name Number String Date Object
StandardDeviation yes no no no
MinValue yes no yes no
MaxValue yes no yes no
MeanValue yes no no no
Count yes yes yes yes
ColumnLengthMin no yes no no
ColumnLengthMax no yes no no
ColumnLengthDistinct no yes no no
ColumnLengthSummary no yes no no
NullValueRatio yes yes yes yes
NullValueCount yes yes yes yes
NotNullCount yes yes yes yes
ColumnPatternMatchCount no yes no no
ColumnPatternMatchRatio no yes no no
ColumnPatternViolations no yes no no
ColumnValueDistinct yes yes yes yes
ColumnValueDistinctRatio yes yes yes yes
ColumnValueSummary yes yes yes yes
IsCandidateKey yes yes yes yes
DistinctKeyCount yes yes yes yes
DistinctKeyRatio yes yes yes yes
KeyCount yes yes yes yes
CandidateKeyViolations yes yes yes yes
ValueInclusionRatio yes yes yes yes
ValueInclusionCount yes yes yes yes
ValueInclusionViolations
yes yes yes yes
IsFunctionalDependency yes yes yes yes
DependencyViolationCount yes yes yes yes
DependencyViolationRatio yes yes yes yes
RowCount yes yes yes yes
DependencySummary yes yes yes yes