Using the HTML Table Source Component

The HTML Table Source Component is a source component used to retrieve an HTML document from an HTTP request or a local file and extract a table element into column data. Output columns can be automatically or manually generated. For each HTML table column, 4 SSIS output columns can be configured depending on the data that needs to be extracted:

  • Text
  • HTML
  • Links
  • Images

General Page

The General page determines where your HTML document will come from. The connection manager property can either be set to an HTTP Connection Manager, or a local file. This page also specifies what table to use for extraction and how to extract data from it.

SSIS HTML Table Source

General Settings
Connection Manager

The HTML Table Source Component requires a connection in order to parse data from HTML Table. The Connection Manager drop-down will show a list of all connection managers that are available to your current SSIS package.

Note the following Connection Managers are supported:

  • Local File
  • HTTP Connection Manager
  • <<File Content in Variable>>(since v21.1)
Local File Path

The path on the file system to the HTML document that will be extracted.

Input Variable

This option allows you to select from a drop-down list an SSIS variable or parameter to which your package has access.

Get Table By

Specifies how to uniquely identify the table element to use for extraction. There are 4 different ways to do this:

  • Class: specify the class name and the position of the HTML table to use for extraction.
  • Id: specify the id of the HTML table to use for extraction.
  • Position: specify the position of the HTML table to use for extraction
  • XPath: specify an XPath expression pointing to the HTML table to use for extraction
Advanced Settings
Header Row Position

The position of the row in the HTML table to use as the header row. If the table has no header rows specify 0.

Top Rows To Skip

These rows at the top of the HTML table (after the header row position) will be ignored.

Bottom Rows To Skip

These rows at the bottom of the HTML table will be ignored.

Trim Whitespace

This option will tell the component to trim whitespace from '_Text' columns.

Note: '_Text' columns are decoded using HTML decoding, however, whitespace trimming is taken place after decoding. For example, if the HTML table had a cell with the content of '&nbsp;Hello&nbsp;World&nbsp;' and trim whitespace was enabled the output would be 'Hello World'

Refresh Columns

This will populate the output columns using the settings on the General Page.

Expression fx icon

Click the blue fx icon to launch SSIS Expression Editor to enable dynamic updates of the property at run time.

Generate Documentation Button

Click the Generate Documentation icon to generate a Word document that describes the component's metadata including relevant mapping, and so on.

Columns Page

SSIS HTML Table Source - Columns

The Columns page allows the user to add/remove/move columns. Each row in the data grid view represents 1 column in the HTML table. Notice how each row has 4 checkbox cells, these represent the following output columns:

  • Text: This adds an output column with the name of the HTML table column suffixed with '_Text'. This column contains the inner text of the HTML table column (no tags). The name and data type properties can be adjusted in the property grid.
  • HTML: This adds an output column with the name of the HTML table column suffixed with '_HTML'. This column contains all of the HTML inside the HTML table column.
  • Links: This adds an output column with the name of the HTML table column suffixed with '_Links'. This column contains a list of href values inside anchor (<a>) tags inside the HTML table column delimited by the pipe ( | ) character. Limit the result to 1 by specifying the Link position in the property grid.
  • Images: This adds an output column with the name of the HTML table column suffixed with '_Images'. This column contains a list of src values inside image (<img>) tags inside the HTML table column delimited by the pipe ( | ) character. Limit the result to 1 by specifying the Image position in the property grid.
The Columns page grid consists of:
  • Column Name: Column that will be retrieved.
  • Properties window for the field listed
    • Name: Specify the column name.
    • Data type: The data type can be changed according.
    • Length: Specify the Length of the fields. If the data type specified is a string, the length specified here would be the maximum size. If the data type is not a string, the length will be ignored.
    • Precision: Specify the number of digits in a number.
    • Scale: Specify the number of digits to the right of the decimal point in a number.
    • CodePage: Specify the Code Page of the field.
    • Link Position: Specify the position of the link value inside the list of href values.
    • Image Position: Specify the position of the image value inside the list of src values.