Using the HTML Table Source Component

The HTML Table Source Component is a source component used to retrieve an HTML document from an HTTP request or a local file and extract a table element into column data. Outputs columns can be automatically or manually generated. For each HTML table column, 4 SSIS output columns can be configured depending on the data that needs to be extracted:

  • Text
  • HTML
  • Links
  • Images
General Page

The general page determines where your HTML document will come from. The connection manager property can either be set to an HTTP Connection Manager, or a local file. This page also specifies what table to use for extraction and how to extract data from it.

SSIS HTML Table Source

Local File Path - The path on the file system to the HTML document that will be used for extraction.

Get Table By - Specifies how to uniquely identify the table element to use for extraction. There are 4 different ways to do this:

  • Class - specify the class name and the position of the HTML table to use for extraction.
  • Id - specify the id of the HTML table to use for extraction.
  • Position - specify the position of the HTML table to use for extraction
  • XPath - specify an XPath expression pointing to the HTML table to use for extraction

Header Row Position - The position of the row in the HTML table to use as the header row. If the table has no header rows specify 0.

Top Rows To Skip - These rows at the top of the HTML table (after the header row position) will be ignored.

Bottom Rows To Skip - These rows at the bottom of the HTML table will be ignored.

Trim Whitespace - This option will tell the component to trim whitespace from '_Text' columns.

NOTE: '_Text' columns are decoded using HTML decoding, however, whitespace trimming is taken place after decoding. For example, if the HTML table had a cell with the content of ' Hello World ' and trim whitespace was enabled the output would be 'Hello World'

Refresh Columns - This will populate the output columns using the settings on the General Page.

Columns Page

The Columns Page allows the user to add/remove/move columns. Each row in the data grid view represents 1 column in the HTML table. Notice how each row has 4 checkbox cells. These represent the following output columns:

  • Text - This adds an output column with the name of the HTML table column suffixed with '_Text'. This column contains the inner text of the HTML table column (no tags). The name and data type properties can be adjusted in the property grid.
  • HTML - This adds an output column with the name of the HTML table column suffixed with '_HTML'. This column contains all of the HTML inside the HTML table column.
  • Links - This adds an output column with the name of the HTML table column suffixed with '_Links'. This column contains a list of href values inside anchor (<a>) tags inside the HTML table column delimited by the pipe ( | ) character. Limit the result to 1 by specifying the Link position in the property grid.
  • Images - This adds an output column with the name of the HTML table column suffixed with '_Images'. This column contains a list of src values inside image (<img>) tags inside the HTML table column delimited by the pipe ( | ) character. Limit the result to 1 by specifying the Image position in the property grid.

SSIS HTML Table Source - Columns