Using the Premium PDF Source

The Premium PDF Component is an SSIS Source component that can be used to extract data from tables in PDF files. There are two pages that can be configured to read from PDF files using the component:

  • General
  • Columns

General Page

The General page of the Premium PDF Source allows you to specify the general settings of the component.

SSIS Premium PDF Source - General Page

Source File Settings
Connection Manager

The Premium PDF Source component requires a connection to connect to the PDF File. The Connection Manager drop-down will show a list of all connection managers that are available for your current SSIS packages. The supported connection managers are listed below:

  • Local File
  • FTPS Connection Manager
  • SFTP Connection Manager
  • Amazon S3 Connection Manager
  • Azure Blob Connection Manager
  • Azure Data Lake Storage Connection Manager
  • Azure Files Connection Manager
  • Box Connection Manager
  • Dropbox Connection Manager
  • Hadoop Connection Manager
  • Google Cloud Storage Connection Manager
  • Google Drive Connection Manager(since v21.2)
  • OneDrive Connection Manager
  • WebDAV Connection Manager
  • Google Cloud Storage Connection Manager
  • SharePoint Connection Manager (offered with the SSIS Integration Toolkit for Microsoft SharePoint)
Source File Path

The Source File Path specifies the location of the PDF file that you are trying to read from. Click the eclipse button ('...') to open a browser dialog to select an item.

Password to Open

This option is used to specify the password to open the PDF file. If the PDF file is not encrypted, you can leave this field blank.

Locale

Specify the Locale of the file.

Configure Table Detection
Combine Tables Strategy

This option can be used to specify how to combine tables across multiple pages from within the PDF file. The available options are:

  • None
  • CombineAcrossPages
  • CombineAll (since v21.1)
    • Skip Tables at Start: This option specifies how many tables to exclude at the beginning of the file.
    • Skip Tables at End: his option specifies how many tables to exclude at the end of the file.
Skip Empty Rows

Enabling this option will skip empty rows in the table.

Right Shift Misaligned Cells

This option allows you to choose to either Shift, Shift to End or Do Not Shift misaligned cells.

Configure Source
Locate Table Strategy

This option allows you to choose the table from the PDF file you want to extract data from.

Column Header Row Index

This field is enabled when “Table Contains Column Names” is checked. The position of the Header index can be specified in this field.

Table Contains Column Names

Enable this option to specify Column Header Row Index.

Data Start Row Index

Specify the row index for the data start position in the table.

Read to End

Enable this option to read till the end of the file.

Max Number of Rows

When “Read to End” is unchecked, this field is active. It allows you to specify the number of rows to read.

Refresh Component

By clicking the Refresh Component button, the component will retrieve the latest metadata from the PDF File you have specified. After clicking this button, you will receive a status message indicating how many fields have been updated, added, or deleted.

Expression fx Button

Click the fx button to launch SSIS Expression Editor to enable dynamic updates of the property at run time.

Columns Page

The Columns page of the Premium PDF Source Component shows you all available attributes from the source that you specified on the General page.

SSIS Premium PDF Source - Coulmns Page

On the top left of the grid, the checkbox can be used to toggle the selection of all available fields. This is a productive way to check or uncheck all available fields.

The Columns Page grid consists of:

  • PDF Field: Column that will be retrieved from the PDF File.
  • Data Type: The data type of this field.
  • Properties window for the field listed
    • Name: specify the column name.
    • Data type: the data type can be changed accordingly.
    • Length: Specify the Length of the fields.
    • Column Header Name Or Index: Specify the Column header name
    • Has Multiple Lines: Boolean field that can be chosen to specify if there are multiple lines in the field.
  • + sign: Add field to PDF File.
  • sign: Remove field from PDF File.
  • Arrows: Move the fields to a desired location in the file.

Preview

The Preview button at the bottom of the Premium File Source component can be used to preview the table which was detected in your PDF file based on the configured settings.

SSIS Premium PDF Source - Preview PDF Table Data

Combine Tables Strategy

This option can be used to specify how to combine tables across multiple pages from within the PDF file. The available options are:

  • None
  • CombineAcrossPages
  • CombineAll (since v21.1)
    • Skip Tables at Start: This option specifies how many tables to exclude at the beginning of the file.
    • Skip Tables at End: This option specifies how many tables to exclude at the end of the file.
Skip Empty Rows

Enable to skip empty rows.

Right Shift Misaligned Cells

This option can be chosen appropriately to either Shift, Shift to end, or Do Not Shift misaligned cells.

Multi-line Date Columns

Select the columns that are multi-line fields.

Apply Settings to Component

This option will apply the settings that are changed in the Preview page to the component.

Close

Closes the Preview page.