Using the Hadoop Connection Manager

The Hadoop Connection Manager is an SSIS connection manager component that can be used to establish connections with Hadoop.

To add a Hadoop connection to your SSIS package, right-click the Connection Manager area in your Visual Studio project, and choose "New Connection..." from the context menu. You will be prompted the "Add SSIS Connection Manager" window. Select the "Hadoop" item to add the new Hadoop connection manager.

New Connection

Add Hadoop Connection

The Hadoop Connection Manager contains the following two pages which configure how you want to connect to Hadoop.

  • General
  • Advanced Settings

General page

The General page on the Hadoop Connection Manager allows you to specify general settings for the connection. 

Hadoop Connection Manager

Authentication Mode

This option allows you to select the type authentication Mode to connect to the Hadoop. Available options are:

  • Basic
  • Kerberos
WebHDFS Host

This option allows you to specify the HDFS server which you will be connect to

Use HTTP ( Not recommended )

This option allows you to choose whether to connect over HTTP protocol.

WebHDFS Port

The Port number which WebHDFS is running on.

WebHDFS User

This option allows you to specify the user which you will be connect as.

Password ( Available only when using Kerberos Authentication mode)

This option allows you to specify a password when connecting using Kerberos authentication

Domain ( Available only when using Kerberos Authentication mode)

This option allows you to specify a domain name when connecting using Kerberos authentication

Upload Chunk Size

The Upload Chunk Size option allows you to specify the size of the file content to be divided to upload large file sequentially.

Timeout (secs)

The Timeout (secs) option allows you to specify a timeout value in seconds for the connection. The default value is 120 seconds.

Retry on Intermittent Errors

This is an option designed to help recover from possible intermittent outages or disruption of service so the integration does not have to be stopped because of such temporary issues. Enabling this option will allow service calls to be retried upon certain types of failure. A service call may be retried up to 3 times before an exception is fired. Retries occur after 0 seconds, 15 seconds, and 60 seconds. Warning: although we have carefully designed this feature so that such retries should only happen when it is deemed to be safe to do so, in some extreme occasions, such retried service calls could result in the creation of duplicate data.

Test Connection

After all the connection information has been provided, you may click the Test Connection button to test if the connection settings entered are valid.

Advanced Settings

The Advanced Settings page on the Hadoop Connection Manager allows you to specify some advanced and optional settings for the connection.

Hadoop Connection Manager

Proxy Mode

Proxy Mode option allows you to specify how you want to configure the proxy server setting. There are three options available.

  • No Proxy
  • Auto-detect (Using system configured proxy)
  • Manual
Proxy Server

Using Proxy Server option allows you to specify the name of the proxy server for the connection.

Port

The Port option allows you to specify the port number of the proxy server for the connection.

Username (Proxy Server Authentication)

Username option (under Proxy Server Authentication) allows you to specify the proxy user account.

Password (Proxy Server Authentication)

Password option (under Proxy Server Authentication) allows you to specify the proxy user's password.

Note: The Proxy Password is not included in the connection manager's ConnectionString property by default. This is by design for security reasons. However, you can include it in your ConnectionString if you want to parameterize your connection manager. The format would be ProxyPassword=myProxyPassword;  (make sure you have a semicolon as the last character). It can be anywhere in the ConnectionString.