Recently I worked on a project to complete an upgrade from Microsoft SQL Server 2008 R2 to SQL Server 2014. Before the upgrade could begin, I created new packages to source data from the client’s operational data store, which was recently migrated to a PostgreSQL database. Communicating between database engines can be tricky, and I encountered a number of issues that were not simple to fix. Since I couldn’t find immediate solutions online, I hope this article will help fill in some gaps that you might face when connecting to PostgreSQL.
The scope and design of this project was straightforward: source data from the PostgreSQL database with minimal transformations and create a MD5 hash value at the source to determine Type 1 updates. My colleague, Gilbert Sharp, recently wrote a great article discussing hash values and surrogate keys, Surrogate Keys in a Business Intelligent Ecosystem.
First I had to determine the best way to connect to a PostgreSQL database. A quick search online will deliver several blog posts with detailed steps on connecting to a PostgreSQL database using an ADO.NET source connection in SQL Server Integration Services (SSIS). Using this method is common, but after mapping the source columns to the target table, I received several error messages: “Cannot Convert between Unicode” and “Non Unicode String Data Type.” If you have developed in SSIS before you know that it is very particular about datatypes and conversions, and in my experience, unlike other tools, SSIS does not perform or allow implicit datatype conversions.
I determined that the database encoding was ‘SQL_ASCII,’ which pointed me to the ADO.NET connection as the culprit. Apparently, ADO.NET does an implicit conversion from varchar to nvarchar regardless of the source database encoding, so the conversion of the data type would need to be handled within SSIS.
The obvious solution would be to use either a data conversion or derived column transformation. Each component used in SSIS performance is impacted; in this case, it would be minimal but would definitely increase the development time for the project. In most cases, I prefer to use a derived column, since SSIS allows you to replace the existing column with the new derived column. This is more of a time saver for development since SSIS has the option to map your columns to the target table by name if the source column names are the same as the target. However, this would tremendously increase my development times. On average, I had 75 columns per table with the datatype of varchar for a majority of the columns. This is also why I opted not to use the data conversion task because the number of columns that would need to be created and then mapped to the target manually not only increased development time but also introduced the likelihood of mapping a source to target incorrectly.
While looking for alternatives, and after working through several trial scenarios, I remembered that I created the connection to the database using the 32-bit ODBC Administrator for the PostgreSQL ANSI driver. So why not use the ODBC type listed in the SSIS Connection Manager? I have to admit, this epiphany didn’t occur immediately due to our new SQL Server 2014 instance not being available and my contemplating the initial ETL architecture using Microsoft Visual Studio 2008 where creating and ODBC source connection option is not available.
Creating a connection within the SSIS Connection Manager with the type of ODBC allowed me to decrease development time and avoid potential performance issues by forcing SSIS to handle data conversion and adding extra components to the package. There are several blog posts that evaluate the performance of an ODBC source and Microsoft’s shifting stance on using ODBC data connections as well. With Microsoft Azure along with ODBC becoming a standard data source connection, many believe that Microsoft will continue to invest and develop ODBC connections within SSIS for better performance. I will not discuss this topic in this blog post, but I would suggest a quick online search to familiarize yourself with these topics.
While sourcing data using the 32-bit ODBC driver, there were a few errors I encountered specific to this source type. Here is some background on my packages that may have attributed to the ODBC issues I encountered:
One master package, which contained five execute packages tasked to call five different packages simultaneously
Each package called contained 10 containers that sourced data in parallel with no dependencies
Within each container the Data Flow Task sourced tables up to 120 columns wide with various data types and as many as 5.4 million records
The goal of this project was to source the data in the most efficient and time-saving manner, therefore we ran multiple packages at the same time
One error I encountered was, “Error HY000 Out of memory while reading tuples.” I was able to correct this issue by ensuring the “Use Declare/Fetch” option was checked in the ODBC Data Source Administrator. Also, OBDC sources in SSIS provide the option to set the Batch Size. It may need to be adjusted based on the amount of data sourced and the server capacity, which you’ll know if you receive the following error: “ODBC Error: Unable to allocate memory. Use a lower Batch Size property value”.
After completing development and troubleshooting the ODBC specific errors, I was able to run the packages locally through SQL Server Data tools. However, after deploying the packages to Integration Services Catalog and creating jobs to run through SQL Server Agent, I received the following error again: “ODBC Error: Unable to allocate memory. Use a lower Batch Size property value”. After several hours of trial and error and various blog and forum research, I was sure that if I could run the package and force the source connection to use the 64-bit ODBC PostgreSQL driver, this would solve my issue.
SQL Server Data Tools is a 32-bit application, so the Integration Services design environment only recognizes 32-bit providers while you are designing a package. There may be several ways to work around this but my solution was to configure the connection within the SQL Server Job Step, under configurations by changing the ConnectionString property value to be the name of the 64-bit ODBC driver I configured locally on the server.
I couldn’t find immediate answers to many of these issues online, so hopefully, this article will provide some context and solutions when using SSIS to connect to a PostgreSQL database.