Thursday, 14 December 2017

Azure Integration Services Preview (ADF V2)

Case
I just created an Azure Data Factory V2 to start with SSIS in the cloud, but I cannot find the SSIS options in ADF. How do I configure SSIS in Azure?
Azure Data Factory V2





















Solution
At the moment SSIS (ADF V2) is still in the preview. The user interface for SSIS is not yet available in the Azure portal (however it is visible in this video, but probably only for Microsoft).

For now you are stuck with a little PowerShell scripting. Before you start make sure you have an ADF V2 and an Azure Database Server available. Preferably in the same Azure region. ADF V2 is only available in East US, East US2 and West Europe. Since I'm from the Netherlands I selected West Europe.

PowerShell ISE
If you never used PowerShell for Azure code then you first need to start Windows PowerShell ISE as Administrator to install the Azure Module. Execute the following command:
# PowerShell code 
Install-Module AzureRM

And if you did work with Azure PowerShell before then you probably have to update the Azure Module by adding the parameter -Force at the end
Install-Module AzureRM




















Parameters
This PowerShell script first starts with a 'parameter' section to provide all details necessary for the script to run.

First start with the name of your subscription. If you are not sure which one it is then you can look it up in the overview page of your Azure Data Factory under Subscription name.
Subscription name in ADF overview page














#################################################
################## PARAMETERS ###################
################################################# 
$SubscriptionName = "mySubscriptionName"

To store the SSISDB we need to provide a database server URL and the server admin and its password. The URL can be found on the SQL database overview page and the user can be found on the SQL server overview page.
#Provide login details for the existing database server
$CatalogServerEndpoint = "myDBServer.database.windows.net"
$DBUser = "Joost"
$DBPassword = "5ecret!

Next we need to provide the details about our newly created Azure Data Factory V2 environment. You can find all the required information on the overview page of ADF. The location is either WestEurope or EastUs or EastUs2.
Azure Data Factory V2 overview page













# Provide details about your existing ADF V2
$DataFactoryName = "bitools"
$ResourceGroupName = "Joost_van_Rossum"
$Location = "WestEurope"

The last part is to configure the Integration Runtime. The Catalog Pricing Tier is the database size of your SSISDB database. It should be Basic or S1, S2, S3... S12, etc. The node size should be Standard_A4_v2, Standard_A8_v2, Standard_D1_v2, Standard_D2_v2, Standard_D3_v2 or Standard_D4_v2. After the preview phase more sizes will be available. Prices can be found here.
# Provide details for the new Integration Runtime
$IntegrationRuntimeName = "SSISJoostIR"
$IntegrationRuntimeDescription = "My First Azure Integration Catalog"
$CatalogPricingTier = "Basic" # S0, S1, S2, S3
$IntegrationRuntimeNodeSize = "Standard_A4_v2"
$IntegrationRuntimeNodeCount = 2
$IntegrationRuntimeParallelExecutions = 2


The script
And now the script itself. It starts with a setting to stop after an error (not the default setting) and a login to Azure. When executing Login-AzureRmAccount it will show a login popup. Login with your azure account.
Login to Azure
























#################################################
################## THE SCRIPT ###################
#################################################
$ErrorActionPreference = "Stop"

# Login to Azure (a pop-up will appear)
Login-AzureRmAccount 

After the login we need to select the right subscription. The Out-Null will prevent showing all properties of your subscription to the screen.
# Select the right subscription
Select-AzureRmSubscription -SubscriptionName $SubscriptionName | Out-Null 

And this will create a new credential with the user id and password. We need it for the next command.
# Create Database creditial with user id and password
$SecureDBPassword = ConvertTo-SecureString $DBPassword -AsPlainText -Force
$ServerCreds = New-Object System.Management.Automation.PSCredential($DBUser, $SecureDBPassword)

This command will create the Integration Runtime environment in Azure Data Factory.
# Create the Integration Runtime
Write-Host "Creating your integration runtime."
Set-AzureRmDataFactoryV2IntegrationRuntime  -ResourceGroupName $ResourceGroupName `
                                            -DataFactoryName $DataFactoryName `
                                            -Name $IntegrationRuntimeName `
                                            -Type Managed `
                                            -CatalogServerEndpoint $CatalogServerEndpoint `
                                            -CatalogAdminCredential $ServerCreds `
                                            -CatalogPricingTier $CatalogPricingTier `
                                            -Description $IntegrationRuntimeDescription `
                                            -Location $Location `
                                            -NodeSize $IntegrationRuntimeNodeSize `
                                            -NodeCount $IntegrationRuntimeNodeCount `
                                            -MaxParallelExecutionsPerNode $IntegrationRuntimeParallelExecutions 

After creating the Integration Runtime environment we need to start it. Only then you can use it. Starting the environment takes 20 to 30 minutes! There is also an Stop-AzureRmDataFactoryV2IntegrationRuntime method which takes the same parameters and takes 2 to 3 minutes.
# Start the Integration Runtime (takes 20 to 30 minutes)
Write-Warning "Starting your integration runtime. This command takes 20 to 30 minutes to complete."
Start-AzureRmDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
                                             -DataFactoryName $DataFactoryName `
                                             -Name $IntegrationRuntimeName `
                                             -Force


Connecting with SSMS
Now we can use SQL Server Management Studio (SSMS) to connect to our newly created SSISDB in Azure. A little different compared to on-premises: you need to click on the Options button and select the SSISDB first. Otherwise you won't see the Integration Services Catalog.
Connecting to the SSISDB in Azure

Spot the differences









































In a future post I will show deployment to the Integration Services Catalog in Azure. And now the complete script for copy and paste purposes.
#################################################
################## PARAMETERS ###################
################################################# 
$SubscriptionName = "mySubscriptionName"

# Provide login details for the existing database server
$CatalogServerEndpoint = "myDBServer.database.windows.net"
$DBUser = "Joost"
$DBPassword = "5ecret!"

# Provide details about your existing ADF V2
$DataFactoryName = "bitools"
$ResourceGroupName = "Joost_van_Rossum"
$Location = "WestEurope" # or EastUs/EastUs2

# Provide details for the new Integration Runtime
$IntegrationRuntimeName = "SSISJoostIR"
$IntegrationRuntimeDescription = "My First Azure Integration Catalog"
$CatalogPricingTier = "Basic" # S0, S1, S2, S3
$IntegrationRuntimeNodeSize = "Standard_A4_v2"
$IntegrationRuntimeNodeCount = 2
$IntegrationRuntimeParallelExecutions = 2

# In public preview, only Standard_A4_v2, Standard_A8_v2, Standard_D1_v2,
# Standard_D2_v2, Standard_D3_v2, Standard_D4_v2 are supported.

#################################################
################## THE SCRIPT ###################
#################################################
$ErrorActionPreference = "Stop"

# Login to Azure (a pop-up will appear)
Login-AzureRmAccount 

# Select the right subscription
Select-AzureRmSubscription -SubscriptionName $SubscriptionName | Out-Null 

# Create Database creditial with user id and password
$SecureDBPassword = ConvertTo-SecureString $DBPassword -AsPlainText -Force
$ServerCreds = New-Object System.Management.Automation.PSCredential($DBUser, $SecureDBPassword)

# Create the Integration Runtime
Write-Host "Creating your integration runtime."
Set-AzureRmDataFactoryV2IntegrationRuntime  -ResourceGroupName $ResourceGroupName `
                                            -DataFactoryName $DataFactoryName `
                                            -Name $IntegrationRuntimeName `
                                            -Type Managed `
                                            -CatalogServerEndpoint $CatalogServerEndpoint `
                                            -CatalogAdminCredential $ServerCreds `
                                            -CatalogPricingTier $CatalogPricingTier `
                                            -Description $IntegrationRuntimeDescription `
                                            -Location $Location `
                                            -NodeSize $IntegrationRuntimeNodeSize `
                                            -NodeCount $IntegrationRuntimeNodeCount `
                                            -MaxParallelExecutionsPerNode $IntegrationRuntimeParallelExecutions 


# Start the Integration Runtime (takes 20 to 30 minutes)
Write-Warning "Starting your integration runtime. This command takes 20 to 30 minutes to complete."
Start-AzureRmDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
                                             -DataFactoryName $DataFactoryName `
                                             -Name $IntegrationRuntimeName `
                                             -Force

Write-Host "Done"




Sunday, 29 October 2017

Calculating Hash values in SSIS

Case
I want to calculate a hash value for a couple of columns in SSIS. In T-SQL you can use HASHBYTES, but that doesn't work for other sources like flat files and for SQL 2012/2014 the input is limited to only 8000 bytes. Is there an alternative for HASHBYTES?
Calculating a hash value over multiple columns















Solution
There are several alternatives for the T-SQL HASHBYTES. First of all there are various custom components available for SSIS like the SSIS Multiple Hash on codeplex, but if you don't want to (or cannot) use custom components, you can accomplish the same result with a little .NET scripting. If you really want to stick to T-SQL, then you can also first stage your files in a table and then calculate the hash with T-SQL afterwards. This blog will show you the scripting solution.

But first, why do you need a hash? When you want to keep track of history with a Persistent stage, Data Vault or Data Warehouse you want to know whether the record from the stage layer is different then the one you have in your historical data layer. You could check each column one by one, but when you have a whole bunch of columns that could be a lot of work and a bit slow.

A hash in ETL is used to generate a single, corresponding (but shorter) value for a whole bunch of columns. It is stored in the stage table as a new column. If one character changes in one of those columns then the hash value will also be different. When comparing the two records (one from the stage layer and one from the historical layer) you now only have to compare the hash value. If it did not change you know you don't have to process the record in your historical layer. Since you only want to calculate the hash once (in the stage package) you will also store it in the historical layer.

Now it is time to explain the scripting solution

1) Starting point
The starting point of this example is a Data Flow Task with a Flat File source component.
Flat File Source












2) Script Component - Input Columns
Add a new Script Component (transformation) to the Data Flow Task. Give it a suitable name and connect it to your flow. Then edit it and select all columns you want to hash on the Input Columns pane. Since we are not changing the existing column you can keep the default Usage Type 'ReadOnly'.
Script Component Input Columns

















Which columns do you want to hash? Three most chosen options:
  1. If you do not know the Primary Key: select all columns to calculate the hash.
  2. If you do know the Primary Key: select all columns except the Primary Key to calculate the hash.
  3. If the Primary Key consists of multiple columns you could even calculate a separate hash for the key columns only.
3) Script Component - Output Column
We need to store the calculated hash in a new column. Go to the Inputs and Outputs pane and add a new column in Output 0. The data type is string and the size depends on which hash algoritme you want to use. For this example we use the MD5 algoritme which returns a 128 bits hash. When you convert that to an ASCII string it would be a 32 character string (that only contains hexadecimal digits).
Script Component Inputs and Outputs























4) Script Component - The script preparation
Now we are almost ready to add the actual script. Go to the Script pane. Select your scripting language. This example will be in C#. Then hit the Edit Script... button to start the Vsta environment. This is a new instance of Visual Studio and will take a few moments to start.
Edit Script...























Optional:
I always start by removing all unnecessary methods and comments to keep the code clean. For this example we do not need the PreExecute and PostExecute methods and I do not want to keep the default help comments.
Clean up before start



















5) Script Component - The code
First we need to add two extra usings to shorten the code. Unfold the Namespaces region at the top and add the following usings:
using System.Security.Cryptography;
using System.Text;

Then Locate the method called Input0_ProcessInputRow and add a new GetMd5Hash method below this existing method (below its closing }). The new method is copied from this MSDN page. I only changed the encoding to Unicode (see note 1):
static string GetMd5Hash(MD5 md5Hash, string input)
{
    // Convert the input string to a byte array and compute the hash.
    byte[] data = md5Hash.ComputeHash(Encoding.Unicode.GetBytes(input));

    // Create a new Stringbuilder to collect the bytes
    // and create a string.
    StringBuilder sBuilder = new StringBuilder();

    // Loop through each byte of the hashed data
    // and format each one as a hexadecimal string.
    for (int i = 0; i < data.Length; i++)
    {
        sBuilder.Append(data[i].ToString("x2"));
    }

    // Return the hexadecimal string.
    return sBuilder.ToString();
}


And at last change the code of the existing method Input0_ProcessInputRow to:
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
    string Separator = "|";
    string RowData = "";

    using (MD5 md5Hash = MD5.Create())
    {
        Row.Hash = GetMd5Hash(md5Hash, RowData = (Row.Title_IsNull ? "" : Row.Title.ToString()) + Separator + (Row.FirstName_IsNull ? "" : Row.FirstName.ToString()) + Separator + (Row.MiddleName_IsNull ? "" : Row.MiddleName.ToString()) + Separator + (Row.LastName_IsNull ? "" : Row.LastName.ToString()));
    }
}


The code above first concatenates all columns with a separator between them (see note 2) and it checks whether the value isn't NULL because we cannot add NULL to a string (see note 3). You will see that it repeats this piece of code for each column before calling the hash method:
(Row.Title_IsNull ? "" : Row.Title.ToString()) + Separator
For the first record in our example it will hash the following text: Mr.|Syed|E|Abbas
And for the third row that contains a null value it will hash the this text: Ms.|Kim||Abercrombie

6) Testing the code
After closing the Vsta editor and clicking OK in the Script Component to close it, add a dummy Derived Column behind it and add a Data Viewer to see the result.
Similar result to T-SQL HASHBYTES


















Note 1:
When you want the exact same result as with T-SQL HASHBYTES then you have to make sure you use the same encoding. Otherwise you get a different hash. In the method GetMd5Hash on the first line of code you see Encoding.Unicode.GetBytes(. There are more options besides Unicode. For example: ASCII, UTF7, UTF8, UTF32, etc. etc. However, as long as you don't have to compare hashes generated by to different methods (T-SQL and .Net) it doesn't matter. In this stackoverflow post you find more examples.

Note 2:
The column separator is added to prevent unwanted matches. If you have these two records with two columns:
Column1 Column2
123 456
12 3456
Without the separator these two will both get concatenated to 123456 and therefor generate the same hash. With the separator you will have two different values to hash: 123|456 and 12|3456. Choose your separator wisely. The number 3 would not be a wise choice in this case.

Note 3:
In the code you see that the columns are checked for null values because you cannot add null to a string. The null values are replace with an empty string. However this shows a bit of an imperfection of this method, because a string with a null value isn't the same as an empty string. To overcome this you could use a different string that is likely to occur in your text. For Numeric and Date data types you could just add an empty string, something like:
(Row.MyNumberColumn_IsNull ? "" : Row.MyNumberColumn.ToString()) + Separator
(Row.MyDateColumn_IsNull ? "" : Row.MyDateColumn.ToString()) + Separator


Note 4:
md5 only uses 128 bits and there are better, saver (, but also a bit slower) methods to calculate hashes:
SHA and SHA1 - 160 bits
SHA2_256 - 256 bits
SHA2_512 - 512 bits

Safer? As long as you don't use it to hash passwords you are still OK with md5.
Better? In rare cases two different strings could return the same md5 hash, but you have a higher chance to win the galaxy lottery.

Rather use SHA2_512? Just use this code instead:
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
    string Separator = "|";
    string RowData = "";

    using (SHA512 shaHash = new SHA512Managed())
    {
        Row.hash2 = GetShaHash(shaHash, RowData = (Row.Title_IsNull ? "" : Row.Title.ToString()) + Separator + (Row.FirstName_IsNull ? "" : Row.FirstName.ToString()) + Separator + (Row.MiddleName_IsNull ? "" : Row.MiddleName.ToString()) + Separator + (Row.LastName_IsNull ? "" : Row.LastName.ToString()));
    }
}

static string GetShaHash(SHA512 shaHash, string input)
{
    // Convert the input string to a byte array and compute the hash.
    byte[] data = shaHash.ComputeHash(Encoding.Unicode.GetBytes(input));

    // Create a new Stringbuilder to collect the bytes
    // and create a string.
    StringBuilder sBuilder = new StringBuilder();

    // Loop through each byte of the hashed data
    // and format each one as a hexadecimal string.
    for (int i = 0; i < data.Length; i++)
    {
        sBuilder.Append(data[i].ToString("x2"));
    }

    // Return the hexadecimal string.
    return sBuilder.ToString();
}

Too much columns => too much coding?
In my book Extending SSIS with .NET Scripting you will find a script component example that just loops through all columns to generates the hash. No money to buy it? I used this code as the base for that script.
An other alternative is to generate the Script Component and its code with BIML. Here is an example of a Script Component in BIML, but getting the hash to work is a bit of a challenge but doable.
Related Posts Plugin for WordPress, Blogger...