Tag: Azure

Data Factory: How to upsert a record in SQL

When importing data to a database we want to do one of three things, insert the record if it doesn't already exist, update the record if it does or potentially delete the record.

For the first two, if your writing a stored procedure this often can lead to a bit of SQL that looks something like this:

IF EXISTS(SELECT 1 FROM DestinationTable WHERE Foo = @keyValue)
BEGIN
UPDATE DestinationTable
SET Baa = @otherValue
WHERE Foo = @keyValue
END
ELSE
BEGIN
INSERT INTO DestinationTable(Foo, Baa)
VALUES (@keyValue, @otherValue)
END

Essentially an IF statement to see if they record exists based on some matching criteria.

Data Factory - Mapping Data Flows

With a mapping data flow, data is inserted into a SQL DB using a Sink. The Sink let's you specify a dataset (which will specify the table to write to), along with mapping options to map the stream data to the destination fields. However the decision on if a row is an Insert/Update/Delete must already be specified!

Let's use an example of some data containing a persons First Name, Last Name and Age. Here's the table in my DB;

And here's a CSV I have to import;

FirstName,LastName,Age
John,Doe,10
Jane,Doe,25
James,Doe,50

As you can see in my import data Jane's age has changed, there's a new entry for James and Janet doesn't exist (but I do want to keep here in the DB). There's also no ID's in my source data as that's an identity created by SQL.

If I look at the Data preview on my source in the Data Flow, I can see the 3 rows from my CSV, but notice there is also a little green plus symbol next to each one.

This means that they are currently being treated as Inserts. Which while true for one of them is not for the others. If we were to connect this to the sink it would result in 3 new records being added to the DB, rather than two being updated.

To change the Insert to an update you need an alert row step. This allows us to define rules to state what should be an insert and what should be an update.

However to know if something should be an insert or an update requires knowledge of what is in the DB. To do that would mean a second source, followed by a join on First Name/Last Name and then conditions based on which rows have an ID from the DB or not. This all seems a bit needlessly complicated, and it is.

Upsert

When using a SQL sink there is a 4th option for what kind of method should be used and that is an Upsert. An upsert will result in a SQL merge being used. SQL Merges take a set of source data, compare it to the data already in the table based on some matching keys and then decide to either update or insert new records based on the result.

On the sink's Settings tab untick Allow insert and tick Allow upsert. When you tick Allow upsert properties for Key columns will appear which is where you specify which columns should be used as a key. For me this is FirstName and LastName.

If you don't already have an Alter Row step it will warn you that this is missing.

Even though we are only doing what equates to a SQL merge, you still need to alter the rows to say they should be an upsert rather than an insert.

As we are upserting everything our condition can just be set to return true rather than analysing any row data.

And there we have it, all rows will be treated as an upsert. If we look at the Data preview we can now see the upsert icon on each row.

And if we look at the table after running the pipeline, we can see that Janes age has been update, James has been added and John and Janet stayed the same.

Data Factory: Importing multiple files with transformations

Let's assume you have a folder containing a bunch of files that you need to import somewhere. e.g. a database or another file store, and in the process of doing that you also need to transform the data in some sort of way.

One option would be to use a pipeline activity like Get Metadata to get your list of files, a ForEach to loop through them and a Mapping Data Flow within the for each to process each file.

This all sounds quite reasonable, but there's a catch. Each time we use a Data Flow activity, that activity will spin-up a Azure Databricks environment to run the Data Flow. So if you have 100 files to import, then that's 100 Databricks environments that will get created.

An alternative is to do everything within one Data Flow activity, resulting in just one Databrick environment being created.

One Data Flow

In your dataset configuration specify a filepath to a folder rather than an individual file (you probably actually had it this way for the Get Metadata activity).

In your data flow source object, pick your dataset. In the source options you can specify a wildcard path to filter what's in the folder, or leave it blank to load every file.

Now when the source is run it will load data from all files.

One major difference to note is now rather than iteratively going through each file we're loading them all in one go which changes how you may think of things.

If you need to know which file a particular row came from then the source options has a field where you can specify a column name for the file to be added to.

Your data now includes data from every file, and the filename it came from.

Equivalent of the Octopus Package Library for Azure Devops

I’ve been using Team City and Octopus deploy in our CI setups for several years, but over the last 6 months have slowly been moving over to use Azure Devops. This is mostly to move away from needing to keep an application like Team City updated by switching to a SAAS service. Octopus is available as a SAAS service, but as Azure Devops is also capable of doing releases I opted to start with that rather than using Octopus again.

One of my first challenges was how you replace Octopus’s package library. The solutions I’m working on (which are generally Sitecore based), consist of the application we write and store in Source Control, and all the other pre-built parts of Sitecore which we don’t store in source control.

With Octopus deploy we would add these static parts to the Octopus package library and then release a combination of them and our application to a server, wiping what is there before we do. That then gives us the exact same deploy on every target.

With Azure Devops however, things are a bit different.

Azure Artifacts

The closest equivalent of the package library is Azure Artifacts. Rather than a primary purpose being to upload packages to be included in a deploy. The goal of Artifacts is more inline with the output of a pipeline being saved to an Artifact which can then be a package feed for things like NuGet or npm.

Unfortunately, there is also no UI to directly upload a NuGet package to an Artifact feed like there is with Octopus’s package library. However, it is possible to upload a pre-built NuGet package another way using NuGet.exe itself.

Manually uploading a NuGet package to Azure Artifacts

To start you need to create a feed for your package to uploaded to.

Login to Azure DevOps and got to Artifacts. Click Create feed and give it a name.

Click connect to feed and select NuGet.exe. You will see under the heading project setup a nuget.config file to add to your solution.

Create an empty solution in Visual Studio and add a nuget.config file to the root folder using the source from the website. It will look something like this;

<?xml version="1.0" encoding="utf-8"?>
<configuration>
<packageSources>
  <clear />
  <add key="MyFeed" value="https://pkgs.dev.azure.com/.../_packaging/MyFeed/nuget/v3/index.json" />
</packageSources>
</configuration>

Copy the NuGet package you would like you publish into the root folder as well. I’m using a package we have created for a tool called Feydra. My folder now looks like this.

Before we can publish into our NuGet feed we need to setup a personal access token. This is done from within Azure Devops.

In the top right corner click the person icon and then select profile. On the profile screen you will see a section called Personal access tokens under Security.

From here click New Token and give it the Read, write & manage permission for Packaging.

You will be given a token which you should save a copy of.

Now we are ready to publish are NuGet package to the Artifact feed.

Open a command prompt at the folder containing your solution file and run the following commands

nuget push -Source &lt;SourceName&gt; -ApiKey az &lt;PackagePath&gt;

For me this is as follows:

nuget push -Source MyFeed -ApiKey az .\Feydra.Custom.1.0.0.30.nupkg

You will then be prompted for a username and password which you should use the personal access token for both.

Refresh the feed on the website and you should see you package added to it, ready to be included in a release pipeline.