Blog
Data Factory: How to upsert a record in SQL

Data Factory: How to upsert a record in SQL

When importing data to a database we want to do one of three things, insert the record if it doesn't already exist, update the record if it does or potentially delete the record.

For the first two, if your writing a stored procedure this often can lead to a bit of SQL that looks something like this:

1IF EXISTS(SELECT 1 FROM DestinationTable WHERE Foo = @keyValue)
2BEGIN
3 UPDATE DestinationTable
4 SET Baa = @otherValue
5 WHERE Foo = @keyValue
6END
7ELSE
8BEGIN
9 INSERT INTO DestinationTable(Foo, Baa)
10 VALUES (@keyValue, @otherValue)
11END

Essentially an IF statement to see if they record exists based on some matching criteria.

Data Factory - Mapping Data Flows

With a mapping data flow, data is inserted into a SQL DB using a Sink. The Sink let's you specify a dataset (which will specify the table to write to), along with mapping options to map the stream data to the destination fields. However the decision on if a row is an Insert/Update/Delete must already be specified!

Let's use an example of some data containing a persons First Name, Last Name and Age. Here's the table in my DB;

And here's a CSV I have to import;

1FirstName,LastName,Age
2John,Doe,10
3Jane,Doe,25
4James,Doe,50

As you can see in my import data Jane's age has changed, there's a new entry for James and Janet doesn't exist (but I do want to keep here in the DB). There's also no ID's in my source data as that's an identity created by SQL.

If I look at the Data preview on my source in the Data Flow, I can see the 3 rows from my CSV, but notice there is also a little green plus symbol next to each one.

This means that they are currently being treated as Inserts. Which while true for one of them is not for the others. If we were to connect this to the sink it would result in 3 new records being added to the DB, rather than two being updated.

To change the Insert to an update you need an alert row step. This allows us to define rules to state what should be an insert and what should be an update.

However to know if something should be an insert or an update requires knowledge of what is in the DB. To do that would mean a second source, followed by a join on First Name/Last Name and then conditions based on which rows have an ID from the DB or not. This all seems a bit needlessly complicated, and it is.

Upsert

When using a SQL sink there is a 4th option for what kind of method should be used and that is an Upsert. An upsert will result in a SQL merge being used. SQL Merges take a set of source data, compare it to the data already in the table based on some matching keys and then decide to either update or insert new records based on the result.

On the sink's Settings tab untick Allow insert and tick Allow upsert. When you tick Allow upsert properties for Key columns will appear which is where you specify which columns should be used as a key. For me this is FirstName and LastName.

If you don't already have an Alter Row step it will warn you that this is missing.

Even though we are only doing what equates to a SQL merge, you still need to alter the rows to say they should be an upsert rather than an insert.

As we are upserting everything our condition can just be set to return true rather than analysing any row data.

And there we have it, all rows will be treated as an upsert. If we look at the Data preview we can now see the upsert icon on each row.

And if we look at the table after running the pipeline, we can see that Janes age has been update, James has been added and John and Janet stayed the same.

Data Factory: Importing multiple files with transformations

Data Factory: Importing multiple files with transformations

Let's assume you have a folder containing a bunch of files that you need to import somewhere. e.g. a database or another file store, and in the process of doing that you also need to transform the data in some sort of way.

One option would be to use a pipeline activity like Get Metadata to get your list of files, a ForEach to loop through them and a Mapping Data Flow within the for each to process each file.

This all sounds quite reasonable, but there's a catch. Each time we use a Data Flow activity, that activity will spin-up a Azure Databricks environment to run the Data Flow. So if you have 100 files to import, then that's 100 Databricks environments that will get created.

An alternative is to do everything within one Data Flow activity, resulting in just one Databrick environment being created.

One Data Flow

In your dataset configuration specify a filepath to a folder rather than an individual file (you probably actually had it this way for the Get Metadata activity).

In your data flow source object, pick your dataset. In the source options you can specify a wildcard path to filter what's in the folder, or leave it blank to load every file.

Now when the source is run it will load data from all files.

One major difference to note is now rather than iteratively going through each file we're loading them all in one go which changes how you may think of things.

If you need to know which file a particular row came from then the source options has a field where you can specify a column name for the file to be added to.

Your data now includes data from every file, and the filename it came from.

JavaScript frameworks explained to an ASP.NET dev

JavaScript frameworks explained to an ASP.NET dev

For most of my career I've been an ASP.NET dev and a JavaScript dev. If I was going to say I was more of an expert in one of them it would be the .NET side of things but I've never really lost touch with JavaScript.

Right now I think it's fair to say technologies in the world are starting to shift how we build websites, with JavaScript frameworks reaching a point with features like static site generation where they actually now offer a decent performance incentive to use them. At some point Blazor may get to a point where it reverses this, but right now there's a compelling argument to move.

For a ASP.NET dev this can be a daunting task. You might be thinking of trying out a headless CMS with a JavaScript front end, but just take a look at this screen grab from Prismic's sdk list.

There's 7 different JavaScript based SDK's listed there! Over half of the total and none of them are that Angular thing you had heard about. Where do you start?

Lets compare to .NET

Well recently I've been updating my JS skills again trying out some of the frameworks I hadn't used before, so I thought I'd share some learnings. The good news is as always it's not really as different as it first seems. To take some of the pain out of understanding what all these frameworks are I thought it would be good to try and relate them back to .NET and what the almost equivalent is.

Assembly Code

No not actual assembler but what does our code actually compile to. In the .NET world we have CIL (Common Intermediate Language), previously known as MSIL (Microsoft Intermediate Language) that our C#, F#, VB etc all compile down to before then being converted to the correct machine code for where they run.

In the front end world think of JavaScript being a bit like this (apart from the fact you actually write JavaScript and we don't write CIL).

View Engine

To render views into HTML, in the ASP.NET world we have Razor, but not just Razor. We also have WebForm, Brail, Bellevue, NDjango (see more here), it just happens that we mostly just use Razor.

I see the equivalents of these being ReactJS, VueJS and Angular. Its not an exact match as they also aren't exact equivalents or each other, but they're largely your functionality that will take a model and turn it into HTML.

Web Application Framework

The problem with the name framework is it applies to basically anything, but this is what I'm going with for describing ASP.NET MVC/ASP.NET Razor Pages/Web Forms, you know all those things built on-top of .NET that make it a website rather than a desktop app. They do things like routing, organising our files into controller and view folders, know how to respond to http requests etc.

Here we have Next.js, Nuxt.js and maybe Gatsby. The link between these and View Engine is a bit stronger than the ASP.NET MVC world as you essentially have a one to one mapping Next.js -> React, Nuxt.js -> Vue but they are what adds routing, static site generation and organization to your code.

Lower Level Framework

Now this one could be wrong :)

In .NET we have different version of the framework. e.g. .NET Framework /3.5/4, .NET Core, .NET 5, Mono. On the front end side they have Node.

Languages

In .NET we have choices including C#, F#, VB among other.

JavaScript has JavaScript (which I know I said was assembly), TypeScript, Coffee Script maybe more.

Not so daunting

There's probably a bunch of flaws with my comparison list and reasons people can point out why things I've said are the same are in fact different, but my point was really to show that while .NET may appear as one button on a SDK list alongside 7 JavaScript based SDK's its not that different. Out of the 7 Node is based on JavaScript. Vue and React are based on Node, and Next/Gatsby/Nuxt are based on Vue/React. There just isn't the same concept of all of it being built by one company or one particular combination being dominant in the same way that ASP.NET MVC + C# + Razor has been for the last generation of .NET websites.

Typescript: What to do when your third party NPM module doesn't have type declarations

Typescript: What to do when your third party NPM module doesn't have type declarations

As a developer who has probably spent slightly more time doing backend work than front end, and also who's front end work has been more about SPAs than making things look pretty, I really like what TypeScript brings to writing JavaScript. While it's often a nice break from C# to do write some code that doesn't require you to go and define every object there obviously a lot of benefits to do this (otherwise we would just make everything dynamic in C'#). Having file name's show up red in VS Code when there's a type error is a really nice feature.

However if you want to set your compiler options to strict, you will more than likely run into an issue with third part code coming from NPM. In the form of:

Type error: Could not find a declaration file for module MODULE NAME HERE

Here's a screen grab of the compiler error from my build server...

It's offering a nice line of hope by suggesting I run an npm command to add in the types. Sometime this works, but as not everyone feels the need to adopt TypeScript you'll find there's times when it doesn't.

The only solution to this is to write your own.

Type Definitions are stored in the node_modules folder, but as that isn't something you'll want to be adding to source control you'll need another folder. I call mine @types and have it in the root. Within that folder you then need a folder for each module requiring type definitions. The name of each folder should match the module name.

Here's mine:

To make the actual declaration file you'll need to name it index.d.ts and place it in the folder you just created. You can read more about how declaration files should be structured here, however there's an issue with writing a declaration file for a third party module.

1. It's very confusing to work out everything that should go in it

2. It's not your module so it might change

Unless your planning on helping develop the module and create the official type declarations, I suggest an easier option.

Create a file that looks like this:

1/// <reference types="node" />
2
3declare module 'module name here';

This is enough to satisfy the compiler. You won't get accurate types for the module, but they would only ever be as good as your interpretation anyway. This will at least unlock the ability for the rest of your application to continue functioning for strict type checking.