Blog

Azure Data Factory: Date formats when working with XML in Data Flows

If there was one thing you could guarantee would always go wrong when importing data it's date formats. Some countries like to show dates as dd/MM/yyyy others prefer MM/dd/yyyy and despite formats like yyyy-MM-dd being a thing in programming for a very long time, people seem to still create files with these ambiguous formats.

I found finding out how to specify what format the date is in within a flat file was not obvious. If you go to the dataset item for a XML file, unlike other formats it's missing the schema tab.

This wasn't overly surprising as I've come to find when using Datasets in a Mapping Data Flow half the time it seems to ignore any schema definition, or if you use wildcard paths it even seems to ignore all the path settings on the dataset.

Within the Data Flow, the source has a projection tab which will import a schema from your xml file. If your data looks like a date, then this will hopefully set the data type to date. One thing I found was having data in dd/MM/yyyy resulted in a format of string rather than date. Annoyingly unless you want to start manually editing the script the UI generates, there's no way of fixing the projection.

Assuming you have a date as the data type this is a good start, but I then found when I ran my data flow which had dates in dd/MM/yyyy, the date field was blank! So it definitely knows it's a date and not a string, but it's doesn't like the format so it's ignored the data.

Back on the projection tab of the source there is another button "Define default format". This will open a side panel where you can set what format your dates, times, whole numbers and fractions will be in. Once I had set this, my dates started feeding through.

Debugging VueJS + TypeScript with VS Code

I recently asked a developer working with VueJS how they do debugging and they told me console.log. Not wanting to go back to the dark ages of development I went and found the solution myself.

The VueJS website has a guide on how to do debugging with VS Code https://vuejs.org/v2/cookbook/debugging-in-vscode.html however the instructions they give don't work with TypeScript. Interestingly it does get you far enough to set breakpoints within either Chrome or Edge, but I don't really count that as a solution. The idea of debugging is you step through your code and see everything that is happening, not step through a copy of your code and then go and find the corresponding file in your code editor.

After a bit of digging I managed to get it to work, so here's my solution.

Prerequisites

This part is essentially as per the VueJS guide.

Make sure you have VS Code installed and install the Debugger for Chrome. Don't worry if you use Edge, the same extension will work.

Create your project with the vue-cli, following the instructions in the Vue CLI Guide. When you do this make sure you pick Typescript.

Displaying Source Code in the Browser

This isn't what we're after but it's still a step we must do.

Create a file called vue.config.js in the root of your solution and paste the following into it. This creates the mapping for the debugger to map compressed files back to the original.

module.exports = {
configureWebpack: {
  devtool: 'source-map'
}
}

Configure debugging from VS Code

In VS Code select the Run and Debug button on the left and click the Run and Debug button. If you don't see this button it probably means that you already have a launch.json file defined. Instead a cog will appear at the top to edit the existing settings.

VS Code Run and Debug

From the options choose either Chrome or Edge.

VS Code select Chrome

This will create a launch.json file with some defaults filled out. Replace the contents of the file with the below (this is the part which differs from the instructions on VueJS's site).

I have included 2 configurations, one for debugging with Chrome and the other for Edge. They are both essentially the same and just launch a different browser.

{
// Use IntelliSense to learn about possible attributes.
// Hover to view descriptions of existing attributes.
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
"version": "0.2.0",
"configurations": [
  {
    "type": "pwa-msedge",
    "request": "launch",
    "name": "vuejs: edge",
    "url": "http://localhost:8080",
    "webRoot": "${workspaceFolder}",
    "breakOnLoad": true,
    "sourceMapPathOverrides": {
        "webpack:///./*": "${webRoot}/*"
    },
    "skipFiles": [
        "${workspaceFolder}/node_modules/**/*"
    ]
  },
  {
    "type": "chrome",
    "request": "launch",
    "name": "vuejs: chrome",
    "url": "http://localhost:8080",
    "webRoot": "${workspaceFolder}",
    "breakOnLoad": true,
    "sourceMapPathOverrides": {
        "webpack:///./*": "${webRoot}/*"
    },
    "skipFiles": [
        "${workspaceFolder}/node_modules/**/*"
    ]
  }
]
}

Debug from VS Code

To start your debugging you still need to start the application from a terminal with npm run serve. The debugging experience is more like Visual Studios attaching to a process rather than running application by clicking start.

Once your application is running, to attach the debugger either press either F5 or go to the run and debug tab, make sure your browser of choice is selected and click the green play button. Your browser of choice will now open and VS Code should be attached and any break points you create will be hit.

Data Factory: How to upsert a record in SQL

When importing data to a database we want to do one of three things, insert the record if it doesn't already exist, update the record if it does or potentially delete the record.

For the first two, if your writing a stored procedure this often can lead to a bit of SQL that looks something like this:

IF EXISTS(SELECT 1 FROM DestinationTable WHERE Foo = @keyValue)
BEGIN
UPDATE DestinationTable
SET Baa = @otherValue
WHERE Foo = @keyValue
END
ELSE
BEGIN
INSERT INTO DestinationTable(Foo, Baa)
VALUES (@keyValue, @otherValue)
END

Essentially an IF statement to see if they record exists based on some matching criteria.

Data Factory - Mapping Data Flows

With a mapping data flow, data is inserted into a SQL DB using a Sink. The Sink let's you specify a dataset (which will specify the table to write to), along with mapping options to map the stream data to the destination fields. However the decision on if a row is an Insert/Update/Delete must already be specified!

Let's use an example of some data containing a persons First Name, Last Name and Age. Here's the table in my DB;

And here's a CSV I have to import;

FirstName,LastName,Age
John,Doe,10
Jane,Doe,25
James,Doe,50

As you can see in my import data Jane's age has changed, there's a new entry for James and Janet doesn't exist (but I do want to keep here in the DB). There's also no ID's in my source data as that's an identity created by SQL.

If I look at the Data preview on my source in the Data Flow, I can see the 3 rows from my CSV, but notice there is also a little green plus symbol next to each one.

This means that they are currently being treated as Inserts. Which while true for one of them is not for the others. If we were to connect this to the sink it would result in 3 new records being added to the DB, rather than two being updated.

To change the Insert to an update you need an alert row step. This allows us to define rules to state what should be an insert and what should be an update.

However to know if something should be an insert or an update requires knowledge of what is in the DB. To do that would mean a second source, followed by a join on First Name/Last Name and then conditions based on which rows have an ID from the DB or not. This all seems a bit needlessly complicated, and it is.

Upsert

When using a SQL sink there is a 4th option for what kind of method should be used and that is an Upsert. An upsert will result in a SQL merge being used. SQL Merges take a set of source data, compare it to the data already in the table based on some matching keys and then decide to either update or insert new records based on the result.

On the sink's Settings tab untick Allow insert and tick Allow upsert. When you tick Allow upsert properties for Key columns will appear which is where you specify which columns should be used as a key. For me this is FirstName and LastName.

If you don't already have an Alter Row step it will warn you that this is missing.

Even though we are only doing what equates to a SQL merge, you still need to alter the rows to say they should be an upsert rather than an insert.

As we are upserting everything our condition can just be set to return true rather than analysing any row data.

And there we have it, all rows will be treated as an upsert. If we look at the Data preview we can now see the upsert icon on each row.

And if we look at the table after running the pipeline, we can see that Janes age has been update, James has been added and John and Janet stayed the same.