Tag: Data

Data Factory - Dynamic mappings in Data Flow

Azure Data Factory and Data Flows make transforming data from one format to another super simple with it's code free approach. However that doesn't mean we want to construct entire data flows when changing mappings from one value to another.

For example if my source data contained a field for favourite creature with value An for Ants, and Ca for Cats I could do the transformation using an iif expression in a derived column task. But in a few weeks time if I needed to add Ba for Bats, editing the whole Data Flow seems like a lot of work, not to mention nested iif statements are going to be come ugly and confusing.

One option would be to have the list of conversions as another source in the flow and do a join, but then this means having that data stored somewhere like blob storage.

Instead a solution I have is to pass the data in as a parameter to the data flow. Data Factory doesn't have an array parameter but we can put a comma separated list in as a string. e.g. An=Ants, Ba=Bats, Ca=Cats.

Then in our derived columns expression we can do this:

iif(instr($ParameterString, toString(SourceValue) + "=")==0,"No Mapping",
  substring(
          substring($ParameterString, instr($ParameterString, toString(SourceValue) + "=")),
          length(toString(SourceValue))+2,
          iif(instr(substring($ParameterString,instr($ParameterString, toString(SourceValue) + "=")+length(toString(SourceValue))+2),",") > 0, 
              instr(substring($ParameterString,instr($ParameterString, toString(SourceValue) + "=")+length(toString(SourceValue))+2),","), 
              length(substring($ParameterString,instr($ParameterString, toString(SourceValue) + "=")))
              )
          )
  )

There's quite a lot going on here so lets break it down

iif(instr($ParameterString, toString(SourceValue) + "=")==0,"No Mapping",

First we are doing an iif to check if our Source Value (the value from our dataset) exists within the Parameter String. If it doesn't then we're setting the value to "No Mapping"

substring(

If the value does exist then we need to grab just the part we want. So we need a sub string and we need to get everything after the source values equals to the next comma.

A reminder the parameters for substring are substring(<string to subset>: string,<from 1-based index>: integral, [<number of characters>: integral])

substring($ParameterString, instr($ParameterString, toString(SourceValue) + "=")),

Our first parameter needs to be the string to subset, that's going to come from our ParameterString, but we're going to do another substring on it to ignore everything before the one we ant to match.

So if our ParameterString was set too:

An=Ants, Ba=Bats, Ca=Cats

and our Source Value was Ba we would now have:

Ba=Bats, Ca=Cats

length(toString(SourceValue))+2,

Next is the start index which will be the length of our source value + 2. If we ended the substring now we would get.

Bats, Ca=Cats

iif(instr(substring($ParameterString,instr($ParameterString, toString(SourceValue) + "=")+length(toString(SourceValue))+2),",") > 0, 
              instr(substring($ParameterString,instr($ParameterString, toString(SourceValue) + "=")+length(toString(SourceValue))+2),","), 
              length(substring($ParameterString,instr($ParameterString, toString(SourceValue) + "=")))
              )
          )

The final parameter for the substring is a bit more complex, but it's similar to what we have just done.

Two scenarios need to be catered for:

  1. If there's more items in the list left then there will be a comma separating them. If this is the case then we need to get the position of the first comma in what we have left.
  2. If there isn't anything else in the list then there will be no comma. In this instance we need to get the length of what remains.

With that our substring will now return Bats

Azure Data Factory: Date formats when working with XML in Data Flows

If there was one thing you could guarantee would always go wrong when importing data it's date formats. Some countries like to show dates as dd/MM/yyyy others prefer MM/dd/yyyy and despite formats like yyyy-MM-dd being a thing in programming for a very long time, people seem to still create files with these ambiguous formats.

I found finding out how to specify what format the date is in within a flat file was not obvious. If you go to the dataset item for a XML file, unlike other formats it's missing the schema tab.

This wasn't overly surprising as I've come to find when using Datasets in a Mapping Data Flow half the time it seems to ignore any schema definition, or if you use wildcard paths it even seems to ignore all the path settings on the dataset.

Within the Data Flow, the source has a projection tab which will import a schema from your xml file. If your data looks like a date, then this will hopefully set the data type to date. One thing I found was having data in dd/MM/yyyy resulted in a format of string rather than date. Annoyingly unless you want to start manually editing the script the UI generates, there's no way of fixing the projection.

Assuming you have a date as the data type this is a good start, but I then found when I ran my data flow which had dates in dd/MM/yyyy, the date field was blank! So it definitely knows it's a date and not a string, but it's doesn't like the format so it's ignored the data.

Back on the projection tab of the source there is another button "Define default format". This will open a side panel where you can set what format your dates, times, whole numbers and fractions will be in. Once I had set this, my dates started feeding through.

Bulk Inserting data using Entity Framework

Using tools like Entity Framework makes life far easier for a developer. Recently I blogged about how using them is what makes .Net Core one of the best platforms for prototype development, but the benefits don’t end there. They are also great from a security perspective by cutting a lot of risk around SQL injection attacks just by avoiding easy mistakes when using regular ADO.NET.

However, they do have some downsides, a main one being that they are particularly slow when it comes to doing bulk inserts to a database.

For example, assume you have an application which regularly receives an xml import file consisting of 200,000 records and each one either needs to be an insert of an update into the db. You’ll quickly learn that looping through the whole lot and then calling save changes results in a process taking an extremely long time to run, it may even just timeout. You then decide to get rid of that long save changes line by breaking it up into blocks of 500 and call save changes for each of those. That may save the timeout issue, but it still results in a process potentially lasting around an hour.

The problem is that this is a scenario Entity Framework or EF.Core just weren’t designed to handle. As a solution you could opt to drop Entity Framework altogether and revert to something like a native SQL Bulk Insert command, but what if you need to be doing some processing in code on the record before the import happens? What if you have one of those classic not quite always valid XML, XML files which would cause SQLs Bulk Insert to fail.

The solution is to use an open source extension called EFCore.BulkExtensions.

EFCore.BulkExtensions

EFCore.BulkExtensions is a set of extension methods to Entity Framework that provide the functionality to do bulk inserts. You can add it to your project using NuGet and you’ll find the project on GitHub here https://github.com/borisdj/EFCore.BulkExtensions

Usage is also very simple to do. Let’s assume you have some existing tradition EF code that loops through a collection and for each one create a new db item and adds it to the db:

public void DoImport(List<foo> collection)
{
  foreach (var item in collection)
  {
      Jobs job = new Jobs();
      
      job.DateAdded = DateTime.UtcNow;
      job.Name = item.Name;
      job.Location = item.Location;

      await dbContext.Jobs.AddAsync(job);
  }

  await dbContext.SaveChangesAsync();
}

Rather than adding each item to the Entity Framework db context, you instead create a list of those objects and then call a BulkInsert function with them on your db context.

public void DoImport(List<foo> collection)
{
  List<Jobs> importJobs
  foreach (var item in collection)
  {
      Jobs job = new Jobs();
      
      job.DateAdded = DateTime.UtcNow;
      job.Name = item.Name;
      job.Location = item.Location;
      
      importJobs.Add(job);
  }

  await dbContext.BulkInsert(importJobs);
}

If also works for updates, but rather than creating a new item, first retrieve it form the db and then at the end call BulkInsertOrUpdate with the list.

await dbContext.BulkInsertOrUpdate(importJobs);

From my experience doing this took my import process that would run for over an hour down to something which would complete in a few minutes.