Advent of Code: Day 2

It’s the 2nd of December, ergo day two of Advent of Code. Read this if you want more context about the competition, or the series of blog posts.

The challenge

Today’s challenge was an interesting problem from a SQL perspective, because it involved some data wrangling,¬† but after that was actually really easy to solve. This is reminiscent of many of the problems we face in the data business on a day-to-day basis. The difficulty often lies in solving the problem, not actually writing the code.

Data preparation and wrangling is an essential part of any data professional’s skill set. I often find that data prep takes 70% or more of total project time, so it really is worth understanding how to best model your data for the problem you’re trying to solve.

The solution (download)

Essentially, part one of the challenge today was to determine the highest and lowest values in a row of data, and then subtract the lowest from the highest to create a checksum value per row. Add up the checksums for each row to complete the challenge.

First task, putting the data into a table that I can then manipulate. I had a couple of thoughts, use the SQL Server Import Wizard Task, use SSIS, but in the end, mainly because the file was pretty small, I just imported it as 16 text strings and stored it in a table variable. One of the best tips I’ve ever learned in SSMS is using the ALT-key in combination with the arrows for block selection. This technique has saved me countless hours in my job, and today, I created 16 rows of¬†(”),¬†and then block copied the data inside those quotes.

So now I have 16 rows of data, but each row contains 16 tab delimited values. Now I have to make them into values I can compare to each other. However, from a SQL perspective, functions like MIN and MAX are designed to work on columns of data, not rows. So the appropriate data structure for my data is not 16 rows with 16 columns in each, but rather 256 rows, each one indicating the number of the string, the position of the value (1-16) and the value itself.

So the following subset, becomes the table on the right.

Data in a usable table

Getting to this state is fairly straightforward, and something we actually do so often that I have a split list user defined table valued function which I keep with me at all times. For the sake of brevity I’ve stripped out the checking and error handling, but the function looks like this.

The previous version of this function used a cursor to loop through the data, but it was fairly unwieldy and it never felt comfortable. One day I found this XML technique online and I adapted my function accordingly. It really is a useful technique to have in your locker.

So we have a “Split List” function, but we now have to use it on every row. So we use an cross apply.

Anyway, back to the problem at hand. It now becomes very simple to see that we can find the MIN and MAX values per RowNumber with a simple query, and then sum them up by using a windowing function.

Pretty straightforward, isn’t it?!

The second part of the challenge required you to find the pair of values on the row where one value was a divisor of the other, and summing the result of that division for each row. Thankfully the data didn’t require any more wrangling, and I could query it in the format I already had it.

My method for finding the pair, was to join each value on a row to each other value on a row, and find the one value where the modulus was 0. I then added the result of the division and voila! Be wary of this though, creating a near-cartesian join can seriously hinder performance, especially on large data sets.

Right, 2 days down, 23 to go. Dunno if I’ll be able to blog every day, but I’m going to give it a go.

Until tomorrow.

One comment

Leave a Reply

Your email address will not be published. Required fields are marked *