Excel 365’s PowerQuery has many wonderful features but occasionally there are common needs that it can’t easily handle. One of those surprising complexities is removing older duplicates but retaining most recent row for each. We’ll explain the simplest solution then why it works.
Here’s some sample data on the left and the most recent purchase for each customer at right.
The solution
There are various ways to solve this problem, this is the easiest we’ve found.
- Load the data into Excel PowerQuery.
- Change the fields / columns to whatever formats you want. Make sure the Date field is in Date format.
- Sort the Date field, descending (so most recent dates are on top).
The code is :
Table.Sort(#”Changed Type”,{{“Date”, Order.Descending}}
We’ll come back to this code a little later because there’s more needed.
- Select the Name field/column then Remove Duplicates
The code is: Table.Distinct(#”Sorted Rows”, {“Name”})
That’s how you’d expect to get the right result, but it doesn’t work properly. You have to add a PowerQuery fix that gives the correct result.
- Go back to the Sorted Rows step, it needs a little extra to fix the PowerQuery problem.
In the formula bar wrap the existing Table.Sort command with the mysterious Table.Buffer function.
For example
Table.Sort(#”Changed Type”,{{“Date”, Order.Descending}}
Becomes …
Table.Buffer(Table.Sort(#”Changed Type”,{{“Date”, Order.Descending}})
There are other solutions, most commonly using Group By. We prefer this method because it’s simple to implement and understand.
Why it works
You’d think it’s a two-step process in PowerQuery would be enough; Sort into date order, descending then Remove Duplicates. Alas that doesn’t work, here’s what happens when you try.
The result for ‘Stephen’ is wrong – 17 July instead of 22 November,
The problem is the way PowerQuery handles a sort in the middle of some applied steps.
Table.Buffer()
Table.Buffer() forces PowerQuery to load the list after sorting. That means the following steps (like Remove Duplicates) work off the entire sorted list and give the correct results.
All Microsoft says about Table.Buffer is “Buffers a table in memory, isolating it from external changes during evaluation.” which doesn’t even begin to explain it’s vital importance in some situations and the performance gains possible elsewhere.
Excel 365’s data sources now include dynamic arrays
Solved! The problem with Excel Tables and Transpose
Better management of large databases in Excel