## Count Distinct Or Unique Values – VBA UDF

A handful of posts ago I looked at some `FREQUENCY() `formulae which could give a count of distinct or unique values in a given column. The formulae became quite complicated and slow – particularly when dealing with mixed data types –  so, in this post, I’m going to have a go at creating some VBA User Defined Functions to do the same thing. A UDF is simply a custom function (written in any language) which is called from a `Range `or, in other words, used in a worksheet formula.

Before I continue, let me clarify the difference between what I call a ‘distinct’ count and a ‘unique’ count. Suppose we have a list as follows:

a, a, b, b, c, d, e, e, f

• From this list, I would say there are 6 distinct values: a, b, c, d, e and f. I use the word ‘distinct’ to mean ‘different’ – the same as it is used in the SQL world .
• From this list, I would say there are 3 unique values: c, d and f. The unique values are the ones which appear exactly once in the list.

The two  VBA approaches I’m going to look at are:

• A `Collection `object
• A `Dictionary `object

As a starting point I’m going to create some basic UDFs which give a distinct count using each of these approaches. The distinct `FREQUENCY()` formula I’m going to compare those UDFs against is:

`=SUM(--(FREQUENCY(IF(A1:A9<>"",MATCH("~"&A1:A9,A1:A9&"",0)),ROW(A1:A9)-ROW(A1)+1)>0))`

This formula was broken down and explained in detail in my other post, so please have a read through that if you’re not sure how it works. I’ll design the basic UDFs in a somewhat contrived fashion so that they give the same results as this formula and then I’ll do some benchmark testing to see which one is faster to calculate.

#### Basic Distinct Count Using A Collection Object

```Public Function COUNTDISTINCTcol(ByRef rngToCheck As Range) As Variant

Dim colDistinct As Collection
Dim varValues As Variant, varValue As Variant
Dim lngCount As Long, lngRow As Long, lngCol As Long

On Error GoTo ErrorHandler

varValues = rngToCheck.Value

'if rngToCheck is more than 1 cell then
'varValues will be a 2 dimensional array
If IsArray(varValues) Then

Set colDistinct = New Collection

For lngRow = LBound(varValues, 1) To UBound(varValues, 1)
For lngCol = LBound(varValues, 2) To UBound(varValues, 2)

varValue = varValues(lngRow, lngCol)

'ignore blank cells and throw error
'if cell contains an error value
If LenB(varValue) > 0 Then

'if the item already exists then an error will
'be thrown which we want to ignore
On Error Resume Next
On Error GoTo ErrorHandler

End If

Next lngCol
Next lngRow

lngCount = colDistinct.Count
Else
If LenB(varValues) > 0 Then
lngCount = 1
End If

End If

COUNTDISTINCTcol = lngCount

Exit Function

ErrorHandler:
COUNTDISTINCTcol = CVErr(xlErrValue)

End Function
```

In summary, each item in the `Collection `must have a unique key and that unique key must be a `String`. If the code tries to create a duplicate key then an error is raised. The error is ignored due to the `On Error Resume Next `directive. A feature of the VBA `Collection `object is that the key is case-insensitive.

The `LenB()` function is used to check if the cell is blank; if the cell contains an error value then an error will be raised at this point and the UDF will return `#VALUE!`. I’ll rant about the confusion in Excel between empty and blank cells in a future blog post (::grin::) but when I say blank I mean the cell either contains:

• absolutely nothing (it is empty)
• a zero length string
• only a cell prefix character (usually `'`)

#### Basic Distinct Count Using A Dictionary Object

I’ve actually written two `Dictionary` object UDFs for benchmarking because there are a couple of ways the `Dictionary` object can be created and maintained. Firstly, a new `Dictionary` object could be created every time the UDF is called or, secondly, a `Dictionary` object could be created on the first call and then cleared (but not destroyed) by using the `Dictionary.RemoveAll()` method on subsequent calls.

The Dictionary class is part of the Microsoft Scripting Runtime library (scrrun.dll) and since I’ve used early binding in my code, a reference needs to be added to it in the VBA IDE under Tools > References. Here’s the UDF which creates a new `Dictionary `object on each call:

```Public Function COUNTDISTINCTdicNew( _
ByRef rngToCheck As Range) As Variant

'Early binding declaration (Microsoft Scripting Runtime
'reference required):
Dim dicDistinct As Scripting.Dictionary
Dim varValues As Variant, varValue As Variant
Dim lngCount As Long, lngRow As Long, lngCol As Long
Dim strValue As String

On Error GoTo ErrorHandler

varValues = rngToCheck.Value

'if rngToCheck is more than 1 cell then
'varValues will be a 2 dimensional array
If IsArray(varValues) Then

Set dicDistinct = CreateObject("Scripting.Dictionary")
dicDistinct.CompareMode = TextCompare

For lngRow = LBound(varValues, 1) To UBound(varValues, 1)
For lngCol = LBound(varValues, 2) To UBound(varValues, 2)

varValue = varValues(lngRow, lngCol)

'ignore blank cells and throw error
'if cell contains an error value
If LenB(varValue) > 0 Then

'cast everything to a string
'so dictionary is not type sensitive
strValue = CStr(varValue)

If Not dicDistinct.Exists(strValue) Then
End If
End If

Next lngCol
Next lngRow

lngCount = dicDistinct.Count
Else
If LenB(varValues) > 0 Then
lngCount = 1
End If

End If

COUNTDISTINCTdicNew = lngCount

Exit Function

ErrorHandler:
COUNTDISTINCTdicNew = CVErr(xlErrValue)

End Function
```

And here’s the UDF which retains the existing `Dictionary `object after the first call and subsequently clears it. The `Static `keyword is used instead of `Dim `so that the `Dictionary `object reference is retained between function calls:

```Public Function COUNTDISTINCTdicStatic( _
ByRef rngToCheck As Range) As Variant

Static dicDistinct As Scripting.Dictionary

Dim varValues As Variant, varValue As Variant
Dim lngCount As Long, lngRow As Long, lngCol As Long
Dim strValue As String

On Error GoTo ErrorHandler

varValues = rngToCheck.Value

'if rngToCheck is more than 1 cell then
'varValues will be a 2 dimensional array
If IsArray(varValues) Then

If dicDistinct Is Nothing Then
Set dicDistinct = CreateObject("Scripting.Dictionary")
dicDistinct.CompareMode = TextCompare
Else
dicDistinct.RemoveAll
End If

For lngRow = LBound(varValues, 1) To UBound(varValues, 1)
For lngCol = LBound(varValues, 2) To UBound(varValues, 2)

varValue = varValues(lngRow, lngCol)

'ignore blank cells and throw error
'if cell contains an error value
If LenB(varValue) > 0 Then

'cast everything to a string
'so dictionary is not type sensitive
strValue = CStr(varValue)

If Not dicDistinct.Exists(strValue) Then
End If
End If

Next lngCol
Next lngRow

lngCount = dicDistinct.Count
Else
If LenB(varValues) > 0 Then
lngCount = 1
End If
End If

COUNTDISTINCTdicStatic = lngCount

Exit Function

ErrorHandler:
COUNTDISTINCTdicStatic = CVErr(xlErrValue)

End Function
```

#### Checking The UDFs

First let’s verify that the UDFs all return the same result as the `FREQUENCY()` formula. As you can see, all four formulae are type-insensitive (they don’t distinguish between text, numbers, logicals etc) and case-insensitive (for example, they don’t distinguish between `"A"` and `"a"` ).

#### Calculation Benchmarking

The two main factors when benchmarking these UDFs are the size of the precedent range and the cardinality (the uniqueness) of the values, so I tested them against a variety of range sizes using the two extreme scenarios: all the values being different and all the values being the same. Here are the results:  • The `FREQUENCY() `formula performs the best when all the values are the same but becomes extremely slow when all the values are different; in fact, I gave up timing it beyond 10,000 cells.
• There’s no material difference between the two `Dictionary` object UDFs which surprised me. I expected the static one to be faster until the number of items in the `Dictionary` from the previous call became very large, thus slowing down the `RemoveAll()` method.
• All of the formulae are pretty fast when all the values are the same.
• The `Collection` object UDF would generally seem to be the best overall choice.

However, the testing so far has been a little bit unfair on the `Dictionary` object UDFs. This is because I went out of my way to make all the UDFs return the same results and, in so doing, I converted all the cell values to `String` types to make the `Dictionary` UDFs type-insensitive and used a `Dictionary.CompareMode` of `TextCompare` to make them case-insensitive. These both hit the performance. If I make the `Dictionary` UDFs type-sensitive (no conversion to `String` types) and case-sensitive (`BinaryCompare`) then the benchmarking looks a little different:  Now the `Dictionary` UDFs are the fastest in both scenarios. Given the performance tests and the features of each UDF, my general choice would be to go with the `Dictionary` object. If you specifically want a type-insensitive and case-insensitive UDF then go with the `Collection `object.

#### Enhanced Distinct Count – Dictionary Object

I’ve enhanced the basic, static `Dictionary` UDF below to give it a bit more punch. One of the changes I’ve made is to reduce the precedent range if a whole column reference has been passed in. To do that I’ve used the `Worksheet.UsedRange `property which Charles Williams recently blogged about: it can become extremely slow if there are a lot of cells containing data or formatting on the worksheet (it’s not the size of the used range that matters) because it internally reads from the Cell table. Have a read through his blog post – you might decide that you would prefer to remove it or use an alternative approach.

```Public Function COUNTDISTINCT( _
ByRef rngToCheck As Range, _
Optional ByVal blnCaseSensitive As Boolean = True _
) As Variant

Static dicDistinct As Object

Dim varValues As Variant, varValue As Variant
Dim lngCount As Long, lngRow As Long, lngCol As Long

On Error GoTo ErrorHandler

'minimise the precedent range in case of full column references
'see Charles Williams' blog:
'http://fastexcel.wordpress.com/2012/12/02/writing-efficient-udfs-part-11-full-column-references-in-udfs-used-range-is-slow/
Set rngToCheck = Intersect(rngToCheck.Worksheet.UsedRange, rngToCheck)

If Not rngToCheck Is Nothing Then

'assign cell value(s) into memory so they
'are faster to work with
varValues = rngToCheck.Value

'if rngToCheck is more than 1 cell then
'varValues will be a 2 dimensional array
If IsArray(varValues) Then

If dicDistinct Is Nothing Then
Set dicDistinct = CreateObject("Scripting.Dictionary")
dicDistinct.CompareMode = BinaryCompare
Else
dicDistinct.RemoveAll
End If

For lngRow = LBound(varValues, 1) To UBound(varValues, 1)
For lngCol = LBound(varValues, 2) To UBound(varValues, 2)

varValue = varValues(lngRow, lngCol)

'ignore error values
If Not IsError(varValue) Then

'ignore blank cells
'including formulae which return ""
If LenB(varValue) > 0 Then

'if we have a string then let's allow for case sensitivity
If VarType(varValue) = vbString Then
If Not blnCaseSensitive Then
varValue = UCase(varValue)
End If
End If

If Not dicDistinct.Exists(varValue) Then
End If

End If
End If
Next lngCol
Next lngRow

lngCount = dicDistinct.Count
Else
'ignore if cell contains an error or is blank
If Not IsError(varValues) Then
If LenB(varValues) > 0 Then
lngCount = 1
End If
End If
End If
End If

COUNTDISTINCT = lngCount

Exit Function

ErrorHandler:
COUNTDISTINCT = CVErr(xlErrValue)

End Function
```

Note the following points:

• It will count number, text and logical data types but it ignores error values such as `#N/A` and `DIV/0!.`
• It ignores empty (and blank) cells.
• It is case-sensitive by default.
• It differentiates between data types. For example, these two formulae would be considered to be different: `=TRUE() ="True"` , as would `="1" and =1.`
• A reference to the Microsoft Scripting Runtime library is required. If you do not include the reference then the `Dictionary` object needs to be declared as an `Object` type and `BinaryCompare` set as a constant equal to `0`.
• It can work on data across multiple columns.

Here are a couple of examples to validate the UDF’s results.

### #### Enhanced Unique Count – Dictionary Object

The code to get a unique count is very similar.

```Public Function COUNTUNIQUE( _
ByRef rngToCheck As Range, _
Optional ByVal blnCaseSensitive As Boolean = True _
) As Variant

Static dicDistinct As Object

Dim varValues As Variant, varValue As Variant, varItems As Variant
Dim lngCount As Long, lngItem As Long
Dim lngRow As Long, lngCol As Long

On Error GoTo ErrorHandler

'minimise the precedent range in case of full column references
'see Charles Williams' blog:
'http://fastexcel.wordpress.com/2012/12/02/writing-efficient-udfs-part-11-full-column-references-in-udfs-used-range-is-slow/
Set rngToCheck = Intersect(rngToCheck.Worksheet.UsedRange, rngToCheck)

If Not rngToCheck Is Nothing Then

'assign cell value(s) into memory so they
'are faster to work with
varValues = rngToCheck.Value

'if rngToCheck is more than 1 cell then
'varValues will be a 2 dimensional array
If IsArray(varValues) Then

If dicDistinct Is Nothing Then
Set dicDistinct = CreateObject("Scripting.Dictionary")
dicDistinct.CompareMode = BinaryCompare
Else
dicDistinct.RemoveAll
End If

For lngRow = LBound(varValues, 1) To UBound(varValues, 1)
For lngCol = LBound(varValues, 2) To UBound(varValues, 2)

varValue = varValues(lngRow, lngCol)

'ignore error values
If Not IsError(varValue) Then

'ignore blank cells
'including formulae which return ""
If LenB(varValue) > 0 Then

'if we have a string then let's allow for case sensitivity
If VarType(varValue) = vbString Then
If Not blnCaseSensitive Then
varValue = UCase(varValue)
End If
End If

'if it already exists then keep a counter on
'how many times it occurs
If dicDistinct.Exists(varValue) Then
dicDistinct.Item(varValue) = dicDistinct.Item(varValue) + 1
Else
'else add it with an occurence of 1
End If

End If
End If
Next lngCol
Next lngRow

'we're only interested in values which appeared exactly once
varItems = dicDistinct.Items

For lngItem = LBound(varItems, 1) To UBound(varItems, 1)
If varItems(lngItem) = 1 Then
lngCount = lngCount + 1
End If
Next lngItem

Else
'ignore if cell contains an error or is blank
If Not IsError(varValues) Then
If LenB(varValues) > 0 Then
lngCount = 1
End If
End If
End If
End If

COUNTUNIQUE = lngCount

Exit Function

ErrorHandler:
COUNTUNIQUE = CVErr(xlErrValue)

End Function
```  #### Wrapping It Up

Using a well-written UDF is a good solution to this problem. The complications apparent in the `FREQUENCY() `formula are hidden away in the UDF’s code which allows for a simple formula. The calculation performance of the `Dictionary `UDF is superior to the `FREQUENCY()` formula, particularly when there are a lot of different values in the precedent range. The flexibility afforded by using code also allows for simple adjustments to enhance the features of the UDF, such as making it case-sensitive.

Are there other good ways to get a distinct (or unique) count? Yes, there are. Two good options would be to use the advanced filter or a pivot table to get a distinct list and then use a simple formula to count the items in that list. I’ve got a blog post planned for automating the advanced filter using VBA, so I’ll cover that in more detail then.

RAD Developer Microsoft MVP - Excel 2009 - 2014
This entry was posted in Microsoft Excel, Microsoft Office and tagged , , , , , , , , , , , , , , , , , , , . Bookmark the permalink.

### 6 Responses to Count Distinct Or Unique Values – VBA UDF

1. Jeff Weir says:

Hi Colin. I did a post at http://dailydoseofexcel.com/archives/2013/10/23/dictionaries-can-be-rude/ recently that did some testing of deduping using dictionaries vs other approaches that may be of interest. Among other things, it turns out that if you have lots of unique items, if you sort them first before you add them to the dictionary, things run considerably faster.

Like

2. Aldo Arch says:

Beautifully Written. Thank you.

Like

3. Christos Rbs says:

It works perfectly! It really helped a lot.
Nice Job!

Like