Count Distinct Or Unique Values – VBA UDF

A handful of posts ago I looked at some FREQUENCY() formulae which could give a count of distinct or unique values in a given column. The formulae became quite complicated and slow – particularly when dealing with mixed data types –  so, in this post, I’m going to have a go at creating some VBA User Defined Functions to do the same thing. A UDF is simply a custom function (written in any language) which is called from a Range or, in other words, used in a worksheet formula.

Before I continue, let me clarify the difference between what I call a ‘distinct’ count and a ‘unique’ count. Suppose we have a list as follows:

a, a, b, b, c, d, e, e, f

  • From this list, I would say there are 6 distinct values: a, b, c, d, e and f. I use the word ‘distinct’ to mean ‘different’ – the same as it is used in the SQL world .
  • From this list, I would say there are 3 unique values: c, d and f. The unique values are the ones which appear exactly once in the list.

The two  VBA approaches I’m going to look at are:

  • A Collection object
  • A Dictionary object

As a starting point I’m going to create some basic UDFs which give a distinct count using each of these approaches. The distinct FREQUENCY() formula I’m going to compare those UDFs against is:

=SUM(--(FREQUENCY(IF(A1:A9<>"",MATCH("~"&A1:A9,A1:A9&"",0)),ROW(A1:A9)-ROW(A1)+1)>0))

This formula was broken down and explained in detail in my other post, so please have a read through that if you’re not sure how it works. I’ll design the basic UDFs in a somewhat contrived fashion so that they give the same results as this formula and then I’ll do some benchmark testing to see which one is faster to calculate.

Download Example Workbook

I’ve made a workbook available for download which has all of the below UDFs.

Basic Distinct Count Using A Collection Object

Public Function COUNTDISTINCTcol(ByRef rngToCheck As Range) As Variant

    Dim colDistinct As Collection
    Dim varValues As Variant, varValue As Variant
    Dim lngCount As Long, lngRow As Long, lngCol As Long

    On Error GoTo ErrorHandler

    varValues = rngToCheck.Value

    'if rngToCheck is more than 1 cell then
    'varValues will be a 2 dimensional array
    If IsArray(varValues) Then

        Set colDistinct = New Collection

        For lngRow = LBound(varValues, 1) To UBound(varValues, 1)
            For lngCol = LBound(varValues, 2) To UBound(varValues, 2)

                varValue = varValues(lngRow, lngCol)

                'ignore blank cells and throw error
                'if cell contains an error value
                If LenB(varValue) > 0 Then

                    'if the item already exists then an error will
                    'be thrown which we want to ignore
                    On Error Resume Next
                    colDistinct.Add vbNullString, CStr(varValue)
                    On Error GoTo ErrorHandler

                End If

            Next lngCol
        Next lngRow

        lngCount = colDistinct.Count
    Else
        If LenB(varValues) > 0 Then
            lngCount = 1
        End If

    End If

    COUNTDISTINCTcol = lngCount

    Exit Function

ErrorHandler:
    COUNTDISTINCTcol = CVErr(xlErrValue)

End Function

In summary, each item in the Collection must have a unique key and that unique key must be a String. If the code tries to create a duplicate key then an error is raised. The error is ignored due to the On Error Resume Next directive. A feature of the VBA Collection object is that the key is case-insensitive.

The LenB() function is used to check if the cell is blank; if the cell contains an error value then an error will be raised at this point and the UDF will return #VALUE!. I’ll rant about the confusion in Excel between empty and blank cells in a future blog post (::grin::) but when I say blank I mean the cell either contains:

  • absolutely nothing (it is empty)
  • a zero length string
  • only a cell prefix character (usually ')

Basic Distinct Count Using A Dictionary Object

I’ve actually written two Dictionary object UDFs for benchmarking because there are a couple of ways the Dictionary object can be created and maintained. Firstly, a new Dictionary object could be created every time the UDF is called or, secondly, a Dictionary object could be created on the first call and then cleared (but not destroyed) by using the Dictionary.RemoveAll() method on subsequent calls.

The Dictionary class is part of the Microsoft Scripting Runtime library (scrrun.dll) and since I’ve used early binding in my code, a reference needs to be added to it in the VBA IDE under Tools > References.

scrrun.dll reference

Here’s the UDF which creates a new Dictionary object on each call:

Public Function COUNTDISTINCTdicNew( _
    ByRef rngToCheck As Range) As Variant

    'Early binding declaration (Microsoft Scripting Runtime
    'reference required):
    Dim dicDistinct As Scripting.Dictionary
    Dim varValues As Variant, varValue As Variant
    Dim lngCount As Long, lngRow As Long, lngCol As Long
    Dim strValue As String

    On Error GoTo ErrorHandler

    varValues = rngToCheck.Value

    'if rngToCheck is more than 1 cell then
    'varValues will be a 2 dimensional array
    If IsArray(varValues) Then

        Set dicDistinct = CreateObject("Scripting.Dictionary")
        dicDistinct.CompareMode = TextCompare

        For lngRow = LBound(varValues, 1) To UBound(varValues, 1)
            For lngCol = LBound(varValues, 2) To UBound(varValues, 2)

                varValue = varValues(lngRow, lngCol)

                'ignore blank cells and throw error
                'if cell contains an error value
                If LenB(varValue) > 0 Then

                    'cast everything to a string
                    'so dictionary is not type sensitive
                    strValue = CStr(varValue)

                    If Not dicDistinct.Exists(strValue) Then
                        dicDistinct.Add strValue, vbNullString
                    End If
                End If

            Next lngCol
        Next lngRow

        lngCount = dicDistinct.Count
    Else
        If LenB(varValues) > 0 Then
            lngCount = 1
        End If

    End If

    COUNTDISTINCTdicNew = lngCount

    Exit Function

ErrorHandler:
    COUNTDISTINCTdicNew = CVErr(xlErrValue)

End Function

And here’s the UDF which retains the existing Dictionary object after the first call and subsequently clears it. The Static keyword is used instead of Dim so that the Dictionary object reference is retained between function calls:

Public Function COUNTDISTINCTdicStatic( _
    ByRef rngToCheck As Range) As Variant

    Static dicDistinct As Scripting.Dictionary

    Dim varValues As Variant, varValue As Variant
    Dim lngCount As Long, lngRow As Long, lngCol As Long
    Dim strValue As String

    On Error GoTo ErrorHandler

    varValues = rngToCheck.Value

    'if rngToCheck is more than 1 cell then
    'varValues will be a 2 dimensional array
    If IsArray(varValues) Then

        If dicDistinct Is Nothing Then
            Set dicDistinct = CreateObject("Scripting.Dictionary")
            dicDistinct.CompareMode = TextCompare
        Else
            dicDistinct.RemoveAll
        End If

        For lngRow = LBound(varValues, 1) To UBound(varValues, 1)
            For lngCol = LBound(varValues, 2) To UBound(varValues, 2)

                varValue = varValues(lngRow, lngCol)

                'ignore blank cells and throw error
                'if cell contains an error value
                If LenB(varValue) > 0 Then

                    'cast everything to a string
                    'so dictionary is not type sensitive
                    strValue = CStr(varValue)

                    If Not dicDistinct.Exists(strValue) Then
                        dicDistinct.Add strValue, vbNullString
                    End If
                End If

            Next lngCol
        Next lngRow

        lngCount = dicDistinct.Count
    Else
        If LenB(varValues) > 0 Then
            lngCount = 1
        End If
    End If

    COUNTDISTINCTdicStatic = lngCount

    Exit Function

ErrorHandler:
    COUNTDISTINCTdicStatic = CVErr(xlErrValue)

End Function

Checking The UDFs

First let’s verify that the UDFs all return the same result as the FREQUENCY() formula.

Distinct count basic UDF results

As you can see, all four formulae are type-insensitive (they don’t distinguish between text, numbers, logicals etc) and case-insensitive (for example, they don’t distinguish between "A" and "a" ).

Calculation Benchmarking

The two main factors when benchmarking these UDFs are the size of the precedent range and the cardinality (the uniqueness) of the values, so I tested them against a variety of range sizes using the two extreme scenarios: all the values being different and all the values being the same. Here are the results:

Basic UDFs All Values Different

Basic UDFs All Values Same

  • The FREQUENCY() formula performs the best when all the values are the same but becomes extremely slow when all the values are different; in fact, I gave up timing it beyond 10,000 cells.
  • There’s no material difference between the two Dictionary object UDFs which surprised me. I expected the static one to be faster until the number of items in the Dictionary from the previous call became very large, thus slowing down the RemoveAll() method.
  • All of the formulae are pretty fast when all the values are the same.
  • The Collection object UDF would generally seem to be the best overall choice.

However, the testing so far has been a little bit unfair on the Dictionary object UDFs. This is because I went out of my way to make all the UDFs return the same results and, in so doing, I converted all the cell values to String types to make the Dictionary UDFs type-insensitive and used a Dictionary.CompareMode of TextCompare to make them case-insensitive. These both hit the performance. If I make the Dictionary UDFs type-sensitive (no conversion to String types) and case-sensitive (BinaryCompare) then the benchmarking looks a little different:

Basic UDFs All Values Different (revised)

Basic UDFs All Values Same (revised)

Now the Dictionary UDFs are the fastest in both scenarios. Given the performance tests and the features of each UDF, my general choice would be to go with the Dictionary object. If you specifically want a type-insensitive and case-insensitive UDF then go with the Collection object.

Enhanced Distinct Count – Dictionary Object

I’ve enhanced the basic, static Dictionary UDF below to give it a bit more punch. One of the changes I’ve made is to reduce the precedent range if a whole column reference has been passed in. To do that I’ve used the Worksheet.UsedRange property which Charles Williams recently blogged about: it can become extremely slow if there are a lot of cells containing data or formatting on the worksheet (it’s not the size of the used range that matters) because it internally reads from the Cell table. Have a read through his blog post – you might decide that you would prefer to remove it or use an alternative approach.

Public Function COUNTDISTINCT( _
    ByRef rngToCheck As Range, _
    Optional ByVal blnCaseSensitive As Boolean = True _
                            ) As Variant

    Static dicDistinct As Object

    Dim varValues As Variant, varValue As Variant
    Dim lngCount As Long, lngRow As Long, lngCol As Long

    On Error GoTo ErrorHandler

    'minimise the precedent range in case of full column references
    'for more information on performance of Worksheet.UsedRange
    'see Charles Williams' blog:
    'http://fastexcel.wordpress.com/2012/12/02/writing-efficient-udfs-part-11-full-column-references-in-udfs-used-range-is-slow/
    Set rngToCheck = Intersect(rngToCheck.Worksheet.UsedRange, rngToCheck)

    If Not rngToCheck Is Nothing Then

        'assign cell value(s) into memory so they
        'are faster to work with
        varValues = rngToCheck.Value

        'if rngToCheck is more than 1 cell then
        'varValues will be a 2 dimensional array
        If IsArray(varValues) Then

            If dicDistinct Is Nothing Then
                Set dicDistinct = CreateObject("Scripting.Dictionary")
                dicDistinct.CompareMode = BinaryCompare
            Else
                dicDistinct.RemoveAll
            End If

            For lngRow = LBound(varValues, 1) To UBound(varValues, 1)
                For lngCol = LBound(varValues, 2) To UBound(varValues, 2)

                    varValue = varValues(lngRow, lngCol)

                    'ignore error values
                    If Not IsError(varValue) Then

                        'ignore blank cells
                        'including formulae which return ""
                        If LenB(varValue) > 0 Then

                            'if we have a string then let's allow for case sensitivity
                            If VarType(varValue) = vbString Then
                                If Not blnCaseSensitive Then
                                    varValue = UCase(varValue)
                                End If
                            End If

                            If Not dicDistinct.Exists(varValue) Then
                                dicDistinct.Add varValue, vbNullString
                            End If

                        End If
                    End If
                Next lngCol
            Next lngRow

            lngCount = dicDistinct.Count
        Else
            'ignore if cell contains an error or is blank
            If Not IsError(varValues) Then
                If LenB(varValues) > 0 Then
                    lngCount = 1
                End If
            End If
        End If
    End If

    COUNTDISTINCT = lngCount

    Exit Function

ErrorHandler:
    COUNTDISTINCT = CVErr(xlErrValue)

End Function

Note the following points:

  • It will count number, text and logical data types but it ignores error values such as #N/A and DIV/0!.
  • It ignores empty (and blank) cells.
  • It is case-sensitive by default.
  • It differentiates between data types. For example, these two formulae would be considered to be different: =TRUE() ="True" , as would ="1" and =1.
  • A reference to the Microsoft Scripting Runtime library is required. If you do not include the reference then the Dictionary object needs to be declared as an Object type and BinaryCompare set as a constant equal to 0.
  • It can work on data across multiple columns.

Here are a couple of examples to validate the UDF’s results.

 COUNTDISTINCT Case Sensitive

 COUNTDISTINCT Case Insensitive

Enhanced Unique Count – Dictionary Object

The code to get a unique count is very similar.

Public Function COUNTUNIQUE( _
    ByRef rngToCheck As Range, _
    Optional ByVal blnCaseSensitive As Boolean = True _
                            ) As Variant

    Static dicDistinct As Object

    Dim varValues As Variant, varValue As Variant, varItems As Variant
    Dim lngCount As Long, lngItem As Long
    Dim lngRow As Long, lngCol As Long

    On Error GoTo ErrorHandler

    'minimise the precedent range in case of full column references
    'for more information on performance of Worksheet.UsedRange
    'see Charles Williams' blog:
    'http://fastexcel.wordpress.com/2012/12/02/writing-efficient-udfs-part-11-full-column-references-in-udfs-used-range-is-slow/
    Set rngToCheck = Intersect(rngToCheck.Worksheet.UsedRange, rngToCheck)

    If Not rngToCheck Is Nothing Then

        'assign cell value(s) into memory so they
        'are faster to work with
        varValues = rngToCheck.Value

        'if rngToCheck is more than 1 cell then
        'varValues will be a 2 dimensional array
        If IsArray(varValues) Then

            If dicDistinct Is Nothing Then
                Set dicDistinct = CreateObject("Scripting.Dictionary")
                dicDistinct.CompareMode = BinaryCompare
            Else
                dicDistinct.RemoveAll
            End If

            For lngRow = LBound(varValues, 1) To UBound(varValues, 1)
                For lngCol = LBound(varValues, 2) To UBound(varValues, 2)

                    varValue = varValues(lngRow, lngCol)

                    'ignore error values
                    If Not IsError(varValue) Then

                        'ignore blank cells
                        'including formulae which return ""
                        If LenB(varValue) > 0 Then

                            'if we have a string then let's allow for case sensitivity
                            If VarType(varValue) = vbString Then
                                If Not blnCaseSensitive Then
                                    varValue = UCase(varValue)
                                End If
                            End If

                            'if it already exists then keep a counter on
                            'how many times it occurs
                            If dicDistinct.Exists(varValue) Then
                                dicDistinct.Item(varValue) = dicDistinct.Item(varValue) + 1
                            Else
                                'else add it with an occurence of 1
                                dicDistinct.Add varValue, 1
                            End If

                        End If
                    End If
                Next lngCol
            Next lngRow

            'we're only interested in values which appeared exactly once
            varItems = dicDistinct.Items

            For lngItem = LBound(varItems, 1) To UBound(varItems, 1)
                If varItems(lngItem) = 1 Then
                    lngCount = lngCount + 1
                End If
            Next lngItem

        Else
            'ignore if cell contains an error or is blank
            If Not IsError(varValues) Then
                If LenB(varValues) > 0 Then
                    lngCount = 1
                End If
            End If
        End If
    End If

    COUNTUNIQUE = lngCount

    Exit Function

ErrorHandler:
    COUNTUNIQUE = CVErr(xlErrValue)

End Function

COUNTUNIQUE Case Sensitive

COUNTUNIQUE Case Insensitive

Wrapping It Up

Using a well-written UDF is a good solution to this problem. The complications apparent in the FREQUENCY() formula are hidden away in the UDF’s code which allows for a simple formula. The calculation performance of the Dictionary UDF is superior to the FREQUENCY() formula, particularly when there are a lot of different values in the precedent range. The flexibility afforded by using code also allows for simple adjustments to enhance the features of the UDF, such as making it case-sensitive.

Are there other good ways to get a distinct (or unique) count? Yes, there are. Two good options would be to use the advanced filter or a pivot table to get a distinct list and then use a simple formula to count the items in that list. I’ve got a blog post planned for automating the advanced filter using VBA, so I’ll cover that in more detail then.

Advertisements

About Colin Legg

RAD Developer Microsoft MVP - Excel 2009 - 2014
This entry was posted in Microsoft Excel, Microsoft Office and tagged , , , , , , , , , , , , , , , , , , , . Bookmark the permalink.

5 Responses to Count Distinct Or Unique Values – VBA UDF

  1. Pingback: Arrays vs Collections vs Dictionary Objects (and Dictionary help) | Newton Excel Bach, not (just) an Excel Blog

  2. Jeff Weir says:

    Hi Colin. I did a post at http://dailydoseofexcel.com/archives/2013/10/23/dictionaries-can-be-rude/ recently that did some testing of deduping using dictionaries vs other approaches that may be of interest. Among other things, it turns out that if you have lots of unique items, if you sort them first before you add them to the dictionary, things run considerably faster.

    Like

  3. Aldo Arch says:

    Beautifully Written. Thank you.

    Like

  4. Pingback: Is Your Cell Blank Or Empty? | RAD Excel

  5. Christos Rbs says:

    It works perfectly! It really helped a lot.
    Nice Job!

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s