Results 1 to 3 of 3

Thread: VB5/6/VBA - HeapsortVarVar, Another VB Sort

  1. #1

    Thread Starter
    PowerPoster
    Join Date
    Feb 2006
    Posts
    24,482

    Lightbulb VB5/6/VBA - HeapsortVarVar, Another VB Sort

    This is a VB6 static (.BAS) module for sorting a Variant array of Variant arrays using the Heapsort algorithm. It could easily be converted to a class (.CLS) if desired as well.

    It should be usable in VB5 and VBA, but this hasn't been tested. it should even be convertable to VBScript, though with much reduced performance of course.


    Why VarVars?

    When I need to work on large sets of row/col-based (tabular) data and I can't use a database or ADO Recordset I find Variant arrays of Variant arrays handy. Much handier then a 2-D array of String because I can used typed data, far less clunky than working with inflexible UDT arrays where you can't iterate over the fields, and even more useful than 2-D Variant arrays.

    Typed data instead of String for everything can be very useful. In a field typed as Single, Double, Currency, etc. 2 and 2.0 will compare properly.

    For another thing you can build a "row" of fields easily just using the VB Array() function, then assign it to a "row" slot in your Variant rows array.


    Why Heapsort?

    The Heapsort is reasonably speedy, doesn't require additional space, and performs well even in worst-case input sequences.

    But the main reason is that it is easy to modify to perform the sort in several steps. This makes it possible to use a Timer control to drive a "background" sort that doesn't cause your programs to become unresponsive. This does add to the time a little, but the Timer.Interval need only be 16 to 32 milliseconds for most purposes. It also allows you update progress bars, handle "cancel" button clicks, etc.

    This has always been the preferred way to handle this in VB6, as the manual states in many places. DoEvents() calls have their place, but DoEvents() is evil, should rarely be used, and even then only used quite carefully.

    However both a synchronous sort and a quantified "pseudo-async" sort are provided here.


    Usage

    Basically, just add HeapsortVarVar.bas to your projects.

    Then when you have a Variant array of Variant arrays to sort, call the SortBy() subroutine (or the QuantumSortBy() function until it returns False).

    You pass the column index to sort on, the outer (rows) array to be sorted, and an optional Boolean Descending value (True or False, default False) to specify the direction of the sort.

    QuantumSortBy() has two more parameters but the source code comments should explain those. These need to be initialized before the first call of each sort to be performed.

    There are no extra dependencies.


    Issues

    There are sorts that can be faster, Quicksort is popular. However they aren't usually enough faster to warrant sacrificing some of the things that are easy to do to Heapsort (e.g. a quantified pseudo-async version).

    If you create a Quicksort adapted for this I'd love to see it. Right now HeapsortVarVar works plenty fast for me, but more speed without additional pain is always appreciated.

    Bugs. As far as I can tell there aren't any, but if you find some I'd like to know about them.

    I find this very versatile, and once "known bug free" it should make a useful and easily reusable sort for VB users. While you can use it for sorting single items, it would be better to just create a new version tailored to work on a simple single-valued 1-D array.


    Demos

    There are two in the ZIP archive attachment. They are "ready to run" with a sample input file included, though I'd try testing after compiling to an EXE first. There is more data included than it appears, though it ZIPpped fairly small because it's about 1500 records copied into the file multiple times. The sort isn't too bad even in the IDE, however populating and re-populating the flexgrid can take a while.

    Yes, I know the flexgrids can sort, but here it is just being used for demo purposes.

    HeapsortDemo

    This is a GUI program that loads and parses the sample data into a VarVar and displays it in an MSHFlexGrid. Then you can click or shift-click the column headers to sort and redisplay the VarVar contents.

    Yes, there are a couple of UI quirks, related to the column click selecting cells in the grid. The program tries to clear these selections but it doesn't always succeed. The grid is merely being used for demo purposes here anyway. But you may already know of a fix for that if you care.

    DeleteDups

    A simpler program without any Forms that loads up the sample data, then sorts on the second column as a Single value, then
    writes a new file including only the first row for each unique "second column" value.

    It uses MsgBox calls to display when it reaches each phase and the time each phase takes.
    Attached Files Attached Files

  2. #2

    Thread Starter
    PowerPoster
    Join Date
    Feb 2006
    Posts
    24,482

    Re: VB5/6/VBA - HeapsortVarVar, Another VB Sort

    I duplicated more rows to make a really big file. Then I ran the DeleteDups demo on it.

    Code:
    126 MB input:
    Loaded 1064880 rows in 47.703 seconds.
    Sorted 1064880 rows in 9.688 seconds.
    Wrote 17 rows in 0.375 seconds.
    Not a screamer, but good for a lot that I do. I am doing a lot of fiddling in there which is why the load takes so long.

    Yes, I realize this isn't the most scientific test with all those dups in there and such.

  3. #3

    Thread Starter
    PowerPoster
    Join Date
    Feb 2006
    Posts
    24,482

    Re: VB5/6/VBA - HeapsortVarVar, Another VB Sort

    Please note:

    This isn't and wasn't meant to be a general purpose, just copy & paste, meets all needs so you never have to think, answer to every sorting need.


    Variants have their cost, and when you're dealing with one Variant per data item plus one for each row of data their cost in memory can be quite large. You can hit practical limits based on system RAM size sooner than you might imagine, and 32-bit programs hit this limit quickly even if you do have a 12 GB desktop system.

    The technique used in HeapsortVarVar is best suited to applications that need only 100 to 100,000 rows. If you don't have a large number of fields then 1 million to 2 million rows is probably a "design maximum" to keep in mind.

    A Microsoft Jet 4.0 database doesn't require MS Access to be installed and is really your best bet generally. If you really are in the 100 million rows and up arena you probably will want to use SQL Server (and I don't mean Compact or Express) or another large scale database. Hopefully we'll soon see an easy to install, easy to use "No SQL" database for Windows, which can improve performance quite a bit when you don't require relational set operations and other SQL features.

    If that doesn't work for you then consider using external sorting utilities and sequential passes over sorted files or other alternatives.


    If you really need everything in RAM to work on it, you have gigantic sets of text file data to work on, etc. then you may need to consider highly tailored code for such applications. At that point your best bet may well indeed be to use arrays of UDTs or Byte arrays and a table of fixed-field offsets and lengths within records in a packed format.

    This is the "Cobol" way, but (a.) it was what let Cobol programs be successful in a world of vastly smaller machines than you have on your desktop today, and (b.) it can scale very well. To make it work well using UDTs you want to avoid any variable-length String fields you can though!

    You can adapt a sort like Heapsort or a Quicksort example, but keep in mind that hand-rolled sort code is nearly as dangerous as hand-rolled encryption logic. Bugs can be subtle and devastating. This is a strong factor in favor of just using proven database engines to handle such things.

    Another option is to build indexes in parallel to the raw data. This can end up being significantly faster than sorting ans resorting, though it also has a memory cost as well as increasing complexity and raising the hazard of subtle errors even higher.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width