Hash & Probe Alternatives

**Judd** · Jan 31st, 2002, 04:16 AM

We have a database that locates records using a 'hash & step' method (linear probing). This has the typical problem of collisions causing speed issues when the record array is full. Hence, for speed, the array needs to be sized to ~ 2x the number of records. This is pretty inefficient when it comes to resources (each record being 512 bytes, typical arrays being 4000 records - ok its not that bad, but we have a couple of these arrays our 2MB app loads a 1MB database and has a 30-40 MB footprint!!).

Does anyone know of any fast search methods that could be alternatives? I was reading about quadratic probing instead of linear, but you'd still have to keep the 'null space' up...

**Guv** · Feb 1st, 2002, 11:38 PM

Possible solutions to this problem depend a lot on the update frequency.

A sorted file using binary search (or better, yet interpolated search) should result in reasonably fast retrieval. Similarly, retrieval from an indexed file tends to be fast. Both of these alternatives increase maintenance time. If the update activity is low, this seems to be the way to go, but is not a good alternative if there is a lot of maintenance activity.

Another possibility is maintaining a sorted or indexed file containing most of the data.. Another (hopefully a small) secondary file contains records not in the main file. Retrieval starts by quickly finding the desired record in the main file or quickly deciding that it is not there. If not there, a slower sequential search of the secondary file is required. Periodically, a time consuming job merges the two files into a larger main file (sorted or indexed) and an empty secondary file. This approach also requires marking some records in the main file as having been deleted. The virtually deleted records would be really deleted when the two files are merged.

A third alternative is to live with the problems caused by the hash and probe method using a file with only 15-25 percent extra space. The first 75% or so of the records entered would be processed fairly fast, with efficiency decreasing for the last 25% of the records.

I think the best alternative also uses two files. A main file using hash & probe, and a secondary file which is either sorted or indexed. When a record is to be added, it will be put into the main file only if less than (for example) 4 probes are required to insert it. The records not inserted would be put into a secondary file. Access to the main file would be reasonably efficient, without requiring only 50% utilization. Access to the secondary file would probably be even more efficient since it is either sorted or indexed. Maintenance of the secondary file would be slower due to the overhead resulting from the requirement for it being an indexed or sorted file. If the secondary file is not too large, the extra maintenance time would not be excessive. Note that maintaining a sorted file does not require sorting it every time you update it. You can read the file and write it to another file, after which you delete the original file and rename the new one. Added records are inserted during the rewrite process. With a more complex program and a memory buffer containing 10 to 20 or more records, you do not need to have enough disk space for two files.

It might take some analysis or perhaps experimentation to determine the optimum sizes for the two files required by the latter alternative. The analysis and experimentation would also attempt to determine the number of probes used and the utilization percentage for the main file.

wossname · Feb 2nd, 2002, 06:57 AM

My favourite tome at the moment is of "The Art of Computer Programming" by Donald E. Knuth.

Volume 3 --> "Sorting and Searching".

There are 25 sorting algorithms analysed in scrutinous detail. Including efficiency analysis and both RAM and disk overhead. Searching is equally well covered.

Get yourself to a good library and find this book. It is truly a work of classic literature. It contains the best possible solution to your problem. I guarantee it! It gives detailed descriptions of algorithms in both plain english and a pseudo ASM format (plus flow diagrams and numerical tables).

A nerd's heaven!

jim mcnamara · Feb 5th, 2002, 10:40 PM

Still my favorite, too - 30+ years later.

You can implement sparse arrays that use unique keys - it's called a collection in VB or ADA. Create the keys with a hash table that has a theoretical length greater by a factor of 2-3 times than all of your table entries. The array element consists of a quadword (64bit) result of hashing and a record number for indexing into the db or file or data array for direct access.

This assumes unique keys.

You can also buy ISAM software to index these things VERY efficiently.

Try:

http://www.mixsoftware.com/product/db/intro1c.htm

**Judd** · Feb 6th, 2002, 03:37 AM

I'm not sure how possible that is. You're saying have a sparse array with 64 bit keys, that presumably only stores the data present and not the sparsity (i.e. memory is not allocated for the missing elements). My arrays are in FORTRAN, which doesn't respond too well to arrays of unknown or unbounded size.

For 'items' I have two, equally size, arrays. The item array and the item_key array. Rather unsurprisingly, item_key, contains the key field, the item name. So we have:

Code:

item_key(n).name = the name of item n, this is the unique name, the hash key.
item_key(n).class = the class of item n
item_key(n).desc = the description of item n

and

item(n) = the data for item n

So the name is hashed into item_key to get index n, which is then used to store data in item at that index.

Once the database is loaded into the app, it only ever uses the index, and doesn't hash. Hashing is only done when creating new items. We used to store the database as a binary file which essentially just dumped the binary array data from memory (i.e. stored the prehashed indexes). However, for business and usability reasons, I'm converting the DB format to XML. This means that when loading a database I'll have to 'create' all my items, which means hashing, which means s..l..o..w..

Thread: Hash & Probe Alternatives

Thread Tools

Display

Hash & Probe Alternatives

Unfortunately...

Posting Permissions