dcsimg
Results 1 to 32 of 32

Thread: fast way of checking if line of file is in list

  1. #1

    Thread Starter
    Fanatic Member
    Join Date
    Oct 2011
    Location
    Sydney, Australia
    Posts
    756

    fast way of checking if line of file is in list

    Hi All,

    Been a while, I hope someone more clever and knowledgeable than me can help.

    I'm working on a little bit of code, it works according to this logic:

    creates a dictionary, this dictionary contains a large amount of items, in my testing at the moment im looking at 190,000 or so

    Now i need to then see if anything in this dictionary is in my excluded list. The excluded list is massive, we are taking 1.5 GB in size and almost 44 Million lines. Each line contains 1 value and its the values in the dictionary im testing against.

    I want to remove any value that appears in the list and prefer to do so fast. I could read it all into memory however that is not practical as it chews about 6GB of memory.

    I have at the moment gone with reading line by line which on its own goes really fast (less than 10 secs on my machine) to just read from start to finish.

    once I add logic in to test if the dictionary contains the read value it slows down WAY to much. I have yet to see it end.

    below is part of my code, fileha is the dictionary in question. I have tried adding matched values into a list to remove later but no difference. If I had to guess the issue is the .Contains part as this would check each value for each line read.... I don't know where to go from here

    Code:
     Using br As New BinaryReader(File.Open("C:\data\exclusions.txt", FileMode.Open))
                Try
                    Do
                        Dim line As String
                        line = br.ReadString
    'works great without the below If statement
                        If fileha.Values.Contains(line) Then
                            For Each i In fileha.Where(Function(kvp) kvp.Value = line).ToList()
                                fileha.Remove(i.Key)
                            Next
                        End If
                    Loop
               Catch ex As EndOfStreamException
                    'reached end of file, this is fine
                End Try
            End Using
    Last edited by bensonsearch; May 7th, 2016 at 11:58 PM.
    My CodeBank Submissions
    • Listbox with transparency and picture support - Click Here
    • Check for a true internet connection - Click Here
    • Open Cash drawer connected to receipt printer - Click Here
    • Custom color and size border around form - Click Here
    • Upload file to website without user logins, includes PHP - Click Here
    • List All Removable USB Storage Devices - Click Here
    • Custom On/Off Slide Control - Click Here
    • Insert multiple rows of data into one database table using parameters - Click Here
    • Trigger USB/Serial Cash Drawer - Click Here

  2. #2
    PowerPoster
    Join Date
    Sep 2005
    Location
    Modesto, Ca.
    Posts
    4,450

    Re: fast way of checking if line of file is in list

    Instead of looping thru 44million lines and checking if the value is in the dictionary, is should be faster to loop thru 190,000 items in the dictionary and check if the item is in the exclusion list.

    I don't know the exact code off hand but I'm sure it can be done.

  3. #3

    Thread Starter
    Fanatic Member
    Join Date
    Oct 2011
    Location
    Sydney, Australia
    Posts
    756

    Re: fast way of checking if line of file is in list

    that would be a matter of changing the using to go the other way. I'll have to do some math as I'd rather go through a small list many times rather than through the big list many times.
    My CodeBank Submissions
    • Listbox with transparency and picture support - Click Here
    • Check for a true internet connection - Click Here
    • Open Cash drawer connected to receipt printer - Click Here
    • Custom color and size border around form - Click Here
    • Upload file to website without user logins, includes PHP - Click Here
    • List All Removable USB Storage Devices - Click Here
    • Custom On/Off Slide Control - Click Here
    • Insert multiple rows of data into one database table using parameters - Click Here
    • Trigger USB/Serial Cash Drawer - Click Here

  4. #4

    Thread Starter
    Fanatic Member
    Join Date
    Oct 2011
    Location
    Sydney, Australia
    Posts
    756

    Re: fast way of checking if line of file is in list

    Wouldn't it be the same ? 44 million by 190,000 vs 190,000 by 44 million. Same number of times
    My CodeBank Submissions
    • Listbox with transparency and picture support - Click Here
    • Check for a true internet connection - Click Here
    • Open Cash drawer connected to receipt printer - Click Here
    • Custom color and size border around form - Click Here
    • Upload file to website without user logins, includes PHP - Click Here
    • List All Removable USB Storage Devices - Click Here
    • Custom On/Off Slide Control - Click Here
    • Insert multiple rows of data into one database table using parameters - Click Here
    • Trigger USB/Serial Cash Drawer - Click Here

  5. #5
    PowerPoster
    Join Date
    Sep 2005
    Location
    Modesto, Ca.
    Posts
    4,450

    Re: fast way of checking if line of file is in list

    Well, if you put the exclusion data into an array you wouldn't have to cycle thru it, you would just call the "find" method. Or you could us a List or a DataTable. But you shouldn't have to cycle thru the exclusion list each time.

  6. #6

    Thread Starter
    Fanatic Member
    Join Date
    Oct 2011
    Location
    Sydney, Australia
    Posts
    756

    Re: fast way of checking if line of file is in list

    The only issue is holding it all in any list or array uses to much ram. Seems to be about 6GB. That's to much
    My CodeBank Submissions
    • Listbox with transparency and picture support - Click Here
    • Check for a true internet connection - Click Here
    • Open Cash drawer connected to receipt printer - Click Here
    • Custom color and size border around form - Click Here
    • Upload file to website without user logins, includes PHP - Click Here
    • List All Removable USB Storage Devices - Click Here
    • Custom On/Off Slide Control - Click Here
    • Insert multiple rows of data into one database table using parameters - Click Here
    • Trigger USB/Serial Cash Drawer - Click Here

  7. #7
    PowerPoster
    Join Date
    Sep 2005
    Location
    Modesto, Ca.
    Posts
    4,450

    Re: fast way of checking if line of file is in list

    Maybe there's a better approach to what your trying to achieve. You need to provide more detail. What are the 44m items, why are they in a text file, what are the 190,000 items, where do they come from.

  8. #8

    Thread Starter
    Fanatic Member
    Join Date
    Oct 2011
    Location
    Sydney, Australia
    Posts
    756

    Re: fast way of checking if line of file is in list

    There might be. I don't wish to reveal the exact data so let's try it like this :

    The small 190,000 dictionary is held in memory and contains people's street number as key and suburb as value (again just example)

    The large 44 mill line file is a list of suburbs we don't care about or are to be excluded from the dictionary.

    The large file is a text file as I can't see any benifit holding it in a database eg size of database would be greater than txt file and read/write is much simpler to a binary stream in terms of raw speed (as far as I know). It's worth noting that I have already removed duplicates from the large file so it's not getting smaller and may even grow larger over time. Duplicates must stay in the dictionary as there are more than 1 person per suburb
    My CodeBank Submissions
    • Listbox with transparency and picture support - Click Here
    • Check for a true internet connection - Click Here
    • Open Cash drawer connected to receipt printer - Click Here
    • Custom color and size border around form - Click Here
    • Upload file to website without user logins, includes PHP - Click Here
    • List All Removable USB Storage Devices - Click Here
    • Custom On/Off Slide Control - Click Here
    • Insert multiple rows of data into one database table using parameters - Click Here
    • Trigger USB/Serial Cash Drawer - Click Here

  9. #9
    Sinecure devotee
    Join Date
    Aug 2013
    Location
    Southern Tier NY
    Posts
    5,919

    Re: fast way of checking if line of file is in list

    Quote Originally Posted by bensonsearch View Post
    Wouldn't it be the same ? 44 million by 190,000 vs 190,000 by 44 million. Same number of times
    Quote Originally Posted by bensonsearch View Post
    The only issue is holding it all in any list or array uses to much ram. Seems to be about 6GB. That's to much
    Holding the 44 million in RAM would be about 6GB you said originally.
    How would holding 190,000 itemes in RAM also be about 6GB?

    Since reading the disk can be a thousand times slower than reading from memory, even if mathematically you have to do the same number of comparisons, reading the file once (one item at a time) and searching memory for a match, will be much faster than reading the file 190,000 times looking for a match.

  10. #10

    Thread Starter
    Fanatic Member
    Join Date
    Oct 2011
    Location
    Sydney, Australia
    Posts
    756

    Re: fast way of checking if line of file is in list

    Sorry I'm not being clear. Holding the 44 million lines is about 6 GB of memory. Holding the 190,000 is negligible, less than 500MB

    I would love to search memory but I can't have all 44 million lines in memory. Not at once. Would maybe holding a GB at a time in memory be worth it? Read say x lines into array then remove matches, dispose of array then read another x lines?
    My CodeBank Submissions
    • Listbox with transparency and picture support - Click Here
    • Check for a true internet connection - Click Here
    • Open Cash drawer connected to receipt printer - Click Here
    • Custom color and size border around form - Click Here
    • Upload file to website without user logins, includes PHP - Click Here
    • List All Removable USB Storage Devices - Click Here
    • Custom On/Off Slide Control - Click Here
    • Insert multiple rows of data into one database table using parameters - Click Here
    • Trigger USB/Serial Cash Drawer - Click Here

  11. #11
    Super Moderator FunkyDexter's Avatar
    Join Date
    Apr 2005
    Location
    An obscure body in the SK system. The inhabitants call it Earth
    Posts
    7,531

    Re: fast way of checking if line of file is in list

    To be honest, if you're dealing with this volume of data then processing it on a client machine in a loop is never going to work very well. My suggestion would be to move this task into a Database, probably on a meaty server (although even a local one would help).

    I'm assuming you exclusion list remains reasonably consistent so you could just maintain that as a table instead of as a file (as you currently have).
    Give it a clustered index on the word column
    Take the source of fileha and import it as a second table.
    Give that a clustered index on the word column

    You're list excluding the exclusions can then be queried something like this (this is TSQL but there will be similar syntax for any DB)
    Code:
    Select word
    from fileha
    except
    Select word
    from exclusions
    Without physically testing it against your data I can't be sure but I'm willing to bet any decent DB will turn this around orders of magnitude quicker than anything you're likely to be able to write yourself.
    You can depend upon the Americans to do the right thing. But only after they have exhausted every other possibility - Winston Churchill

    Hadoop actually sounds more like the way they greet each other in Yorkshire - Inferrd

  12. #12

    Thread Starter
    Fanatic Member
    Join Date
    Oct 2011
    Location
    Sydney, Australia
    Posts
    756

    Re: fast way of checking if line of file is in list

    So even though a database will add overhead in terms of psychical size it would perform the finding quicker than a low level read? Yes SQL would make it a hell of a lot easier. Would a database even hold 44 million records ?
    My CodeBank Submissions
    • Listbox with transparency and picture support - Click Here
    • Check for a true internet connection - Click Here
    • Open Cash drawer connected to receipt printer - Click Here
    • Custom color and size border around form - Click Here
    • Upload file to website without user logins, includes PHP - Click Here
    • List All Removable USB Storage Devices - Click Here
    • Custom On/Off Slide Control - Click Here
    • Insert multiple rows of data into one database table using parameters - Click Here
    • Trigger USB/Serial Cash Drawer - Click Here

  13. #13
    Frenzied Member
    Join Date
    Jul 2011
    Location
    UK
    Posts
    1,282

    Re: fast way of checking if line of file is in list

    FD's reply looks really interesting. Never worked with a database, so I'd love to know just how quick that approach would be.


    Quote Originally Posted by bensonsearch View Post
    I would love to search memory but I can't have all 44 million lines in memory. Not at once. Would maybe holding a GB at a time in memory be worth it? Read say x lines into array then remove matches, dispose of array then read another x lines?
    Yes, but only if the 44 million lines stored in the file are already sorted. Sounds like you have control over the exclusions file, so could do that.

    I like questions like this, so I ran a few quick benchmarks (I'm weird like that). In my estimates, your code from the first post would take roughly 30 hours. You could get that down to about 23 hours if you pulled the values from the dictionary into their own string array and worked against that (which I found quite interesting ). Of course, the actual results would depend on the nature of the data you are working with.

    However, if you run a binary search against the data in the dictionary (line by line from the exclusions file), then that time looks like it will drop to around 2 minutes.

    A rough idea of the modifications needed when still reading from the file line by line:
    Code:
    Using br As New BinaryReader(File.Open("C:\data\exclusions.txt", FileMode.Open))
        Try
            Dim theValues() As String = fileha.Values.ToArray
            Array.Sort(theValues)
            Do
                Dim line As String
                line = br.ReadString
    
                'works great without the below If statement
                If Array.BinarySearch(theValues, line) >= 0 Then
                    'found a match
    
                    '   For Each i In fileha.Where(Function(kvp) kvp.Value = line).ToList()
                    '       fileha.Remove(i.Key)
                    '   Next
                End If
            Loop
        Catch ex As EndOfStreamException
            'reached end of file, this is fine
        End Try
    End Using
    Pulling the sorted data from the exclusions file into an array 4,000,000 million lines at a time, and running a binary search against that (value by value from the dictionary) would drop the time to about 30 seconds. But it has to be already sorted. Sorting it in code is too expensive (about 4 minutes).

  14. #14

    Thread Starter
    Fanatic Member
    Join Date
    Oct 2011
    Location
    Sydney, Australia
    Posts
    756

    Re: fast way of checking if line of file is in list

    Wow 2 minutes would be heaps faster than the current test code I have. 30 seconds is good. I do have control over the file yes. I have no idea what a binary search is? Im assuming i would sort the exclusions and that puts it in like alpha order? Each line of data is exactly the same length if that helps.

    I wonder why a .contains will take 30 hours and a binary search would take 2 minutes. That seems crazy.(like good but such a difference)
    My CodeBank Submissions
    • Listbox with transparency and picture support - Click Here
    • Check for a true internet connection - Click Here
    • Open Cash drawer connected to receipt printer - Click Here
    • Custom color and size border around form - Click Here
    • Upload file to website without user logins, includes PHP - Click Here
    • List All Removable USB Storage Devices - Click Here
    • Custom On/Off Slide Control - Click Here
    • Insert multiple rows of data into one database table using parameters - Click Here
    • Trigger USB/Serial Cash Drawer - Click Here

  15. #15
    Frenzied Member
    Join Date
    Jul 2011
    Location
    UK
    Posts
    1,282

    Re: fast way of checking if line of file is in list

    Yep, a binary search only works on sorted data. If you look at the code I posted, you will see I pull the values from the dictionary into their own array and then sort that array so it can be used in the binary search
    Code:
    Dim theValues() As String = fileha.Values.ToArray
    Array.Sort(theValues)
    Try it as above, without loading the exclusions into memory, and see if you get an acceptable time improvement when working with your actual data.

  16. #16
    Super Moderator FunkyDexter's Avatar
    Join Date
    Apr 2005
    Location
    An obscure body in the SK system. The inhabitants call it Earth
    Posts
    7,531

    Re: fast way of checking if line of file is in list

    A binary search is where you repeatedly check the middle of a dataset for the desired value and throw away half depending on whether the found value is higher or lower than you want

    E.g. I have the following:-
    Aaron
    Becky
    Benson
    Dexter
    Robert
    Samuel
    Xander

    If I'm searching for Benson I first go to Dexter because that's in the middle. Dexter > Benson so I discard Dexter and everything that comes after. I go to the new middle which is now Becky. Becky < Benson so I throw away Becky and everything that goes before it. The new middle is now Benson (its the only record left). Benson = Benson so I'm done.

    It a pretty fast search but it requires the records to be ordered. .contains can't use it because it has no concept of the ordering of the records. It's probably using a plain old sequential loop under the lid.

    BTW a database is still likely to be quicker because it maintains indexes, meaning it doesn't even need to do a binary search to find the record - it will maintain statistics that will tell it exactly where in the table a record is going to be. If you're comparing to indexed fields as I suggested it'll probably use a hash match which will out perform just about any other algorithm you can come up with. Of course, it does mean introducing a whole new technology to your program so the binary search will be a much simpler solution if it's fast enough for you.


    edit> Benson probably wasn't the best example to use there because both a sequential and a binary search would each have required 3 passes. Try the same exercise search for Samuel and you'll start to see why a binary search can make such a big difference.
    Last edited by FunkyDexter; May 9th, 2016 at 05:50 AM.
    You can depend upon the Americans to do the right thing. But only after they have exhausted every other possibility - Winston Churchill

    Hadoop actually sounds more like the way they greet each other in Yorkshire - Inferrd

  17. #17

    Thread Starter
    Fanatic Member
    Join Date
    Oct 2011
    Location
    Sydney, Australia
    Posts
    756

    Re: fast way of checking if line of file is in list

    A hash match? This is all so interesting ESP how the binary search works. Thanks for everyone's help. I'll try the binary search and see if it helps get me closer to what I'm looking for
    My CodeBank Submissions
    • Listbox with transparency and picture support - Click Here
    • Check for a true internet connection - Click Here
    • Open Cash drawer connected to receipt printer - Click Here
    • Custom color and size border around form - Click Here
    • Upload file to website without user logins, includes PHP - Click Here
    • List All Removable USB Storage Devices - Click Here
    • Custom On/Off Slide Control - Click Here
    • Insert multiple rows of data into one database table using parameters - Click Here
    • Trigger USB/Serial Cash Drawer - Click Here

  18. #18
    Super Moderator FunkyDexter's Avatar
    Join Date
    Apr 2005
    Location
    An obscure body in the SK system. The inhabitants call it Earth
    Posts
    7,531

    Re: fast way of checking if line of file is in list

    Strictly speaking I should have said Hash Join rather than Hash Match. The point is that it can prepare both dataset in advance and effectively "shuffle" them into each other at the right position rather than looping across the entirety of one record set. The binary search removes the loop from one dataset but you're still looping the other.

    Think of it as having two halves of a deck of cards that you want to shuffle together. A sequential algorithm is like repeatedly taking the top card from one half then checking each card in the other half until you find the place you want. A binary search algorithm is like repeatedly taking the top card from one deck and putting it straight into the correct place in the other half which will be exponentially quicker. A Hash Join is a riffle shuffle and will be exponentially quicker again.
    You can depend upon the Americans to do the right thing. But only after they have exhausted every other possibility - Winston Churchill

    Hadoop actually sounds more like the way they greet each other in Yorkshire - Inferrd

  19. #19
    Fanatic Member namrekka's Avatar
    Join Date
    Feb 2005
    Location
    Netherlands
    Posts
    639

    Re: fast way of checking if line of file is in list

    I agree with FD. The only thing is it must be sorted. So if the data will change a lot the sorting issue can cause the troubles. In case the system is busy with sorting, you can't retrieve data. Sorting 44 million lines consumes time and memory.

    It depends on what kind of application you are busy with. Is it a tool (just for you) or a commercial application. I would go for a database, or better a system with hash tables.
    Have a look here for the differences:

    https://en.wikipedia.org/wiki/Binary_search_algorithm

  20. #20
    PowerPoster SJWhiteley's Avatar
    Join Date
    Feb 2009
    Location
    South of the Mason-Dixon Line
    Posts
    2,256

    Re: fast way of checking if line of file is in list

    Another option is to break apart the matching data if it's arranged in logical groups. For example addresses: there's no point searching for '12345 Somestreet' in every single address if the town is 'Sometown' - you only need to search a subset of addresses. This would mean the data doesn't need to be sorted, but logically arranged or grouped (using indexed database columns, for example).
    "Ok, my response to that is pending a Google search" - Bucky Katt.
    "There are two types of people in the world: Those who can extrapolate from incomplete data sets." - Unk.
    "Before you can 'think outside the box' you need to understand where the box is."

  21. #21
    Frenzied Member
    Join Date
    Jul 2011
    Location
    UK
    Posts
    1,282

    Re: fast way of checking if line of file is in list

    I would love to try FD's method, but my knowledge of database programming is virtually non existent.

    However, all his talk of hashes made me wonder what would happen if you add the values from the dictionary to a HashSet(Of String).

    So I tried it, and used the HashSet's Contains method against each line from the file (without loading the file into memory).

    The binary search method was taking about 90 seconds. The HashSet.Contains method took about 9 seconds, which I think is impressive given that it takes just under 6 seconds just to read the 44 million lines from the file.

    I don't normally work with hash sets either, but I believe all the values in a HashSet have to be unique, which may complicate things. Shouldn't be a major problem though....

  22. #22

    Thread Starter
    Fanatic Member
    Join Date
    Oct 2011
    Location
    Sydney, Australia
    Posts
    756

    Re: fast way of checking if line of file is in list

    I did try that but had little success. Can't remember why (tried a lot). I think the issue is after the hashset is full of matches then how do we go about removing them from the original dictionary. And again fast.
    My CodeBank Submissions
    • Listbox with transparency and picture support - Click Here
    • Check for a true internet connection - Click Here
    • Open Cash drawer connected to receipt printer - Click Here
    • Custom color and size border around form - Click Here
    • Upload file to website without user logins, includes PHP - Click Here
    • List All Removable USB Storage Devices - Click Here
    • Custom On/Off Slide Control - Click Here
    • Insert multiple rows of data into one database table using parameters - Click Here
    • Trigger USB/Serial Cash Drawer - Click Here

  23. #23
    Frenzied Member
    Join Date
    Jul 2011
    Location
    UK
    Posts
    1,282

    Re: fast way of checking if line of file is in list

    How many matches are you likely to get?

  24. #24

    Thread Starter
    Fanatic Member
    Join Date
    Oct 2011
    Location
    Sydney, Australia
    Posts
    756

    Re: fast way of checking if line of file is in list

    That's impossible to say but easy several thousand. Really it should be about half of the 190,000 list. Ideally
    My CodeBank Submissions
    • Listbox with transparency and picture support - Click Here
    • Check for a true internet connection - Click Here
    • Open Cash drawer connected to receipt printer - Click Here
    • Custom color and size border around form - Click Here
    • Upload file to website without user logins, includes PHP - Click Here
    • List All Removable USB Storage Devices - Click Here
    • Custom On/Off Slide Control - Click Here
    • Insert multiple rows of data into one database table using parameters - Click Here
    • Trigger USB/Serial Cash Drawer - Click Here

  25. #25
    Frenzied Member
    Join Date
    Jul 2011
    Location
    UK
    Posts
    1,282

    Re: fast way of checking if line of file is in list

    Half

    I was working with 10000 matches, and got that down to 0.01 seconds to remove from the dictionary by again using a HashSet.

    Interesting things, these HashSets

  26. #26
    Super Moderator FunkyDexter's Avatar
    Join Date
    Apr 2005
    Location
    An obscure body in the SK system. The inhabitants call it Earth
    Posts
    7,531

    Re: fast way of checking if line of file is in list

    The binary search method was taking about 90 seconds. The HashSet.Contains method took about 9 seconds, which I think is impressive given that it takes just under 6 seconds just to read the 44 million lines from the file.
    Sounds about right and I'm guessing that's using much the same HashMatch algorithm a DB would use. The main thing the database brings to the table is the indexes. They have statistics telling it exactly where to jump forward to for the next leaf instead of having to forward loop toward it.

    You could roll up a sort of pseudo HashMatch by doing a forward only binary search using two ordered sets. The algorithm would go something like this:-
    1. Take the first record from you input
    2. Do a binary search to get to the correct record in your exclusions
    3. Throw away everything before that record from the exclusions - If your input is ordered then by definition no subsequent input can appear before the current one in the exclusions.
    4. Loop, but you're now searching against a progressively smaller set of exclusions.

    This isn't going to take advantage of the Block Nested Loops that a Hash Join uses but you'd be replacing a few scans with a lot of binary reads. I doubt it would be anywhere near as quick as getting the DB to do the work for you but should be a little quicker than a standard binary search. If my maths is correct you'd be cutting the iterations in half.

    You could probably implement your own Block Nested Loop code if you were feeling really keen but this is going to start getting increasingly complex for less and less benefit (it'd be beyond my abilities for sure). At some point you ought to stop and ask yourself how much effort do I want to put in to get increasingly trivial performance payoffs.
    You can depend upon the Americans to do the right thing. But only after they have exhausted every other possibility - Winston Churchill

    Hadoop actually sounds more like the way they greet each other in Yorkshire - Inferrd

  27. #27
    Super Moderator FunkyDexter's Avatar
    Join Date
    Apr 2005
    Location
    An obscure body in the SK system. The inhabitants call it Earth
    Posts
    7,531

    Re: fast way of checking if line of file is in list

    I think the issue is after the hashset is full of matches then how do we go about removing them from the original dictionary.
    You don't. You're looking for an exclusion rather than a merge so you push data to a separate collection every time the .contains returns a negative. That collection then ends up holding the records that weren't in the exclusion set.
    You can depend upon the Americans to do the right thing. But only after they have exhausted every other possibility - Winston Churchill

    Hadoop actually sounds more like the way they greet each other in Yorkshire - Inferrd

  28. #28
    Frenzied Member
    Join Date
    Jul 2011
    Location
    UK
    Posts
    1,282

    Re: fast way of checking if line of file is in list

    Quote Originally Posted by FunkyDexter View Post
    You don't. You're looking for an exclusion rather than a merge so you push data to a separate collection every time the .contains returns a negative. That collection then ends up holding the records that weren't in the exclusion set.
    Why didn't I think of that. I was adding the matches to a collection then removing from the dictionary after they were all found. Getting too old to justify my salt

  29. #29
    Frenzied Member
    Join Date
    Jul 2011
    Location
    UK
    Posts
    1,282

    Re: fast way of checking if line of file is in list

    Quote Originally Posted by FunkyDexter View Post
    You could roll up a sort of pseudo HashMatch by doing a forward only binary search using two ordered sets. The algorithm would go something like this:-
    1. Take the first record from you input
    2. Do a binary search to get to the correct record in your exclusions
    3. Throw away everything before that record from the exclusions - If your input is ordered then by definition no subsequent input can appear before the current one in the exclusions.
    4. Loop, but you're now searching against a progressively smaller set of exclusions.

    This isn't going to take advantage of the Block Nested Loops that a Hash Join uses but you'd be replacing a few scans with a lot of binary reads. I doubt it would be anywhere near as quick as getting the DB to do the work for you but should be a little quicker than a standard binary search. If my maths is correct you'd be cutting the iterations in half.
    That sounds really interesting. I wouldn't mind having a play with that, and also try to learn a bit about database programming, to see if they are as fast as people are saying. But it'd end in tears. I'm supposed to be glossing all the woodwork in the house ...... supposed to be

  30. #30
    Super Moderator FunkyDexter's Avatar
    Join Date
    Apr 2005
    Location
    An obscure body in the SK system. The inhabitants call it Earth
    Posts
    7,531

    Re: fast way of checking if line of file is in list

    Getting too old to justify my salt
    Got to admit - I haven't had to implement a search in nigh on 2 decades. I think that's why I've enjoyed contributing to this thread, it's been a walk down memory lane.

    to see if they are as fast as people are saying
    Ultimately they don't do anything you couldn't do yourself but they save you having to. A lot of very bright people have done the work for you. And for anything set based they're fast, really fast.

    If you really want to get carried away, take a look at how quick a NoSQL database like Hadoop will turn something like this around. Be prepared to spend a month and a half learning how to write Map-Reduce jobs though.
    You can depend upon the Americans to do the right thing. But only after they have exhausted every other possibility - Winston Churchill

    Hadoop actually sounds more like the way they greet each other in Yorkshire - Inferrd

  31. #31
    Frenzied Member
    Join Date
    Jul 2011
    Location
    UK
    Posts
    1,282

    Re: fast way of checking if line of file is in list

    Quote Originally Posted by FunkyDexter View Post
    If you really want to get carried away, take a look at how quick a NoSQL database like Hadoop will turn something like this around. Be prepared to spend a month and a half learning how to write Map-Reduce jobs though.
    An excuse to delay the decorating for a month and a half ..... sounds tempting

    I was going to say that most of the talk about databases sounds like Greek to me, but Hadoop actually sounds more like the way they greet each other in Yorkshire.

  32. #32
    Super Moderator FunkyDexter's Avatar
    Join Date
    Apr 2005
    Location
    An obscure body in the SK system. The inhabitants call it Earth
    Posts
    7,531

    Re: fast way of checking if line of file is in list

    Hadoop actually sounds more like the way they greet each other in Yorkshire
    That, came within a hairs breadth of getting added to my sig.

    In fact, stuff it... it's getting added.
    You can depend upon the Americans to do the right thing. But only after they have exhausted every other possibility - Winston Churchill

    Hadoop actually sounds more like the way they greet each other in Yorkshire - Inferrd

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Featured


Click Here to Expand Forum to Full Width