Page 1 of 2 12 LastLast
Results 1 to 40 of 42

Thread: [RESOLVED] Sorting Large Datasets

  1. #1

    Thread Starter
    Hyperactive Member
    Join Date
    May 2013
    Posts
    295

    Resolved [RESOLVED] Sorting Large Datasets

    Hello all,

    I'm finding that Sorting files and data in VB .NET to be a bit confusing.
    I was wondering if I could explain what I need to do and hope that someone could point me to a tutorial that fits my needs.

    I have multiple flat file datasets. They are 400 bytes each with a fixed field layout. I need to merge the multiple files and sort them on 12 different elements scattered within the 400 bytes.
    I then need to process those records and either enhance them with appended data or drop them altogether depending on the validity of the data they contain. I will then need to sort and process them a total of 6 more times by differing data elements until I finally have a single file finished product.

    The individual files range in size from 5MB to 1.5GB.

    They have anywhere from 40K records all the way up to 600 million records.
    I do have a way of cutting down on the Actual Amount of Data I will process. I have coded a process to read in as many files as the user wants and I produce two files.
    A single sequenced mini record - with just a numeric sequence number, the 12 data fields I need to sort and one very large file that has the full 400 bytes of data with the matching sequence number as the first 18 positions.

    Would anyone know of a way to perform a sort on multiple fields as seen in the example below.

    First initial/Last name/House Number/Street Name/Zip Code ("/" Denotes Within)

    So the way the sort would be
    Sort by

    Zip Code
    Then Street name within Zip Code
    Then House number within Street Name within zip code
    Then Last name within House number within Street Name within zip code
    Then First initial within Last name within House number within Street Name within zip code

    I have googled my eyes red and come across multiple options but they are either very confusing or all single field sorts.

    Thank you in advance!

    -NJ
    Lo And Behold
    This is my first Microsoft App Submission
    Your opinion of it would be appreciated

  2. #2
    Fanatic Member kpmc's Avatar
    Join Date
    Sep 2017
    Posts
    1,012

    Re: Sorting Large Datasets

    I would be lying if I said I understand your objective. I would start by at least creating and filling datatables into a data.dataset. What type of flat files are you working with? DBF?CSV?TXT? So far I seem to have more questions than answers.

  3. #3
    You don't want to know.
    Join Date
    Aug 2010
    Posts
    4,578

    Re: Sorting Large Datasets

    Yes, if you want any hope of being able to filter and sort this data quickly, you'll want to involve a database.

    Flat files are not fast. There's nothing you can do about it. Before you can even start sorting and filtering, you have to load the entire thing into memory. Having a fixed field layout lets you play sort of loose with that, but if it's not already sorted in some meaningful way you have to pass over everything in order to perform any operation.

    We could talk about restructuring the files such that they could be easier to navigate, but that conversation would be "write your own database management system". It's worth going ahead and importing this data.
    This answer is wrong. You should be using TableAdapter and Dictionaries instead.

  4. #4
    Super Moderator jmcilhinney's Avatar
    Join Date
    May 2005
    Location
    Sydney, Australia
    Posts
    110,299

    Re: Sorting Large Datasets

    I would suggest that you use a TextFieldParser to read each file and then load each record into a DataTable. Once you've read the all files into the same DataTable you can then filter and sort via the DefaultView, which is a DataView. Records can be added, edited and removed and you can finally loop through the DefaultView and write the contents out to a new file. You would not create any intermediate files; all intermediate data would be stored in-memory. E.g.
    vb.net Code:
    1. Using ofd As New OpenFileDialog
    2.     If ofd.ShowDialog() = DialogResult.OK Then
    3.         Dim table As New DataTable
    4.  
    5.         With table.Columns
    6.             .Add("Column1", GetType(String))
    7.             .Add("Column2", GetType(Integer))
    8.         End With
    9.  
    10.         For Each filePath In ofd.FileNames
    11.             Using parser As New TextFieldParser(filePath)
    12.                 'Fixed-width fields.
    13.                 parser.SetFieldWidths(10, 15, 20)
    14.  
    15.                 Do Until parser.EndOfData
    16.                     Dim fields = parser.ReadFields()
    17.                     Dim row = table.NewRow()
    18.  
    19.                     row("Column1") = fields(0)
    20.                     row("Column2") = CInt(fields(1))
    21.                 Loop
    22.             End Using
    23.         Next
    24.  
    25.         Dim view = table.DefaultView
    26.  
    27.         view.Sort = "Column1, Column2 DESC"
    28.  
    29.         'Edit rows as required.
    30.         For Each rowView As DataRowView In view
    31.             Dim number = CInt(rowView("Column2"))
    32.  
    33.             If number > 100 Then
    34.                 rowView("Column1") = number * 2
    35.             End If
    36.         Next
    37.  
    38.         'Remove rows as required.
    39.         For i = view.Count - 1 To 0
    40.             Dim rowView = view(i)
    41.             Dim text = CStr(rowView("Column1"))
    42.  
    43.             If text.Length > 20 Then
    44.                 'Row is removed from DataTable, which will be reflected in DataView.
    45.                 table.Rows.Remove(rowView.Row)
    46.             End If
    47.         Next
    48.  
    49.         Using sfd As New SaveFileDialog
    50.             If sfd.ShowDialog() = DialogResult.OK Then
    51.                 Dim writer As New StreamWriter(sfd.FileName)
    52.  
    53.                 For Each rowView As DataRowView In view
    54.                     writer.WriteLine(String.Join(",", rowView.Row.ItemArray))
    55.                 Next
    56.             End If
    57.         End Using
    58.     End If
    59. End Using

  5. #5
    Super Moderator jmcilhinney's Avatar
    Join Date
    May 2005
    Location
    Sydney, Australia
    Posts
    110,299

    Re: Sorting Large Datasets

    Only after submitting that last post did I specifically consider the volume of data you mentioned. If memory usage is an issue doing as I suggested then you would certainly want to load the data into a database. Processing can be a bit more cumbersome for some tasks but certainly things like sorting and filtering will be efficient.

  6. #6

    Thread Starter
    Hyperactive Member
    Join Date
    May 2013
    Posts
    295

    Re: Sorting Large Datasets

    Quote Originally Posted by kpmc View Post
    I would be lying if I said I understand your objective. I would start by at least creating and filling datatables into a data.dataset. What type of flat files are you working with? DBF?CSV?TXT? So far I seem to have more questions than answers.
    Hi,

    Initially I would be removing all the duplicate records or come as close as I can by bringing together all known possible likeness, such as name and address matches or close to that. Joe Public at 123 Main street would be a match for J Public at 123 Main Street APT A.
    So instead of having two records there would be only one. Save a stamp, so to speak.
    But after the initial match, I would have 6 more sorts on differing fields in order to capture even more possible matches.

    I work on a mainframe with flat files and I can pass billions of records this way in minutes. I'm trying to duplicate the process somehow on a PC. I do not see an easy way of doing this.

    -NJ
    Lo And Behold
    This is my first Microsoft App Submission
    Your opinion of it would be appreciated

  7. #7

    Thread Starter
    Hyperactive Member
    Join Date
    May 2013
    Posts
    295

    Re: Sorting Large Datasets

    Quote Originally Posted by jmcilhinney View Post
    I would suggest that you use a TextFieldParser to read each file and then load each record into a DataTable. Once you've read the all files into the same DataTable you can then filter and sort via the DefaultView, which is a DataView. Records can be added, edited and removed and you can finally loop through the DefaultView and write the contents out to a new file. You would not create any intermediate files; all intermediate data would be stored in-memory. E.g.
    vb.net Code:
    1. Using ofd As New OpenFileDialog
    2.     If ofd.ShowDialog() = DialogResult.OK Then
    3.         Dim table As New DataTable
    4.  
    5.         With table.Columns
    6.             .Add("Column1", GetType(String))
    7.             .Add("Column2", GetType(Integer))
    8.         End With
    9.  
    10.         For Each filePath In ofd.FileNames
    11.             Using parser As New TextFieldParser(filePath)
    12.                 'Fixed-width fields.
    13.                 parser.SetFieldWidths(10, 15, 20)
    14.  
    15.                 Do Until parser.EndOfData
    16.                     Dim fields = parser.ReadFields()
    17.                     Dim row = table.NewRow()
    18.  
    19.                     row("Column1") = fields(0)
    20.                     row("Column2") = CInt(fields(1))
    21.                 Loop
    22.             End Using
    23.         Next
    24.  
    25.         Dim view = table.DefaultView
    26.  
    27.         view.Sort = "Column1, Column2 DESC"
    28.  
    29.         'Edit rows as required.
    30.         For Each rowView As DataRowView In view
    31.             Dim number = CInt(rowView("Column2"))
    32.  
    33.             If number > 100 Then
    34.                 rowView("Column1") = number * 2
    35.             End If
    36.         Next
    37.  
    38.         'Remove rows as required.
    39.         For i = view.Count - 1 To 0
    40.             Dim rowView = view(i)
    41.             Dim text = CStr(rowView("Column1"))
    42.  
    43.             If text.Length > 20 Then
    44.                 'Row is removed from DataTable, which will be reflected in DataView.
    45.                 table.Rows.Remove(rowView.Row)
    46.             End If
    47.         Next
    48.  
    49.         Using sfd As New SaveFileDialog
    50.             If sfd.ShowDialog() = DialogResult.OK Then
    51.                 Dim writer As New StreamWriter(sfd.FileName)
    52.  
    53.                 For Each rowView As DataRowView In view
    54.                     writer.WriteLine(String.Join(",", rowView.Row.ItemArray))
    55.                 Next
    56.             End If
    57.         End Using
    58.     End If
    59. End Using
    Hi JMC,

    I tried that before. (My version anyway, your is a bit more elegant) and the entire system eventually freezes as the test data grows.

    -NJ
    Lo And Behold
    This is my first Microsoft App Submission
    Your opinion of it would be appreciated

  8. #8

    Thread Starter
    Hyperactive Member
    Join Date
    May 2013
    Posts
    295

    Re: Sorting Large Datasets

    Quote Originally Posted by jmcilhinney View Post
    Only after submitting that last post did I specifically consider the volume of data you mentioned. If memory usage is an issue doing as I suggested then you would certainly want to load the data into a database. Processing can be a bit more cumbersome for some tasks but certainly things like sorting and filtering will be efficient.
    Hi,

    I am looking into SQLite as the engine and I hope it works as I'm trying to achieve this without much overhead. SQLite is a file based DB so I'll see what the performance level is once I figure out how to store and sort the data.

    Thank you all for the time you spent with my post!

    -NJ
    Lo And Behold
    This is my first Microsoft App Submission
    Your opinion of it would be appreciated

  9. #9
    Wall Poster TysonLPrice's Avatar
    Join Date
    Sep 2002
    Location
    Columbus, Ohio
    Posts
    3,834

    Re: Sorting Large Datasets

    I would start with a SQLBULKCOPY into a database. Then use stored procedures to do the heavy lifting. Assuming you have that option.
    Please remember next time...elections matter!

  10. #10
    Addicted Member
    Join Date
    Jul 2017
    Location
    Exeter, UK
    Posts
    180

    Re: Sorting Large Datasets

    I think, regardless of how you go about processing this problem, you will never achieve the outcome you seek. The address data can be formatted in so many ways, it will be impossible to identify duplicates unless a post/zip code is present e.g. J Public at 123 Main street could be Jack, John, Jessie and they could all reside there, which one would you decide is legitimate to remove, and what if the address lines are not in the same position within the table columns e.g. J Public, 123 Main street, London, OR, J Public, , 123 Main Street, London etc. etc. You are on a hiding to nothing.

  11. #11

    Thread Starter
    Hyperactive Member
    Join Date
    May 2013
    Posts
    295

    Re: Sorting Large Datasets

    Quote Originally Posted by TysonLPrice View Post
    I would start with a SQLBULKCOPY into a database. Then use stored procedures to do the heavy lifting. Assuming you have that option.
    Yah, I do a bulk insert every 500K records then free up memory for the next insert. So far I can pass 3.2 Million records in just under 10 minutes. Not very good when I ramp things up to the hundreds of millions but I'm working on it.

    I looking into upgrading the hardware. More memory and an SSD should improve things.

    Thank you for the reply!

    -NJ
    Lo And Behold
    This is my first Microsoft App Submission
    Your opinion of it would be appreciated

  12. #12
    Wall Poster TysonLPrice's Avatar
    Join Date
    Sep 2002
    Location
    Columbus, Ohio
    Posts
    3,834

    Re: Sorting Large Datasets

    Have you considered copying the text files straight to the database? You can with MS:

    https://solutioncenter.apexsql.com/s...rt-techniques/
    Please remember next time...elections matter!

  13. #13

    Thread Starter
    Hyperactive Member
    Join Date
    May 2013
    Posts
    295

    Re: Sorting Large Datasets

    Quote Originally Posted by bmwpete View Post
    I think, regardless of how you go about processing this problem, you will never achieve the outcome you seek. The address data can be formatted in so many ways, it will be impossible to identify duplicates unless a post/zip code is present e.g. J Public at 123 Main street could be Jack, John, Jessie and they could all reside there, which one would you decide is legitimate to remove, and what if the address lines are not in the same position within the table columns e.g. J Public, 123 Main street, London, OR, J Public, , 123 Main Street, London etc. etc. You are on a hiding to nothing.
    We already have algorithms in place to achieve a very high success rate. Our Match rate is consistently in the very high 90's, hence the multi-pass process. The issue we have is the cost of Mainframe lease. it's slowly killing us and all the off the shelf replacement software is extremely expensive with tremendous recurring costs and still not nearly as efficient and accurate as our own.
    When we use the mainframe for the pre-processing and send the files to our server for final output formatting and reports we save quite a bit of money by removing the drudgery from the mainframe.
    We would just really like to offload all the processing from the Mainframe if possible. We're a very small footprint company. There are only 5 of us and three are sales/management. We are all longtime Mainframe coders self taught on the PC and .NET.
    As a side note, I'm not tasked with a complete re-write, my assignment is to determine feasibility.
    If I can get the sorts done and come close the the agility of the Mainframe, the associated costs per run/job would plummet.

    I have a few more tests to perform, but I think I may have figured out something. Once I code and test it, I'll report back.

    Thanks,

    -NJ
    Lo And Behold
    This is my first Microsoft App Submission
    Your opinion of it would be appreciated

  14. #14

    Thread Starter
    Hyperactive Member
    Join Date
    May 2013
    Posts
    295

    Re: Sorting Large Datasets

    Quote Originally Posted by TysonLPrice View Post
    Have you considered copying the text files straight to the database? You can with MS:

    https://solutioncenter.apexsql.com/s...rt-techniques/
    Yah, I do something like that now. I just load a dataset with 500K records, then I start a SQL Transaction /.... For Each/... Insert the 500K Records .../ Next .../ then I do a Transaction.Commit (If there are no errors) and the commit itself takes just a few seconds.
    Lo And Behold
    This is my first Microsoft App Submission
    Your opinion of it would be appreciated

  15. #15
    Wall Poster TysonLPrice's Avatar
    Join Date
    Sep 2002
    Location
    Columbus, Ohio
    Posts
    3,834

    Re: Sorting Large Datasets

    Quote Originally Posted by NJDevils28 View Post
    Yah, I do something like that now. I just load a dataset with 500K records, then I start a SQL Transaction /.... For Each/... Insert the 500K Records .../ Next .../ then I do a Transaction.Commit (If there are no errors) and the commit itself takes just a few seconds.
    Just to make sure we are on the same page I'm not talking via a user application...I'm talking straight to the database. Then process the data from an application.

    https://docs.microsoft.com/en-us/sql...t-transact-sql
    Please remember next time...elections matter!

  16. #16
    Addicted Member
    Join Date
    Jul 2017
    Location
    Exeter, UK
    Posts
    180

    Re: Sorting Large Datasets

    If you already have the main logic on a mainframe, have you considered migrating it onto something like the 'Microfocus' server product, this will run exactly the same code on a laptop. There are of course 'costs' involved, but may be worth further enquiry and at least, remove the on going MIP charges you are paying for currently.

  17. #17

    Thread Starter
    Hyperactive Member
    Join Date
    May 2013
    Posts
    295

    Re: Sorting Large Datasets

    Quote Originally Posted by TysonLPrice View Post
    Just to make sure we are on the same page I'm not talking via a user application...I'm talking straight to the database. Then process the data from an application.

    https://docs.microsoft.com/en-us/sql...t-transact-sql
    Hi,

    Yes, we are on the same page.
    I'm using SQLite not MS MySQL. Tools such as the one described in that Document do not exist (At least that I know of) for SQLite. From what I have gathered so far, is that any GUI's for SQLite are not commercial and created and maintained by independent developers. We shy away from that unless we code it ourselves.

    In full disclosure, we normally do not use any databases at all, well not for this purpose anyway. We use flat files only and send them through our process to a final product of, once again a flat file. That file, with an accompanying layout is then secure FTP'ed to the client for their in-house use.
    Lo And Behold
    This is my first Microsoft App Submission
    Your opinion of it would be appreciated

  18. #18

    Thread Starter
    Hyperactive Member
    Join Date
    May 2013
    Posts
    295

    Re: Sorting Large Datasets

    Quote Originally Posted by bmwpete View Post
    If you already have the main logic on a mainframe, have you considered migrating it onto something like the 'Microfocus' server product, this will run exactly the same code on a laptop. There are of course 'costs' involved, but may be worth further enquiry and at least, remove the on going MIP charges you are paying for currently.
    Yes, we have used MicroFocus COBOL and while it is indeed, a good fit for our needs, once again the cost associated with the product make it very prohibitive to use.
    We understand that moving from a MainFrame to a Server would reduce our turnaround time to some degree, so the cost savings must overcome taking on that new liability. Our profit margin is not as wide as we would like in these lean times.
    If the service to the client is reduced and we run the risk of losing that client, then the cost of the MF lease is justified. We lease our Mainframe as part of a development pool and are charged by the actual time consumed by the Mainframe. That includes the Data Upload, the file processing and the Data download. Running all our jobs after midnight also saves us money.
    That being said, If the cost and associated fees of MicroFocus COBOL do not offer the significant savings were looking for then we might as well stay with the mainframe.
    If we can reproduce the performance (or come close) to the MF while saving money without risking our client base, then its a win/win for all concerned.
    Lo And Behold
    This is my first Microsoft App Submission
    Your opinion of it would be appreciated

  19. #19
    PowerPoster ChrisE's Avatar
    Join Date
    Jun 2017
    Location
    Frankfurt
    Posts
    3,042

    Re: Sorting Large Datasets

    Quote Originally Posted by NJDevils28 View Post
    Hi,

    Yes, we are on the same page.
    I'm using SQLite not MS MySQL. Tools such as the one described in that Document do not exist (At least that I know of) for SQLite. From what I have gathered so far, is that any GUI's for SQLite are not commercial and created and maintained by independent developers. We shy away from that unless we code it ourselves.

    In full disclosure, we normally do not use any databases at all, well not for this purpose anyway. We use flat files only and send them through our process to a final product of, once again a flat file. That file, with an accompanying layout is then secure FTP'ed to the client for their in-house use.
    your Brain must be a USB-Stick, I'm surprised that it work's since the 90's

    regards
    Chris
    to hunt a species to extinction is not logical !
    since 2010 the number of Tigers are rising again in 2016 - 3900 were counted. with Baby Callas it's 3901, my wife and I had 2-3 months the privilege of raising a Baby Tiger.

  20. #20
    You don't want to know.
    Join Date
    Aug 2010
    Posts
    4,578

    Re: Sorting Large Datasets

    Quote Originally Posted by NJDevils28 View Post
    From what I have gathered so far, is that any GUI's for SQLite are not commercial and created and maintained by independent developers. We shy away from that unless we code it ourselves.
    This is a very unhealthy attitude and you are flushing money down the toilet.

    You will also be sad to hear SQLite itself is not commercial and it is created/maintained by independent developers working for free.
    This answer is wrong. You should be using TableAdapter and Dictionaries instead.

  21. #21

    Thread Starter
    Hyperactive Member
    Join Date
    May 2013
    Posts
    295

    Re: Sorting Large Datasets

    Quote Originally Posted by Sitten Spynne View Post
    This is a very unhealthy attitude and you are flushing money down the toilet.

    You will also be sad to hear SQLite itself is not commercial and it is created/maintained by independent developers working for free.
    I understand your high defense of the independent developer. I myself am one, on the side of course. I hold independent developers in high regard, I truly do.

    As for SQLite being an independently developed and maintained product, I am fully aware of it and I enjoy leveraging it for my purposes. I have also used it as a teaching tool. Unfortunately, We really do not have a need for a DB in our current business model.

    I use it for myself in my programs but for my Day Job, no. I used it here as a proof of concept that maybe a DB engine could solve our issues and the answer so far seems to be no.

    But when it comes to my actual paycheck, I do what I'm told. Period. We only use off the shelf industry proven products. It comes down to security and nothing more. We use only Industry proven IDS, Anti-maleware and Secure transmissions and Email.

    We deal in personally identifiable information and in order for us to maintain a certain level of trust we have very specific rules to follow.

    My apologies for an insult I may have inferred.

    This community has been very helpful and supportive and i would not like to jeopardize my relationship to it.

    Sincerely,

    -NJ
    Lo And Behold
    This is my first Microsoft App Submission
    Your opinion of it would be appreciated

  22. #22

    Thread Starter
    Hyperactive Member
    Join Date
    May 2013
    Posts
    295

    Re: Sorting Large Datasets

    Quote Originally Posted by ChrisE View Post
    your Brain must be a USB-Stick, I'm surprised that it work's since the 90's

    regards
    Chris
    Lets put it this way. Our code has been around since the early 80's and still works great. We do not compete with the big boys for other reasons and none of them is their software is superior to ours.
    Lo And Behold
    This is my first Microsoft App Submission
    Your opinion of it would be appreciated

  23. #23
    You don't want to know.
    Join Date
    Aug 2010
    Posts
    4,578

    Re: Sorting Large Datasets

    "Security and verification requirements" is a good reason to avoid integrating things into your build-chain. Make sure to mention that when you say you "can't use independent tools".

    I only say this because there's a strong tendency in the community to practice what's called "Not Implemented Here Syndrome" and reject any tool that wasn't hand-written on the basis "we can do it better because we can customize it". There are very useful things (like "GUI tools for SQLite") that took many people many years to develop on their own, translating into hundreds of thousands of dollars worth of labor. The .NET community in particular has a strain of the illness, "Not Implemented By Microsoft Syndrome". They'll wait decades for MS to release some version of a tool rather than use existing alternatives. This behavior in the .NET community has caused many talented people to abandon their .NET projects and join communities for other platforms like OS X.

    I get now from your explanation you're working in an industry that puts requirements on you that mean you can't be on the "bleeding edge" or even "anywhere near the edge". That's a serious constraint you can't really work around.

    I'm kind of curious why you think a database won't be a help with this task, but I'm assuming that answer's something along the lines of "I'm constantly receiving these files and have to process them within a time envelope, I don't have time for a database import". When I read your first post I thought it was "this is a static set of files and it's OK if initial import is a long process".
    This answer is wrong. You should be using TableAdapter and Dictionaries instead.

  24. #24

    Thread Starter
    Hyperactive Member
    Join Date
    May 2013
    Posts
    295

    Re: Sorting Large Datasets

    Quote Originally Posted by Sitten Spynne View Post
    "Security and verification requirements" is a good reason to avoid integrating things into your build-chain. Make sure to mention that when you say you "can't use independent tools".

    I only say this because there's a strong tendency in the community to practice what's called "Not Implemented Here Syndrome" and reject any tool that wasn't hand-written on the basis "we can do it better because we can customize it". There are very useful things (like "GUI tools for SQLite") that took many people many years to develop on their own, translating into hundreds of thousands of dollars worth of labor. The .NET community in particular has a strain of the illness, "Not Implemented By Microsoft Syndrome". They'll wait decades for MS to release some version of a tool rather than use existing alternatives. This behavior in the .NET community has caused many talented people to abandon their .NET projects and join communities for other platforms like OS X.

    I get now from your explanation you're working in an industry that puts requirements on you that mean you can't be on the "bleeding edge" or even "anywhere near the edge". That's a serious constraint you can't really work around.

    I'm kind of curious why you think a database won't be a help with this task, but I'm assuming that answer's something along the lines of "I'm constantly receiving these files and have to process them within a time envelope, I don't have time for a database import". When I read your first post I thought it was "this is a static set of files and it's OK if initial import is a long process".
    Thank you for the kindness of your understanding.
    You are very correct in we are always a few versions behind anything even close to bleeding edge.

    And yes, you are very observant to conclude that the turnover from receipt of data to final product is very time restrictive. We get new Data daily and process over-night.
    I could describe the minutia of why the window is so small, but that would divert us from the original topic.

    I have been playing with a few sort algorithms out there such as QSort or even the built in OS Command line Sort and I may have found my solution. I know QSort is a form of bubble sort and inherently slow but it's also the easiest to understand.

    My idea so far....
    I will use the mini record approach to apply a sequence number to all the records. Then I will create 7 mini files qwith just the sequence number a sort key in the order I want the for the individual (7 total) sorts. I will then perform my first sort and use those results to select the records that will be the input to the next sort and so on until I have completed all 7 sorts and then compare my output and time to completion with our current running production process.

    I'm sure I'll have to tweak things but if I get the sorts right then the data should match exactly.

    I will update the forum as I test.

    Tahnk you for your time,

    -NJ
    Lo And Behold
    This is my first Microsoft App Submission
    Your opinion of it would be appreciated

  25. #25
    PowerPoster ChrisE's Avatar
    Join Date
    Jun 2017
    Location
    Frankfurt
    Posts
    3,042

    Re: Sorting Large Datasets

    Quote Originally Posted by NJDevils28 View Post
    Lets put it this way. Our code has been around since the early 80's and still works great. We do not compete with the big boys for other reasons and none of them is their software is superior to ours.
    I didn't mean this disrespectful, could you supply some sample Data of you flat files? I wrote way back in VB6 a Class
    for working with flat files, you could treat them like Recordsets. I only started around easter this year with .Net.
    So the other guy's here can give you better advice with .Net tan I can.

    regards
    Chris
    to hunt a species to extinction is not logical !
    since 2010 the number of Tigers are rising again in 2016 - 3900 were counted. with Baby Callas it's 3901, my wife and I had 2-3 months the privilege of raising a Baby Tiger.

  26. #26

    Thread Starter
    Hyperactive Member
    Join Date
    May 2013
    Posts
    295

    Re: Sorting Large Datasets

    Hello All,

    Just to update everyone, I was unable to successfully sort the minimum amount of data records needed to offload some work from the mainframe to the server.
    We need the ability to pass 600 million records at least twice in order to remove duplicates. The initial 600 million reduces to a manageable 147 million after the first 2 passes and after further refining, the end result is (Usually) 120 million.

    We upgraded the server from 64GB to 128GB of ram and the 147 million was able to run without a significant reduction in the server performance.

    Thank you all for your advice with this issue!

    Regards,

    -NJ
    Lo And Behold
    This is my first Microsoft App Submission
    Your opinion of it would be appreciated

  27. #27
    You don't want to know.
    Join Date
    Aug 2010
    Posts
    4,578

    Re: [RESOLVED] Sorting Large Datasets

    I still don't understand why a database "can't work". Millions of records is a small load, and odds are whatever you were doing could've been optimized. But you didn't/can't show your work, so no one could help you with that.
    This answer is wrong. You should be using TableAdapter and Dictionaries instead.

  28. #28

    Thread Starter
    Hyperactive Member
    Join Date
    May 2013
    Posts
    295

    Re: [RESOLVED] Sorting Large Datasets

    Quote Originally Posted by Sitten Spynne View Post
    I still don't understand why a database "can't work". Millions of records is a small load, and odds are whatever you were doing could've been optimized. But you didn't/can't show your work, so no one could help you with that.
    Hi Sitten,

    I hope your doing well!

    We tried loading the 600 million mini records (160 bytes each record) into a SQLite data base. it took around 1TB of storage space and a few hours to load the data. We were able to perform some of the sorts, not all but the time it took to load the data (Cut the records from 400-800 bytes each down to 160 bytes) and sort the data and then extract the data for just a single client test took the better part of an entire day.
    Right now we can process a few hundred clients on the Mainframe in the same amount of time with approximately the same amount of data per client.

    We're hoping that the issue was the multi-step procedure of using the DB.

    just a note; We do not have any commercially licensed DB engines as we have never had a need for it. It could be the SQLite software that is causing the performance hit, we do not know.
    Lo And Behold
    This is my first Microsoft App Submission
    Your opinion of it would be appreciated

  29. #29
    Fanatic Member kpmc's Avatar
    Join Date
    Sep 2017
    Posts
    1,012

    Re: [RESOLVED] Sorting Large Datasets

    We tried loading the 600 million mini records (160 bytes each record) into a SQLite data base
    Have you considered a XML solution?

  30. #30
    You don't want to know.
    Join Date
    Aug 2010
    Posts
    4,578

    Re: [RESOLVED] Sorting Large Datasets

    SQLite in particular is bad at bulk inserts, and I wouldn't be surprised if it wasn't designed to handle that much data all at once. Something like mySQL or Postgres is free and probably better at handling that.

    Basically: you have a complex problem and complex answers require more knowledge of the data involved. I can't really make a proof-of-concept for 600m records, but thought maybe someone else had some experience with bulk inserts like that. You really need someone that has a more-than-superficial knowledge of the DB.

    For an example of what I mean:

    The most obvious way to insert a lot of data into SQLite is a bulk insert. Using imaginary pseudocode it'd look like:
    Code:
    For Each item In recordSource
        _sqlite.Execute(INSERT INTO table ....)
    Next
    This is particularly slow for bulk insertions, because SQLite opens a transaction for all INSERT queries. You can address this several ways. One way is to directly mess with the DB settings via commands. If you're ambitious, you can recompile SQLite so the option you want is the default. The alternative is to use a different INSERT syntax that allows you to provide more than one row of data for one INSERT query. However, that's STILL limited to a maximum of 255 rows per INSERT.

    That's the kind of information you need, that's why there are people whose job is "DB Admin". Random tutorials don't really cut it when you're trying a really big task. That said, there aren't really "SQLite DB Admins" because it's just not intended for this kind of heavy-duty use. That's why it's SQLite.

    I'm not even 100% sure you can make a database work, but it does feel like you'll be duplicating a lot of effort other tools have been created to help with.
    This answer is wrong. You should be using TableAdapter and Dictionaries instead.

  31. #31
    Frenzied Member
    Join Date
    Dec 2014
    Location
    VB6 dinosaur land
    Posts
    1,191

    Re: [RESOLVED] Sorting Large Datasets

    The trick I've used for speeding SQLite inserts is wrapping the inserts in a transaction, but I have no idea on how well it would work for your scenario. You'd have to experiment with how many inserts you could do on one transaction based on RAM, etc. I was doing commits every 5000. That will speed it up quite a bit. My initial testing has been on SSD. When I tried my program on a traditional platter drive, it was taking 7-8 seconds to load only 750 records and I had no idea why! Once I learned how SQLite handled Inserts and used this trick, it became quicker than using the SSD.

  32. #32

    Thread Starter
    Hyperactive Member
    Join Date
    May 2013
    Posts
    295

    Re: [RESOLVED] Sorting Large Datasets

    Quote Originally Posted by topshot View Post
    The trick I've used for speeding SQLite inserts is wrapping the inserts in a transaction, but I have no idea on how well it would work for your scenario. You'd have to experiment with how many inserts you could do on one transaction based on RAM, etc. I was doing commits every 5000. That will speed it up quite a bit. My initial testing has been on SSD. When I tried my program on a traditional platter drive, it was taking 7-8 seconds to load only 750 records and I had no idea why! Once I learned how SQLite handled Inserts and used this trick, it became quicker than using the SSD.
    Hi Topshot.

    We did just that.
    We wrapped 500K records in a transaction and then executed the commit. That was surprisingly quick. It was reading in the records creating the mini records, then inserting them into the database as a whole that consumed copious amounts of time. It also took allot of system resources for each commit.
    Plus we could never figure out the proper SQL statement to perform the sort when selecting the records.

    Believe it or not I'm still plugging away at it.



    I forgot to add this. After loading the 600 million records, I executed this SQL statement

    SELECT DISTINCT * FROM table ORDER BY field1,field2,field3,field4 DESC,field5;

    That's not the field names just the gist of what I executed. It has been running for four hours now.
    Last edited by NJDevils28; Sep 26th, 2017 at 03:07 PM.
    Lo And Behold
    This is my first Microsoft App Submission
    Your opinion of it would be appreciated

  33. #33
    Frenzied Member
    Join Date
    Dec 2014
    Location
    VB6 dinosaur land
    Posts
    1,191

    Re: [RESOLVED] Sorting Large Datasets

    Well, that is lots of sorting for 600M and Distinct may not necessarily be the most efficient execution plan either. Then again, it may be the best you can do. Hard to say without lots of testing.

  34. #34

    Thread Starter
    Hyperactive Member
    Join Date
    May 2013
    Posts
    295

    Re: [RESOLVED] Sorting Large Datasets

    Quote Originally Posted by topshot View Post
    Well, that is lots of sorting for 600M and Distinct may not necessarily be the most efficient execution plan either. Then again, it may be the best you can do. Hard to say without lots of testing.
    It's still running. Been running since 2PM yesterday and Yes, I was in error with the server I chose to use. It is running on an older machine with 8GB ram and a 5400RPM drive.
    I am about to kill the process and start over again on a production server. The new server has 128GB ram and a 2TB M.2 SSD available for test use. I can't kick it off until after 5:00 pm, unfortunately. But it will give me a more real world test then the Development server I used last night.

    Would you like me to post results or just let the thread die?

    -NJ
    Lo And Behold
    This is my first Microsoft App Submission
    Your opinion of it would be appreciated

  35. #35
    PowerPoster ChrisE's Avatar
    Join Date
    Jun 2017
    Location
    Frankfurt
    Posts
    3,042

    Re: [RESOLVED] Sorting Large Datasets

    Hi,

    here a few thought's how to Import Data first to Datagrid

    sample with CSV, or text, it uses a the Jet Provider

    Code:
    ''Import from  Textfile into a new  Table
         sSQL = "Select * Into MyImport From [data1.txt] IN 'c:\test' 'Text;'"
    
    ''Import from Textfile and add to existing Table 
          sSQL = "Insert Into MyImport Select * From [data1.txt] In 'C:\test' 'Text;'"
    
    then you would fire the Execute...
    or with CSV to Datagrid.
    If there a Headers in the CSV you can use the select...
    Code:
    Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
            Dim _tb As New DataTable
            Dim SearchBox As String
            SearchBox = TextBox1.Text
            Me.DataGridView1.DataSource = _tb
            Dim SrcDir As String = "C:\"
            Dim sConn As String = String.Join(";", New String() { _
               "Provider=Microsoft.Jet.OLEDB.4.0", _
               "Data Source=" & SrcDir, _
               "Extended Properties=""Text; HDR=Yes; FMT=Delimited"""})
            Dim SQL As String = "Select * From [CSVTest.csv] Where AD_Name1 Like  '" & SearchBox & "%';"
    
            'in my CSV table there is a Header named AD_Name1 and I used a Filter
            'to select all name with the Letter 'B%'
            Using Cn As New OleDbConnection(sConn), ta As New OleDbDataAdapter(SQL, Cn)
    
                Cn.Open()
                ta.Fill(_tb)
            End Using
        End Sub


    regards
    Chris
    to hunt a species to extinction is not logical !
    since 2010 the number of Tigers are rising again in 2016 - 3900 were counted. with Baby Callas it's 3901, my wife and I had 2-3 months the privilege of raising a Baby Tiger.

  36. #36

    Thread Starter
    Hyperactive Member
    Join Date
    May 2013
    Posts
    295

    Re: [RESOLVED] Sorting Large Datasets

    Quote Originally Posted by ChrisE View Post
    Hi,

    here a few thought's how to Import Data first to Datagrid

    sample with CSV, or text, it uses a the Jet Provider

    Code:
    ''Import from  Textfile into a new  Table
         sSQL = "Select * Into MyImport From [data1.txt] IN 'c:\test' 'Text;'"
    
    ''Import from Textfile and add to existing Table 
          sSQL = "Insert Into MyImport Select * From [data1.txt] In 'C:\test' 'Text;'"
    
    then you would fire the Execute...
    or with CSV to Datagrid.
    If there a Headers in the CSV you can use the select...
    Code:
    Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
            Dim _tb As New DataTable
            Dim SearchBox As String
            SearchBox = TextBox1.Text
            Me.DataGridView1.DataSource = _tb
            Dim SrcDir As String = "C:\"
            Dim sConn As String = String.Join(";", New String() { _
               "Provider=Microsoft.Jet.OLEDB.4.0", _
               "Data Source=" & SrcDir, _
               "Extended Properties=""Text; HDR=Yes; FMT=Delimited"""})
            Dim SQL As String = "Select * From [CSVTest.csv] Where AD_Name1 Like  '" & SearchBox & "%';"
    
            'in my CSV table there is a Header named AD_Name1 and I used a Filter
            'to select all name with the Letter 'B%'
            Using Cn As New OleDbConnection(sConn), ta As New OleDbDataAdapter(SQL, Cn)
    
                Cn.Open()
                ta.Fill(_tb)
            End Using
        End Sub


    regards
    Chris
    Hi Chris,

    I thought of doing something like that, but I would have to cut the data into manageable parts as 600M records would eventually blow out memory and grind the system to a stall.

    As an update. Once I moved everything back over to the Production/Test server with the SSD's, I was able to process all the records from beginning to end and produce a matching file to the Mainframe output in under four hours.
    Note, that the four hours was not a continuous process. I did allot of manual validity tests (While the next step ran) with sample data and kicking off of jobs.
    AS for monitoring the system resources during the run, they never peaked higher then 28% (28% for my jobs, overall they flashed to 100% a few times). If I can process four or five jobs per server, I may have a proof of concept.

    This all went surprisingly well. I did not have much hope for a pleasant completion. I still do not have a final time for the process as I still have to work out the procedures and code them, but I have confidence now where I did not 24 hours ago.

    The hardware made all the difference. Copious amounts of RAM and a blazing fast SSD.

    Thanks all,

    -NJ
    Lo And Behold
    This is my first Microsoft App Submission
    Your opinion of it would be appreciated

  37. #37
    PowerPoster ChrisE's Avatar
    Join Date
    Jun 2017
    Location
    Frankfurt
    Posts
    3,042

    Re: [RESOLVED] Sorting Large Datasets

    Hi,

    maby sort your Data in Blocks and then combine them..

    Code:
     Private Function AddFile(ByVal SourceFile As String, ByVal DestinationFile As String) As Boolean
            Try
                Using SW As New IO.StreamWriter(DestinationFile, True, System.Text.Encoding.Default)
                    Dim s As String = Nothing
                    Using SR As New IO.StreamReader(SourceFile, System.Text.Encoding.Default)
                        SW.Write(SR.ReadToEnd)
                    End Using
                End Using
                Return True
            Catch ex As Exception
                Debug.Print(ex.Message)
                Return False
            End Try
        End Function
    
        Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
            Dim D_File As String = "c:\Text_Test\combined.txt"
    
            Debug.Print(AddFile("c:\Text_Test\1_to_49999.txt", D_File))
            Debug.Print(AddFile("c:\Text_Test\50000_to_100000.txt", D_File))
        End Sub
    both files would have to have the same schema

    a matter of trial and error ...

    regards
    Chris
    to hunt a species to extinction is not logical !
    since 2010 the number of Tigers are rising again in 2016 - 3900 were counted. with Baby Callas it's 3901, my wife and I had 2-3 months the privilege of raising a Baby Tiger.

  38. #38

    Thread Starter
    Hyperactive Member
    Join Date
    May 2013
    Posts
    295

    Re: [RESOLVED] Sorting Large Datasets

    Quote Originally Posted by ChrisE View Post
    Hi,

    maby sort your Data in Blocks and then combine them..

    Code:
     Private Function AddFile(ByVal SourceFile As String, ByVal DestinationFile As String) As Boolean
            Try
                Using SW As New IO.StreamWriter(DestinationFile, True, System.Text.Encoding.Default)
                    Dim s As String = Nothing
                    Using SR As New IO.StreamReader(SourceFile, System.Text.Encoding.Default)
                        SW.Write(SR.ReadToEnd)
                    End Using
                End Using
                Return True
            Catch ex As Exception
                Debug.Print(ex.Message)
                Return False
            End Try
        End Function
    
        Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
            Dim D_File As String = "c:\Text_Test\combined.txt"
    
            Debug.Print(AddFile("c:\Text_Test\1_to_49999.txt", D_File))
            Debug.Print(AddFile("c:\Text_Test\50000_to_100000.txt", D_File))
        End Sub
    both files would have to have the same schema

    a matter of trial and error ...

    regards
    Chris
    Hi Chris,

    Thank you for the code!
    Normally I would shy away from using a DB as it is not my forte'. But since I was able to produce the desired results with off-the-shelf software, I find myself leaning towards that as a solution. I am now coding a WinForm to kick off as many jobs as we think the system can handle.
    I have contacted a SQLite support group and now I have (I think) a plan to maximize the system resources in order to process four or more jobs concurrently, with the ability to kick off new jobs as previous ones complete. I will be coding this for quite some time but once again, I have confidence now.

    Thank you to all the folks who recommended the path to using a DB Engine!

    -NJ
    Lo And Behold
    This is my first Microsoft App Submission
    Your opinion of it would be appreciated

  39. #39
    Frenzied Member
    Join Date
    Dec 2014
    Location
    VB6 dinosaur land
    Posts
    1,191

    Re: [RESOLVED] Sorting Large Datasets

    Quote Originally Posted by NJDevils28 View Post
    The hardware made all the difference. Copious amounts of RAM and a blazing fast SSD.
    "Use a bigger hammer"

  40. #40

    Thread Starter
    Hyperactive Member
    Join Date
    May 2013
    Posts
    295

    Re: [RESOLVED] Sorting Large Datasets

    heh

    Lo And Behold
    This is my first Microsoft App Submission
    Your opinion of it would be appreciated

Page 1 of 2 12 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width