Page 3 of 3 FirstFirst 123
Results 81 to 93 of 93

Thread: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

  1. #81
    PowerPoster wqweto's Avatar
    Join Date
    May 2011
    Location
    Sofia, Bulgaria
    Posts
    6,169

    Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

    Btw, you can unzip on a separate thread/process so that decompression and ingress happen in parallel. I've just cobbled a generic async approach to unzipping here: ZipAsync.zip

    In the test project you can see how it uses built-in GetObject to retrieve an object reference from the remote ZipAsync process and then it calls OpenArchive and successive ReadChunk on this remote object to retrieve chunks of data as it's being decompressed in the ZipAsync process in parallel to any processing which happen in the driver test process.

    The name of the object (a GUID) is generated by driver process and is passed on the command line so that ZipAsync registers "ZipAsync.<<guid>>" name in ROT and driver process retrieves the same name from ROT using GetObject. For debug purposes both processes use fixed GUID as currently preset in ZipAsync project's "Comnmand Line" setting, thus it's possible to successfully communicate between the two instances of VB IDE with ZipAsync and driver project while debugging both of the projects.

    Probably some throttling will be necessary on the unzipping part if ingress is too slow to catch up with decompression so that OOM is prevented in ZipAsync process (currently not implemented).

    Also keep in mind that as a final polish it's possible both driver project and ZipAsync project to be combined in a single project/binary which differentiate mode of execution by command line parameters e.g "--channel <<GUID>>" command line option would start the application in a hidden ZipAsync mode, much the same way chrome.exe starts a gazillion processes and does multi-threading on several separate channels in parallel.

    cheers,
    </wqw>

  2. #82

    Thread Starter
    PowerPoster Elroy's Avatar
    Join Date
    Jun 2014
    Location
    Near Nashville TN
    Posts
    10,910

    Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

    @SmUX2k: Hey, no problem. As long as you and Olaf don't get into some long protracted discussion of RC#, go ahead and post here.

    Quote Originally Posted by wqweto View Post
    Btw, you can unzip on a separate thread/process so that decompression and ingress happen in parallel.
    Ok, that might be fun to play with. Right now, the unzipping is MUCH faster than the file processing (of the unzipped files). But I'd still want some kind of callback telling me when each file was unzipped (so I didn't try to start processing it before it was unzipped).

    Not sure you're looking at what we're doing, but the unzipped TXT files are deleted after they're processed.

    ------------

    Right now though, after reveling about another Verstappen win and eating lunch, I'm playing around with the MariaDB. I'm thinking that might provide a big bump in processing time over writing MDB files, as it too runs in a different thread. In fact, that'll be a 64-bit thread (after ODBC hands it to the actual MariaDB server).
    Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.

  3. #83
    Fanatic Member
    Join Date
    Apr 2021
    Posts
    616

    Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

    Quote Originally Posted by wqweto View Post
    Btw, you can unzip on a separate thread/process so that decompression and ingress happen in parallel. I've just cobbled a generic async approach to unzipping here: ZipAsync.zip
    This might also be handy for me, and I've already downloaded it (think I was the first to get it, when you posted this)...my plan is to move over from SQLite and instead create my OWN BLOB storage system that uses Dil's HugeBinaryFile class (so it goes beyond the 2GB limit) to write the data directly into a huge binary DB...SQL is too complicated for me at times, and there's too much I have to cater for just to add a file into the DB, while I think (as files will rarely be deleted from the DB, it's generally write and forget) I can create something a little more tailored to my needs...which was my original plan.

    Granted, I have RC6's LZMA compression as an option, but async zip is still worth considering...least of all as an interim step for files, before they're permanently placed in the database at maximum compression.

    And Elroy, if you think about it, this COULD be handy in other ways...you're getting a buffer of decompressed data coming in as it decompresses...it could be processing on thread 1 while the next block of data is decompressing on thread 2 (I'm assuming we're talking about in-memory decompression)...perhaps sometimes the processor would have to wait for the decompressor, but that's easy enough to handle. You're essentially cutting out that wait time at the start while it decompresses, and when there's tons of files every second counts.

  4. #84

    Thread Starter
    PowerPoster Elroy's Avatar
    Join Date
    Jun 2014
    Location
    Near Nashville TN
    Posts
    10,910

    Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

    Quote Originally Posted by SmUX2k View Post
    And Elroy, if you think about it, this COULD be handy in other ways...you're getting a buffer of decompressed data coming in as it decompresses...it could be processing on thread 1 while the next block of data is decompressing on thread 2 (I'm assuming we're talking about in-memory decompression)...perhaps sometimes the processor would have to wait for the decompressor, but that's easy enough to handle. You're essentially cutting out that wait time at the start while it decompresses, and when there's tons of files every second counts.
    Exactly. With a SQL server, it'd actually potentially get three threads going: 1) unzipping, 2) VB6 processing, 3) writing to the database. And yeah, to be safe, there'd need to be some interaction between the first two, and maybe the last two as well. I'm sure SQL servers have a buffer, but there may be a point at which the buffer is full and we need to wait to write anymore.
    Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.

  5. #85
    Fanatic Member
    Join Date
    Apr 2021
    Posts
    616

    Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

    Another thought for you, Wqw, and hopefully Elroy will back me up on the usefulness of it for both of us and other users...I haven't looked at the update, so forgive me if it is already there!

    We can be talking about 6GB+ files (and often 1GB+) that are being decompressed, and obviously VB6 can't easily manage that amount of data without clever trickery...perhaps a way to start and stop the decompression would be handy here? Maybe set a limit on the amount of data in the outgoing buffer, or number of buffer entries before the thread continues decompressing? I think manual control over the decompression would be the most logical of those, but intelligent buffer limits would work just as well.

    I've avoided using the in-memory decompression specifically for this reason...I have 32GB of memory, but VB6 obviously doesn't have access to all of it at once...but being able to decompress in-memory rather than writing the output to the HD would definitely improve the overall speed of things :-)

    I agree that in most cases this isn't an issue, but there will be people who are decompressing files of the size that VB6 wouldn't be able to normally handle...I'm proof that there's at least one :-P

  6. #86
    PowerPoster wqweto's Avatar
    Join Date
    May 2011
    Location
    Sofia, Bulgaria
    Posts
    6,169

    Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

    FYI, the ZipAsync does not hit the disk i.e. all decompression is done in memory and output is asychnornously buffered until ReadChunk is called. I just updated the ZipAsync.zip archive above to implement NUM_BACKLOG_FILES constant (currently 1) which controls throttle on decompression i.e. the number of decompressed files which are outstanding/not processed by ReadChunk.

    But this might not be enough in your case, will probably need to implement some kind of maximum backlog buffer size for large individual files. This will require to split decompression output of a single file to several buffers/streams so that these can be discarded early as ReadChunk piecemeal receives data from current file.

    cheers,
    </wqw>

  7. #87

  8. #88

    Thread Starter
    PowerPoster Elroy's Avatar
    Join Date
    Jun 2014
    Location
    Near Nashville TN
    Posts
    10,910

    Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

    Ok, it'll be a while before I get into my office, but I'm focusing on how it's going to run with MariaDB right now. I've got it all installed, with test programs up and running.

    I just need to incorporate it all into the BigFile program that process the stock trades files.

    Also, it dawns on me that, with a SQL server, there probably should be some kind of search to see if the data is already there before doing it again. We certainly don't want to duplicate this stuff in a database. But, that's a possible next step.

    The unzipping actually runs pretty fast now. The bottleneck now that I'm seeing is in the ASCII-to-binary conversion and database writing, which I'm doing all in one step, so I can't tease those apart. But I feel pretty confident that the ASCII-to-binary conversion is about as fast as it's going to get. Hopefully, later today, I'll report time differences between writing MDB files versus shoving it all into a MariaDB SQL server (with ASCII-to-binary part of both of those).
    Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.

  9. #89
    PowerPoster wqweto's Avatar
    Join Date
    May 2011
    Location
    Sofia, Bulgaria
    Posts
    6,169

    Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

    Quote Originally Posted by Elroy View Post
    Also, it dawns on me that, with a SQL server, there probably should be some kind of search to see if the data is already there before doing it again. We certainly don't want to duplicate this stuff in a database. But, that's a possible next step.
    In MSSQL you have IGNORE_DUP_KEY option for unique indexes like this:

    CREATE UNIQUE [CLUSTERED] INDEX MyIndex ON MyTable(UniqueCol1, UniqueCol2, ...) WITH IGNORE_DUP_KEY

    It basicly does not error/allow inserting a duplicate row by (UniqueCol1, UniqueCol2, ...) but just returns an info message row is ignored (as a warning).

    cheers,
    </wqw>

  10. #90
    Fanatic Member
    Join Date
    Apr 2021
    Posts
    616

    Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

    Quote Originally Posted by Elroy View Post
    The unzipping actually runs pretty fast now. The bottleneck now that I'm seeing is in the ASCII-to-binary conversion and database writing, which I'm doing all in one step, so I can't tease those apart. But I feel pretty confident that the ASCII-to-binary conversion is about as fast as it's going to get. Hopefully, later today, I'll report time differences between writing MDB files versus shoving it all into a MariaDB SQL server (with ASCII-to-binary part of both of those).
    I know this is probably not useful to you, but I'll mention it anyway. Is the app constantly writing to the DB, or is it stopping and starting constantly? If the latter, perhaps some sort of write cacheing system is needed. With my BLOB writer, I store a cache of BLOBs in memory and only write to the DB when that cache size reaches a certain point (~100MB, but obviously this is tweakable). If you find that the main bottleneck is the writing to the DB, this would probably remove that bottleneck or, more specifically, will defer it until later. In my case, it might take a second to write each BLOB individually when I do them on demand 1 at a time as and when they're available, but doing them in bulk like this seemingly tends to cut the time taken by more than half.

    What if you were to build the data into arrays and write them to a file, and have another app (hopefully one that is using a different thread) that is there to load these files and place them into the DB? It seems like an extra step, but I would guess that writing to a file would be far quicker than writing to DB. Obviously I don't know SQL well enough, but I would assume it would be possible to write 1 column at a time, and if so you should be able to target a specific element of a UDT (like the timestamps, for instance) and write them on their own into a file. I'd say write the file with a ".tmp" suffix and when the file is complete you change it to ".txt" (and the second app looks specifically for ".txt" files so won't try to load a file you're still writing).

    Take DB writing out of the bottleneck equation and you're left with just the conversion bottleneck...and you yourself say that you've optimised it as much as possible (probably not true, there's always improvements to make...just they improve things less each time) so this might improve the efficiency. If the DB writing causes a backlog (I expect it will) you can always put in a wait if there's more than X files in the DB queue (number of text files in the folder, as the DB writer will delete files they've processed).

    Perhaps now you're seeing why I went with the BLOB method over RAW data...yes, an entire column of millions of data points takes 1 second or less to write to the DB (partly because it's LZMA compressed so much smaller...even L1 LZMA made a decent difference)...and 100MB at a time is usually done in single digit seconds (so under 10s). I'm not trying to dissuade you from using the RAW method, just pointing out that I didn't need it so went with daily data in bulk.

  11. #91

    Thread Starter
    PowerPoster Elroy's Avatar
    Join Date
    Jun 2014
    Location
    Near Nashville TN
    Posts
    10,910

    Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

    Quote Originally Posted by wqweto View Post
    In MSSQL you have IGNORE_DUP_KEY option for unique indexes like this:

    CREATE UNIQUE [CLUSTERED] INDEX MyIndex ON MyTable(UniqueCol1, UniqueCol2, ...) WITH IGNORE_DUP_KEY

    It basicly does not error/allow inserting a duplicate row by (UniqueCol1, UniqueCol2, ...) but just returns an info message row is ignored (as a warning).

    cheers,
    </wqw>
    To truly be unique, a unique key would need to include the "Seconds", which is currently a float (Single). I've always been nervous about including a float in a unique key. But probably year, month, day is probably enough (although that wouldn't be unique at all).

    Also, I'm worried that a IGNORE_DUP_KEY would slow things down. I was just thinking about searching the database at the start of each month's dump and see if anything for that Year-Month was there, and, if so, issue a warning. Again, that's another "phase" to this though. I just need to get MariaDB up and running first, and do some timings compared to dumping to MDBs.

    ----------

    Also, I'm working from memory, but here are my MariaDB columns: AutoID, Ticker Symbol, Year, Month, Day, Hour, Minute, Seconds, BidPrice, BidVolume, BidExchange, AskPrice, AskVolume, AskExchange.

    Symbol=CHAR(5) latin1
    Year=SMALLINT
    Month=TINYINT
    Day=TINYINT
    Hour=TINYINT
    Minute=TINYINT
    Seconds=FLOAT

    ------------

    Further FYI, the files I've got now are ZIPs for a year, with one-TXT-file-per-month in them.
    ...
    Last edited by Elroy; Mar 10th, 2024 at 09:43 AM.
    Any software I post in these forums written by me is provided "AS IS" without warranty of any kind, expressed or implied, and permission is hereby granted, free of charge and without restriction, to any person obtaining a copy. To all, peace and happiness.

  12. #92
    Addicted Member
    Join Date
    Feb 2022
    Posts
    217

    Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

    I feel like I should get a participation trophy for reading every word of this thread. kidding. Did we ever get the final benchmarks of RC6/sqlite vs HugeBigFile?

  13. #93
    PowerPoster
    Join Date
    Jun 2013
    Posts
    7,454

    Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

    Quote Originally Posted by taishan View Post
    Did we ever get the final benchmarks of RC6/sqlite vs HugeBigFile?
    In case you mean "RC6-cCSV -> Sqlite-imports" vs. "HugeFile-class -> MariaDB-imports" -
    the performance is about factor 3 better with RC6/sqlite (according to my own tests).

    Olaf

Page 3 of 3 FirstFirst 123

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width