[RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

Printable View

Show 60 post(s) from this thread on one page

Mar 7th, 2024, 04:12 PM
Elroy

Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast
I've about got a prototype up and running.
- Extract ZIP file.
- Process 12 TXT files.
- Create 12 MDB files with records and fields.
Time for lunch though.

From alpha-testing, it looks like it's going to process a whole ZIP file (12 TXT files) in about a minute.

I haven't started on the "file windowing" yet though, as that's going to take a little thought. And some of your files do blow up VB6 memory. But there are solutions.

I'll be back.
Mar 7th, 2024, 04:33 PM
SmUX2k

Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

Quote:

Originally Posted by Elroy

I haven't started on the "file windowing" yet though, as that's going to take a little thought. And some of your files do blow up VB6 memory. But there are solutions.

https://www.vbforums.com/showthread....File-I-O-Class is what I used...took *4* seconds to fully load in a file (1MB at a time) and calculate exactly where each day ends and the next day begins...and that's with using StrConv on the byte array output! This is the output I get from my code (in immediate window):

Code:

21:29:24 - Started 21:29:24 1 101884744 2021-03-01 0 101884743 101884744 21:29:24 2 122164934 2021-03-02 101884744 224049677 122164934 21:29:25 3 145589353 2021-03-03 224049678 369639030 145589353 21:29:25 4 219832632 2021-03-04 369639031 589471662 219832632 21:29:25 5 237883695 2021-03-05 589471663 827355357 237883695 21:29:26 6 194742596 2021-03-08 827355358 1022097953 194742596 21:29:26 7 121977061 2021-03-09 1022097954 1144075014 121977061 21:29:26 8 160873534 2021-03-10 1144075015 1304948548 160873534 21:29:26 9 78840585 2021-03-11 1304948549 1383789133 78840585 21:29:26 10 93401787 2021-03-12 1383789134 1477190920 93401787 21:29:26 11 77502242 2021-03-15 1477190921 1554693162 77502242 21:29:26 12 103138946 2021-03-16 1554693163 1657832108 103138946 21:29:27 13 117197095 2021-03-17 1657832109 1775029203 117197095 21:29:27 14 112431124 2021-03-18 1775029204 1887460327 112431124 21:29:27 15 106238659 2021-03-19 1887460328 1993698986 106238659 21:29:27 16 73350569 2021-03-22 1993698987 2067049555 73350569 21:29:27 17 99863779 2021-03-23 2067049556 2166913334 99863779 21:29:27 18 102351865 2021-03-24 2166913335 2269265199 102351865 21:29:27 19 133777368 2021-03-25 2269265200 2403042567 133777368 21:29:28 20 103652747 2021-03-26 2403042568 2506695314 103652747 21:29:28 21 112607831 2021-03-29 2506695315 2619303145 112607831 21:29:28 22 86706024 2021-03-30 2619303146 2706009169 86706024 21:29:28 23 75862958 2021-03-31 2706009170 2780609680 74600511 21:29:28 - Finished

It's not particularly smart code, but it is loading 2.5GB of data into memory (from RAM drive, as I think you know by now) in fairly small blocks at ~600MB/s...with some smart management, I'm sure you could make good use of HugeBinaryFile for this :-)

Edit: For reference, the last 3 columns are Start pointer, End pointer, Size (and I know there's a weird dupe but check the last line and you'll see size doesn't tally...there's a reason, last number is right as it's calculated from start and end pointer...not important, this isn't about my dodgy code, it's about the speed of the HBF)
Mar 7th, 2024, 05:44 PM
Elroy

Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

I've got a file with these lines in it:

Quote:

2021-01-06 09:37:56.491,51.49,100,13,51.5,50000,2
2021-01-06 09:37:56.491,51.49,100,13,51.5,50100,2

Apparently, someone wanted a LOT of shares. So, a VB6 Integer isn't going to work for the "AskVolume".

I'm just going to change both the volumes to Longs.
Mar 7th, 2024, 05:51 PM
SmUX2k

Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

Quote:

Originally Posted by Elroy

I've got a file with these lines in it:

Apparently, someone wanted a LOT of shares. So, a VB6 Integer isn't going to work for the "AskVolume".

I'm just going to change both the volumes to Longs.

I dealt with this for Schmidt also...divide by 100...they're generally reported in batches of 100, so you should be able to handle up to 3,276,700 while storing the value in the Ask Volume...I have never seen that many yet (just in case, consider capping the value at that), I don't think people buy or sell that many at once...it's just that people need to be aware that the value is multiples of 100 if using that data :-)

Storing as a long would be an alternative, though not necessarily worth it
Mar 7th, 2024, 06:40 PM
Elroy

4 Attachment(s)

Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

Ok, here's a prototype (attached).

I picked on the "ko_quote_year_2021_1icfge.zip" file because none of the TXT files were too large to fit into memory. That's a problem I'm still working on.

Here's a screen capture of the latest version of the program, after it processed that file:

Attachment 190667

Now, by far, the biggest time-consumer is writing out those MDB files. The rest (unzip, massage byte array) is all quite fast.

Here's a clip of the folder after it laid down all the MDB files:

Attachment 190668

The red rectangle is around the ones the program made. The green rectangle is around the one that was processed to produce those.

It's important to note that none of these MDB files are anywhere near the 2GB limit of MS Access MDB files, and I didn't even do any work on the UDT.

And here's just a look at the first of those MDB files:

Attachment 190669

Almost 5 million records in that thing, and that's just one of the 12 files (i.e., months).

------------------

I'll see about handling bigger files. I'm tempted to just use some far-memory stuff and still read the whole TXT files into memory. You said you had 32GB so that should be plenty to process anything you get.

------------------

But wait, the latest episode of Halo is out. I think it might be time to watch that. :)
Mar 7th, 2024, 07:24 PM
SmUX2k

Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

"Time Taken: -86281.96 seconds."...that's quite a speed increase...almost a day faster than your best!...could be because I started it just before midnight (here) and it finished at 00:01 :-)

Did it again, 118s for MDBs of 294MB

You mention that the MDB is the slowest part...not a major worry for me as I will jump in at around that point and re-route the output into a byte array using CopyMemory and compress it into a BLOB to be added to my database system. Everything else seems to do things fast and the way I needed it done, definitely faster than I was achieving.

As for the bigger files, that is definitely a requirement...my first test of the app balked at the data I sent it, turns out one of the files was 700MB...but HugeBinaryFile definitely does the job. As I mentioned above, I used it to first grab day-by-day pointers and it was then able to load an entire day's worth of data into memory 1 MB at a time...in theory it could actually have loaded an ENTIRE day (you'll see from the output above that the largest "day" was ~237MB).

I might be able to handle some of the other bits and pieces I need for optimal usefulness. The data would be output on a day by day basis, so would generally ignore the DMY data and only check to see if the day has changed or not, and I am ignoring the Exchange (bid and ask) value. I won't do anything major until you're done...or if I do, I'll make notes throughout of where I've made edits so things don't clash if there's an update. I'll look more into it tomorrow, as it's late right now (if you hadn't guessed by the 00:01)

I've had a quick look at the code, and I'm pretty sure I could incorporate the HugeBinaryFile code into the cmdDoIt and have it iterate through the file day by day and perform the output when complete...where it is now is actually pretty useful once I have tweaked it a little...I can happily say that I can follow the logic of the code easily for the most part :-)
Mar 8th, 2024, 01:35 PM
Elroy

1 Attachment(s)

Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

WOW, ok, I've got everything up and running (between watching F1 qualifying). Ran into another problem though. Trying to process that aapl_quote_year_2021_g2yon9.zip file and it actually generates an MDB file that's >2GB (which is the limit).

As a quick-fix, I'm going to cut down on the UDT, as you've been encouraging me to do. That'll get it under the 2GB MDB limit (but not necessarily by much). This is definitely some "big data". :)

I'll attach what I've got so far. It'll read a TXT file of any size, and write "most" of them to MDB just fine.

This thing also ties up the processor thread badly (almost as if it's hung). Not sure what the best approach is to solve that one, or if it even needs solving.
Mar 8th, 2024, 01:46 PM
SmUX2k

Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

Quote:

Originally Posted by Elroy

WOW, ok, I've got everything up and running (between watching F1 qualifying). Ran into another problem though. Trying to process that aapl_quote_year_2021_g2yon9.zip file and it actually generates an MDB file that's >2GB (which is the limit).

Simple answer, and what I have already done...store the file by day rather than month...so the first file it outputs should be 2021-01-04.mdb, which is 53,560k. Give me 5-10 minutes and I will post an update on MY progress with adapting the code (though obviously I will have to now move the adaptations I need to over to the new code and make sure they still work :-) )...it still has 1 month of data to get through for the entire year (which has taken a little over an hour) and I have completed the test :-P

And on the hang, I generally put a DoEvents here and there to give the system back some time to catch up...this is one of the big problems of trying to process massive amounts of data constantly for long periods of time!

Edit: Oh, and if you do want to do by day rather than month, wait until I post my update as my code already has the bits in place to do this...should be easy enough to work with, or you can use an adapted version of the code I sent earlier to you and use HBF to pull the specific day into memory...it seems to work fine for me, the largest day in that AAPL file is under 200MB.
Mar 8th, 2024, 02:04 PM
SmUX2k

1 Attachment(s)

Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

Update:

I have made some slight modifications to the code, turning the Sub into a Function (which is called naked by the original Sub). This allows me to now pass a filename (ZIP or TXT) to the function and it'll skip the relevant parts (file dialog if ZIP, file dialog and zip decompress if TXT) if I decide to set up a queue automation system for processing entire folders worth of ZIPs, which I already have done in my own code. It also allows for the ability to customise or optimise the method for decompression...currently the app is set up to extract everything in the zip before continuing, but people might prefer to extract on demand and delete afterwards (I also added an option to flag whether it deletes temporary files when it's finished with them, the default being true but if you specify a TXT file it logically sets that to false...which can be overridden).

It has Elroy's updated HBF included (I've essentially commented out the original code in the file loading, and incorporate the HBF loading code there, rather than remove it) and is set up to index the TXT file (work out where the day splits are in the file) and load a SPECIFIC day's data into a byte array rather than the whole file. This isn't going to be 100% reliable, as I know sometimes individual days will screw with this simple system, and I should really block-load the data but I don't yet know enough about the overall way things are set up in the code to allow me to do partial blocks....but it works. I also added in protection against trailing CR/LF/spaces in the data so the CSV parser gets only the data it needs. Also, I suspect now it is set up to work with 1 day's worth of data at a time, it *should* be okay...it's rare to see 1 day's worth of data being more than 100MB (though I am testing it with AAPL, and they're going as high as 200MB but usually around the 100MB mark or lower)

I also have option buttons to allow the option to output to MDB (it outputs to a file based on the date, so 1 file per day, rather than writing everything for the month into 1 MDB), a SQL Blob (not implemented...the button is just there) or file(s) (again, not there...it would be 1 file per column per day, so 7 files for each day's output when it is done...is there REALLY any reason to output it to a CSV again? If I was just splitting the CSV into days I could have done that ages ago!).

I'm currently stress testing it on a year's worth of data and 81 days has taken 35 minutes as I originally wrote this. 183 days took 1 hour. For the record, I am taking this timing from the write time of the MDB files. First file was done at 17:32 and last file was 18:54 so 252 days in 82 minutes! About 19.5s per day on average. The app itself reports "5042.465s" in total which suggests 20s per day.

I need to do a little more work on the day splitter (I sent Elroy a copy of the code, though after sending it I realised there's a few things still missing) to make it a bit more robust and useful.

Just some quality of life improvements that I would have implemented...Elroy, feel free to take these ideas and redo them your way if you want to :-)

Obviously this was done using the original Big_File (from yesterday) so expect an updated version in a little while. Edit: A little while as in tomorrow, maybe...other things to do :-P
Mar 8th, 2024, 02:40 PM
Elroy

4 Attachment(s)

Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

Yeah, I was thinking about prompting for a "Folder" instead of a "ZIP file". Then, all the ZIP files in that folder could just be processed. An easy way to do that would be to prompt for the folder, and then load all the ZIP files into a ListBox. Then, just build a super-loop which processes them all.

But, I successfully processed the AAPL file. I think I might call my part of this thing done. With all we've explored, I don't think there are any additional huge gains in speed to be found ... maybe some small tweaks, but nothing that's going to be a 2X or anything like gains we've already achieved. Still 20 minutes for that huge AAPL file though.

Here's a screen-grab of processing the AAPL file:
Attachment 190682

To keep the MDB files under 2GB, I removed my AutoID field (which wasn't necessary), removed "Year" and "Month". And also made "Seconds" into a byte, only keeping the integer portion. March was still very large:
Attachment 190683

And here's a peek at March with a DB Viewer:
Attachment 190684

This latest version is attached.

Unless someone finds a bug, I think this one is done for me. :) Have fun with it.
Mar 8th, 2024, 02:51 PM
SmUX2k

Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

Quote:

Originally Posted by Elroy

To keep the MDB files under 2GB, I removed my AutoID field (which wasn't necessary), removed "Year" and "Month". And also made "Seconds" into a byte, only keeping the integer portion.

That is probably a big no-no...the seconds NEEDS milliseconds in the output (sometimes there are 100s of price changes in a second, often there's more than 10)...this is why I convert HMS to a Long, as there's only 86.4m milliseconds in a 24h period...this is how I do it: "(((60& * m) + (60& * 60 * h)) * 1000) + s * 1000"...there's probably tidier ways to do it, but this was a quick algorithm :-)

In a day of data that has JUST 1m lines in it, and only 30,600 seconds of data in an 8.5h period, you will expect an average of 33 entries per second...without milliseconds, there's no way to tell which came first unless you just trust they're in order

Edit: And there seems to be a discrepancy of some sort...my MDB output files aren't the same size as yours, though they're working with exactly the same data. I can't post full details as it hasn't processed them all, but:

AAPL_2021_01.mdb - 1,021,504k
AAPL_2021_02.mdb - 1,153,848k

And it's a bit slower on my PC, it seems...will take a while to process all of them to get a fuller picture (I would post a screenshot, but of course don't have them all done)

Edit: And it crashed on the 3rd month...guessing you've posted an old version (or I am somehow running one...redownloading and re-testing) :-P

Edit 2: I guess you re-uploaded it inbetween and I got an old version...idiot me didn't notice you had posted two versions today, I don't know why I am so tired...it shouldn't be able to even look at an AAPL file so it has to have been one you added HBF to. Or perhaps I was an idiot and was running my edited version (though pretty sure I would have known, I display the date rather than the month. It's fine now!
Mar 8th, 2024, 03:00 PM
Elroy

Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

Quote:

Originally Posted by SmUX2k

That is probably a big no-no...the seconds NEEDS milliseconds in the output (sometimes there are 100s of price changes in a second, often there's more than 10)...this is why I convert HMS to a Long, as there's only 86.4m milliseconds in a 24h period...this is how I do it: "(((60& * m) + (60& * 60 * h)) * 1000) + s * 1000"...there's probably tidier ways to do it, but this was a quick algorithm :-)

In a day of data that has JUST 1m lines in it, and only 30,600 seconds of data in an 8.5h period, you will expect an average of 33 entries per second...without milliseconds, there's no way to tell which came first unless you just trust they're in order

Ahhh, makes sense. I'm thinking about lunch though. I suspect you can work that one out. :)

Also, if I really want to look at this more, I should probably install a SQL server on my machine. That would get rid of that 2GB limit. I've got MariaDB on my NAS box, but that would require sending the data through the router and switches over on the other table in my office. And I'm not sure what the speeds are of my switches.

I've almost got another big update ready for my VB6_Frm_To_Py program, and I'm going to work on that now. You take care, and I'll be around. :)
Mar 8th, 2024, 03:27 PM
SmUX2k

Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

Quote:

Originally Posted by Elroy

Ahhh, makes sense. I'm thinking about lunch though. I suspect you can work that one out. :)

Also, if I really want to look at this more, I should probably install a SQL server on my machine. That would get rid of that 2GB limit. I've got MariaDB on my NAS box, but that would require sending the data through the router and switches over on the other table in my office. And I'm not sure what the speeds are of my switches.

I've almost got another big update ready for my VB6_Frm_To_Py program, and I'm going to work on that now. You take care, and I'll be around. :)

Thanks for all of this, it's given me stuff to think about...different ways to process things.

I've mentioned Schmidt's RC6 before, it comes with an SQLite DB system which allows you to create local SQL DBs of any size...I am sure there are other SQLite implementations out there, or similar, and (if you're a masochist) HBF has an output mode that WRITES data as well, and, theoretically it would be possible to create your own DB system that can hold any size of data. In the long term this is probably what I will do with my BLOB system, do away with SQL entirely and instead store the output files raw with an offset stored to say where the file actually is in the DB. I went with SQL simply because it was a simpler option than writing my own system, but in the long term it seems to have worked out I was probably better off doing my own system. What held me back most wasn't the managing of the data, it was more the logistics of managing millions (potentially) of files in a large binary file and keeping their file pointers accurate, and also pulling the pointer data for a specific file when there *are* millions of files to search through. It was also the 2GB file limit, but HBF fixes that and accepts a Currency absolute seek (so not limited by the max of a Long)

Just in case you didn't read my edits above, my file sizes for the MDBs don't match yours (though the source files are exactly the same) and it crashes at the 3rd month. It isn't a huge problem, you already know I don't plan to use that bit of the code, but thought I'd mention it in case it IS an old version

Edit: It was somehow an old version, ignore that comment!
Mar 8th, 2024, 04:14 PM
Elroy

Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

Yeah, it was that 3rd month (March) of the AAPL file I was having trouble with too. But the latest version (in post #70) should have that fixed (by cutting a few things out).

I'd hate to having one-file-per-day, as, to me, the only thing that makes any of this useful is the ability to have it in a database form where someone could do research with it. In my own mind, that's why the blob never made a great deal of sense to me.

But hey ho, you have fun with it. It's been a fun journey for me to figure some of this stuff out.
Mar 8th, 2024, 04:27 PM
SmUX2k

Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

Quote:

Originally Posted by Elroy

Yeah, it was that 3rd month (March) of the AAPL file I was having trouble with too. But the latest version (in post #70) should have that fixed (by cutting a few things out).

I'd hate to having one-file-per-day, as, to me, the only thing that makes any of this useful is the ability to have it in a database form where someone could do research with it. In my own mind, that's why the blob never made a great deal of sense to me.

But hey ho, you have fun with it. It's been a fun journey for me to figure some of this stuff out.

I'd love to store it 1 month (or even 1 year) per DB, but the problem is that DBs don't implement compression. I essentially need to be able to pull this data out 1 day at a time, and specific columns only, which is definitely easily doable with an SQL DB query...but it would pull 10MB of data from the database for one column of data, when if I LZMA compressed the data stream (the CopyMemory-pulled array data) it would be anything between 500k and 1MB. When 1 day of data is 80MB, you can imagine what 1 year of data would look like, 10 years of data is 10x that and 4000+ stocks worth of data just increases that to unimaginable levels. On the small scale where you're looking at 2022 data for AAPL, you don't see the problem...when you're looking at 2000-2024 data for AAPL, it's 24 years worth of data (though, to be fair, 2000-2020 is probably about as big as 2021 is...or, if not, not far off...my current earliest zip is 2010 and that's 368MB compared to the 1.5GB 2021 and 1.6GB 2022 that has December missing) and that's potentially 6000x 80MB (24 years of 250 days 6000...which is ~483GB) just for one stock.

I say better to store it as a BLOB or similar so someone can extract the specific data they want and do the research on it on demand, rather than having the data available raw 24/7 and taking up petabytes of space :-)
Mar 9th, 2024, 02:00 AM
Schmidt

Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

Quote:

Originally Posted by SmUX2k

I'd love to store it 1 month (or even 1 year) per DB,
but the problem is that DBs don't implement compression.

That's not really the case...
Most decent DB-engines can be configured for transparent compression (allowing direct queries)
at either row, column or page-level - (for example SQLite, MySQL, PostgreSQL and even MS-SQLServer).

The MS-Jet-Engine (*.mdb-Files) is one of the rare exceptions here, without any support for it...

That these compression-features (at DB-level) are not often known (or used) these days,
is due to transparent compression-support directly at FileSystem-Level -
(most modern Linux-FS' - but also MS-NTFS allow that with roughly similar compression-ratios as the DB-engines).

Sure, the choosen compression-algos for this kind of "live"-compression do not achieve rates as high as e.g. LZMA -
but their write/read throughput is ~400-800MB/sec (that's faster than the write/read rate of magnet-HDs).

So, with NTFS-Folder-based compression, you could interact and query your DB-data directly, in a decent performance -
at the cost of having only about half of the LZMA-compression-ratios.

Olaf
Mar 9th, 2024, 08:35 AM
Elroy

Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

Yeah, I was thinking about this too ... both the size issue and the compression issue.

I'm still tempted to install MariaDB on my new box, we'll see. But pretty much any SQL server software is going to solve the size issue. Furthermore, they'll probably run faster than writing to MDB files. This is true because all the writing will be handled in a different OS thread (and probably a different CPU core) because the server software will be running as a service independent of the VB6 program. Although, with the ADO, you can still directly interface with it.

Regarding compression, an easy solution is to just turn on Windows compression on the folder where the SQL server stores its files. That probably won't be "maximum" compression, but I suspect it'd be pretty good. And also, done that way, you can still run queries (or whatever) on your table(s) in your SQL server.

I've got to watch the F1 Saudi race today, and also clean the house. But we'll see. I'll keep thinking about installing MariaDB and taking a better look at some of this stuff.
Mar 9th, 2024, 09:48 AM
SmUX2k

Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

The speed problem I have found with processing data is not the actual data processing but the actual writing of that data to an array (or variable). Schmidt's code (the CSV parser) runs pretty well but where the MAIN bottleneck occurs is when that data is moved over from the output callback into the array.

I tested this by moving the value from Csv.ParseNumber into a temporary variable then moving it from that temporary variable into the actual array I wanted the data in. I benchmarked both commands separately.

It took 200s (with this extra step, usually around 120s...I assume my benchmarking is going to add some time to the process...the act of observing something in motion will alter its trajectory, and all that) to process the entire 2.5GB file, according to Schmidt's timing (slightly more than the previous times, I know...not sure why)

40s of that was taking the ParseNumber value and putting it into the temporary variable. 20s of that was moving the data from temporary variable into the array. By this logic you can assume that the actual converting of the data is taking 20s.

I re-run the code, ParseNumber completed in 38.6s while assigning values took 21.7s...that suggest the converting took ~17s

I am the first to admit that my benchmarking code (using GetTickCount to get the number of ms it takes to do something) isn't mission-critical perfect, and there's 214m iterations of these (it only processes 4 values) so it isn't much time per individual write to the array...but it is indicative, if nothing else.

Later on, when I want to read this data in from the SQL, the issue is not going to be the speed of the SQL system but the writing of this data directly into the array. There will be ways to speed it up, but is it really going to be quicker than loading a BLOB from SQLite, decompressing it and writing it directly to the array using CopyMemory? This was my main reason for wanting to directly access the underlying memory of variables/arrays, to maximise the speed of loading of the data. If it was coming from SQL, a split() would probably be involved in the process if I don't iterate through each array element one by one, and we both know how godawful slow either of those options can be. I suppose if I also went the raw SQL DB route I could run comparisons on speed with what I get through CopyMemory...do I really need to though?
Mar 9th, 2024, 10:22 AM
Elroy

Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

Hi SmUX2k,

I'm no longer sure what "array" you're talking about. The way my latest code works is to just create one UDT at a time, and then dump it into the database.

I guess you're still building an array so you can make your "blob". Personally, I'm not at all convinced that a blob is the way to go. That just completely disables your ability to do easy queries, sorts, and analysis on the processed data, without a lot of "unpacking" before you can do that.

I know you may not use it, but I'm playing around with using MariaDB to store all the data in one database. If I were to actually want to use something like this, that's just the way I'd do it. I'd definitely want to "easily" do things like make daily or monthly histograms for, say, average stock price, and other such things. Just thinking about doing that from a blob already gives me a headache.

--------------

And hey, if you're trying to work out how to move your array (UDT array, or however you're doing it) to/from RC#'s version of MySQL, you may do better to start a new forum thread, as that's getting pretty far away from the OP topic of this thread. :)
Mar 9th, 2024, 10:57 AM
SmUX2k

Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

Quote:

Originally Posted by Elroy

I'm no longer sure what "array" you're talking about. The way my latest code works is to just create one UDT at a time, and then dump it into the database.

You yourself admitted that the slowest part of the app was the writing to DB, did you not? :-P

You may have set it up as a UDT, but you write it to DB as StockUDT()...you can just as easily access specific "arrays" by going StockUDT().element (for example StockUDT().BuyPrice would be the same as the entire array of BuyPrice...I would refer to that element of a UDT as an array). Within MY app (set up before you did your work on the CSV parser) I have DataXX() arrays (XX being TS, BP, AP, BV, AV, BE, AE for the 7 different elements)...I haven't converted it to a UDT yet as it wasn't high on my priorities, but I eventually will do.

Quote:

Originally Posted by Elroy

I guess you're still building an array so you can make your "blob". Personally, I'm not at all convinced that a blob is the way to go. That just completely disables your ability to do easy queries, sorts, and analysis on the processed data, without a lot of "unpacking" before you can do that.

No queries to do, sorts are done by timestamp BEFORE writing to the BLOB, no analysis done beyond the current day (comparisons to other days can be done, but it isn't really vital...I have no plans to yet and will deal with that if it occurs), and unpacking IS a slightly lengthier process than just pulling from DB but the data is in the array far quicker. ALL analysis is intraday, generally intra-minute, and done on pre-generated indicator results rather than on the price data (which is used to generate the indicator results).

Quote:

Originally Posted by Elroy

I know you may not use it, but I'm playing around with using MariaDB to store all the data in one database. If I were to actually want to use something like this, that's just the way I'd do it. I'd definitely want to "easily" do things like make daily or monthly histograms for, say, average stock price, and other such things. Just thinking about doing that from a blob already gives me a headache.

I may use it, I dunno...would be worth it to compare the BLOB method to the SQL raw method, if only to put to rest for certain which one is right for my needs :-P

And yes, though I mention the RC6 CSV Parser, your app is still there as part of my app...I'm just fiddling around with things and trying to get both the RC6 and your app's output tied into the BLOB generator so I can properly benchmark them.

You might want to easily do daily or monthly histograms of the average stock price for a stock...I actually don't...the statistical analysis I am doing relates to split-second triggers from results of indicators (I can't easily explain it any better than that) rather than stock prices directly, and it is this out-of-the-box thinking that I hope gives me the edge over all those people basing their analysis on historical prices and estimating which ones might be due to rise soon :-P

My analysis requires an entire day's worth of data to be loaded into memory and a simulation of the day's trading to be repeatedly run millions (or perhaps billions) of times with slightly different setups...the loading speed of a day's data isn't vitally important if it is only done once before the data is used millions of times, but there might come times when it is only used once once loaded (it depends entirely on what I am doing at that time...either hunting for good settings or testing if a setting is good). I should probably point out that it won't be the tick data that I use for this, it'll be aggregate data created from the tick data, so there won't be millions of lines of prices to simulate.

I'm not saying SQL raw data isn't an option, as it definitely is, but for my specific needs (where the query will essentially ALWAYS be for the specific BLOB of data...I might want all of 2022-01-02, and I will never want all except for before 10am, for instance...the BLOB is essentially a pre-generated query result that's nicely packaged up into a LZMA compressed binary ready to be CopyMemory copied into the array) I don't see it doing as well.

Essentially I have an idea in my head and I am following it...I may find it is the wrong idea later on, but I'm on the track and staying there as otherwise things get too complicated and I will never finish the project (which, as you can probably tell by another post from 2022, has been going for many years already...I started it a little after Covid hit)

Quote:

Originally Posted by Elroy

And hey, if you're trying to work out how to move your array (UDT array, or however you're doing it) to/from RC#'s version of MySQL, you may do better to start a new forum thread, as that's getting pretty far away from the OP topic of this thread. :)

I'm not, just using it as an example of the fact that the populating of individual elements of an array one element at a time is going to be time consuming. Getting the data from MariaDB is one thing, getting it from the query result into where I need it is an entirely different thing. There might be more efficient ways to import data from a SQL DB that I don't know, but (let me say this again) I *don't* know SQL well enough to know these things...I know VB6 for the most part, and go with what I know :-P
Mar 9th, 2024, 12:16 PM
wqweto

Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

Btw, you can unzip on a separate thread/process so that decompression and ingress happen in parallel. I've just cobbled a generic async approach to unzipping here: ZipAsync.zip

In the test project you can see how it uses built-in GetObject to retrieve an object reference from the remote ZipAsync process and then it calls OpenArchive and successive ReadChunk on this remote object to retrieve chunks of data as it's being decompressed in the ZipAsync process in parallel to any processing which happen in the driver test process.

The name of the object (a GUID) is generated by driver process and is passed on the command line so that ZipAsync registers "ZipAsync.<<guid>>" name in ROT and driver process retrieves the same name from ROT using GetObject. For debug purposes both processes use fixed GUID as currently preset in ZipAsync project's "Comnmand Line" setting, thus it's possible to successfully communicate between the two instances of VB IDE with ZipAsync and driver project while debugging both of the projects.

Probably some throttling will be necessary on the unzipping part if ingress is too slow to catch up with decompression so that OOM is prevented in ZipAsync process (currently not implemented).

Also keep in mind that as a final polish it's possible both driver project and ZipAsync project to be combined in a single project/binary which differentiate mode of execution by command line parameters e.g "--channel <<GUID>>" command line option would start the application in a hidden ZipAsync mode, much the same way chrome.exe starts a gazillion processes and does multi-threading on several separate channels in parallel.

cheers,
</wqw>
Mar 9th, 2024, 03:07 PM
Elroy

Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

@SmUX2k: Hey, no problem. :) As long as you and Olaf don't get into some long protracted discussion of RC#, go ahead and post here.

Quote:

Originally Posted by wqweto

Btw, you can unzip on a separate thread/process so that decompression and ingress happen in parallel.

Ok, that might be fun to play with. Right now, the unzipping is MUCH faster than the file processing (of the unzipped files). But I'd still want some kind of callback telling me when each file was unzipped (so I didn't try to start processing it before it was unzipped).

Not sure you're looking at what we're doing, but the unzipped TXT files are deleted after they're processed.

------------

Right now though, after reveling about another Verstappen win and eating lunch, I'm playing around with the MariaDB. I'm thinking that might provide a big bump in processing time over writing MDB files, as it too runs in a different thread. In fact, that'll be a 64-bit thread (after ODBC hands it to the actual MariaDB server).
Mar 9th, 2024, 03:19 PM
SmUX2k

Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

Quote:

Originally Posted by wqweto

Btw, you can unzip on a separate thread/process so that decompression and ingress happen in parallel. I've just cobbled a generic async approach to unzipping here: ZipAsync.zip

This might also be handy for me, and I've already downloaded it (think I was the first to get it, when you posted this)...my plan is to move over from SQLite and instead create my OWN BLOB storage system that uses Dil's HugeBinaryFile class (so it goes beyond the 2GB limit) to write the data directly into a huge binary DB...SQL is too complicated for me at times, and there's too much I have to cater for just to add a file into the DB, while I think (as files will rarely be deleted from the DB, it's generally write and forget) I can create something a little more tailored to my needs...which was my original plan.

Granted, I have RC6's LZMA compression as an option, but async zip is still worth considering...least of all as an interim step for files, before they're permanently placed in the database at maximum compression.

And Elroy, if you think about it, this COULD be handy in other ways...you're getting a buffer of decompressed data coming in as it decompresses...it could be processing on thread 1 while the next block of data is decompressing on thread 2 (I'm assuming we're talking about in-memory decompression)...perhaps sometimes the processor would have to wait for the decompressor, but that's easy enough to handle. You're essentially cutting out that wait time at the start while it decompresses, and when there's tons of files every second counts.
Mar 9th, 2024, 03:24 PM
Elroy

Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

Quote:

Originally Posted by SmUX2k

And Elroy, if you think about it, this COULD be handy in other ways...you're getting a buffer of decompressed data coming in as it decompresses...it could be processing on thread 1 while the next block of data is decompressing on thread 2 (I'm assuming we're talking about in-memory decompression)...perhaps sometimes the processor would have to wait for the decompressor, but that's easy enough to handle. You're essentially cutting out that wait time at the start while it decompresses, and when there's tons of files every second counts.

Exactly. With a SQL server, it'd actually potentially get three threads going: 1) unzipping, 2) VB6 processing, 3) writing to the database. And yeah, to be safe, there'd need to be some interaction between the first two, and maybe the last two as well. I'm sure SQL servers have a buffer, but there may be a point at which the buffer is full and we need to wait to write anymore.
Mar 9th, 2024, 04:06 PM
SmUX2k

Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

Another thought for you, Wqw, and hopefully Elroy will back me up on the usefulness of it for both of us and other users...I haven't looked at the update, so forgive me if it is already there!

We can be talking about 6GB+ files (and often 1GB+) that are being decompressed, and obviously VB6 can't easily manage that amount of data without clever trickery...perhaps a way to start and stop the decompression would be handy here? Maybe set a limit on the amount of data in the outgoing buffer, or number of buffer entries before the thread continues decompressing? I think manual control over the decompression would be the most logical of those, but intelligent buffer limits would work just as well.

I've avoided using the in-memory decompression specifically for this reason...I have 32GB of memory, but VB6 obviously doesn't have access to all of it at once...but being able to decompress in-memory rather than writing the output to the HD would definitely improve the overall speed of things :-)

I agree that in most cases this isn't an issue, but there will be people who are decompressing files of the size that VB6 wouldn't be able to normally handle...I'm proof that there's at least one :-P
Mar 10th, 2024, 05:42 AM
wqweto

Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

FYI, the ZipAsync does not hit the disk i.e. all decompression is done in memory and output is asychnornously buffered until ReadChunk is called. I just updated the ZipAsync.zip archive above to implement NUM_BACKLOG_FILES constant (currently 1) which controls throttle on decompression i.e. the number of decompressed files which are outstanding/not processed by ReadChunk.

But this might not be enough in your case, will probably need to implement some kind of maximum backlog buffer size for large individual files. This will require to split decompression output of a single file to several buffers/streams so that these can be discarded early as ReadChunk piecemeal receives data from current file.

cheers,
</wqw>
Mar 10th, 2024, 08:09 AM
wqweto

Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

This is on another level — https://questdb.io/blog/1brc-merykittys-magic-swar/
Mar 10th, 2024, 08:34 AM
Elroy

Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

Ok, it'll be a while before I get into my office, but I'm focusing on how it's going to run with MariaDB right now. I've got it all installed, with test programs up and running.

I just need to incorporate it all into the BigFile program that process the stock trades files.

Also, it dawns on me that, with a SQL server, there probably should be some kind of search to see if the data is already there before doing it again. We certainly don't want to duplicate this stuff in a database. But, that's a possible next step.

The unzipping actually runs pretty fast now. The bottleneck now that I'm seeing is in the ASCII-to-binary conversion and database writing, which I'm doing all in one step, so I can't tease those apart. But I feel pretty confident that the ASCII-to-binary conversion is about as fast as it's going to get. Hopefully, later today, I'll report time differences between writing MDB files versus shoving it all into a MariaDB SQL server (with ASCII-to-binary part of both of those).
Mar 10th, 2024, 09:00 AM
wqweto

Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

Quote:

Originally Posted by Elroy

Also, it dawns on me that, with a SQL server, there probably should be some kind of search to see if the data is already there before doing it again. We certainly don't want to duplicate this stuff in a database. But, that's a possible next step.

In MSSQL you have IGNORE_DUP_KEY option for unique indexes like this:

CREATE UNIQUE [CLUSTERED] INDEX MyIndex ON MyTable(UniqueCol1, UniqueCol2, ...) WITH IGNORE_DUP_KEY

It basicly does not error/allow inserting a duplicate row by (UniqueCol1, UniqueCol2, ...) but just returns an info message row is ignored (as a warning).

cheers,
</wqw>
Mar 10th, 2024, 09:30 AM
SmUX2k

Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

Quote:

Originally Posted by Elroy

The unzipping actually runs pretty fast now. The bottleneck now that I'm seeing is in the ASCII-to-binary conversion and database writing, which I'm doing all in one step, so I can't tease those apart. But I feel pretty confident that the ASCII-to-binary conversion is about as fast as it's going to get. Hopefully, later today, I'll report time differences between writing MDB files versus shoving it all into a MariaDB SQL server (with ASCII-to-binary part of both of those).

I know this is probably not useful to you, but I'll mention it anyway. Is the app constantly writing to the DB, or is it stopping and starting constantly? If the latter, perhaps some sort of write cacheing system is needed. With my BLOB writer, I store a cache of BLOBs in memory and only write to the DB when that cache size reaches a certain point (~100MB, but obviously this is tweakable). If you find that the main bottleneck is the writing to the DB, this would probably remove that bottleneck or, more specifically, will defer it until later. In my case, it might take a second to write each BLOB individually when I do them on demand 1 at a time as and when they're available, but doing them in bulk like this seemingly tends to cut the time taken by more than half.

What if you were to build the data into arrays and write them to a file, and have another app (hopefully one that is using a different thread) that is there to load these files and place them into the DB? It seems like an extra step, but I would guess that writing to a file would be far quicker than writing to DB. Obviously I don't know SQL well enough, but I would assume it would be possible to write 1 column at a time, and if so you should be able to target a specific element of a UDT (like the timestamps, for instance) and write them on their own into a file. I'd say write the file with a ".tmp" suffix and when the file is complete you change it to ".txt" (and the second app looks specifically for ".txt" files so won't try to load a file you're still writing).

Take DB writing out of the bottleneck equation and you're left with just the conversion bottleneck...and you yourself say that you've optimised it as much as possible (probably not true, there's always improvements to make...just they improve things less each time) so this might improve the efficiency. If the DB writing causes a backlog (I expect it will) you can always put in a wait if there's more than X files in the DB queue (number of text files in the folder, as the DB writer will delete files they've processed).

Perhaps now you're seeing why I went with the BLOB method over RAW data...yes, an entire column of millions of data points takes 1 second or less to write to the DB (partly because it's LZMA compressed so much smaller...even L1 LZMA made a decent difference)...and 100MB at a time is usually done in single digit seconds (so under 10s). I'm not trying to dissuade you from using the RAW method, just pointing out that I didn't need it so went with daily data in bulk.
Mar 10th, 2024, 09:38 AM
Elroy

Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

Quote:

Originally Posted by wqweto

In MSSQL you have IGNORE_DUP_KEY option for unique indexes like this:

CREATE UNIQUE [CLUSTERED] INDEX MyIndex ON MyTable(UniqueCol1, UniqueCol2, ...) WITH IGNORE_DUP_KEY

It basicly does not error/allow inserting a duplicate row by (UniqueCol1, UniqueCol2, ...) but just returns an info message row is ignored (as a warning).

cheers,
</wqw>

To truly be unique, a unique key would need to include the "Seconds", which is currently a float (Single). I've always been nervous about including a float in a unique key. But probably year, month, day is probably enough (although that wouldn't be unique at all).

Also, I'm worried that a IGNORE_DUP_KEY would slow things down. I was just thinking about searching the database at the start of each month's dump and see if anything for that Year-Month was there, and, if so, issue a warning. Again, that's another "phase" to this though. I just need to get MariaDB up and running first, and do some timings compared to dumping to MDBs.

----------

Also, I'm working from memory, but here are my MariaDB columns: AutoID, Ticker Symbol, Year, Month, Day, Hour, Minute, Seconds, BidPrice, BidVolume, BidExchange, AskPrice, AskVolume, AskExchange.

Symbol=CHAR(5) latin1
Year=SMALLINT
Month=TINYINT
Day=TINYINT
Hour=TINYINT
Minute=TINYINT
Seconds=FLOAT

------------

Further FYI, the files I've got now are ZIPs for a year, with one-TXT-file-per-month in them.
...
Apr 4th, 2024, 03:21 AM
taishan

Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

I feel like I should get a participation trophy for reading every word of this thread. kidding. Did we ever get the final benchmarks of RC6/sqlite vs HugeBigFile?
Apr 4th, 2024, 04:19 AM
Schmidt

Re: [RESOLVED] Converting ASCII Byte Array to Single or Double ... Fast

Quote:

Originally Posted by taishan

Did we ever get the final benchmarks of RC6/sqlite vs HugeBigFile?

In case you mean "RC6-cCSV -> Sqlite-imports" vs. "HugeFile-class -> MariaDB-imports" -
the performance is about factor 3 better with RC6/sqlite (according to my own tests).

Olaf

Show 60 post(s) from this thread on one page