-
[RESOLVED] Find string in range of Byte Array
I read a very large csv file into a byte array.
During reading, i find a carriage return charcode and now I need to find a string in a specific range of the byte array.
I know the starting address of the range and the end address. How fast can I find a string pattern in an array of bytes?
-
Re: Find string in range of Byte Array
open the file in binary mode #1
set a string variable = space$(filelen(path+filename)) or faster use space$(lof(1)) if file is already opened.
get #1, , string_variable
use x= INSTR(string_variable,search_reason)
x= position if found
x= 0 if not found
don't forget to close the file with CLOSE #1
the only one thing, as VB uses UNICODE (16bits), it will get the double in size in RAM than if holding in a byte array.
But, you can do fast searches, using INSTR than looping in a byte array.
Well, there is another way, which is very low level calls, get the array pointer, and do a low level CPU find routine, equivalent to instr(), but needs to handle pointers varptr iirc, (in C is &).
The best way is to compile in mASM a little routine for that, but IIRC Windows has a kernel function to do the same, it just needs pointers and buffer size, But can't recall its name.
-
Re: Find string in range of Byte Array
The comma-separated file contains records:
Category |
Available |
Quantity |
House |
Yes |
10 |
Apartment |
Yes |
9 |
Machine |
No |
0 |
Apartment |
No |
8 |
And to me from this file it is necessary to select records where there is a field an 'Apartment'. Then paste in new file.
VarPtr is better solution maybe.
-
Re: Find string in range of Byte Array
Quote:
Originally Posted by
sergeos
The comma-separated file contains records:
Category |
Available |
Quantity |
House |
Yes |
10 |
Apartment |
Yes |
9 |
Machine |
No |
0 |
Apartment |
No |
8 |
And to me from this file it is necessary to select records where there is a field an 'Apartment'. Then paste in new file.
VarPtr is better solution maybe.
it is text format with CRCL line termination?
it is easy, seems like a programming test for the first year students???
Any computer since 70's can do that in basic.
open source$ for input as #1
open dest$ for output as #2
while eof(1) = false
line input #1, a$
if instr(a$,"Apartment")=1 then PRINT#2, a$
wend
close #2
close #1
DONE!.
-
Re: Find string in range of Byte Array
Quote:
Originally Posted by
flyguille
it is text format with CRCL line termination?
it is easy, seems like a programming test for the first year students???
Any computer since 70's can do that in basic.
Smile. Did you try your code on file over 100-150MB? Termination symbol is vbLf
New Computers lost few hours for that work.
70-s goes to eternal peace...
If it was so easy I would not create a topic.
-
Re: Find string in range of Byte Array
You already have the data in a byte array.
Then search for the byte with value '10' starting on offset 0.
Then check if the first 9 bytes of the starting on offset matches 'Apartment'
If so copy the bytes from offset till "found location -1" to a new file and append a byte with value '10' (vbLf)
-
Re: Find string in range of Byte Array
Quote:
Originally Posted by
sergeos
Smile. Did you try your code on file over 100-150MB? Termination symbol is vbLf
New Computers lost few hours for that work.
70-s goes to eternal peace...
If it was so easy I would not create a topic.
I didn't said it is the faster way. Because it reads one text line at a time.
and is one disk I/O operation per line, two i/o operation if found the target, this is the slower method.
-
Re: Find string in range of Byte Array
ok there goes, air coding.....
dim b() as byte
open source$ for binary access read #1
redim b(0 to lof(1)-1)
get #1,, b()
close #1
open dest$ for binary access write lock read as #1
x=0
on error goto ENDing ' IF compiled without arrays boundaries checks for faster execution it won't works.
while x<= ubound(b)
if b(x)=&h41 then if b(x+1)=&h70 then if b(x+2)=&h61 then if b(x+3)=&h72 then if b(x+4)=&h74 then if b(x+5) =&h6d then if b(x+6) = &h65 then if b(x+7)=&h6e then if b(x+8) = &h74 thenGosub SendLine
x=x+1
wend
close #1
exit sub
SendLine:
if x=0 then else if b(x-1) = 10 then else RETURN
while x<=ubound(b)
put#1, ,b(x)
if b(x)=10 then RETURN
x=x+1
wend
x=x-1:RETURN
ENDing:
close #1
end sub
-
Re: Find string in range of Byte Array
Just a thought. Have you tried to use ADO to read the csv file? It would have the advantage to filter on the Category header along with handling vbLf (chr 10) and/or vbCr (chr 13) terminated lines. Thinking no need to "reinvent the wheel" if not absolutely necessary.
Edited: Another advantage suits your needs. The filtered recordset can be written to disk as a separate file, i.e., snippet:
Code:
' establish ADO connection to the folder containing csv
Set RS = New ADODB.RecordSet
strSQL = "Select * FROM [" & filename+extension & "] WHERE (Category = 'Apartment')"
RS.Open strSQL, [connection], adOpenForwardOnly, adLockReadOnly, adCmdText
fNr = FreeFile()
Open [new file] For Output As #fNr
Print #fNr, RS.GetString(adClipString, , ","); ' can error if no records returned
Close #fNr
RS.Close
' close connection & set ADODB objects to Nothing
Headers won't be included, but you can iterate through the RS.Fields collection to print those to file beforehand if needed. Plenty of examples in this forum on connecting to and reading csv files with ADO
-
1 Attachment(s)
Re: Find string in range of Byte Array
You could use the Jet 4.0 Text IISAM to do this, and by selecting into a new table (file) you can get the header row as well. It should also be faster that way than sucking it into RAM all at once and using GetString() on a Recordset.
But for files of any substantial size it won't be speedy. Not bad but not as fast as you seem to hope for.
I tweaked an existing Text I/O class to support raw ANSI Text I/O as well as ANSI and Unicode. Then I applied a few more optimizations. That along with a few other small tricks to avoid creating tons of small String values got the time down a bit. It may not be bug-free but so far it seems to pass the tests I have run.
There are two Projects here: MakeFile and FilterFile.
MakeFile creates a sample file of 8,000,000 rows, about 105MB. The compiled EXE takes around 25 seconds to run on this PC, faster machines should run faster since it is mostly CPU-bound.
FilterFile reads that created sample file. It copies the column header row, then copies data rows where the first column (Category) = "Apartment" to a new file. The compiled EXE takes around 10 seconds to run on this PC.
Code:
Option Explicit
Private Declare Function GetTickCount Lib "kernel32" () As Long
'Case-sensitive:
Private Declare Function StrCmpN Lib "shlwapi" Alias "StrCmpNA" ( _
ByVal lpStr1 As Long, _
ByVal lpStr2 As Long, _
ByVal nChar As Long) As Long
'Case insensitive:
'Private Declare Function StrCmpNI Lib "shlwapi" Alias "StrCmpNIA" ( _
' ByVal lpStr1 As Long, _
' ByVal lpStr2 As Long, _
' ByVal nChar As Long) As Long
Private Sub Main()
Dim T0 As Long
Dim Row As String
Dim htfOut As HugeTextFile
Dim AnsiComma As String
Dim AnsiApartment As String
Dim CommaPos As Long
Dim Apartments As Long
T0 = GetTickCount()
ChDir App.Path
ChDrive App.Path
With New HugeTextFile
.OpenFile "DataOrig.csv", _
Access:=HTF_ACCESS_READ, _
LfNewLine:=True, _
Encoding:=HTF_ENC_ANSIRAW
Row = .ReadLine() 'Colheader row.
Set htfOut = New HugeTextFile
htfOut.OpenFile "DataFiltered.csv", _
Access:=HTF_ACCESS_WRITE, _
LfNewLine:=True, _
Encoding:=HTF_ENC_ANSIRAW
htfOut.WriteLine Row
AnsiComma = StrConv(",", vbFromUnicode)
AnsiApartment = StrConv("Apartment", vbFromUnicode)
Do Until .EOF
Row = .ReadLine()
'This logic assumes:
'
' o Category field values begin in column 1 and have no spaces
' between the value and its delimiting comma, i.e. no trimming
' is needed.
'
' o Category field values have expected case (case-sensitive
' compare is made).
CommaPos = InStrB(1, Row, AnsiComma)
If StrCmpN(StrPtr(Row), StrPtr(AnsiApartment), CommaPos - 1) = 0 Then
Apartments = Apartments + 1
htfOut.WriteLine Row
End If
Loop
htfOut.CloseFile
.CloseFile
End With
MsgBox "Done:" & vbNewLine _
& vbNewLine _
& Format$(Apartments, "#,##0") & " Apartments copied" & vbNewLine _
& vbNewLine _
& Format$(CSng(GetTickCount() - T0) / 1000, "0.0##") & " seconds"
End Sub
Even if you don't want to use this code it may have a few ideas in it that you could make use of.
-
Re: Find string in range of Byte Array
Quote:
You could use the Jet 4.0 Text IISAM to do this, and by selecting into a new table (file) you can get the header row as well. It should also be faster that way than sucking it into RAM all at once and using GetString() on a Recordset.
Just FYI to others. GetString isn't truly necessary. The data can be extracted from source to destination, in same format, same headers, just using an ADO text connection and executing a single statement. Something as simple as:
Code:
SELECT * INTO [%DestFileName%] IN '' [text;Database=%DestPath%] FROM
[%SourceTableName%] WHERE (Category = 'Apartment')
The placeholders above would need to be provided. A schema.ini should be provided for the source file to ensure no oddities like guessing wrong on source data types
-
Re: Find string in range of Byte Array
I don't have anything in my schema notes to suggest that you can specify the row delimiter though. You'd probably always get CRLFs in the output file.
-
Re: Find string in range of Byte Array
Quote:
I don't have anything in my schema notes to suggest that you can specify the row delimiter though.
Question for me? If so, the answer is no and good point if the default delimiter isn't desired. If that is the case, then my last post isn't a player. The GetString method does allow that, though as you suggest, performance may be an issue. I don't know if OLEDB sucks up the entire 150mb csv at once or not when a where clause is used. If the schema.ini has MaxScanRows=0, then yes. Otherwise, if you say so, I'll defer to your statement. Just a bit surprised no one considered ADO as a possible solution.
-
Re: Find string in range of Byte Array
I had looked at it and performance was pretty bad just using it to create a large sample file. Bad enough that I gave it up, though more like 120 seconds than "a few hours." ;)
I misread earlier posts and thought that 70 seconds had been achieved and rejected as too slow. Considering the earlier posts again the Jet Text IISAM (ADO or even DAO) might be acceptable.
That is a decent general tool for this sort of thing.
-
Re: Find string in range of Byte Array
Quote:
That is a decent general tool for this sort of thing.
Agreed, key word being general. Custom solutions often out perform general libraries. Besides, depending on the actual format of the source csv, ADO likely won't exactly duplicate the output to match the source. Example: if the source has some fields quote delimited, others not, for example. That may or may not be a requirement.
-
Re: Find string in range of Byte Array
LaVolpe, Can't open CSV with your method. Only one column selecting into RS. Maybe separator must be set, don't know.
dilettante, no bad! Your case faster by me on 7 sec. My code is:
Code:
Public Function GetRecord(ByRef Fields() As Byte) As Boolean
Dim arrTemp() As Byte
Dim lPOint As Long
Const SEARCH_CRITERIA As String = "Apartment"
ReDim arrTemp(Len(SEARCH_CRITERIA) - 1)
Do
If BufNextChar >= BufSize Then
'Need to read the next buffer of data.
If FileCharsLeft <= 0 Then
'No more data to read: make sure we didn't have a last
'record with no CRLF, or exit with EOF result.
Exit Function
'End If
Else
'If FileCharsLeft < BufSize Then
'Final buffer is smaller than the others.
BufSize = FileCharsLeft
End If
ReDim Buf(BufSize - 1) 'Undo previous Redim Preserve.
Get #F, , Buf
ReDim Preserve Buf(BufSize) 'We need to append a terminating 0 byte.
FileCharsLeft = FileCharsLeft - BufSize
BufNextChar = 0
lPOint = 0
End If
Do Until BufNextChar >= BufSize
If Buf(BufNextChar) = Delim Or Buf(BufNextChar) = CR Then
CopyMemory arrTemp(0), ByVal VarPtr(Buf(lPOint)), LenB(Buf(lPOint)) * (UBound(arrTemp) + 1)
If StrConv(arrTemp, vbUnicode) = SEARCH_CRITERIA Then
If (Not Fields) = -1 Then
ReDim Preserve Fields(BufNextChar - lPOint + 0) ' UBound(Fields))
Else
ReDim Preserve Fields(BufNextChar - lPOint + UBound(Fields) + 1)
End If
CopyMemory Fields(UBound(Fields) - (BufNextChar - lPOint)), ByVal VarPtr(Buf(lPOint)), LenB(Buf(lPOint)) * (BufNextChar - lPOint + 1)
End If
lPOint = BufNextChar + 1
End If
BufNextChar = BufNextChar + 1 'Dirty chars and commas skipped here.
Loop
Loop
End Function
-
Re: Find string in range of Byte Array
The problem with custom low-level code is always getting it right though. I'm still not sure the Text I/O class I posted above is fully debugged.
Data corruption is bad enough, but silent loss of data can be even more insidious. Using a power tool known to be reliable makes sense even if it costs you some performance. Of course you still need to test, test, test because one can easily make an error working with high-level tools.
-
Re: Find string in range of Byte Array
Quote:
LaVolpe, Can't open CSV with your method. Only one column selecting into RS. Maybe separator must be set, don't know
If you are interested in the ADO approach, would need to know more details about your specific csv file... For example:
- What is the column/field delimiter or are the columns fixed widths?
- Can any fields be interpreted incorrectly, i.e., a text field interpreted as a Date, Integer, Boolean field incorrectly?
Those are generally the most common issues and can be resolved by supplying a schema.ini file. That file tells the csv parser (ADO, OLEDB, etc) how to handle specifics when the csv is not technically to specs.
Edited. Using the simple example you provided in post #3, I created a file called tester.csv with that content. Then executed a statement similar to the one in post #11 which produced a file containing exactly this:
Code:
"Category","Available","Quantity"
"House","Yes",10
"Apartment","Yes",9
"Machine","No",0
"Apartment","No",8
Adding a where clause (Category='Apartment') produced exactly this:
Code:
"Category","Available","Quantity"
"Apartment","Yes",9
"Apartment","No",8
Connection string used... replace %TARGET% with the csv folder
Code:
Provider=Microsoft.Jet.OLEDB.4.0;Data Source=%TARGET%;Extended Properties='text;HDR=YES;FMT=CSVDelimited'
-
1 Attachment(s)
Re: Find string in range of Byte Array
LaVolpe
This is my code for ADO. read only one Column
Code:
Dim dbConnectionString As String
Dim CN As Object
Dim vRS As Object
Dim sSQL As String
Dim PathFile As String
Set CN = CreateObject("ADODB.Connection")
dbConnectionString = "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" & "C:\mm" & "" & ";Extended Properties='text;HDR=NO'"
CN.Open dbConnectionString
PathFile = "test.csv"
Set vRS = CreateObject("ADODB.Recordset")
sSQL = "Select * FROM [" & PathFile & "]" 'WHERE (Category = 'Tyres')"
vRS.Open sSQL, CN, 0, 1, 1
'If Not vRS.EOF Then .Range("A2").CopyFromRecordset vRS
Attachment 145971
-
Re: Find string in range of Byte Array
Thank you for the test.zip. That will help out others tremendously.
Your file appears to be semi-colon delimited. By adding a schema.ini file, it appears to be parsed out correctly; however, to save the matching records in the same format would require using ADO's GetString or something like what Dilettane suggests.
If one does not already exist, create a file in same folder as your csv and name the file: schema.ini
Paste this into the file and save it, then try your sample code again
Code:
[test.csv]
Format=Delimited(;)
ColNameHeader=False
There appears to be money values in European format which converted incorrectly on my US system; though they may parse correctly on yours. If this file were actually something that needed to be parsed correctly on anyone's pc, the schema.ini file would need to be adjusted for even greater detail. Just FYI. If more interested in schema.ini files, simply google: csv schema.ini
-
Re: Find string in range of Byte Array
One more note. The test file you provided recently isn't anything like the sample data your provided in post #3.
If the test file isn't actual data you are trying to process, please provide a small sample file of actual data. It doesn't help you if we are providing solutions for data you won't even be using. Only a couple records are needed, along with any headers. If your actual data contains information that should not be public, you can blank out those field values or provide bogus values, i.e., instead of something like a real phone number, use 555-0123 for example.
-
Re: Find string in range of Byte Array
Hi sergeos,
You didn't really specify things very well in your OP (post #1). For instance, is your input file ANSI or Unicode? In theory, it could quite easily be either. A CSV can certainly be either, but I suspect ANSI is probably more typical, so that's what I'll assume.
Also, apparently, the file is of a size that it'll fit into memory or you wouldn't be talking about a byte array. If it were me, I probably wouldn't read it into a byte array in the first place, but that's what you've asked for, so I'll go with it.
Therefore, we're reading an ANSI file into a byte array, doing some kind of parsing such that we want to search within portions of it for some string. Here's how I'd tend to do that....
Code:
Option Explicit
'
Private Declare Sub CopyMemory Lib "kernel32.dll" Alias "RtlMoveMemory" (ByRef Dest As Any, ByRef Source As Any, ByVal Bytes As Long)
'
Private Sub Form_Load()
' Create a byte array with ANSI "ABCDEFGHIJK" in it (11 letters).
Dim bb() As Byte
ReDim bb(1 To 11)
bb(1) = 65 ' A
bb(2) = 66 ' B
bb(3) = 67 ' C
bb(4) = 68 ' D
bb(5) = 69 ' E
bb(6) = 70 ' F
bb(7) = 71 ' G
bb(8) = 72 ' H
bb(9) = 73 ' I
bb(10) = 74 ' J
bb(11) = 75 ' K
' Create a needle to search for.
Dim sNeedle As String
sNeedle = "CD"
' Do search and report results.
MsgBox StringIsInByteArrayRange(bb(), 1, 11, sNeedle)
MsgBox StringIsInByteArrayRange(bb(), 3, 11, sNeedle)
MsgBox StringIsInByteArrayRange(bb(), 4, 11, sNeedle)
MsgBox StringIsInByteArrayRange(bb(), 3, 4, sNeedle)
Unload Me
End Sub
Private Function StringIsInByteArrayRange(bbHay() As Byte, iRangeBegin As Long, iRangeEnd As Long, sNeedle As String, Optional bHayIsUnicode As Boolean = False) As Long
' Returns the position where found, with iRangeBegin being first position.
' Returns ZERO if not found.
' Returns -1 if there's something wrong with things.
' CAVEATS: If you're actually searching for vbNullChar specified in sNeedle, this function won't work correctly.
Dim iErr As Long
Dim iLength As Long
Dim iHayStrLen As Long
Dim sHay As String
Dim i As Long
'
' Set number of bytes to search in hay.
iLength = iRangeEnd - iRangeBegin + 1
'
' Set "something wrong" return value.
StringIsInByteArrayRange = -1
'
' First a bit of validity checking.
On Error Resume Next
If iRangeBegin < LBound(bbHay) Then Exit Function
If iRangeEnd > UBound(bbHay) Then Exit Function
If iRangeBegin > iRangeEnd Then Exit Function
If LBound(bbHay) > UBound(bbHay) Then Exit Function
If bHayIsUnicode And (iLength Mod 2) Then Exit Function ' If Hay is Unicode, we must search an even number of bytes.
iErr = Err
On Error GoTo 0
If iErr Then Exit Function ' This is a "DIM'ed" check.
'
' We should be good, so do it.
iHayStrLen = (iLength + 1) \ 2 ' The +1 handles odd lengths. The \2 is because strings are Unicode, not bytes.
' There may be an extra 0 at the end of the string (for odd byte ranges), but nothing we can do about that.
sHay = String$(iHayStrLen, 0)
'
' Move our hay range into the string.
' Note that "bytes" are moved with no conversion to Unicode.
CopyMemory ByVal StrPtr(sHay), bbHay(iRangeBegin), iLength
'
' Is our hay ANSI (or just bytes) or is it Unicode?
' If it's ANSI (or bytes) we must convert to Unicode for search.
If Not bHayIsUnicode Then sHay = StrConv(sHay, vbUnicode)
'
' We're now ready to use Instr for our search.
StringIsInByteArrayRange = InStr(sHay, sNeedle)
End Function
Enjoy,
Elroy
p.s. I did put an option in the search function for dealing with a Unicode byte array (i.e., Unicode CSV file), but I didn't test it. Also, hopefully you can see that, for testing, I just created a small byte array that represents your ANSI file.
-
Re: Find string in range of Byte Array
And here's another little procedure I often use. It'll search for your strings without reading the file into memory. However, it does work only for ANSI strings in the file. I've just never (at least not yet) had a need to search a file for Unicode strings.
Code:
Option Explicit
Private Sub Form_Load()
Dim sFileSpec As String
Dim iFoundPos As Long
sFileSpec = "C:\Users\Elroy\Desktop\test.txt"
MsgBox BinaryFileSearch(sFileSpec, "asdf", False, 6, iFoundPos)
MsgBox iFoundPos
End Sub
Public Function BinaryFileSearch(sFileSpec As String, sSearchString As String, Optional bCaseSensitive As Boolean = True, _
Optional lStartPosition As Long = 1, Optional lFoundPosition As Long, _
Optional lFileHandleToUse As Long = 0) As Boolean
' Returns true if sSearchString is found, else false.
' sSearchString can be no longer than 128.
' This will work even if Word or Excel has the file open.
' The lFoundPosition is a return argument.
' It returns the latest position before lStartPosition (if there isn't one after lStartPosition) or
' it returns the earliest position after lStartPosition.
Dim iFle As Long
Dim FileData As String
Dim FilePointer As Long
Dim FileLength As Long
Dim sFind As String
Dim iPos As Long
'
If Len(sSearchString) > 128 Then
Err.Raise 1234
Exit Function
End If
'
If lFileHandleToUse = 0 Then
If Not bFileExists(sFileSpec) Then Exit Function
iFle = FreeFile
On Error Resume Next
Open sFileSpec For Binary As iFle
If Err <> 0 Then
Close iFle
On Error GoTo 0
Exit Function
End If
On Error GoTo 0
'
If Len(iFle) = 0 Then Close iFle: Exit Function
Else
iFle = lFileHandleToUse ' The file MUST be opened BINARY for this to work.
End If
'
If bCaseSensitive Then
sFind = sSearchString
Else
sFind = LCase$(sSearchString)
End If
FileData = Space(1024)
FileLength = LOF(iFle)
FilePointer = lStartPosition
Do
If FilePointer > FileLength Then Exit Do
Get iFle, FilePointer, FileData
If Not bCaseSensitive Then FileData = LCase$(FileData)
iPos = InStr(FileData, sFind)
If iPos <> 0 Then
lFoundPosition = FilePointer + iPos - 1
If lFoundPosition >= lStartPosition Then
BinaryFileSearch = True
Exit Do
End If
End If
FilePointer = ((FilePointer + 1024) - Len(sFind)) + 1
Loop
If lFileHandleToUse = 0 Then Close iFle
End Function
Public Function bFileExists(fle As String) As Boolean
On Error GoTo FileExistsError
' If no error then something existed.
bFileExists = (GetAttr(fle) And vbDirectory) = 0
Exit Function
FileExistsError:
bFileExists = False
Exit Function
End Function
Enjoy,
Elroy
-
Re: Find string in range of Byte Array
-
Re: Find string in range of Byte Array
To be honest, I forgot about this function.
As it turned out there were plenty of decisions. The problem is solved.
-
Re: [RESOLVED] Find string in range of Byte Array
@sergeos. When resolved, please mark your thread as such. Use the Thread Tools menu near top of your 1st post. Additionally, it would be helpful to indicate which method you used to solve the problem. This could help others that happen upon your thread for a similar problem.
Edited: Bad timing on my part, you resolved this while I was typing the post. ;)
-
Re: [RESOLVED] Find string in range of Byte Array
Hi xxdoc123,
Well, you certainly could use InstrB. In fact, you could pass the entire bb() Byte Array into InstrB. However, be sure to recognize that InstrB is going to make a copy of the Byte Array (and convert it to a string) when it's passed into InstrB. I don't know how large Sergeos's files will be, but InstrB will certainly double the memory requirements. If we're talking about relatively small files, I'll agree that InstrB is a good choice. In fact, it could be used in the function of my post #22 to make it work a bit better.
All The Best,
Elroy