-
Aug 2nd, 2019, 02:20 AM
#1
Thread Starter
Hyperactive Member
[RESOLVED] How to parse big files ?
I need to parse a big txt file (non csv) that has over 700 Mb and so far all my attempts failed with Out of memory error. Initially this file had a smaller size and the lines below worked very well but when this file, with the same structure, increased in volume the method used crashed. Many other approaches based on Line Input, Open file For Binary Access, FSO did not work as well. Any suggestion ?
Code:
Dim intFile As Integer, myArr As Variant, myFile As String, s As String
Const delimit As String = "=========================================="
myFile = app.Path & "\BigFile.txt"
On Error GoTo errHandler
intFile = FreeFile
Open myFile For Input As #intFile
s = Input(LOF(intFile), #intFile)
Close #intFile
myArr = Split(s, delimit)
"VB code is practically pseudocode" - Tanner Helland
"When you do things right, people won't be sure you've done anything at all" - Matt Groening
"If you wait until you are ready, it is almost certainly too late" - Seth Godin
"Believe nothing you hear, and only one half that you see" - Edgar Allan Poe
-
Aug 2nd, 2019, 02:27 AM
#2
Re: How to parse big files ?
Why do you want to read it in one go?
You first read the file in a string, then do a split on the same string
The string needs 2 times the size of the text file (ANSI->UNICODE)
Then again the same size for the string array.
So for a 700MB text file you need 2*1400MB working (string) space...
Just read the file in chunks
-
Aug 2nd, 2019, 03:11 AM
#3
Thread Starter
Hyperactive Member
Re: How to parse big files ?
The initial request was based on a smaller file (100.000 rows) and the approach above worked very well and fast. The reason to read it twice was to capture some specific details from blocks (delimited by delimit variable) that have different sizes. The time cost is minor and it was worth for a better control. Yes, it is feasible to use chunks but these are based on buffer size and not on my delimiter position (which is variable). I have to imagine somehow a link between 2 consecutive chunks so that to find the block delimiters and to parse the content by some rules.
"VB code is practically pseudocode" - Tanner Helland
"When you do things right, people won't be sure you've done anything at all" - Matt Groening
"If you wait until you are ready, it is almost certainly too late" - Seth Godin
"Believe nothing you hear, and only one half that you see" - Edgar Allan Poe
-
Aug 2nd, 2019, 03:47 AM
#4
Re: How to parse big files ?
You can use my clsFile class.
I use it for reading huge text files, either Unix or DOS based (vbLf or vbCrLf delimiter)
It also deals with UTF8 files.
I assume your text file still has linefeeds?
Otherwise you need to replace the hardcode '10' and '13' for the linefeed detection by a variable.
Code:
'---------------------------------------------------------------------------------------
' Module : clsFile
' DateTime : 18-11-2005
' Author : Original Will Barden
' Purpose :
' Revision : 12-3-2014, Unix/Dos mode, UTF8
' : 24-8-2016, Rewrite of pReadLine because possible ending of buffer in the middle of UTF8 character
'---------------------------------------------------------------------------------------
Option Explicit
Private Declare Sub CopyMemory Lib "kernel32" Alias "RtlMoveMemory" (pDst As Any, pSrc As Any, ByVal ByteLen As Long)
Private Declare Function MultiByteToWideChar Lib "kernel32.dll" ( _
ByVal CodePage As Long, _
ByVal dwFlags As Long, _
ByVal lpMultiByteStr As Long, _
ByVal cbMultiByte As Long, _
ByVal lpWideCharStr As Long, _
ByVal cchWideChar As Long) As Long
Private Const CP_UTF8 As Long = 65001
Private m_iEOL As Integer
Private m_lBufferSize As Long
Private m_iID As Integer
Private m_bFileOpen As Boolean
Private m_lFileSize As Long
Private m_lBytesRead As Long
Private m_bEOF As Boolean
Private m_bUTF8Header As Boolean
Private m_lBlockPointer As Long
Private m_lNofBlocks As Long
Private m_lBlock As Long
Private m_bNextLine As Boolean
Public Property Get Progress() As Double
If m_bFileOpen And m_lFileSize <> 0 Then
Progress = 100# * m_lBytesRead / m_lFileSize
If Progress > 100 Then Progress = 100
End If
End Property
Public Property Get BytesRead() As Long
BytesRead = m_lBytesRead
End Property
Public Property Get FileSize() As Long
FileSize = m_lFileSize
End Property
'---------------------------------------------------------------------------------------
' Procedure : EndOfFile, DateTime : 18-11-2005
'---------------------------------------------------------------------------------------
Public Property Get EndOfFile() As Boolean
EndOfFile = m_bEOF Or Not m_bFileOpen
End Property
'---------------------------------------------------------------------------------------
' Procedure : CloseTextStream
' DateTime : 18-11-2005
' Author : Will Barden
' Purpose :
'---------------------------------------------------------------------------------------
Public Sub CloseTextStream()
If m_bFileOpen Then Close m_iID
m_bFileOpen = False
m_bNextLine = False
m_bEOF = True
End Sub
'---------------------------------------------------------------------------------------
' Procedure : OpenTextStream, DateTime : 18-11-2005
'---------------------------------------------------------------------------------------
Public Function OpenTextStream(sFileName As String) As Boolean
On Error GoTo errHandler
If Not m_bFileOpen Then
If Len(Dir$(sFileName)) > 0 Then
OpenTextStream = pOpenFile(sFileName)
End If
If Not OpenTextStream Then
Close #m_iID
m_bFileOpen = False
End If
End If
Exit Function
errHandler:
End Function
'---------------------------------------------------------------------------------------
' Procedure : ReadLine, DateTime : 18-11-2005
'---------------------------------------------------------------------------------------
Public Function ReadLine() As String
If m_bFileOpen Then
ReadLine = pReadLine
m_lBytesRead = m_lBytesRead + Len(ReadLine) + m_iEOL
End If
End Function
'---------------------------------------------------------------------------------------
' Procedure : pOpenFile, DateTime : 18-11-2005
'---------------------------------------------------------------------------------------
Private Function pOpenFile(sFileName As String) As Boolean
On Error GoTo errHandler
m_iID = FreeFile
Open sFileName For Binary Access Read As m_iID Len = 32766
m_lFileSize = LOF(m_iID)
m_lNofBlocks = m_lFileSize \ m_lBufferSize
m_lBlock = 0
m_lBlockPointer = -1
m_lBytesRead = 0
m_iEOL = 1
m_bEOF = False
m_bFileOpen = True
pOpenFile = True
Exit Function
errHandler:
End Function
Private Function pReadLine() As String
Dim bData() As Byte
If pReadLineBuffer(bData) Then
pReadLine = pBytesToStringEx(bData)
End If
End Function
'---------------------------------------------------------------------------------------
' Procedure : pReadLineBuffer, DateTime : 23-8-2016
'---------------------------------------------------------------------------------------
Private Function pReadLineBuffer(bReturn() As Byte) As Boolean
Dim bFound As Boolean
Dim lUBound As Long, lStart As Long, lLen As Long
Static bBuffer() As Byte
' Read new block of data
If m_lBlockPointer = -1 Then
If m_lBlock > m_lNofBlocks Then
m_bEOF = True
Exit Function
End If
bBuffer = pReadBlock(m_lBlock)
m_lBlock = m_lBlock + 1
End If
' Start of the new string
lStart = m_lBlockPointer + 1
lUBound = UBound(bBuffer)
Do Until m_lBlockPointer = lUBound Or bFound
m_lBlockPointer = m_lBlockPointer + 1
' End of Line: LineFeed ^10
bFound = (bBuffer(m_lBlockPointer) = 10)
Loop
If bFound Then
' End of line, so start building string
lLen = m_lBlockPointer - lStart
' DOS line ^13^10
If m_lBlockPointer > 0 Then
If bBuffer(m_lBlockPointer - 1) = 13 Then
lLen = lLen - 1
m_iEOL = 2
End If
End If
If lLen > 0 Then
ReDim bReturn(lLen - 1)
CopyMemory bReturn(0), bBuffer(lStart), lLen
pReadLineBuffer = True
End If
Else
' No EOL, Build 1st part using current buffer, second part from next buffer
m_lBlockPointer = -1
lLen = lUBound - lStart + 1
' Check if last character is 1st byte of DOS CrLf ^13^10 pair
If lLen > 0 Then
' If so, remove
If bBuffer(lStart + lLen - 1) = 13 Then lLen = lLen - 1
End If
If lLen > 0 Then
Dim bNextBuffer() As Byte
ReDim bReturn(lLen - 1)
CopyMemory bReturn(0), bBuffer(lStart), lLen
If pReadLineBuffer(bNextBuffer) Then pAppendBytes bReturn, bNextBuffer
pReadLineBuffer = True
Else
pReadLineBuffer = pReadLineBuffer(bReturn)
End If
End If
End Function
Private Function pReadBlock(lBlock As Long) As Byte()
Dim lLen As Long
Dim bBuffer() As Byte
If lBlock = m_lNofBlocks Then
lLen = m_lFileSize Mod m_lBufferSize
ReDim bBuffer(lLen - 1)
Else
ReDim bBuffer(m_lBufferSize - 1)
End If
Get #m_iID, , bBuffer
pReadBlock = bBuffer
End Function
Private Sub pAppendBytes(bBytes() As Byte, bAppend() As Byte)
Dim lAppendPos As Long
lAppendPos = UBound(bBytes) + 1
ReDim Preserve bBytes(lAppendPos + UBound(bAppend))
CopyMemory bBytes(lAppendPos), bAppend(0), UBound(bAppend) + 1
End Sub
'---------------------------------------------------------------------------------------
' Procedure : pBytesToStringEx, DateTime : 23-8-2016
'---------------------------------------------------------------------------------------
Private Function pBytesToStringEx(bArr() As Byte) As String
Dim bBuffer() As Byte
Dim lLength As Long
If Not m_bNextLine Then
m_bUTF8Header = pHasUTF8header(bArr)
If m_bUTF8Header Then
lLength = UBound(bArr) + 1
ReDim bBuffer(lLength - 4)
CopyMemory bBuffer(0), bArr(3), lLength - 3
m_lBytesRead = m_lBytesRead + 3
m_bUTF8Header = True
bArr = bBuffer
End If
m_bNextLine = True
End If
If pContainsUTF8(bArr) Then
pBytesToStringEx = pConvertFromUTF8(bArr)
Else
pBytesToStringEx = StrConv(bArr, vbUnicode)
End If
End Function
Private Function pConvertFromUTF8(ByRef Source() As Byte) As String
Dim Size As Long, Pointer As Long, Length As Long, Buffer As String
Size = UBound(Source) - LBound(Source) + 1
Pointer = VarPtr(Source(LBound(Source)))
Length = MultiByteToWideChar(CP_UTF8, 0, Pointer, Size, 0, 0)
Buffer = Space$(Length)
MultiByteToWideChar CP_UTF8, 0, Pointer, Size, StrPtr(Buffer), Length
pConvertFromUTF8 = Buffer
End Function
'---------------------------------------------------------------------------------------
' Procedure : pHasUTF8header
' DateTime : 14-6-2013
' Author : ArnoutV
' Purpose :
' Reference : http://en.wikipedia.org/wiki/Byte_order_mark
'---------------------------------------------------------------------------------------
Private Function pHasUTF8header(Source() As Byte) As Boolean
' Header: EF, BB, BF
If UBound(Source) >= 2 Then
If Source(0) = &HEF Then
If Source(1) = &HBB Then
If Source(2) = &HBF Then
pHasUTF8header = True
Exit Function
End If
End If
End If
End If
End Function
'---------------------------------------------------------------------------------------
' Procedure : pContainsUTF8
' DateTime : 23-10-2012
' Author : ArnoutV
' Purpose : Detect UTF8 characters in byte array
' : If byte1 >= "&C2" and byte1 <= "D3" and
' : byte2 >= "&80" and byte2 <= "BF" then UTF8 encoded character
' Bron : http://www.utf8-chartable.de/unicode-utf8-table.pl
' Revision : 23-11-2012
' : 02-04-2014 - 3 bytes UTF8 : http://www.utf8-chartable.de/unicode-utf8-table.pl?start=2048
'---------------------------------------------------------------------------------------
Private Function pContainsUTF8(ByRef Source() As Byte) As Boolean
Dim i As Long, lUBound As Long
Dim hexC2 As Byte, hexDB As Byte, hexBF As Byte, hex80 As Byte, hexE0 As Byte
Dim CurByte As Byte
If m_bUTF8Header Then
pContainsUTF8 = True
Exit Function
End If
hexC2 = &HC2: hexDB = &HDB: hex80 = &H80: hexBF = &HBF: hexE0 = &HE0
lUBound = UBound(Source)
For i = 0 To lUBound - 1
CurByte = Source(i)
If CurByte >= hexC2 And CurByte <= hexDB Then
If (Source(i + 1) And hex80) Then
pContainsUTF8 = True
Exit For
End If
End If
If i + 2 <= lUBound Then
If CurByte >= hexE0 Then
If (Source(i + 1) And hex80) And (Source(i + 2) And hex80) Then
pContainsUTF8 = True
Exit For
End If
End If
End If
Next i
End Function
Private Sub Class_Initialize()
m_lBufferSize = 102400
m_bEOF = True
End Sub
Private Sub Class_Terminate()
' Just to be sure
If m_bFileOpen Then Close m_iID
End Sub
-
Aug 2nd, 2019, 03:54 AM
#5
Re: How to parse big files ?
@Arnoutdv: JFYI, this line
Open sFileName For Binary Access Read As m_iID
. . . does not work on read-only media. It turns out For Binary always opens the underlying OS file handle for R/W no matter if Access Read is requested.
Another pitfall with For Binary Access Read is that sFileName is created empty when the file is non-existant -- not something you would expect from Access Read option.
cheers,
</wqw>
-
Aug 2nd, 2019, 04:01 AM
#6
Re: How to parse big files ?
Originally Posted by wqweto
@Arnoutdv: JFYI, this line
Open sFileName For Binary Access Read As m_iID
. . . does not work on read-only media. It turns out For Binary always opens the underlying OS file handle for R/W no matter if Access Read is requested.
Another pitfall with For Binary Access Read is that sFileName is created empty when the file is non-existant -- not something you would expect from Access Read option.
cheers,
</wqw>
The original code dates from 2005 or before and I only added some functionality (UTF8) and some small bug fixed in the last years.
There is however a check to see if the file exists before opening it.
Code:
If Len(Dir$(sFileName)) > 0 Then
OpenTextStream = pOpenFile(sFileName)
End If
The need for "Access Read" is also unclear to me.
I believe the Len has no effect either on binary files
Code:
Open sFileName For Binary Access Read As m_iID Len = 32766
Should just be
Code:
Open sFileName For Binary As m_iID
It also only works for files up to 2GB, I never got the time to rewrite it to work with >2GB files
-
Aug 2nd, 2019, 04:57 AM
#7
Thread Starter
Hyperactive Member
Re: How to parse big files ?
Thank you for reply, Arnout. I didn't understand very well the entire usage of the class but I was able to read the first line and I found out the file size (757517246) and total number of blocks (7397). I do not find inside any method related to how to read/parse these blocks. Could you please give me some help?
"VB code is practically pseudocode" - Tanner Helland
"When you do things right, people won't be sure you've done anything at all" - Matt Groening
"If you wait until you are ready, it is almost certainly too late" - Seth Godin
"Believe nothing you hear, and only one half that you see" - Edgar Allan Poe
-
Aug 2nd, 2019, 05:03 AM
#8
Re: How to parse big files ?
I don't think I understand your request.
You say the file is made of rows (lines in a text file I assume)
But you also talk about blocks.
What do you mean with blocks?
And what do you need to do with each block of data?
-
Aug 2nd, 2019, 06:42 AM
#9
Thread Starter
Hyperactive Member
Re: How to parse big files ?
A block is, in your example, a chunk of a file that has a size defined by the buffer size. You call it m_lBlock explicitly. In my case a block is the data within 2 delimiters (see above "======") and this info has a specific meaning that I have to read it as if it had an unique ID (timestamp). Of course, either we take the first reference or the second a block/chunk has lines and some of these lines will be inserted in a database after the parsing process.
"VB code is practically pseudocode" - Tanner Helland
"When you do things right, people won't be sure you've done anything at all" - Matt Groening
"If you wait until you are ready, it is almost certainly too late" - Seth Godin
"Believe nothing you hear, and only one half that you see" - Edgar Allan Poe
-
Aug 2nd, 2019, 06:47 AM
#10
Re: How to parse big files ?
No I think you misunderstood. This is just a general text file reader.
A replacement for the following:
Code:
' Classic VB File IO
Open FileName For Input As #1
Do While Not EOF(1)
Line Input #1, sLine
Loop
Close #1
Code:
Dim cFile As clsFile
Set cFile = New clsFile
' File IO using cFile object
If cFile.OpenTextStream(FileName) Then
Do While Not cFile.EndOfFile
sLine = cFile.ReadLine
Loop
cFile.CloseTextStream
End If
Is it correct that the file is a normal text file?
The chunks you want to process are separated by "============================"
Then just process the file line by line until you encounter the chunk separator
Can you explain how you handle the blocks of data?
Because your original blocks are still strings with newline separators
Last edited by Arnoutdv; Aug 2nd, 2019 at 06:50 AM.
-
Aug 2nd, 2019, 06:56 AM
#11
Re: How to parse big files ?
This could be a skeleton for what you want to do:
Code:
Option Explicit
Private Const delimit As String = "=========================================="
Public Sub ParseFile(sFileName As String)
Dim sLine As String
Dim sHeader As String
Dim bDelimit As Boolean
Dim cFile As clsFile
Set cFile = New clsFile
If cFile.OpenTextStream Then
Do Until cFile.EndOfFile
sHeader = cFile.ReadLine
Do
sLine = cFile.ReadLine
bDelimit = (sLine = delimit)
If Not bDelimit Then
pUpdateDBsLine, sHeader
End If
Loop Until cFile.EndOfFile Or bDelimit
Loop
End If
End Sub
Private Sub pUpdateDB(sLine As String, sHeader As String)
' whatever you need to do
End Sub
-
Aug 2nd, 2019, 08:00 AM
#12
Thread Starter
Hyperactive Member
Re: How to parse big files ?
Originally Posted by Arnoutdv
No I think you misunderstood. This is just a general text file reader.
A replacement for the following:
Code:
' Classic VB File IO
Open FileName For Input As #1
Do While Not EOF(1)
Line Input #1, sLine
Loop
Close #1
Code:
Dim cFile As clsFile
Set cFile = New clsFile
' File IO using cFile object
If cFile.OpenTextStream(FileName) Then
Do While Not cFile.EndOfFile
sLine = cFile.ReadLine
Loop
cFile.CloseTextStream
End If
Is it correct that the file is a normal text file?
The chunks you want to process are separated by "============================"
Then just process the file line by line until you encounter the chunk separator
Can you explain how you handle the blocks of data?
Because your original blocks are still strings with newline separators
Yes, it is a normal text file with row separators LF. My ''chunks'' are separated by "============================" (...I call them blocks because chunks are associated with space of a string and usually they have a predefined size). Now, I took your advice and I run the Do While - Loop above in my context just to check any possible Out of memory error. The result says that 21965012 lines were read (without other conditions) in 94 seconds. But considering that these blocks of data require themselves a secondary parsing and the results have be saved further as a csv format I expect to exceed 10 minutes... You are right, the schema of my parsing process has that logic. Thank you.
"VB code is practically pseudocode" - Tanner Helland
"When you do things right, people won't be sure you've done anything at all" - Matt Groening
"If you wait until you are ready, it is almost certainly too late" - Seth Godin
"Believe nothing you hear, and only one half that you see" - Edgar Allan Poe
-
Aug 2nd, 2019, 01:07 PM
#13
Re: How to parse big files ?
Originally Posted by Daniel Duta
The result says that 21965012 lines were read (without other conditions) in 94 seconds.
If you want to gain some more speed, you could try to use the RC5-CSV-Parser in "pure LineMode"...
LineMode-Parsing can be set, as shown in this Class (name it cLineParsing.cls):
Code:
Option Explicit
Implements ICSVCallback
Private CSV As cCSV, InsideBlock As Boolean
Public Sub ParseLargeFile(FileName As String)
Set CSV = New_c.CSV
CSV.ColSepChar = vbNullChar '<- for LineMode, overwrite the default with a Chr(0)
CSV.QuotChar = vbNullChar '<- for LineMode, overwrite the default with a Chr(0)
CSV.ParseFile FileName, Me
End Sub
Private Function ICSVCallback_NewValue(ByVal RowNr As Long, ByVal ColNr As Long, B() As Byte, ByVal BValStartPos As Long, ByVal BValLen As Long) As Long
Dim LineStartsWithBlockChar As Boolean
If BValLen Then LineStartsWithBlockChar = (B(BValStartPos) = 61) '<- this test is cheap and fast, since it looks up the ByteArray directly for "="
If Not InsideBlock Then 'we are outside block-delimiters
If LineStartsWithBlockChar Then 'a new block starts
InsideBlock = True 'we don't do anything on that line, other than switching the Private Variable
Else 'handle the lines, which are outside (if needed)
Debug.Print "Outside: " & CSV.GetStringValue(B, BValStartPos, BValLen)
End If
Else 'we are inside the block-delimiters
If LineStartsWithBlockChar Then 'it can only be the end of a block
InsideBlock = False 'we don't do anything on that line, other than switching the Private Variable
Else 'handle the lines, which are inside the block
Debug.Print "Inside: " & CSV.GetStringValue(B, BValStartPos, BValLen)
End If
End If
End Function
Form-Code for testing:
Code:
Option Explicit
Private Sub Form_Load()
With New_c.StringBuilder
.AppendNL "Line 1 (outside block)"
.AppendNL "Line 2 (outside block)"
.AppendNL "======================"
.AppendNL " Line 3 (inside block)"
.AppendNL " Line 4 (inside block)"
.AppendNL "======================"
.AppendNL "Line 5 (outside block)"
.Append "Line 6 (outside block)"
New_c.FSO.WriteByteContent "c:\temp\largefile.txt", .ToUTF8
End With
With New cLineParsing
.ParseLargeFile "c:\temp\largefile.txt"
End With
End Sub
If you want to measure performance on your huge file,
make sure to comment out the Debug.Prints in the cLineParsing.cls...
HTH
Olaf
-
Aug 4th, 2019, 11:03 AM
#14
Thread Starter
Hyperactive Member
Re: How to parse big files ?
Thank you Olaf for your solution. You have synthesized very well above what I intend to do (what is it possible to do) within blocks and on the other hand the speed indeed increased significantly (from 94 sec. to 3 sec., although the comparison is not done 1:1). Even I've had to adjust your code a bit I couldn't to append all rows using the AppendNL method because a run-time error Out of String-Memory occurred when StringBuilder accumulated 295481 rows ... Below is the code used. As you see, I had to include a dictionary so that to be able to check if some items (fields) does exist in each row parsed. Perhaps it was possible to get similar results using your ArrayList method but I didn't understand how could I use it in my context. Thank you.
Code:
Option Explicit
Implements ICSVCallback
Private SB As vbRichClient5.cStringBuilder
Private CSV As vbRichClient5.cCSV
Private Const delimit As String = "-------------------------------------------------------------------------------"
Private rand As String, col() As String, InsideBlock As Boolean
Private dict As Scripting.Dictionary
Private Sub Form_Load()
Dim filePath As String, i As Long
ReDim col(1 To 33)
Set dict = New Scripting.Dictionary
col(1) = "Log Record ID:"
col(2) = "Event Time:"
col(3) = "Event Type:"
'................................
col(32) = "SystemType:"
col(33) = "sourceType:"
'Create a dictionary to check if an item exists.
For i = 1 To UBound(col)
dict.Add col(i), i
Next i
filePath = App.Path & "\hugeFile.txt"
ParseLargeFile filePath
End Sub
Public Sub ParseLargeFile(FileName As String)
New_c.Timing True
Set SB = New_c.StringBuilder
Set CSV = New_c.CSV
CSV.ColSepChar = vbNullChar '<- for LineMode, overwrite the default with a Chr(0)
CSV.QuotChar = vbNullChar '<- for LineMode, overwrite the default with a Chr(0)
CSV.ParseFile FileName, Me
New_c.FSO.WriteByteContent App.Path & "\largefile.txt", SB.ToUTF8
Debug.Print New_c.Timing(False)
End Sub
Private Function ICSVCallback_NewValue(ByVal RowNr As Long, ByVal ColNr As Long, b() As Byte, ByVal BValStartPos As Long, ByVal BValLen As Long) As Long
Dim k As Long, chunk As String
rand = CSV.GetStringValue(b, BValStartPos, BValLen)
If rand = delimit Then
InsideBlock = True
SB.AppendNL Join(col, vbTab) 'append the header just at the first occurrence of delimiter
ReDim col(1 To 33) 'make array empty
Else
If InsideBlock = True Then
k = InStr(1, rand, ":") 'note: ":" or "=" are mandatory for the rows I am interested in
If k = 0 Then k = InStr(1, rand, "=")
If k > 0 Then 'the right row format for parsing
chunk = Mid$(rand, 1, k)
If dict.Exists(chunk) Then 'chunk is a required field in database
col(dict(chunk)) = Trim(Mid$(rand, Len(chunk) + 1)) 'get data in the field order
End If
End If
End If
End If
End Function
"VB code is practically pseudocode" - Tanner Helland
"When you do things right, people won't be sure you've done anything at all" - Matt Groening
"If you wait until you are ready, it is almost certainly too late" - Seth Godin
"Believe nothing you hear, and only one half that you see" - Edgar Allan Poe
-
Aug 4th, 2019, 11:44 AM
#15
Re: How to parse big files ?
Originally Posted by Daniel Duta
... I couldn't to append all rows using the AppendNL method because a run-time error Out of String-Memory occurred when StringBuilder accumulated 295481 rows ...
If something does not fit into memory, then you'll have to use the file-system.
It's still worthwhile to buffer a certain amount of Line-Strings in the StringBuilder,
though you'll have to make sure, to "releave it" by writing its content to the file occasionally
(when a certain size has accumulated in the SB)...
The approach, how to do that in conjunction with an RC5-FileStream-instance is shown in post #20 here:
http://www.vbforums.com/showthread.p...=1#post5403219
It boils down to these lines basically (inside the parsing-callback):
Code:
SB.AppendNL Join(col, vbTab)
If SB.length > 1048576 Then 'write-out, when 1MB have accumulated
FS.WriteFromByteArr SB.ToUTF8
SB.Clear 'clear the StringBuilder for the next additions
End If
After exiting the parsing, you'll have to write-out the rest-bytes from the StringBuilder as well of course,
with a final check like this:
If SB.length Then FS.WriteFromByteArr SB.ToUTF8
HTH
Olaf
-
Aug 4th, 2019, 12:53 PM
#16
Re: How to parse big files ?
Why not insert into the database as you go? I'm not sure why you need to try sticking the entire mess into RAM for this.
-
Aug 4th, 2019, 05:24 PM
#17
Thread Starter
Hyperactive Member
Re: How to parse big files ?
Originally Posted by dilettante
Why not insert into the database as you go? I'm not sure why you need to try sticking the entire mess into RAM for this.
Hi, Dile. Of course, it would be more efficient to move all content in a database directly but we work in a team with some specific task. Shortly, we have 3 layers of tasks inside our ETL process:
1.Receiving, transformation and saving data in a basic format, csv or xls (we have to keep these inputs for a while and it is more advantageous to save them as structured data than in their original format; considering the example from this thread it means to save on server 105 Mb instead 730 Mb);
2.Transferring these standard files in some databases considering some queries and projects (these processes are triggered by other colleagues who prepare data flows from previous point so that to provide both a certain format/structure and an internal consistency for layer 3 (data scientists);
3.Capitalizing on the results; here are guys who use all data from layer 2 in many flows of web visualization, where customers find various statistics, charts, predictions etc.
So, these kind of stuff are not personal projects but some work requirements and when tasks are divided in this way the responsibility for data integrity is assumed by each one specifically.
"VB code is practically pseudocode" - Tanner Helland
"When you do things right, people won't be sure you've done anything at all" - Matt Groening
"If you wait until you are ready, it is almost certainly too late" - Seth Godin
"Believe nothing you hear, and only one half that you see" - Edgar Allan Poe
-
Aug 5th, 2019, 12:32 AM
#18
Thread Starter
Hyperactive Member
Re: How to parse big files ?
Originally Posted by Schmidt
If something does not fit into memory, then you'll have to use the file-system.
It's still worthwhile to buffer a certain amount of Line-Strings in the StringBuilder,
though you'll have to make sure, to "releave it" by writing its content to the file occasionally
(when a certain size has accumulated in the SB)...
The approach, how to do that in conjunction with an RC5-FileStream-instance is shown in post #20 here:
http://www.vbforums.com/showthread.p...=1#post5403219
It boils down to these lines basically (inside the parsing-callback):
Code:
SB.AppendNL Join(col, vbTab)
If SB.length > 1048576 Then 'write-out, when 1MB have accumulated
FS.WriteFromByteArr SB.ToUTF8
SB.Clear 'clear the StringBuilder for the next additions
End If
After exiting the parsing, you'll have to write-out the rest-bytes from the StringBuilder as well of course,
with a final check like this:
If SB.length Then FS.WriteFromByteArr SB.ToUTF8
HTH
Olaf
I took all these suggestions transferring the StringBuilder content on disk at each 1Mb accumulated in SB (that in this context is a buffer). The results are beyond my initial expectations: total rows read by parser= ~22 million, total rows wrote on file (tab delimited columns) = 326207, total processing time = 17.4 second. As I tested so far, it is not very easy to find a competitive solution at these performances... Thank you very much, Olaf.
"VB code is practically pseudocode" - Tanner Helland
"When you do things right, people won't be sure you've done anything at all" - Matt Groening
"If you wait until you are ready, it is almost certainly too late" - Seth Godin
"Believe nothing you hear, and only one half that you see" - Edgar Allan Poe
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|