Results 1 to 18 of 18

Thread: [RESOLVED] How to parse big files ?

  1. #1

    Thread Starter
    Hyperactive Member Daniel Duta's Avatar
    Join Date
    Feb 2011
    Location
    Bucharest, Romania
    Posts
    396

    Resolved [RESOLVED] How to parse big files ?

    I need to parse a big txt file (non csv) that has over 700 Mb and so far all my attempts failed with Out of memory error. Initially this file had a smaller size and the lines below worked very well but when this file, with the same structure, increased in volume the method used crashed. Many other approaches based on Line Input, Open file For Binary Access, FSO did not work as well. Any suggestion ?
    Code:
    Dim intFile As Integer, myArr As Variant, myFile As String, s As String
    Const delimit As String = "=========================================="
    
    myFile = app.Path & "\BigFile.txt"
    
    On Error GoTo errHandler
    
      intFile = FreeFile
    
          Open myFile For Input As #intFile
            s = Input(LOF(intFile), #intFile)
          Close #intFile
    
        myArr = Split(s, delimit)
    "VB code is practically pseudocode" - Tanner Helland
    "When you do things right, people won't be sure you've done anything at all" - Matt Groening
    "If you wait until you are ready, it is almost certainly too late" - Seth Godin
    "Believe nothing you hear, and only one half that you see" - Edgar Allan Poe

  2. #2
    PowerPoster Arnoutdv's Avatar
    Join Date
    Oct 2013
    Posts
    5,872

    Re: How to parse big files ?

    Why do you want to read it in one go?
    You first read the file in a string, then do a split on the same string
    The string needs 2 times the size of the text file (ANSI->UNICODE)
    Then again the same size for the string array.
    So for a 700MB text file you need 2*1400MB working (string) space...

    Just read the file in chunks

  3. #3

    Thread Starter
    Hyperactive Member Daniel Duta's Avatar
    Join Date
    Feb 2011
    Location
    Bucharest, Romania
    Posts
    396

    Re: How to parse big files ?

    The initial request was based on a smaller file (100.000 rows) and the approach above worked very well and fast. The reason to read it twice was to capture some specific details from blocks (delimited by delimit variable) that have different sizes. The time cost is minor and it was worth for a better control. Yes, it is feasible to use chunks but these are based on buffer size and not on my delimiter position (which is variable). I have to imagine somehow a link between 2 consecutive chunks so that to find the block delimiters and to parse the content by some rules.
    "VB code is practically pseudocode" - Tanner Helland
    "When you do things right, people won't be sure you've done anything at all" - Matt Groening
    "If you wait until you are ready, it is almost certainly too late" - Seth Godin
    "Believe nothing you hear, and only one half that you see" - Edgar Allan Poe

  4. #4
    PowerPoster Arnoutdv's Avatar
    Join Date
    Oct 2013
    Posts
    5,872

    Re: How to parse big files ?

    You can use my clsFile class.
    I use it for reading huge text files, either Unix or DOS based (vbLf or vbCrLf delimiter)
    It also deals with UTF8 files.

    I assume your text file still has linefeeds?
    Otherwise you need to replace the hardcode '10' and '13' for the linefeed detection by a variable.

    Code:
    '---------------------------------------------------------------------------------------
    ' Module    : clsFile
    ' DateTime  : 18-11-2005
    ' Author    : Original Will Barden
    ' Purpose   :
    ' Revision  : 12-3-2014, Unix/Dos mode, UTF8
    '           : 24-8-2016, Rewrite of pReadLine because possible ending of buffer in the middle of UTF8 character
    '---------------------------------------------------------------------------------------
    Option Explicit
    
    Private Declare Sub CopyMemory Lib "kernel32" Alias "RtlMoveMemory" (pDst As Any, pSrc As Any, ByVal ByteLen As Long)
    
    Private Declare Function MultiByteToWideChar Lib "kernel32.dll" ( _
                             ByVal CodePage As Long, _
                             ByVal dwFlags As Long, _
                             ByVal lpMultiByteStr As Long, _
                             ByVal cbMultiByte As Long, _
                             ByVal lpWideCharStr As Long, _
                             ByVal cchWideChar As Long) As Long
                             
    Private Const CP_UTF8 As Long = 65001
    
    Private m_iEOL As Integer
    Private m_lBufferSize As Long
    Private m_iID As Integer
    Private m_bFileOpen As Boolean
    
    Private m_lFileSize As Long
    Private m_lBytesRead As Long
    Private m_bEOF As Boolean
    Private m_bUTF8Header As Boolean
    
    Private m_lBlockPointer As Long
    Private m_lNofBlocks As Long
    Private m_lBlock As Long
    Private m_bNextLine As Boolean
    
    Public Property Get Progress() As Double
      If m_bFileOpen And m_lFileSize <> 0 Then
        Progress = 100# * m_lBytesRead / m_lFileSize
        If Progress > 100 Then Progress = 100
      End If
    End Property
    
    Public Property Get BytesRead() As Long
      BytesRead = m_lBytesRead
    End Property
    
    Public Property Get FileSize() As Long
      FileSize = m_lFileSize
    End Property
    
    '---------------------------------------------------------------------------------------
    ' Procedure : EndOfFile, DateTime  : 18-11-2005
    '---------------------------------------------------------------------------------------
    Public Property Get EndOfFile() As Boolean
      EndOfFile = m_bEOF Or Not m_bFileOpen
    End Property
    
    '---------------------------------------------------------------------------------------
    ' Procedure : CloseTextStream
    ' DateTime  : 18-11-2005
    ' Author    : Will Barden
    ' Purpose   :
    '---------------------------------------------------------------------------------------
    Public Sub CloseTextStream()
      If m_bFileOpen Then Close m_iID
      m_bFileOpen = False
      m_bNextLine = False
      m_bEOF = True
    End Sub
    
    '---------------------------------------------------------------------------------------
    ' Procedure : OpenTextStream,  DateTime  : 18-11-2005
    '---------------------------------------------------------------------------------------
    Public Function OpenTextStream(sFileName As String) As Boolean
      On Error GoTo errHandler
      
      If Not m_bFileOpen Then
        
        If Len(Dir$(sFileName)) > 0 Then
          OpenTextStream = pOpenFile(sFileName)
        End If
      
        If Not OpenTextStream Then
          Close #m_iID
          m_bFileOpen = False
        End If
      
      End If
      
      Exit Function
    
    errHandler:
    End Function
    
    '---------------------------------------------------------------------------------------
    ' Procedure : ReadLine,  DateTime  : 18-11-2005
    '---------------------------------------------------------------------------------------
    Public Function ReadLine() As String
      If m_bFileOpen Then
        ReadLine = pReadLine
        m_lBytesRead = m_lBytesRead + Len(ReadLine) + m_iEOL
      End If
    End Function
    
    '---------------------------------------------------------------------------------------
    ' Procedure : pOpenFile, DateTime  : 18-11-2005
    '---------------------------------------------------------------------------------------
    Private Function pOpenFile(sFileName As String) As Boolean
      On Error GoTo errHandler
      
      m_iID = FreeFile
      Open sFileName For Binary Access Read As m_iID Len = 32766
      m_lFileSize = LOF(m_iID)
      m_lNofBlocks = m_lFileSize \ m_lBufferSize
      m_lBlock = 0
      m_lBlockPointer = -1
      m_lBytesRead = 0
      m_iEOL = 1
      m_bEOF = False
      m_bFileOpen = True
      pOpenFile = True
      Exit Function
    
    errHandler:
    End Function
    
    Private Function pReadLine() As String
      Dim bData() As Byte
      
      If pReadLineBuffer(bData) Then
        pReadLine = pBytesToStringEx(bData)
      End If
    End Function
    
    '---------------------------------------------------------------------------------------
    ' Procedure : pReadLineBuffer, DateTime  : 23-8-2016
    '---------------------------------------------------------------------------------------
    Private Function pReadLineBuffer(bReturn() As Byte) As Boolean
      Dim bFound As Boolean
      Dim lUBound As Long, lStart As Long, lLen As Long
      Static bBuffer() As Byte
      
      ' Read new block of data
      If m_lBlockPointer = -1 Then
        If m_lBlock > m_lNofBlocks Then
          m_bEOF = True
          Exit Function
        End If
        bBuffer = pReadBlock(m_lBlock)
        m_lBlock = m_lBlock + 1
      End If
      
      ' Start of the new string
      lStart = m_lBlockPointer + 1
      
      lUBound = UBound(bBuffer)
      Do Until m_lBlockPointer = lUBound Or bFound
        m_lBlockPointer = m_lBlockPointer + 1
        ' End of Line: LineFeed ^10
        bFound = (bBuffer(m_lBlockPointer) = 10)
      Loop
      
      If bFound Then
        ' End of line, so start building string
        lLen = m_lBlockPointer - lStart
        
        ' DOS line ^13^10
        If m_lBlockPointer > 0 Then
          If bBuffer(m_lBlockPointer - 1) = 13 Then
            lLen = lLen - 1
            m_iEOL = 2
          End If
        End If
        
        If lLen > 0 Then
          ReDim bReturn(lLen - 1)
          CopyMemory bReturn(0), bBuffer(lStart), lLen
          pReadLineBuffer = True
        End If
      Else
        ' No EOL, Build 1st part using current buffer, second part from next buffer
        m_lBlockPointer = -1
        lLen = lUBound - lStart + 1
        
        ' Check if last character is 1st byte of DOS CrLf ^13^10 pair
        If lLen > 0 Then
          ' If so, remove
          If bBuffer(lStart + lLen - 1) = 13 Then lLen = lLen - 1
        End If
        
        If lLen > 0 Then
          Dim bNextBuffer() As Byte
          ReDim bReturn(lLen - 1)
          CopyMemory bReturn(0), bBuffer(lStart), lLen
          If pReadLineBuffer(bNextBuffer) Then pAppendBytes bReturn, bNextBuffer
          pReadLineBuffer = True
        Else
          pReadLineBuffer = pReadLineBuffer(bReturn)
        End If
      End If
      
    End Function
    
    Private Function pReadBlock(lBlock As Long) As Byte()
      Dim lLen As Long
      Dim bBuffer() As Byte
      
      If lBlock = m_lNofBlocks Then
        lLen = m_lFileSize Mod m_lBufferSize
        ReDim bBuffer(lLen - 1)
      Else
        ReDim bBuffer(m_lBufferSize - 1)
      End If
      
      Get #m_iID, , bBuffer
      
      pReadBlock = bBuffer
    End Function
    
    Private Sub pAppendBytes(bBytes() As Byte, bAppend() As Byte)
      Dim lAppendPos As Long
      
      lAppendPos = UBound(bBytes) + 1
      ReDim Preserve bBytes(lAppendPos + UBound(bAppend))
      CopyMemory bBytes(lAppendPos), bAppend(0), UBound(bAppend) + 1
      
    End Sub
    
    '---------------------------------------------------------------------------------------
    ' Procedure : pBytesToStringEx, DateTime  : 23-8-2016
    '---------------------------------------------------------------------------------------
    Private Function pBytesToStringEx(bArr() As Byte) As String
      Dim bBuffer() As Byte
      Dim lLength As Long
      
      If Not m_bNextLine Then
        m_bUTF8Header = pHasUTF8header(bArr)
        If m_bUTF8Header Then
          lLength = UBound(bArr) + 1
          ReDim bBuffer(lLength - 4)
          CopyMemory bBuffer(0), bArr(3), lLength - 3
          m_lBytesRead = m_lBytesRead + 3
          m_bUTF8Header = True
          bArr = bBuffer
        End If
        m_bNextLine = True
      End If
      
      If pContainsUTF8(bArr) Then
        pBytesToStringEx = pConvertFromUTF8(bArr)
      Else
        pBytesToStringEx = StrConv(bArr, vbUnicode)
      End If
    End Function
    
    Private Function pConvertFromUTF8(ByRef Source() As Byte) As String
      Dim Size As Long, Pointer As Long, Length As Long, Buffer As String
      
      Size = UBound(Source) - LBound(Source) + 1
      Pointer = VarPtr(Source(LBound(Source)))
      
      Length = MultiByteToWideChar(CP_UTF8, 0, Pointer, Size, 0, 0)
      Buffer = Space$(Length)
      MultiByteToWideChar CP_UTF8, 0, Pointer, Size, StrPtr(Buffer), Length
      pConvertFromUTF8 = Buffer
    End Function
    
    '---------------------------------------------------------------------------------------
    ' Procedure : pHasUTF8header
    ' DateTime  : 14-6-2013
    ' Author    : ArnoutV
    ' Purpose   :
    ' Reference : http://en.wikipedia.org/wiki/Byte_order_mark
    '---------------------------------------------------------------------------------------
    Private Function pHasUTF8header(Source() As Byte) As Boolean
      
      ' Header: EF, BB, BF
      If UBound(Source) >= 2 Then
        If Source(0) = &HEF Then
          If Source(1) = &HBB Then
            If Source(2) = &HBF Then
              pHasUTF8header = True
              Exit Function
            End If
          End If
        End If
      End If
    
    
    End Function
    
    '---------------------------------------------------------------------------------------
    ' Procedure : pContainsUTF8
    ' DateTime  : 23-10-2012
    ' Author    : ArnoutV
    ' Purpose   : Detect UTF8 characters in byte array
    '           : If byte1 >= "&C2" and byte1 <= "D3" and
    '           :    byte2 >= "&80" and byte2 <= "BF" then UTF8 encoded character
    ' Bron      : http://www.utf8-chartable.de/unicode-utf8-table.pl
    ' Revision  : 23-11-2012
    '           : 02-04-2014 - 3 bytes UTF8 : http://www.utf8-chartable.de/unicode-utf8-table.pl?start=2048
    '---------------------------------------------------------------------------------------
    Private Function pContainsUTF8(ByRef Source() As Byte) As Boolean
      Dim i As Long, lUBound As Long
      Dim hexC2 As Byte, hexDB As Byte, hexBF As Byte, hex80 As Byte, hexE0 As Byte
      Dim CurByte As Byte
      
      If m_bUTF8Header Then
        pContainsUTF8 = True
        Exit Function
      End If
      
      hexC2 = &HC2: hexDB = &HDB: hex80 = &H80: hexBF = &HBF: hexE0 = &HE0
      
      lUBound = UBound(Source)
      For i = 0 To lUBound - 1
        CurByte = Source(i)
        If CurByte >= hexC2 And CurByte <= hexDB Then
          If (Source(i + 1) And hex80) Then
            pContainsUTF8 = True
            Exit For
          End If
        End If
        If i + 2 <= lUBound Then
          If CurByte >= hexE0 Then
            If (Source(i + 1) And hex80) And (Source(i + 2) And hex80) Then
              pContainsUTF8 = True
              Exit For
            End If
          End If
        End If
      Next i
    End Function
    
    Private Sub Class_Initialize()
      m_lBufferSize = 102400
      m_bEOF = True
    End Sub
    
    Private Sub Class_Terminate()
      ' Just to be sure
      If m_bFileOpen Then Close m_iID
    End Sub

  5. #5
    PowerPoster wqweto's Avatar
    Join Date
    May 2011
    Location
    Sofia, Bulgaria
    Posts
    5,120

    Re: How to parse big files ?

    @Arnoutdv: JFYI, this line

    Open sFileName For Binary Access Read As m_iID

    . . . does not work on read-only media. It turns out For Binary always opens the underlying OS file handle for R/W no matter if Access Read is requested.

    Another pitfall with For Binary Access Read is that sFileName is created empty when the file is non-existant -- not something you would expect from Access Read option.

    cheers,
    </wqw>

  6. #6
    PowerPoster Arnoutdv's Avatar
    Join Date
    Oct 2013
    Posts
    5,872

    Re: How to parse big files ?

    Quote Originally Posted by wqweto View Post
    @Arnoutdv: JFYI, this line

    Open sFileName For Binary Access Read As m_iID

    . . . does not work on read-only media. It turns out For Binary always opens the underlying OS file handle for R/W no matter if Access Read is requested.

    Another pitfall with For Binary Access Read is that sFileName is created empty when the file is non-existant -- not something you would expect from Access Read option.

    cheers,
    </wqw>
    The original code dates from 2005 or before and I only added some functionality (UTF8) and some small bug fixed in the last years.
    There is however a check to see if the file exists before opening it.
    Code:
        If Len(Dir$(sFileName)) > 0 Then
          OpenTextStream = pOpenFile(sFileName)
        End If
    The need for "Access Read" is also unclear to me.
    I believe the Len has no effect either on binary files
    Code:
    Open sFileName For Binary Access Read As m_iID Len = 32766
    Should just be
    Code:
    Open sFileName For Binary As m_iID
    It also only works for files up to 2GB, I never got the time to rewrite it to work with >2GB files

  7. #7

    Thread Starter
    Hyperactive Member Daniel Duta's Avatar
    Join Date
    Feb 2011
    Location
    Bucharest, Romania
    Posts
    396

    Re: How to parse big files ?

    Thank you for reply, Arnout. I didn't understand very well the entire usage of the class but I was able to read the first line and I found out the file size (757517246) and total number of blocks (7397). I do not find inside any method related to how to read/parse these blocks. Could you please give me some help?
    "VB code is practically pseudocode" - Tanner Helland
    "When you do things right, people won't be sure you've done anything at all" - Matt Groening
    "If you wait until you are ready, it is almost certainly too late" - Seth Godin
    "Believe nothing you hear, and only one half that you see" - Edgar Allan Poe

  8. #8
    PowerPoster Arnoutdv's Avatar
    Join Date
    Oct 2013
    Posts
    5,872

    Re: How to parse big files ?

    I don't think I understand your request.

    You say the file is made of rows (lines in a text file I assume)
    But you also talk about blocks.
    What do you mean with blocks?
    And what do you need to do with each block of data?

  9. #9

    Thread Starter
    Hyperactive Member Daniel Duta's Avatar
    Join Date
    Feb 2011
    Location
    Bucharest, Romania
    Posts
    396

    Re: How to parse big files ?

    A block is, in your example, a chunk of a file that has a size defined by the buffer size. You call it m_lBlock explicitly. In my case a block is the data within 2 delimiters (see above "======") and this info has a specific meaning that I have to read it as if it had an unique ID (timestamp). Of course, either we take the first reference or the second a block/chunk has lines and some of these lines will be inserted in a database after the parsing process.
    "VB code is practically pseudocode" - Tanner Helland
    "When you do things right, people won't be sure you've done anything at all" - Matt Groening
    "If you wait until you are ready, it is almost certainly too late" - Seth Godin
    "Believe nothing you hear, and only one half that you see" - Edgar Allan Poe

  10. #10
    PowerPoster Arnoutdv's Avatar
    Join Date
    Oct 2013
    Posts
    5,872

    Re: How to parse big files ?

    No I think you misunderstood. This is just a general text file reader.

    A replacement for the following:
    Code:
    ' Classic VB File IO
    Open FileName For Input As #1
    Do While Not EOF(1)
      Line Input #1, sLine
    Loop 
    Close #1
    Code:
    Dim cFile As clsFile
    Set cFile = New clsFile
    
    ' File IO using cFile object
    If cFile.OpenTextStream(FileName) Then
      Do While Not cFile.EndOfFile
        sLine = cFile.ReadLine
      Loop
      cFile.CloseTextStream
    End If
    Is it correct that the file is a normal text file?
    The chunks you want to process are separated by "============================"
    Then just process the file line by line until you encounter the chunk separator

    Can you explain how you handle the blocks of data?
    Because your original blocks are still strings with newline separators
    Last edited by Arnoutdv; Aug 2nd, 2019 at 06:50 AM.

  11. #11
    PowerPoster Arnoutdv's Avatar
    Join Date
    Oct 2013
    Posts
    5,872

    Re: How to parse big files ?

    This could be a skeleton for what you want to do:
    Code:
    Option Explicit
    
    Private Const delimit As String = "=========================================="
    
    Public Sub ParseFile(sFileName As String)
      Dim sLine As String
      Dim sHeader As String
      Dim bDelimit As Boolean
      Dim cFile As clsFile
      
      Set cFile = New clsFile
      If cFile.OpenTextStream Then
        Do Until cFile.EndOfFile
          sHeader = cFile.ReadLine
          Do
            sLine = cFile.ReadLine
            bDelimit = (sLine = delimit)
            If Not bDelimit Then
              pUpdateDBsLine, sHeader
            End If
          Loop Until cFile.EndOfFile Or bDelimit
        Loop
      End If
    
    End Sub
    
    Private Sub pUpdateDB(sLine As String, sHeader As String)
      ' whatever you need to do
    End Sub

  12. #12

    Thread Starter
    Hyperactive Member Daniel Duta's Avatar
    Join Date
    Feb 2011
    Location
    Bucharest, Romania
    Posts
    396

    Re: How to parse big files ?

    Quote Originally Posted by Arnoutdv View Post
    No I think you misunderstood. This is just a general text file reader.

    A replacement for the following:
    Code:
    ' Classic VB File IO
    Open FileName For Input As #1
    Do While Not EOF(1)
      Line Input #1, sLine
    Loop 
    Close #1
    Code:
    Dim cFile As clsFile
    Set cFile = New clsFile
    
    ' File IO using cFile object
    If cFile.OpenTextStream(FileName) Then
      Do While Not cFile.EndOfFile
        sLine = cFile.ReadLine
      Loop
      cFile.CloseTextStream
    End If
    Is it correct that the file is a normal text file?
    The chunks you want to process are separated by "============================"
    Then just process the file line by line until you encounter the chunk separator

    Can you explain how you handle the blocks of data?
    Because your original blocks are still strings with newline separators
    Yes, it is a normal text file with row separators LF. My ''chunks'' are separated by "============================" (...I call them blocks because chunks are associated with space of a string and usually they have a predefined size). Now, I took your advice and I run the Do While - Loop above in my context just to check any possible Out of memory error. The result says that 21965012 lines were read (without other conditions) in 94 seconds. But considering that these blocks of data require themselves a secondary parsing and the results have be saved further as a csv format I expect to exceed 10 minutes... You are right, the schema of my parsing process has that logic. Thank you.
    "VB code is practically pseudocode" - Tanner Helland
    "When you do things right, people won't be sure you've done anything at all" - Matt Groening
    "If you wait until you are ready, it is almost certainly too late" - Seth Godin
    "Believe nothing you hear, and only one half that you see" - Edgar Allan Poe

  13. #13
    PowerPoster
    Join Date
    Jun 2013
    Posts
    7,219

    Re: How to parse big files ?

    Quote Originally Posted by Daniel Duta View Post
    The result says that 21965012 lines were read (without other conditions) in 94 seconds.
    If you want to gain some more speed, you could try to use the RC5-CSV-Parser in "pure LineMode"...

    LineMode-Parsing can be set, as shown in this Class (name it cLineParsing.cls):
    Code:
    Option Explicit
    
    Implements ICSVCallback
    
    Private CSV As cCSV, InsideBlock As Boolean
    
    Public Sub ParseLargeFile(FileName As String)
      Set CSV = New_c.CSV
          CSV.ColSepChar = vbNullChar '<- for LineMode, overwrite the default with a Chr(0)
          CSV.QuotChar = vbNullChar '<- for LineMode, overwrite the default with a Chr(0)
          CSV.ParseFile FileName, Me
    End Sub
    
    Private Function ICSVCallback_NewValue(ByVal RowNr As Long, ByVal ColNr As Long, B() As Byte, ByVal BValStartPos As Long, ByVal BValLen As Long) As Long
      Dim LineStartsWithBlockChar As Boolean
      If BValLen Then LineStartsWithBlockChar = (B(BValStartPos) = 61) '<- this test is cheap and fast, since it looks up the ByteArray directly for "="
      
      If Not InsideBlock Then 'we are outside block-delimiters
        
        If LineStartsWithBlockChar Then 'a new block starts
          InsideBlock = True 'we don't do anything on that line, other than switching the Private Variable
        
        Else 'handle the lines, which are outside (if needed)
          Debug.Print "Outside: " & CSV.GetStringValue(B, BValStartPos, BValLen)
        End If
        
      Else 'we are inside the block-delimiters
      
        If LineStartsWithBlockChar Then 'it can only be the end of a block
          InsideBlock = False 'we don't do anything on that line, other than switching the Private Variable
        
        Else 'handle the lines, which are inside the block
          Debug.Print "Inside: " & CSV.GetStringValue(B, BValStartPos, BValLen)
        End If
        
      End If
    End Function
    Form-Code for testing:
    Code:
    Option Explicit
    
    Private Sub Form_Load()
      With New_c.StringBuilder
        .AppendNL "Line 1 (outside block)"
        .AppendNL "Line 2 (outside block)"
        .AppendNL "======================"
        .AppendNL "  Line 3 (inside block)"
        .AppendNL "  Line 4 (inside block)"
        .AppendNL "======================"
        .AppendNL "Line 5 (outside block)"
        .Append   "Line 6 (outside block)"
        New_c.FSO.WriteByteContent "c:\temp\largefile.txt", .ToUTF8
      End With
      
      With New cLineParsing
        .ParseLargeFile "c:\temp\largefile.txt"
      End With
    End Sub
    If you want to measure performance on your huge file,
    make sure to comment out the Debug.Prints in the cLineParsing.cls...

    HTH

    Olaf

  14. #14

    Thread Starter
    Hyperactive Member Daniel Duta's Avatar
    Join Date
    Feb 2011
    Location
    Bucharest, Romania
    Posts
    396

    Re: How to parse big files ?

    Thank you Olaf for your solution. You have synthesized very well above what I intend to do (what is it possible to do) within blocks and on the other hand the speed indeed increased significantly (from 94 sec. to 3 sec., although the comparison is not done 1:1). Even I've had to adjust your code a bit I couldn't to append all rows using the AppendNL method because a run-time error Out of String-Memory occurred when StringBuilder accumulated 295481 rows ... Below is the code used. As you see, I had to include a dictionary so that to be able to check if some items (fields) does exist in each row parsed. Perhaps it was possible to get similar results using your ArrayList method but I didn't understand how could I use it in my context. Thank you.
    Code:
    Option Explicit
    
    Implements ICSVCallback
    
    Private SB As vbRichClient5.cStringBuilder
    Private CSV As vbRichClient5.cCSV
    Private Const delimit As String = "-------------------------------------------------------------------------------"
    Private rand As String, col() As String, InsideBlock As Boolean
    Private dict As Scripting.Dictionary
    
    Private Sub Form_Load()
        Dim filePath As String, i As Long
        ReDim col(1 To 33)
    Set dict = New Scripting.Dictionary
    
        col(1) = "Log Record ID:"
        col(2) = "Event Time:"
        col(3) = "Event Type:"
        '................................
        col(32) = "SystemType:"
        col(33) = "sourceType:"
        'Create a dictionary to check if an item exists.
        For i = 1 To UBound(col)
            dict.Add col(i), i
        Next i
    
        filePath = App.Path & "\hugeFile.txt"
        ParseLargeFile filePath
    
    End Sub
    
    Public Sub ParseLargeFile(FileName As String)
          New_c.Timing True
      Set SB = New_c.StringBuilder
      Set CSV = New_c.CSV
          CSV.ColSepChar = vbNullChar '<- for LineMode, overwrite the default with a Chr(0)
          CSV.QuotChar = vbNullChar   '<- for LineMode, overwrite the default with a Chr(0)
          CSV.ParseFile FileName, Me
          New_c.FSO.WriteByteContent App.Path & "\largefile.txt", SB.ToUTF8
      Debug.Print New_c.Timing(False)
    End Sub
    
    Private Function ICSVCallback_NewValue(ByVal RowNr As Long, ByVal ColNr As Long, b() As Byte, ByVal BValStartPos As Long, ByVal BValLen As Long) As Long
      Dim k As Long, chunk As String
    
         rand = CSV.GetStringValue(b, BValStartPos, BValLen)
    
        If rand = delimit Then
            InsideBlock = True
            SB.AppendNL Join(col, vbTab) 'append the header just at the first occurrence of delimiter
            ReDim col(1 To 33) 'make array empty
        Else
            If InsideBlock = True Then
                   k = InStr(1, rand, ":") 'note: ":" or "=" are mandatory for the rows I am interested in
                If k = 0 Then k = InStr(1, rand, "=")
              If k > 0 Then 'the right row format for parsing
                    chunk = Mid$(rand, 1, k)
                 If dict.Exists(chunk) Then 'chunk is a required field in database
                    col(dict(chunk)) = Trim(Mid$(rand, Len(chunk) + 1)) 'get data in the field order
                 End If
              End If
            End If
        End If
    End Function
    "VB code is practically pseudocode" - Tanner Helland
    "When you do things right, people won't be sure you've done anything at all" - Matt Groening
    "If you wait until you are ready, it is almost certainly too late" - Seth Godin
    "Believe nothing you hear, and only one half that you see" - Edgar Allan Poe

  15. #15
    PowerPoster
    Join Date
    Jun 2013
    Posts
    7,219

    Re: How to parse big files ?

    Quote Originally Posted by Daniel Duta View Post
    ... I couldn't to append all rows using the AppendNL method because a run-time error Out of String-Memory occurred when StringBuilder accumulated 295481 rows ...
    If something does not fit into memory, then you'll have to use the file-system.

    It's still worthwhile to buffer a certain amount of Line-Strings in the StringBuilder,
    though you'll have to make sure, to "releave it" by writing its content to the file occasionally
    (when a certain size has accumulated in the SB)...

    The approach, how to do that in conjunction with an RC5-FileStream-instance is shown in post #20 here:
    http://www.vbforums.com/showthread.p...=1#post5403219

    It boils down to these lines basically (inside the parsing-callback):
    Code:
     SB.AppendNL Join(col, vbTab)
     If SB.length > 1048576 Then 'write-out, when 1MB have accumulated
        FS.WriteFromByteArr SB.ToUTF8
        SB.Clear 'clear the StringBuilder for the next additions
     End If
    After exiting the parsing, you'll have to write-out the rest-bytes from the StringBuilder as well of course,
    with a final check like this:
    If SB.length Then FS.WriteFromByteArr SB.ToUTF8

    HTH

    Olaf

  16. #16
    PowerPoster
    Join Date
    Feb 2006
    Posts
    24,482

    Re: How to parse big files ?

    Why not insert into the database as you go? I'm not sure why you need to try sticking the entire mess into RAM for this.

  17. #17

    Thread Starter
    Hyperactive Member Daniel Duta's Avatar
    Join Date
    Feb 2011
    Location
    Bucharest, Romania
    Posts
    396

    Re: How to parse big files ?

    Quote Originally Posted by dilettante View Post
    Why not insert into the database as you go? I'm not sure why you need to try sticking the entire mess into RAM for this.
    Hi, Dile. Of course, it would be more efficient to move all content in a database directly but we work in a team with some specific task. Shortly, we have 3 layers of tasks inside our ETL process:
    1.Receiving, transformation and saving data in a basic format, csv or xls (we have to keep these inputs for a while and it is more advantageous to save them as structured data than in their original format; considering the example from this thread it means to save on server 105 Mb instead 730 Mb);
    2.Transferring these standard files in some databases considering some queries and projects (these processes are triggered by other colleagues who prepare data flows from previous point so that to provide both a certain format/structure and an internal consistency for layer 3 (data scientists);
    3.Capitalizing on the results; here are guys who use all data from layer 2 in many flows of web visualization, where customers find various statistics, charts, predictions etc.
    So, these kind of stuff are not personal projects but some work requirements and when tasks are divided in this way the responsibility for data integrity is assumed by each one specifically.
    "VB code is practically pseudocode" - Tanner Helland
    "When you do things right, people won't be sure you've done anything at all" - Matt Groening
    "If you wait until you are ready, it is almost certainly too late" - Seth Godin
    "Believe nothing you hear, and only one half that you see" - Edgar Allan Poe

  18. #18

    Thread Starter
    Hyperactive Member Daniel Duta's Avatar
    Join Date
    Feb 2011
    Location
    Bucharest, Romania
    Posts
    396

    Re: How to parse big files ?

    Quote Originally Posted by Schmidt View Post
    If something does not fit into memory, then you'll have to use the file-system.

    It's still worthwhile to buffer a certain amount of Line-Strings in the StringBuilder,
    though you'll have to make sure, to "releave it" by writing its content to the file occasionally
    (when a certain size has accumulated in the SB)...

    The approach, how to do that in conjunction with an RC5-FileStream-instance is shown in post #20 here:
    http://www.vbforums.com/showthread.p...=1#post5403219

    It boils down to these lines basically (inside the parsing-callback):
    Code:
     SB.AppendNL Join(col, vbTab)
     If SB.length > 1048576 Then 'write-out, when 1MB have accumulated
        FS.WriteFromByteArr SB.ToUTF8
        SB.Clear 'clear the StringBuilder for the next additions
     End If
    After exiting the parsing, you'll have to write-out the rest-bytes from the StringBuilder as well of course,
    with a final check like this:
    If SB.length Then FS.WriteFromByteArr SB.ToUTF8

    HTH

    Olaf
    I took all these suggestions transferring the StringBuilder content on disk at each 1Mb accumulated in SB (that in this context is a buffer). The results are beyond my initial expectations: total rows read by parser= ~22 million, total rows wrote on file (tab delimited columns) = 326207, total processing time = 17.4 second. As I tested so far, it is not very easy to find a competitive solution at these performances... Thank you very much, Olaf.
    "VB code is practically pseudocode" - Tanner Helland
    "When you do things right, people won't be sure you've done anything at all" - Matt Groening
    "If you wait until you are ready, it is almost certainly too late" - Seth Godin
    "Believe nothing you hear, and only one half that you see" - Edgar Allan Poe

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width