Results 1 to 29 of 29

Thread: Reading a text file, as Binary, Line by Line ?

  1. #1

    Thread Starter
    Hyperactive Member
    Join Date
    Oct 2013
    Posts
    389

    Reading a text file, as Binary, Line by Line ?

    Need some direction here,
    i'm trying to read a txt file, as binary, and split each line's content into a string.

    Currently the code i use is:
    Code:
    ...
    
    Open FilePath For Binary As #ff
        
            Do While Not (EOF(ff))
            
                strLine = strLine & InputB(1, #ff)
    
            Loop
                
    Close #ff

    I would like to split it into lines, so i can use a long function i already wrote for text reading,
    which contains the Line Inputs , without working to hard to Split() and parse the complete file string (strLine) from scratch.

    Code:
    ...
    Open path For Input As #fileNo
    
            Line Input #fileNo, tempString
                Label1.Caption = tempString
    
            Line Input #fileNo, tempString
                Label2.Caption = tempString
            
            ....
    
    Close #fileNo
    as seen above my Input in each iterate is 1 byte long,

    I see two options here,
    either splitting the complete text by vbCrLf, or finding the byte sequence for vbCrLf and parsing each line with it.

    Anyone got a better idea ?
    Last edited by stum; Oct 7th, 2014 at 04:04 AM.

  2. #2
    Frenzied Member
    Join Date
    May 2014
    Location
    Kallithea Attikis, Greece
    Posts
    1,289

    Re: Reading a text file, as Binary, Line by Line ?

    To get a Unicode or Ascii text file to a string...and make vbcrlf the defaut paragraph break use this function:
    Code:
    Public Function ReadUnicodeOrAscii(ByVal f$) As String
    If f$ = "" Then Exit Function
    Dim W As Long, i As Long, buf$, mw As Long, maxmw As Long, buf2$
    maxmw = 32000  'check it for maxmw=200
    Dim a() As Byte
    
    W = FreeFile
    On Error Resume Next
    Err.clear
    mw = FileLen(f$)
    If Err.Number > 0 Then Exit Function
    If mw < 2 Then Exit Function
    Open f$ For Binary As W
    a() = ChrW(0)
    Get #W, , a()
    buf$ = a()
    
    If buf$ <> ChrW(&HFEFF) Then
    ' no unicode
    buf$ = ""
    Seek #W, 1
    If maxmw > mw Then maxmw = mw
        While mw > 0
            If mw < maxmw Then
                ReDim a(mw - 1) As Byte
                Get #W, , a()
                buf$ = buf$ + StrConv(a(), vbUnicode, 0)
                mw = 0
            Else
                ReDim a(maxmw - 1) As Byte
                Get #W, , a()
                buf$ = buf$ + StrConv(a(), vbUnicode, 0)
                mw = mw - maxmw
            End If
        Wend
    Else
    buf$ = ""
    mw = mw - 2 ' exclude 2 bytes FEFF
    If maxmw > mw Then maxmw = mw
        While mw > 0
            If mw < maxmw Then
                ReDim a(mw - 1) As Byte
                Get #W, , a()
                buf2$ = a()
                buf$ = buf$ + buf2$
                mw = 0
            Else
                ReDim a(maxmw - 1) As Byte
                Get #W, , a()
                buf2$ = a()
                buf$ = buf$ + buf2$
                mw = mw - maxmw
            End If
     
        Wend
                If InStr(1, buf$, vbCrLf) = 0 Then
                a() = Split(buf$, ChrW(&HD))   ' if we have only vbCR...
              buf$ = Join(a(), vbCrLf)
                       End If
           buf$ = Left$(buf$, Len(buf$))
    End If
    Close W
    ReadUnicodeOrAscii = buf$
    
    End Function
    To handle your disk file line by line in our days is not useful (we have gigabytes ram...). So read it and then read each line easy...

    Get the mydoc class that uses the above function to place all text paragraphs in the class (i use a double linked list and a system to reuse the deleted paragraph holders).


    Dim a as new mydoc
    a.editdoc= ReadUnicodeOrAscii("c:\that.txt")
    debug.print a.DocLines, a.DocParagraphs

    There is no break function but if you write:

    public WithEvents a as mydoc

    in a form load
    set a as new mydoc

    you have a
    Private Sub mDoc_BreakLine(Data As String, datanext As String)
    ' do something here
    end sub

    But you can use only the mydoc without breaking the lines so each paragraph is a line only.

    to walk through the lines is easy

    dim i as long
    for i=1 to a.doclines
    debug.print a.TextLine(i)
    next i

    You can insert, append or delete paragraphs...

    you can save the text in mydoc using this function

    Code:
    Public Function SaveUnicode(ByVal f$, ByVal buf As String) As Boolean
    Dim W As Long, a() As Byte
    On Error GoTo t12345
    If f$ <> "" Then Kill f$
    If Err.Number > 0 Then Exit Function
    W = FreeFile
    DoEvents
    Open f$ For Binary As W
    buf$ = ChrW(&HFEFF) + buf$
    
    Dim maxmw As Long, ipos As Long
    ipos = 1
    maxmw = 32000 ' check it with maxmw 20 OR 1
    For ipos = 1 To Len(buf) Step maxmw
    a() = Mid$(buf, ipos, maxmw)
    Put #W, , a()
    Next ipos
    Close W
    SaveUnicode = True
    t12345:
    End Function
    Attached Files Attached Files

  3. #3
    Frenzied Member
    Join Date
    Jun 2006
    Posts
    1,098

    Re: Reading a text file, as Binary, Line by Line ?

    @George: That is needlessly overcomplicated.

    Splitting the complete text on vbCrLf is the easiest way to get all lines of a text file.
    Code:
    Dim strLines() As String, ff As Integer, i As Long
    
    ff = FreeFile
    Open FilePath For Binary As #ff
    strLines = Split(Input(LOF(#ff), #ff), vbCrLf)
    Close #ff
    
    For i = 0 To UBound(strLines)
      ' Do something with strLines(i)
    Next i

  4. #4
    Frenzied Member
    Join Date
    May 2014
    Location
    Kallithea Attikis, Greece
    Posts
    1,289

    Re: Reading a text file, as Binary, Line by Line ?

    The myDoc class is some bigger than a plain string array that can be fill with the split function. So not needed (here is more as an idea to what can be a nice class for document processing).
    Because a txt file can be in unicode I prefer to read with ReadUnicodeOrAscii, who I posted before. In your code I think Input cannot fetch more than 32k chars but maybe I am wrong. Have you test your code with a big text file??

  5. #5
    PowerPoster
    Join Date
    Feb 2012
    Location
    West Virginia
    Posts
    14,205

    Re: Reading a text file, as Binary, Line by Line ?

    The biggest file I have read in using Input method such as above is just over 500mb, no problem

    I would expect it to fail if you exceed your memory size or if the file is over 2gb

    Also note that reading 1 byte at a time as in the OP is the slowest possible way to read a file and could take thousands of times longer to read the file than would a method that reads larger chunks at once.
    Last edited by DataMiser; Oct 7th, 2014 at 01:57 PM.

  6. #6
    PowerPoster
    Join Date
    Feb 2006
    Posts
    24,482

    Re: Reading a text file, as Binary, Line by Line ?

    Why do you want to open the file as a binary file if it is a text file? Is there something you didn't mention?

  7. #7

    Thread Starter
    Hyperactive Member
    Join Date
    Oct 2013
    Posts
    389

    Re: Reading a text file, as Binary, Line by Line ?

    Quote Originally Posted by dilettante View Post
    Why do you want to open the file as a binary file if it is a text file? Is there something you didn't mention?
    reading a Unicode\UTF-8\Big Indian text files requires me to read it as binary - otherwise, VB6 automatically converts it to ANSI.

    Quote Originally Posted by DataMiser
    Also note that reading 1 byte at a time as in the OP is the slowest possible way to read a file and could take thousands of times longer to read the file than would a method that reads larger chunks at once.
    You are correct i actually overseen this.
    It was originally being prepared to detect a vbCrLf binary sequence.
    but thinking backward now, it just seems silly to keep the buffer this low, thank you for that.

    @Georgekar
    Thanks, but it is an overkill for such a task.
    Last edited by stum; Oct 9th, 2014 at 02:55 AM.

  8. #8
    PowerPoster
    Join Date
    Jun 2013
    Posts
    7,219

    Re: Reading a text file, as Binary, Line by Line ?

    Quote Originally Posted by stum View Post
    reading a Unicode\UTF-8\Big Indian text files requires me to read it as binary - otherwise, VB6 automatically converts it to ANSI.
    Then why not simply reading it first into a ByteArray completely:
    Code:
    Sub Test()
    Dim B() As Byte
        B = ReadFileBytes("C:\Tests\SomeUnicode.txt")
    
        'now, one can check the first few Bytes of B() for BOM-information
        'or if the format is known to e.g. being UTF8, pass the Array directly 
        'into an UTF8toVBString-routine
    
        'then Split the retrieved Unicode-String as you like (e.g. on vbCrLf)
    End Sub
    
    Function ReadFileBytes(FileName As String) As Byte()
    Dim FNr&: FNr = FreeFile
      Open FileName For Binary Access Read As FNr
        ReDim ReadFileBytes(0 To LOF(FNr) - 1)
        Get FNr, , ReadFileBytes
      Close FNr
    End Function


    Olaf

  9. #9
    Fanatic Member
    Join Date
    Jan 2006
    Posts
    557

    Re: Reading a text file, as Binary, Line by Line ?

    Another solution is to pre-process your file in Notepad.

    Open your file in notepad, save as ansi, all UTF-8 codes will be correctly translated to ANSI equivalent characters according to the active codepage of your system. After having tried a number of solutions internal to VB, I have pretty much standardized on the pre-process trick.

  10. #10
    Frenzied Member
    Join Date
    May 2014
    Location
    Kallithea Attikis, Greece
    Posts
    1,289

    Re: Reading a text file, as Binary, Line by Line ?

    I post a simple routine in #2...for loading ascii or unicode. It is easy...just copy it.

  11. #11
    Fanatic Member
    Join Date
    Jan 2006
    Posts
    557

    Re: Reading a text file, as Binary, Line by Line ?

    Are you talking Unicode or UTF-8? The Unicode that VB6 handles is always two bytes per character. UTF-8 is not Unicode! A UTF-8 character can vary from 1 to 4 bytes, with the lower part of ascii (less than 128) always a single byte. The UTF-8 specification was made to be backward compatible with ascii. ANSI is a one byte system based on a regionale code page. A UTF-8 file has a unique 3 bytes BOM at the beginning and converting the remainder of the bytes one at a time is not trivial. I still recommend the Notepad trick... If you need a working piece of code... holler, I'll post something that will do the job.

  12. #12
    PowerPoster
    Join Date
    Jun 2013
    Posts
    7,219

    Re: Reading a text file, as Binary, Line by Line ?

    Quote Originally Posted by Navion View Post
    Are you talking Unicode or UTF-8? The Unicode that VB6 handles is always two bytes per character. UTF-8 is not Unicode!
    Well, as a valid Unicode-Encoding, UTF-8 is surely Unicode.

    All in all a really nice and efficient format for storage and transfers, because it not only
    covers the whole Unicode-range, but is also capable to transport Chars in the ASCII-range (unencoded).

    Quote Originally Posted by Navion View Post
    A UTF-8 file has a unique 3 bytes BOM at the beginning
    Sometimes, but not necessarily.

    Quote Originally Posted by Navion View Post
    and converting the remainder of the bytes one at a time is not trivial.
    I still recommend the Notepad trick...
    Nah, there's no need for an external Application, when you can do the same
    ANSI-conversion yourself directly per VB-Code - although one could ask:
    Why convert an already existing, decoded Unicode-VBString back into ANSI?

    Anyways, here's some code which shows how to read (UTF8-)ByteArrays from
    a file, then converting them (with or without BOM), into an Uni-VBString.


    Code:
    Option Explicit
      
    Private Declare Function MultiByteToWideChar& Lib "kernel32" (ByVal CodePage&, ByVal dwFlags&, MultiBytes As Any, ByVal cBytes&, ByVal pWideChars&, ByVal cWideChars&)
    Private Declare Function TextOutW& Lib "gdi32" (ByVal hDC&, ByVal x&, ByVal y&, ByVal pS&, ByVal LenS&)
     
    Private Sub Form_Click()
    Dim B() As Byte
        B = ReadFileBytes("C:\Tests\SomeUtf8.txt")
    
    Dim S As String
        S = UTF8ToVBString(B)
        If StrPtr(S) Then TextOutW hDC, 4, 4, StrPtr(S), Len(S)
     
      'just in case ANSI-W conversion of the Uni-VBString is needed
      '(that's similar to your suggested NotePad-method)
      S = StrConv(StrConv(S, vbFromUnicode), vbUnicode)
      Print vbLf; vbLf; " "; S
    End Sub
    
    Public Function UTF8ToVBString(B() As Byte) As String
    Dim LB As Long, Bytes As Long, WChars As Long
    On Error GoTo ReturnEmptyString
      LB = LBound(B) + IIf(HasUTF8BOM(B), 3, 0)
      Bytes = UBound(B) - LB + 1
      
      WChars = MultiByteToWideChar(65001, 0, B(LB), Bytes, 0, 0)
      UTF8ToVBString = Space$(WChars)
      MultiByteToWideChar 65001, 0, B(LB), Bytes, StrPtr(UTF8ToVBString), WChars
    ReturnEmptyString:
    End Function
    
    Public Function HasUTF8BOM(B() As Byte) As Boolean
    Dim LB&: LB = LBound(B)
      If UBound(B) - LB > 1 Then HasUTF8BOM = (B(LB) = 239 And B(LB + 1) = 187 And B(LB + 2) = 191)
    End Function
     
    Public Function ReadFileBytes(FileName As String) As Byte()
      If FileLen(FileName) = 0 Then ReadFileBytes = vbNullString: Exit Function
      Dim FNr&: FNr = FreeFile
      Open FileName For Binary Access Read As FNr
        ReDim ReadFileBytes(0 To LOF(FNr) - 1)
        Get FNr, , ReadFileBytes
      Close FNr
    End Function

    Olaf

  13. #13
    Fanatic Member
    Join Date
    Jan 2006
    Posts
    557

    Re: Reading a text file, as Binary, Line by Line ?

    Quote Originally Posted by Schmidt View Post

    Sometimes, but not necessarily.

    Olaf
    That's true but without a BOM, there is no way to know what the file contains and if there is encoding in that particular file, unless maybe its an XML or web file with a UTF-8 tag.

    At any rate, I set my mind last week about writing a typesetting program in VB-6. My provider of raw text files has all sort of flavors of text files, ANSI, UTF-8 and also a number of other UTF format, some other ISO as well, with a 2 bytes BOM. So I delved into the matter seriously. After much study and some code writing, I was ready to take what I think are sound decisions for my own needs.

    I looked at a number of code examples on the web and found out I did not like all that much the MultiByteToWideChar approach and decided not to use it. Instead, I found a piece of code, all VB, without API calls, that dealt with the matter in what I consider, after exhaustive tests, as flawless and in a manner that is more compatible with my programming style and preferences. Without a BOM of either 2 or 3 bytes, as per established standards, I will tend to treat a file as ANSI.

    Now the text files I am going to work with are thousands of pages long. I am quite experienced with dealing with super large text files, as I built a special C++ i/o stream class to deal with huge ANSI DXF files coming from the Unix world, often 200 megabytes in size or more, and about which the standard VB6 file I/O proved not the best tool. As I prefer to do as little C++ as I can get away with, I decided not to modify the class with multi-byte recognition.

    All factors weighted in, the solution that worked best for me, as far as the typesetting business goes is to pre-process the UTF-8 and UTF-16's in notepad so that they be compatible with my ANSI I/O stream class without modification. As a bonus, the whole thing led me to build a pure VB6 (no API) UTF-8 only I/O stream class for when the need arise to quickly convert on the fly, a BOM compliant smaller text files.

  14. #14
    PowerPoster
    Join Date
    Feb 2006
    Posts
    24,482

    Re: Reading a text file, as Binary, Line by Line ?

    I'm reminded of The Notepad file encoding problem, redux and related articles.

    Note also that once Notepad rolls the dice and makes its guess it calls... you guessed it... MultiByteToWideChar as necessary.

    BTW, as far as I can tell from available documentation DXF files should never contain ANSI, but only either escaped ASCII or Unicode. Which Unicode encoding is a mystery as they seem to be incredibly poor about expressing such things, but from their ramblings I'd guess UTF-8. They also seem to just ram ANSI-1252 into ASCII fields and let the chips fall where they may.

    This sounds like some scary "house of cards" software to me, with the stink of Unix all over it.

    I fail to see how slamming such data through Notepad does anything but cause more grief.

  15. #15
    Fanatic Member
    Join Date
    Jan 2006
    Posts
    557

    Re: Reading a text file, as Binary, Line by Line ?

    Quote Originally Posted by dilettante View Post
    BTW, as far as I can tell from available documentation DXF files should never contain ANSI, but only either escaped ASCII or Unicode.
    Actually, about the graphics files from Unix, I just made the story short calling them ANSI to simplify things a bit. There are a few flavors there too. The DXF are pure ASCII, but they are not standard in the fact that lines are delimited with Lf only instead of CrLf. They are too large to be managed by the Split of vb6 and line input does not work with them, each line being too long and making parsing as per Autodesk specifications a nightmare. Along with these, i also get a number of fairly large .PS, .PDF and .EPS containing RLE encoded sections and the .EPS ones have a binary header of variable length, all being Lf only delimited. Stream processing was the best option to tackle all those problems, since I had access to the encoding C++ librairies, provided by the customer.

    The UTF-8 and other similar ones are for a personal project. I ran tests with my own UTF-8 library and the notepad option and the results are consistently identical. I insisted upon the provider to deliver UTF-8 or 16 or ANSI. I will not get into other specifications, so I am think I am good there.

  16. #16
    Frenzied Member
    Join Date
    May 2014
    Location
    Kallithea Attikis, Greece
    Posts
    1,289

    Re: Reading a text file, as Binary, Line by Line ?

    read the code...no UTF8 but the 2byte long utf16 (not the variant with 4 bytes or more). Also I check this ChrW(&HFEFF) if it is in the start of file, to decide what to do..next...

    for the vblf as in the place of vbcrlf...we have to do something...Id the vblf as a wide char or a single byte??
    Last edited by georgekar; Oct 10th, 2014 at 04:16 AM.

  17. #17
    PowerPoster
    Join Date
    Jun 2013
    Posts
    7,219

    Re: Reading a text file, as Binary, Line by Line ?

    Quote Originally Posted by Navion View Post
    I looked at a number of code examples on the web and found out I did not like all that much the MultiByteToWideChar approach and decided not to use it. Instead, I found a piece of code, all VB, without API calls, that dealt with the matter in what I consider, after exhaustive tests, as flawless ...
    Well, in the case of UTF8-decoding I'm nearly 100% sure, that the "piece of VB-Code-without APIs" you
    found on the Web will produce *differing* results than the Systems-MultiByteToWideChar-APIcall - and
    those differing results will - with high probability - be wrong.

    On a side-note ... since when is "Code-snippets which avoid external libs" a criterion for quality?
    You might be lucky with your current choice (perhaps because the Input you feed in, is coming from
    a "range" which is understood and handled well enough) - but as said, the System-API was tested in
    *all* possible scenarios on all Unicode-ranges - you're doing yourself no favour in not choosing it for
    UTF8-decoding (and the mapping to 16bit-WideChars).

    Quote Originally Posted by Navion View Post
    All factors weighted in, the solution that worked best for me, as far as the typesetting business goes is to pre-process the UTF-8 and UTF-16's in notepad so that they be compatible with my ANSI I/O stream class without modification.
    But you're loosing information this way ... in case your UTF8-TextBlob contained a mix of english, cyrillic and chinese chars,
    you will loose information, when you load this as UTF8 into Notepad - and then saving it from there as ANSI.

    Olaf

  18. #18
    Frenzied Member
    Join Date
    May 2014
    Location
    Kallithea Attikis, Greece
    Posts
    1,289

    Re: Reading a text file, as Binary, Line by Line ?

    Inside vb string we have Unicode wide chars. I post a simple routine that open a text file in a windows environment that previous are saved from notepad as Ascii or as Unicode...If we deal with something else then we have to make the right decoder. MultiByteToWideChar is the best for the utf8 to wide-char conversion.
    The method of preparation using a reading and saving from notepad has a meaning for a user not for a programmer. A meaning that we accomplish our task...For a programmer...this preparation must be done with some automatic way. So we need a routine to do that, and the os give that "MultiByteToWideChar" for the conversion.

  19. #19
    PowerPoster
    Join Date
    Jun 2013
    Posts
    7,219

    Re: Reading a text file, as Binary, Line by Line ?

    Quote Originally Posted by georgekar View Post
    I post a simple routine that open a text file in a windows environment that previous are saved from notepad as Ascii or as Unicode...
    In case you mean the routine you posted in #2, this is not really recommendable
    (since it still contains at least one bug and is not written very efficiently).

    Below is an alternative routine which is not longer than yours, but does about twice as much
    (including UTF8- as well as 16bit-BigEndian-Decoding and Detection of all 3 BOM-types).

    Code:
    Declare Function MultiByteToWideChar& Lib "kernel32" (ByVal CodePage&, ByVal dwFlags&, MultiBytes As Any, ByVal cBytes&, ByVal pWideChars&, ByVal cWideChars&)
    
    Function ReadUnicodeOrANSI(FileName As String, Optional ByVal EnsureWinLFs As Boolean) As String
    Dim i&, FNr&, BLen&, WChars&, BOM As Integer, BTmp As Byte, B() As Byte
    
    On Error GoTo ErrHandler
      BLen = FileLen(FileName)
      If BLen = 0 Then Exit Function
      
      FNr = FreeFile
      Open FileName For Binary Access Read As FNr
      
        Get FNr, , BOM
        Select Case BOM
          Case &HFEFF, &HFFFE 'one of the two possible 16 Bit BOMs
            If BLen >= 3 Then
              ReDim B(0 To BLen - 3): Get FNr, 3, B 'read the Bytes
              
              If BOM = &HFFFE Then 'big endian, so lets swap the byte-pairs
                For i = 0 To UBound(B) Step 2
                  BTmp = B(i): B(i) = B(i + 1): B(i + 1) = BTmp
                Next
              End If
              ReadUnicodeOrANSI = B
            End If
          Case &HBBEF 'the start of a potential UTF8-BOM
            Get FNr, , BTmp
            If BTmp = &HBF Then 'it's indeed the UTF8-BOM
              If BLen >= 4 Then
                ReDim B(0 To BLen - 4): Get FNr, 4, B 'read the Bytes
                
                WChars = MultiByteToWideChar(65001, 0, B(0), BLen - 3, 0, 0)
                ReadUnicodeOrANSI = Space$(WChars)
                MultiByteToWideChar 65001, 0, B(0), BLen - 3, StrPtr(ReadUnicodeOrANSI), WChars
              End If
            Else 'not an UTF8-BOM, so read the whole Text as ANSI
              ReadUnicodeOrANSI = Space$(BLen)
              Get FNr, 1, ReadUnicodeOrANSI
            End If
            
          Case Else 'no BOM was detected, so read the whole Text as ANSI
            ReadUnicodeOrANSI = Space$(BLen)
            Get FNr, 1, ReadUnicodeOrANSI
        End Select
        
        If EnsureWinLFs And InStr(ReadUnicodeOrANSI, vbCrLf) = 0 Then
          If InStr(ReadUnicodeOrANSI, vbLf) Then
            ReadUnicodeOrANSI = Replace(ReadUnicodeOrANSI, vbLf, vbCrLf)
          ElseIf InStr(ReadUnicodeOrANSI, vbCr) Then
            ReadUnicodeOrANSI = Replace(ReadUnicodeOrANSI, vbCr, vbCrLf)
          End If
        End If
        
    ErrHandler:
    If FNr Then Close FNr
    If Err Then Err.Raise Err.Number, Err.Source & ".ReadUnicodeOrANSI", Err.Description
    End Function
    Edit:
    Correction of a Copy&Paste-mistake in the EnsureWinLFs-section at the end of the above routine...

    Changed the old Line:
    ElseIf InStr(ReadUnicodeOrANSI, vbLf) Then

    To the new one:
    ElseIf InStr(ReadUnicodeOrANSI, vbCr) Then

    Thanks to Bonnie West for pointing out that bug...

    Olaf
    Last edited by Schmidt; Oct 10th, 2014 at 07:45 PM. Reason: Code-correction in the EnsureWinLFs-section

  20. #20
    Frenzied Member
    Join Date
    May 2014
    Location
    Kallithea Attikis, Greece
    Posts
    1,289

    Re: Reading a text file, as Binary, Line by Line ?

    Are you sure that you can get as many bytes as the length of a byte array in one read operation, or is a limit or 32768 or something like that, in a binary access for reading file?

  21. #21
    Frenzied Member
    Join Date
    May 2014
    Location
    Kallithea Attikis, Greece
    Posts
    1,289

    Re: Reading a text file, as Binary, Line by Line ?

    mr Schmidt,
    your code is ok...
    So your code is working for utf8 also. I think that this code can be insert in my M2000 interpreter...(the latest version)...

    I have to remember why I use partial reads...and not one read for all...Where you found a bug in my code?

  22. #22
    PowerPoster
    Join Date
    Jun 2013
    Posts
    7,219

    Re: Reading a text file, as Binary, Line by Line ?

    Quote Originally Posted by georgekar View Post
    I have to remember why I use partial reads...and not one read for all...
    One should use partial reads on huge Files which are larger than - say - 100-200MB,
    because above that the Memory-allocator could start to choke whilst attempting to
    reserve consecutive memory (for the Byte-Array or VB-String which is supposed to
    hold the file-contents).

    Aside from that consideration, VBs Binary File-Mode is by principle able to feed directly into
    (Variable-provided) allocations of any size.

    Quote Originally Posted by georgekar View Post
    Where you found a bug in my code?
    For example your routine will not read the content of a File which contains only the Character "A",
    because of these lines which come already at the top of the function:

    mw = FileLen(f$)
    ...
    If mw < 2 Then Exit Function

    The rest I see is not really "bugs" - but by not checking for e.g. the quite common 3-Byte UTF8-BOM,
    you would currently decode such an UTF8-File with your ANSI-decoder (producing garbage then) -
    or your vbCr-to-vbCrLf replacement block is sitting exclusively only under the 16Bit-Uni-Codepath,
    not under the ANSI-one... (although it would make sense to do that in both cases) - also you left
    out the check for "plain vbLFs" (which are quite common in Unix-Textfiles as e.g. open C-Sources).

    Also your Code for that part is not really efficient (the Split-Join thing you do is slower than
    Replace with a preceding Instr-check).

    Code:
    If InStr(1, Buf$, vbCrLf) = 0 Then
      A() = Split(Buf$, ChrW(&HD))   ' if we have only vbCR...
      Buf$ = Join(A(), vbCrLf)
    End If
    Buf$ = Left$(Buf$, Len(Buf$))
    Also the last line with the Left$-instruction of your code-block above is entirely redundant.

    If you care for one more word of advice ...

    My coding style is also not the best - but try to get at least the (nested) indentations right -
    consequently! (some parts of your code contain proper indents - some parts not at all).
    It's really difficult to read - and can give the wrong impression about otherwise interesting
    code.

    Olaf
    Last edited by Schmidt; Oct 11th, 2014 at 09:44 AM.

  23. #23
    Frenzied Member
    Join Date
    May 2014
    Location
    Kallithea Attikis, Greece
    Posts
    1,289

    Re: Reading a text file, as Binary, Line by Line ?

    Thank you (I put your code in M2000 Interpreter as from Schmidt member of vbforums, I have work to do on it, unicode support and dropping the common controls -selectors- by using glist)
    As for my style of coding..First I write a simple code, to do a simple task, then I run it and observe it, then I do some changes, and observe the execution. I always thinking what can be wrong, so I have to think about the initial status of variables and the range of values to deal. Many times I have no idea what at the end want to do, to be a perfect code, but as I wrote the things get obvious where they going..For this reason...programming is an art.

    About indentations: I use indentations when i have to understand what happen...and the things going wrong

    For the bug (maybe reading this you see that this isn't a bug). Because the input file should be lines with vbcrlf or vbcr so if one byte was only in a text file then that return us an empty string...If we have a vbcrlf or vbcr then exactly the same we get.
    When this is no good? If we have to read data as ascii and not lines of text.

  24. #24
    Fanatic Member DrUnicode's Avatar
    Join Date
    Mar 2008
    Location
    Natal, Brazil
    Posts
    631

    Re: Reading a text file, as Binary, Line by Line ?

    A UTF-8 file has a unique 3 bytes BOM at the beginning
    Sometimes, but not necessarily.
    That's true but without a BOM, there is no way to know what the file contains and if there is encoding in that

    particular file, unless maybe its an XML or web file with a UTF-8 tag.
    I disagree. There are at least 2 ways to detect UTF-8 encoded text files that do not have a BOM.

    See attached project which handles UTF-8 without BOM as well as handlng BOMs UTF16LE, UTF16BE, and UTF8.
    Several sample text files are included in project.

    BTW - Mozilla Firefox and Notepad++ have in-built detection of UTF-8 without a BOM and there are probably other apps that can do this as well.

    Name:  GenericFileReader.jpg
Views: 7206
Size:  33.1 KB
    Attached Files Attached Files
    Last edited by DrUnicode; Oct 12th, 2014 at 08:33 PM. Reason: Moer info

  25. #25
    Fanatic Member
    Join Date
    Jan 2006
    Posts
    557

    Re: Reading a text file, as Binary, Line by Line ?

    The readfile project worked well on French and Spanish UTF-8 files I had on hand.

    With some other files, results varied a bit. Here is the result of 3 bytes BOM UTF-8 file.

    Everyone can draw their own conclusions. By the way, I also drop the file into Mozilla Firefox : same results as Notepad.

    Notepad, Firefox and my UTF-8 decoding routine gave the exact same results. Not perfect most probably due to locale codepage.

    MultibyteToWideChar performed the worst.


    Name:  UTF-8-results.jpg
Views: 7217
Size:  137.6 KB

  26. #26
    Fanatic Member DrUnicode's Avatar
    Join Date
    Mar 2008
    Location
    Natal, Brazil
    Posts
    631

    Re: Reading a text file, as Binary, Line by Line ?

    According to your screenshots, none of the methods processed the original file correctly, which make me wonder if the original file was coded properly in UTF-8. The word "Hejâz" was incorrect in all of your screenshots except the original.

    I am guessing that your sample file has ANSI text with a UTF-8 BOM.

    I hand typed your original text into Notepad and saved as UTF-8 (with BOM).
    It works just fine with prjReadFile.

    MultibyteToWideChar performed the worst.
    I find that hard to believe. Many of us have been using MultibyteToWideChar for years and it works flawless.

    Anyway, try the attached UTF-8 file to see if works OK.
    Attached Files Attached Files

  27. #27
    Fanatic Member
    Join Date
    Jan 2006
    Posts
    557

    Re: Reading a text file, as Binary, Line by Line ?

    After more study, it goes like this :

    Name:  example1.jpg
Views: 7290
Size:  43.7 KB

    Name:  example2.jpg
Views: 8183
Size:  59.8 KB

    What we have here is a cross-platform inconsistency combined with a Notepad bug and most likely the use of a seldom used text file format.

    I modified my program to show the first three bytes of a file to reduce guesswork.

    The conclusions are :

    - Both my UTF-8 stream and MultiByteToWideChar work the same on valid UTF-8 files.

    - There is a difference though, the UTF-8 stream inspects each line to detect a UTF-8 or ANSI line. If the line of text is not ANSI, it assumes UTF-8 although it actually might be something else. This is my the results are similar to notepad.

    - The faulty file has NO bom, yet Notepad falsely but POSITIVELY identifies it as a UTF-8 files. That means Notepad does not rely on BOM alone, and makes the same error so to speak, as the UTF-8 routine, both in the detecting and the processing of each line. Hence, Notepad does not use MultiByteToWideChar.

    - RichTextBox does use MultiByteToWideChar or equivalent as the results are the same as GenericReadFile.

    - The faulty file essentially has an ANSI text portion at the beginning, followed by a block of text that is neither ANSI, nor UTF-8 (nor UTF-16be or le I would assume), then another ANSI block at the end.

    It was a bit of a pain, but some things were learned in the process.

    (Edit : it's too bad the uploader reduces files to smaller jpegs. Originals are loss less png's for more quality)

  28. #28
    PowerPoster Elroy's Avatar
    Join Date
    Jun 2014
    Location
    Near Nashville TN
    Posts
    9,853

    Re: Reading a text file, as Binary, Line by Line ?

    Wow, I just hammered out a routine that does almost exactly this. See post #32 on http://www.vbforums.com/showthread.p...h-notes-field!

    Also, I've got another procedure for reading in Unix style files (with only LF as line terminator). I'll post it too.

    The file has to be already opened as Binary before this procedure is called. Also, the maximum line length is 2000 which worked for me, but you can change it.

    Code:
    Private Function GetNextLine(hFile As Long) As String
        Dim s As String * 2000
        Dim i As Long
        Dim iStartPtr As Long
        '
        iStartPtr = Seek(hFile)
        If EOF(hFile) Then
            GetNextLine = Chr$(0)
            Exit Function
        End If
        '
        Get hFile, , s
        '
        i = InStr(s, vbLf)
        Select Case i
        Case 0 ' Maybe need to make s bigger.
            GetNextLine = Chr$(0)
            Exit Function
        Case 1
            Seek hFile, iStartPtr + 1
            Exit Function ' Just return null string on repeating vbLF's.
        End Select
        '
        Seek hFile, iStartPtr + i
        GetNextLine = Left$(s, i - 1)
    End Function
    This could also be fairly easily adapted to other types of line terminators. Or, just monitor for Chr$(13) and toss them.

    EDIT: Yes, just to note, this routine was specifically written to read ASCII files. There is no consideration of Unicode (and no need for it in the spot it was used).
    Last edited by Elroy; Oct 30th, 2014 at 10:05 AM.

  29. #29
    Frenzied Member
    Join Date
    May 2014
    Location
    Kallithea Attikis, Greece
    Posts
    1,289

    Re: Reading a text file, as Binary, Line by Line ?

    In binary files EOF cannot be used if we use INPUT
    From http://msdn.microsoft.com/en-us/libr...(v=vs.90).aspx
    With files opened for Binary access, an attempt to read through the file using the Input function until EOF returns True generates an error. Use the LOF and Loc functions instead of EOF when reading binary files with Input, or use Get when using the EOF function. With files opened for Output, EOF always returns True.
    The LOF() return the length in bytes. If we want to read unicode we have to get 2bytes for each char. So maybe when we want 2 bytes we found 1...so EOF isn't help if we use GET and we get 2 bytes but left only one byte in file. So using 2*(LOF(thatfile)\2 ) we get the end of file that is good for unicode text (suppossed we use UTF16 in 16bit a char).
    Last edited by georgekar; Oct 30th, 2014 at 09:24 AM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width