Results 1 to 19 of 19

Thread: How to read a unicode text file line by line.

  1. #1

    Thread Starter
    Fanatic Member
    Join Date
    Mar 2010
    Posts
    762

    How to read a unicode text file line by line.

    Hi
    I have a unicode text file (a file with extension "txt", but one that can contain unicode text for example Greek, Russian or Chinese text in the file).
    Notepad can easily handle such a file.
    However, how can I read such a file programically line by line?
    Currently, I can find a lot of examples in this forum that can do such thing for me like this:
    ListFNr = FreeFile
    Open ListFilePath For Binary Access Read As #ListFNr
    ReDim bytResults(LOF(ListFNr) - 1)
    Get #ListFNr, , bytResults
    Close #ListFNr

    LineStr = Split(Mid(bytResults, 2), vbCrLf)
    However, I don't want to read the ENTIRE file at once and then split it into lines.
    I need to read the contents of the file, actually line by line.
    I need to loop through the file, line by line, read one line at a time, process it, and then read the next line from the file and process it, and so on and so forth until I reach the end of the file.
    All the above, I need to AVOID reading the entire content of the file in one shot.
    How can I do that.
    Thanks.

  2. #2
    Junior Member
    Join Date
    Jun 2015
    Posts
    23

    Re: How to read a unicode text file line by line.

    In my opinion the best way to do this would be to chunk it out:

    1. Read 4096 bytes from the file.
    2. Search that 4096 bytes for the proper sequence (CR LF if UTF-8, NUL CR NUL LF or CR NUL LF NUL if UTF-16)
    3. If you find it, go process that line
    4. If not read more data and keep searching
    5. Loop till file end.

  3. #3
    Default Member Bonnie West's Avatar
    Join Date
    Jun 2012
    Location
    InIDE
    Posts
    4,060

    Re: How to read a unicode text file line by line.

    Here are (late-bound!) examples of the 2 simplest solutions commonly used for this kind of task:

    Code:
    Private Sub Form_Load()
    #If UseADO Then
    
        Const adReadLine = -2&
    
        With CreateObject("ADODB.Stream")
            .Open
            .LoadFromFile "Unicode.txt"
    
            Do Until .EOS
                Debug.Print .ReadText(adReadLine)
            Loop
    
            .Close
        End With
    
    #Else 'UseFSO
    
        Const TristateTrue = -1&
    
        With CreateObject("Scripting.FileSystemObject")
            With .OpenTextFile("Unicode.txt", Format:=TristateTrue)
                Do Until .AtEndOfStream
                    Debug.Print .ReadLine
                Loop
    
                .Close
            End With
        End With
    
    #End If
    End Sub
    Last edited by Bonnie West; Jul 1st, 2015 at 04:46 AM.
    On Local Error Resume Next: If Not Empty Is Nothing Then Do While Null: ReDim i(True To False) As Currency: Loop: Else Debug.Assert CCur(CLng(CInt(CBool(False Imp True Xor False Eqv True)))): Stop: On Local Error GoTo 0
    Declare Sub CrashVB Lib "msvbvm60" (Optional DontPassMe As Any)

  4. #4

    Thread Starter
    Fanatic Member
    Join Date
    Mar 2010
    Posts
    762

    Re: How to read a unicode text file line by line.

    Thanks to both of you for the great advice.
    I tried the ADO and the FSO methods, and both of them worked perfectly.
    I just have a couple of questions about them:
    1. With the ADO method, you used a .Close statement to close the file in the end.
    With the FSO method, you didn't close the file. Why?
    How can I close the file at the end of that loop?
    2. Which one of these two methods (ADO and FSO) is superior?
    3. Let's say in addition to the contents of the file being unicode text, the name of the file (and the folder where the file resides) is also unicode. In that case, which one of the two methods is superior?
    Please advise.
    Thanks.

  5. #5
    Default Member Bonnie West's Avatar
    Join Date
    Jun 2012
    Location
    InIDE
    Posts
    4,060

    Re: How to read a unicode text file line by line.

    Quote Originally Posted by IliaPreston View Post
    1. With the ADO method, you used a .Close statement to close the file in the end.
    With the FSO method, you didn't close the file. Why?
    How can I close the file at the end of that loop?
    Sorry, I forgot to include that statement. I suspect, however, that the FSO implicitly closes the text file when the TextStream object is set to Nothing (i.e., during its Class_Terminate() event), so explicitly invoking the Close method is probably optional. I'm not really sure about this though, so you may still want to call that method just to be safe. I've included that statement in the above code now.

    Quote Originally Posted by IliaPreston View Post
    2. Which one of these two methods (ADO and FSO) is superior?
    If you want to know which one is more efficient/faster, sorry, but I have no idea. I haven't done any benchmarks that compares the two because it is pointless anyway. Both are slower than the equivalent Windows APIs, so by using either of the 2 approaches, you accept the fact that it isn't the most optimal solution. That's usually the case in programming: simple solutions are typically less efficient while low-level methods are more complicated and thus harder to understand and debug.

    Quote Originally Posted by IliaPreston View Post
    3. Let's say in addition to the contents of the file being unicode text, the name of the file (and the folder where the file resides) is also unicode. In that case, which one of the two methods is superior?
    A quick test revealed that both methods can handle both Unicode text and path.
    On Local Error Resume Next: If Not Empty Is Nothing Then Do While Null: ReDim i(True To False) As Currency: Loop: Else Debug.Assert CCur(CLng(CInt(CBool(False Imp True Xor False Eqv True)))): Stop: On Local Error GoTo 0
    Declare Sub CrashVB Lib "msvbvm60" (Optional DontPassMe As Any)

  6. #6
    PowerPoster
    Join Date
    Jan 2008
    Posts
    11,074

    Re: How to read a unicode text file line by line.

    Can someone post a unicode test file so I can use it to test with. Thanks.


    Anything I post is an example only and is not intended to be the only solution, the total solution nor the final solution to your request nor do I claim that it is. If you find it useful then it is entirely up to you to make whatever changes necessary you feel are adequate for your purposes.

  7. #7

    Thread Starter
    Fanatic Member
    Join Date
    Mar 2010
    Posts
    762

    Re: How to read a unicode text file line by line.

    Thanks a lot Bonnie West for the new explanations.
    I believe the two solutions that you provided are great and I will use one of them.
    However, I can not skip this question, because I am curious:
    Is there also any third way of doing it? What I specifically mean is how to do it by Shell32 object.
    A whole lot of file manipulation can be done by Shell32, so I guess this specific task (line reading from a unicode text file) should be also possible by Shell32.
    If I am right, then how can it be done by Shell32?
    One more time, it is just because of curiosity as the other two methods are fine.
    I just cannot convince myself to remain in the dark on the Shell32 possibility.
    Thanks

  8. #8

    Thread Starter
    Fanatic Member
    Join Date
    Mar 2010
    Posts
    762

    Re: How to read a unicode text file line by line.

    Quote Originally Posted by jmsrickland View Post
    Can someone post a unicode test file so I can use it to test with. Thanks.
    Hi jmsrickland.
    Thank you for showing interest in this topic.
    The attached file is what I am using to test my program.
    I hope it is useful to you.
    Attached Files Attached Files

  9. #9
    Default Member Bonnie West's Avatar
    Join Date
    Jun 2012
    Location
    InIDE
    Posts
    4,060

    Re: How to read a unicode text file line by line.

    Quote Originally Posted by jmsrickland View Post
    Can someone post a unicode test file so I can use it to test with. Thanks.
    Notepad can be used to create text files encoded in either UTF-8, UTF-16 LE (Unicode) or UTF-16 BE (Unicode big endian). You can generate Unicode text by using Google Translate.

    Quote Originally Posted by Bonnie West View Post
    I suspect, however, that the FSO implicitly closes the text file when the TextStream object is set to Nothing (i.e., during its Class_Terminate() event), so explicitly invoking the Close method is probably optional.
    It looks like my suspicion was correct. I've observed via Process Hacker that the TextStream object does close its internal file handle (in case it wasn't closed yet) just prior to being terminated.
    On Local Error Resume Next: If Not Empty Is Nothing Then Do While Null: ReDim i(True To False) As Currency: Loop: Else Debug.Assert CCur(CLng(CInt(CBool(False Imp True Xor False Eqv True)))): Stop: On Local Error GoTo 0
    Declare Sub CrashVB Lib "msvbvm60" (Optional DontPassMe As Any)

  10. #10
    Default Member Bonnie West's Avatar
    Join Date
    Jun 2012
    Location
    InIDE
    Posts
    4,060

    Re: How to read a unicode text file line by line.

    Quote Originally Posted by IliaPreston View Post
    Is there also any third way of doing it? What I specifically mean is how to do it by Shell32 object.
    A whole lot of file manipulation can be done by Shell32, so I guess this specific task (line reading from a unicode text file) should be also possible by Shell32.
    If I am right, then how can it be done by Shell32?
    I don't think there's any Shell Object that can read and write to text files.
    On Local Error Resume Next: If Not Empty Is Nothing Then Do While Null: ReDim i(True To False) As Currency: Loop: Else Debug.Assert CCur(CLng(CInt(CBool(False Imp True Xor False Eqv True)))): Stop: On Local Error GoTo 0
    Declare Sub CrashVB Lib "msvbvm60" (Optional DontPassMe As Any)

  11. #11
    PowerPoster
    Join Date
    Jun 2013
    Posts
    7,219

    Re: How to read a unicode text file line by line.

    Quote Originally Posted by Bonnie West View Post
    If you want to know which one is more efficient/faster, sorry, but I have no idea.
    Seems that the FSO-LineReader starts out a little bit slower than the ADO-Stream
    with smaller files (around 10000 lines) - but it scales linearily with larger FileSizes
    whilst the ADO-Stream-Reading gets exponentially slower with larger FileSizes
    (above 50000 lines or more).

    For performance-comparisons (also with the RC5-LineParser-Class, which
    is much faster), see the attached project further below...

    Quote Originally Posted by Bonnie West View Post
    A quick test revealed that both methods can handle both Unicode text and path.
    With the quite notable exception, that only the ADO-Stream-Object can handle
    UTF8-Files properly (the FSO-Stream only supports UTF16-LE).

    UTF8 is the much more common format these days (due to the Web).

    The following Demo-Project:
    UnicodeParsing.zip

    compares ADO- and FSO-LineParsing, based on the content IliaPreston
    has posted (in UTF16-LE-Mode, since that's the only mode the FSO supports) -
    and a second comparison (using the same content, but encoded as UTF8)
    is also included, this time comparing ADO-UTF8-mode with the UTF8-Parser
    which comes with vbRichClient5.

    Here's a ScreenShot:


    RC5-Parsing being the fastest - and that timing includes not only the
    parsing of the current lines content into a String-Variable (as in the other 3 cases),
    but already splitting that Line-String into its two separate components
    (english and russian words) - as well as the import of all lines as new
    records into a DB-Table, which then stands ready for convenient querying.

    Olaf

  12. #12
    Default Member Bonnie West's Avatar
    Join Date
    Jun 2012
    Location
    InIDE
    Posts
    4,060

    Re: How to read a unicode text file line by line.

    Quote Originally Posted by Schmidt View Post
    For performance-comparisons (also with the RC5-LineParser-Class, which
    is much faster), see the attached project further below...
    Thanks for that! Now we know which approach is best in a given scenario.
    On Local Error Resume Next: If Not Empty Is Nothing Then Do While Null: ReDim i(True To False) As Currency: Loop: Else Debug.Assert CCur(CLng(CInt(CBool(False Imp True Xor False Eqv True)))): Stop: On Local Error GoTo 0
    Declare Sub CrashVB Lib "msvbvm60" (Optional DontPassMe As Any)

  13. #13

    Thread Starter
    Fanatic Member
    Join Date
    Mar 2010
    Posts
    762

    Re: How to read a unicode text file line by line.

    Thanks for all the great advice.
    One thing that I need to understand is that if I use the ADO or the FSO approach, is there a guarantee that they loop through from the very first line, and then second, and then third line, and so on until the last line in the text file or is it possible that they might lose the sequence for example can they ever read the third line, and then the first line and then the 4th line, and then 6th line and then 5th line?
    Is the sequence guaranteed or can they read lines out of sequence?
    The reason that I ask this is that I am trying to search a text file for a substring and the program should find the line number where that substring is located. So, if the substring is found in a line that is read in the third run of the loop, can we be 100% sure that the substring is located in the 3rd line of the text file?

  14. #14
    PowerPoster Arnoutdv's Avatar
    Join Date
    Oct 2013
    Posts
    5,872

    Re: How to read a unicode text file line by line.

    Your question makes no sense to me.
    The byte order of a file will not change, this has nothing to do with actual way the file stored on a storage device.

  15. #15
    Default Member Bonnie West's Avatar
    Join Date
    Jun 2012
    Location
    InIDE
    Posts
    4,060

    Re: How to read a unicode text file line by line.

    Both of the ADO Stream object and the FileSystemObject always reads lines sequentially from the first through the last. The Stream object's ReadText documentation indirectly provided proof of this fact:

    Quote Originally Posted by MSDN
    ReadText cannot be used to read backwards.
    Note that the FileSystemObject has a (read-only) property called Line "that returns the current line number in a TextStream file", so you don't have to keep track of the current line number yourself if you're using that object.
    On Local Error Resume Next: If Not Empty Is Nothing Then Do While Null: ReDim i(True To False) As Currency: Loop: Else Debug.Assert CCur(CLng(CInt(CBool(False Imp True Xor False Eqv True)))): Stop: On Local Error GoTo 0
    Declare Sub CrashVB Lib "msvbvm60" (Optional DontPassMe As Any)

  16. #16
    Frenzied Member
    Join Date
    Jan 2010
    Posts
    1,103

    Re: How to read a unicode text file line by line.

    Why nobody give API solution? Is API ReadFile less efficiency because it hardly distinguish (VBCRLF) line?

  17. #17

    Thread Starter
    Fanatic Member
    Join Date
    Mar 2010
    Posts
    762

    Re: How to read a unicode text file line by line.

    Thanks for everybody's advice.
    Note that the FileSystemObject has a (read-only) property called Line "that returns the current line number in a TextStream file", so you don't have to keep track of the current line number yourself if you're using that object
    Thanks for that "Line" property.
    Is there also another property that would give me the total number of lines in the file?
    I searched for such thing and looks like (correct me if I am wrong) that there isn't. For example this page:
    https://msdn.microsoft.com/en-us/lib...=vs.84%29.aspx
    doesn't list any such property that would give the total number of lines.
    I would read lines until I reach the end of the file in order to obtain the total count of the file lines then.
    Is there any better way of obtaining the total count of the file lines?
    Thanks.

  18. #18
    PowerPoster Arnoutdv's Avatar
    Join Date
    Oct 2013
    Posts
    5,872

    Re: How to read a unicode text file line by line.

    The number of lines can only be retrieved when processing the whole file.
    Unless each line has the same lenght then you can calculate the number of lines based on the length of the first line and the total size of the file.

  19. #19
    Member
    Join Date
    Jun 2010
    Posts
    63

    Re: How to read a unicode text file line by line.


Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width