Hi
I have a unicode text file (a file with extension "txt", but one that can contain unicode text for example Greek, Russian or Chinese text in the file).
Notepad can easily handle such a file.
However, how can I read such a file programically line by line?
Currently, I can find a lot of examples in this forum that can do such thing for me like this:
ListFNr = FreeFile
Open ListFilePath For Binary Access Read As #ListFNr
ReDim bytResults(LOF(ListFNr) - 1)
Get #ListFNr, , bytResults
Close #ListFNr
LineStr = Split(Mid(bytResults, 2), vbCrLf)
However, I don't want to read the ENTIRE file at once and then split it into lines.
I need to read the contents of the file, actually line by line.
I need to loop through the file, line by line, read one line at a time, process it, and then read the next line from the file and process it, and so on and so forth until I reach the end of the file.
All the above, I need to AVOID reading the entire content of the file in one shot.
How can I do that.
Thanks.
In my opinion the best way to do this would be to chunk it out:
1. Read 4096 bytes from the file.
2. Search that 4096 bytes for the proper sequence (CR LF if UTF-8, NUL CR NUL LF or CR NUL LF NUL if UTF-16)
3. If you find it, go process that line
4. If not read more data and keep searching
5. Loop till file end.
Here are (late-bound!) examples of the 2 simplest solutions commonly used for this kind of task:
Code:
Private Sub Form_Load()
#If UseADO Then
Const adReadLine = -2&
With CreateObject("ADODB.Stream")
.Open
.LoadFromFile "Unicode.txt"
Do Until .EOS
Debug.Print .ReadText(adReadLine)
Loop
.Close
End With
#Else 'UseFSO
Const TristateTrue = -1&
With CreateObject("Scripting.FileSystemObject")
With .OpenTextFile("Unicode.txt", Format:=TristateTrue)
Do Until .AtEndOfStream
Debug.Print .ReadLine
Loop
.Close
End With
End With
#End If
End Sub
Last edited by Bonnie West; Jul 1st, 2015 at 04:46 AM.
On Local Error Resume Next: If Not Empty Is Nothing Then Do While Null: ReDim i(True To False) As Currency: Loop: Else Debug.Assert CCur(CLng(CInt(CBool(False Imp True Xor False Eqv True)))): Stop: On Local Error GoTo 0
Thanks to both of you for the great advice.
I tried the ADO and the FSO methods, and both of them worked perfectly.
I just have a couple of questions about them:
1. With the ADO method, you used a .Close statement to close the file in the end.
With the FSO method, you didn't close the file. Why?
How can I close the file at the end of that loop?
2. Which one of these two methods (ADO and FSO) is superior?
3. Let's say in addition to the contents of the file being unicode text, the name of the file (and the folder where the file resides) is also unicode. In that case, which one of the two methods is superior?
Please advise.
Thanks.
1. With the ADO method, you used a .Close statement to close the file in the end.
With the FSO method, you didn't close the file. Why?
How can I close the file at the end of that loop?
Sorry, I forgot to include that statement. I suspect, however, that the FSO implicitly closes the text file when the TextStream object is set to Nothing (i.e., during its Class_Terminate() event), so explicitly invoking the Close method is probably optional. I'm not really sure about this though, so you may still want to call that method just to be safe. I've included that statement in the above code now.
Originally Posted by IliaPreston
2. Which one of these two methods (ADO and FSO) is superior?
If you want to know which one is more efficient/faster, sorry, but I have no idea. I haven't done any benchmarks that compares the two because it is pointless anyway. Both are slower than the equivalent Windows APIs, so by using either of the 2 approaches, you accept the fact that it isn't the most optimal solution. That's usually the case in programming: simple solutions are typically less efficient while low-level methods are more complicated and thus harder to understand and debug.
Originally Posted by IliaPreston
3. Let's say in addition to the contents of the file being unicode text, the name of the file (and the folder where the file resides) is also unicode. In that case, which one of the two methods is superior?
A quick test revealed that both methods can handle both Unicode text and path.
On Local Error Resume Next: If Not Empty Is Nothing Then Do While Null: ReDim i(True To False) As Currency: Loop: Else Debug.Assert CCur(CLng(CInt(CBool(False Imp True Xor False Eqv True)))): Stop: On Local Error GoTo 0
Can someone post a unicode test file so I can use it to test with. Thanks.
Anything I post is an example only and is not intended to be the only solution, the total solution nor the final solution to your request nor do I claim that it is. If you find it useful then it is entirely up to you to make whatever changes necessary you feel are adequate for your purposes.
Thanks a lot Bonnie West for the new explanations.
I believe the two solutions that you provided are great and I will use one of them.
However, I can not skip this question, because I am curious:
Is there also any third way of doing it? What I specifically mean is how to do it by Shell32 object.
A whole lot of file manipulation can be done by Shell32, so I guess this specific task (line reading from a unicode text file) should be also possible by Shell32.
If I am right, then how can it be done by Shell32?
One more time, it is just because of curiosity as the other two methods are fine.
I just cannot convince myself to remain in the dark on the Shell32 possibility.
Thanks
Can someone post a unicode test file so I can use it to test with. Thanks.
Notepad can be used to create text files encoded in either UTF-8, UTF-16 LE (Unicode) or UTF-16 BE (Unicode big endian). You can generate Unicode text by using Google Translate.
Originally Posted by Bonnie West
I suspect, however, that the FSO implicitly closes the text file when the TextStream object is set to Nothing (i.e., during its Class_Terminate() event), so explicitly invoking the Close method is probably optional.
It looks like my suspicion was correct. I've observed via Process Hacker that the TextStream object does close its internal file handle (in case it wasn't closed yet) just prior to being terminated.
On Local Error Resume Next: If Not Empty Is Nothing Then Do While Null: ReDim i(True To False) As Currency: Loop: Else Debug.Assert CCur(CLng(CInt(CBool(False Imp True Xor False Eqv True)))): Stop: On Local Error GoTo 0
Is there also any third way of doing it? What I specifically mean is how to do it by Shell32 object.
A whole lot of file manipulation can be done by Shell32, so I guess this specific task (line reading from a unicode text file) should be also possible by Shell32.
If I am right, then how can it be done by Shell32?
I don't think there's any Shell Object that can read and write to text files.
On Local Error Resume Next: If Not Empty Is Nothing Then Do While Null: ReDim i(True To False) As Currency: Loop: Else Debug.Assert CCur(CLng(CInt(CBool(False Imp True Xor False Eqv True)))): Stop: On Local Error GoTo 0
If you want to know which one is more efficient/faster, sorry, but I have no idea.
Seems that the FSO-LineReader starts out a little bit slower than the ADO-Stream
with smaller files (around 10000 lines) - but it scales linearily with larger FileSizes
whilst the ADO-Stream-Reading gets exponentially slower with larger FileSizes
(above 50000 lines or more).
For performance-comparisons (also with the RC5-LineParser-Class, which
is much faster), see the attached project further below...
Originally Posted by Bonnie West
A quick test revealed that both methods can handle both Unicode text and path.
With the quite notable exception, that only the ADO-Stream-Object can handle
UTF8-Files properly (the FSO-Stream only supports UTF16-LE).
UTF8 is the much more common format these days (due to the Web).
compares ADO- and FSO-LineParsing, based on the content IliaPreston
has posted (in UTF16-LE-Mode, since that's the only mode the FSO supports) -
and a second comparison (using the same content, but encoded as UTF8)
is also included, this time comparing ADO-UTF8-mode with the UTF8-Parser
which comes with vbRichClient5.
Here's a ScreenShot:
RC5-Parsing being the fastest - and that timing includes not only the
parsing of the current lines content into a String-Variable (as in the other 3 cases),
but already splitting that Line-String into its two separate components
(english and russian words) - as well as the import of all lines as new
records into a DB-Table, which then stands ready for convenient querying.
For performance-comparisons (also with the RC5-LineParser-Class, which
is much faster), see the attached project further below...
Thanks for that! Now we know which approach is best in a given scenario.
On Local Error Resume Next: If Not Empty Is Nothing Then Do While Null: ReDim i(True To False) As Currency: Loop: Else Debug.Assert CCur(CLng(CInt(CBool(False Imp True Xor False Eqv True)))): Stop: On Local Error GoTo 0
Thanks for all the great advice.
One thing that I need to understand is that if I use the ADO or the FSO approach, is there a guarantee that they loop through from the very first line, and then second, and then third line, and so on until the last line in the text file or is it possible that they might lose the sequence for example can they ever read the third line, and then the first line and then the 4th line, and then 6th line and then 5th line?
Is the sequence guaranteed or can they read lines out of sequence?
The reason that I ask this is that I am trying to search a text file for a substring and the program should find the line number where that substring is located. So, if the substring is found in a line that is read in the third run of the loop, can we be 100% sure that the substring is located in the 3rd line of the text file?
Your question makes no sense to me.
The byte order of a file will not change, this has nothing to do with actual way the file stored on a storage device.
Both of the ADO Stream object and the FileSystemObject always reads lines sequentially from the first through the last. The Stream object's ReadText documentation indirectly provided proof of this fact:
Originally Posted by MSDN
ReadText cannot be used to read backwards.
Note that the FileSystemObject has a (read-only) property called Line"that returns the current line number in a TextStream file", so you don't have to keep track of the current line number yourself if you're using that object.
On Local Error Resume Next: If Not Empty Is Nothing Then Do While Null: ReDim i(True To False) As Currency: Loop: Else Debug.Assert CCur(CLng(CInt(CBool(False Imp True Xor False Eqv True)))): Stop: On Local Error GoTo 0
Note that the FileSystemObject has a (read-only) property called Line "that returns the current line number in a TextStream file", so you don't have to keep track of the current line number yourself if you're using that object
Thanks for that "Line" property.
Is there also another property that would give me the total number of lines in the file?
I searched for such thing and looks like (correct me if I am wrong) that there isn't. For example this page: https://msdn.microsoft.com/en-us/lib...=vs.84%29.aspx
doesn't list any such property that would give the total number of lines.
I would read lines until I reach the end of the file in order to obtain the total count of the file lines then.
Is there any better way of obtaining the total count of the file lines?
Thanks.
The number of lines can only be retrieved when processing the whole file.
Unless each line has the same lenght then you can calculate the number of lines based on the length of the first line and the total size of the file.