Need some direction here,
i'm trying to read a txt file, as binary, and split each line's content into a string.
Currently the code i use is:
Code:
...
Open FilePath For Binary As #ff
Do While Not (EOF(ff))
strLine = strLine & InputB(1, #ff)
Loop
Close #ff
I would like to split it into lines, so i can use a long function i already wrote for text reading,
which contains the Line Inputs , without working to hard to Split() and parse the complete file string (strLine) from scratch.
Code:
...
Open path For Input As #fileNo
Line Input #fileNo, tempString
Label1.Caption = tempString
Line Input #fileNo, tempString
Label2.Caption = tempString
....
Close #fileNo
as seen above my Input in each iterate is 1 byte long,
I see two options here,
either splitting the complete text by vbCrLf, or finding the byte sequence for vbCrLf and parsing each line with it.
Re: Reading a text file, as Binary, Line by Line ?
To get a Unicode or Ascii text file to a string...and make vbcrlf the defaut paragraph break use this function:
Code:
Public Function ReadUnicodeOrAscii(ByVal f$) As String
If f$ = "" Then Exit Function
Dim W As Long, i As Long, buf$, mw As Long, maxmw As Long, buf2$
maxmw = 32000 'check it for maxmw=200
Dim a() As Byte
W = FreeFile
On Error Resume Next
Err.clear
mw = FileLen(f$)
If Err.Number > 0 Then Exit Function
If mw < 2 Then Exit Function
Open f$ For Binary As W
a() = ChrW(0)
Get #W, , a()
buf$ = a()
If buf$ <> ChrW(&HFEFF) Then
' no unicode
buf$ = ""
Seek #W, 1
If maxmw > mw Then maxmw = mw
While mw > 0
If mw < maxmw Then
ReDim a(mw - 1) As Byte
Get #W, , a()
buf$ = buf$ + StrConv(a(), vbUnicode, 0)
mw = 0
Else
ReDim a(maxmw - 1) As Byte
Get #W, , a()
buf$ = buf$ + StrConv(a(), vbUnicode, 0)
mw = mw - maxmw
End If
Wend
Else
buf$ = ""
mw = mw - 2 ' exclude 2 bytes FEFF
If maxmw > mw Then maxmw = mw
While mw > 0
If mw < maxmw Then
ReDim a(mw - 1) As Byte
Get #W, , a()
buf2$ = a()
buf$ = buf$ + buf2$
mw = 0
Else
ReDim a(maxmw - 1) As Byte
Get #W, , a()
buf2$ = a()
buf$ = buf$ + buf2$
mw = mw - maxmw
End If
Wend
If InStr(1, buf$, vbCrLf) = 0 Then
a() = Split(buf$, ChrW(&HD)) ' if we have only vbCR...
buf$ = Join(a(), vbCrLf)
End If
buf$ = Left$(buf$, Len(buf$))
End If
Close W
ReadUnicodeOrAscii = buf$
End Function
To handle your disk file line by line in our days is not useful (we have gigabytes ram...). So read it and then read each line easy...
Get the mydoc class that uses the above function to place all text paragraphs in the class (i use a double linked list and a system to reuse the deleted paragraph holders).
Dim a as new mydoc
a.editdoc= ReadUnicodeOrAscii("c:\that.txt")
debug.print a.DocLines, a.DocParagraphs
There is no break function but if you write:
public WithEvents a as mydoc
in a form load
set a as new mydoc
you have a
Private Sub mDoc_BreakLine(Data As String, datanext As String)
' do something here
end sub
But you can use only the mydoc without breaking the lines so each paragraph is a line only.
to walk through the lines is easy
dim i as long
for i=1 to a.doclines
debug.print a.TextLine(i)
next i
You can insert, append or delete paragraphs...
you can save the text in mydoc using this function
Code:
Public Function SaveUnicode(ByVal f$, ByVal buf As String) As Boolean
Dim W As Long, a() As Byte
On Error GoTo t12345
If f$ <> "" Then Kill f$
If Err.Number > 0 Then Exit Function
W = FreeFile
DoEvents
Open f$ For Binary As W
buf$ = ChrW(&HFEFF) + buf$
Dim maxmw As Long, ipos As Long
ipos = 1
maxmw = 32000 ' check it with maxmw 20 OR 1
For ipos = 1 To Len(buf) Step maxmw
a() = Mid$(buf, ipos, maxmw)
Put #W, , a()
Next ipos
Close W
SaveUnicode = True
t12345:
End Function
Re: Reading a text file, as Binary, Line by Line ?
@George: That is needlessly overcomplicated.
Splitting the complete text on vbCrLf is the easiest way to get all lines of a text file.
Code:
Dim strLines() As String, ff As Integer, i As Long
ff = FreeFile
Open FilePath For Binary As #ff
strLines = Split(Input(LOF(#ff), #ff), vbCrLf)
Close #ff
For i = 0 To UBound(strLines)
' Do something with strLines(i)
Next i
Re: Reading a text file, as Binary, Line by Line ?
The myDoc class is some bigger than a plain string array that can be fill with the split function. So not needed (here is more as an idea to what can be a nice class for document processing).
Because a txt file can be in unicode I prefer to read with ReadUnicodeOrAscii, who I posted before. In your code I think Input cannot fetch more than 32k chars but maybe I am wrong. Have you test your code with a big text file??
Re: Reading a text file, as Binary, Line by Line ?
The biggest file I have read in using Input method such as above is just over 500mb, no problem
I would expect it to fail if you exceed your memory size or if the file is over 2gb
Also note that reading 1 byte at a time as in the OP is the slowest possible way to read a file and could take thousands of times longer to read the file than would a method that reads larger chunks at once.
Last edited by DataMiser; Oct 7th, 2014 at 01:57 PM.
Re: Reading a text file, as Binary, Line by Line ?
Originally Posted by dilettante
Why do you want to open the file as a binary file if it is a text file? Is there something you didn't mention?
reading a Unicode\UTF-8\Big Indian text files requires me to read it as binary - otherwise, VB6 automatically converts it to ANSI.
Originally Posted by DataMiser
Also note that reading 1 byte at a time as in the OP is the slowest possible way to read a file and could take thousands of times longer to read the file than would a method that reads larger chunks at once.
You are correct i actually overseen this.
It was originally being prepared to detect a vbCrLf binary sequence.
but thinking backward now, it just seems silly to keep the buffer this low, thank you for that.
@Georgekar
Thanks, but it is an overkill for such a task.
Re: Reading a text file, as Binary, Line by Line ?
Originally Posted by stum
reading a Unicode\UTF-8\Big Indian text files requires me to read it as binary - otherwise, VB6 automatically converts it to ANSI.
Then why not simply reading it first into a ByteArray completely:
Code:
Sub Test()
Dim B() As Byte
B = ReadFileBytes("C:\Tests\SomeUnicode.txt")
'now, one can check the first few Bytes of B() for BOM-information
'or if the format is known to e.g. being UTF8, pass the Array directly
'into an UTF8toVBString-routine
'then Split the retrieved Unicode-String as you like (e.g. on vbCrLf)
End Sub
Function ReadFileBytes(FileName As String) As Byte()
Dim FNr&: FNr = FreeFile
Open FileName For Binary Access Read As FNr
ReDim ReadFileBytes(0 To LOF(FNr) - 1)
Get FNr, , ReadFileBytes
Close FNr
End Function
Re: Reading a text file, as Binary, Line by Line ?
Another solution is to pre-process your file in Notepad.
Open your file in notepad, save as ansi, all UTF-8 codes will be correctly translated to ANSI equivalent characters according to the active codepage of your system. After having tried a number of solutions internal to VB, I have pretty much standardized on the pre-process trick.
Re: Reading a text file, as Binary, Line by Line ?
Are you talking Unicode or UTF-8? The Unicode that VB6 handles is always two bytes per character. UTF-8 is not Unicode! A UTF-8 character can vary from 1 to 4 bytes, with the lower part of ascii (less than 128) always a single byte. The UTF-8 specification was made to be backward compatible with ascii. ANSI is a one byte system based on a regionale code page. A UTF-8 file has a unique 3 bytes BOM at the beginning and converting the remainder of the bytes one at a time is not trivial. I still recommend the Notepad trick... If you need a working piece of code... holler, I'll post something that will do the job.
Re: Reading a text file, as Binary, Line by Line ?
Originally Posted by Navion
Are you talking Unicode or UTF-8? The Unicode that VB6 handles is always two bytes per character. UTF-8 is not Unicode!
Well, as a valid Unicode-Encoding, UTF-8 is surely Unicode.
All in all a really nice and efficient format for storage and transfers, because it not only
covers the whole Unicode-range, but is also capable to transport Chars in the ASCII-range (unencoded).
Originally Posted by Navion
A UTF-8 file has a unique 3 bytes BOM at the beginning
Sometimes, but not necessarily.
Originally Posted by Navion
and converting the remainder of the bytes one at a time is not trivial.
I still recommend the Notepad trick...
Nah, there's no need for an external Application, when you can do the same
ANSI-conversion yourself directly per VB-Code - although one could ask:
Why convert an already existing, decoded Unicode-VBString back into ANSI?
Anyways, here's some code which shows how to read (UTF8-)ByteArrays from
a file, then converting them (with or without BOM), into an Uni-VBString.
Code:
Option Explicit
Private Declare Function MultiByteToWideChar& Lib "kernel32" (ByVal CodePage&, ByVal dwFlags&, MultiBytes As Any, ByVal cBytes&, ByVal pWideChars&, ByVal cWideChars&)
Private Declare Function TextOutW& Lib "gdi32" (ByVal hDC&, ByVal x&, ByVal y&, ByVal pS&, ByVal LenS&)
Private Sub Form_Click()
Dim B() As Byte
B = ReadFileBytes("C:\Tests\SomeUtf8.txt")
Dim S As String
S = UTF8ToVBString(B)
If StrPtr(S) Then TextOutW hDC, 4, 4, StrPtr(S), Len(S)
'just in case ANSI-W conversion of the Uni-VBString is needed
'(that's similar to your suggested NotePad-method)
S = StrConv(StrConv(S, vbFromUnicode), vbUnicode)
Print vbLf; vbLf; " "; S
End Sub
Public Function UTF8ToVBString(B() As Byte) As String
Dim LB As Long, Bytes As Long, WChars As Long
On Error GoTo ReturnEmptyString
LB = LBound(B) + IIf(HasUTF8BOM(B), 3, 0)
Bytes = UBound(B) - LB + 1
WChars = MultiByteToWideChar(65001, 0, B(LB), Bytes, 0, 0)
UTF8ToVBString = Space$(WChars)
MultiByteToWideChar 65001, 0, B(LB), Bytes, StrPtr(UTF8ToVBString), WChars
ReturnEmptyString:
End Function
Public Function HasUTF8BOM(B() As Byte) As Boolean
Dim LB&: LB = LBound(B)
If UBound(B) - LB > 1 Then HasUTF8BOM = (B(LB) = 239 And B(LB + 1) = 187 And B(LB + 2) = 191)
End Function
Public Function ReadFileBytes(FileName As String) As Byte()
If FileLen(FileName) = 0 Then ReadFileBytes = vbNullString: Exit Function
Dim FNr&: FNr = FreeFile
Open FileName For Binary Access Read As FNr
ReDim ReadFileBytes(0 To LOF(FNr) - 1)
Get FNr, , ReadFileBytes
Close FNr
End Function
Re: Reading a text file, as Binary, Line by Line ?
Originally Posted by Schmidt
Sometimes, but not necessarily.
Olaf
That's true but without a BOM, there is no way to know what the file contains and if there is encoding in that particular file, unless maybe its an XML or web file with a UTF-8 tag.
At any rate, I set my mind last week about writing a typesetting program in VB-6. My provider of raw text files has all sort of flavors of text files, ANSI, UTF-8 and also a number of other UTF format, some other ISO as well, with a 2 bytes BOM. So I delved into the matter seriously. After much study and some code writing, I was ready to take what I think are sound decisions for my own needs.
I looked at a number of code examples on the web and found out I did not like all that much the MultiByteToWideChar approach and decided not to use it. Instead, I found a piece of code, all VB, without API calls, that dealt with the matter in what I consider, after exhaustive tests, as flawless and in a manner that is more compatible with my programming style and preferences. Without a BOM of either 2 or 3 bytes, as per established standards, I will tend to treat a file as ANSI.
Now the text files I am going to work with are thousands of pages long. I am quite experienced with dealing with super large text files, as I built a special C++ i/o stream class to deal with huge ANSI DXF files coming from the Unix world, often 200 megabytes in size or more, and about which the standard VB6 file I/O proved not the best tool. As I prefer to do as little C++ as I can get away with, I decided not to modify the class with multi-byte recognition.
All factors weighted in, the solution that worked best for me, as far as the typesetting business goes is to pre-process the UTF-8 and UTF-16's in notepad so that they be compatible with my ANSI I/O stream class without modification. As a bonus, the whole thing led me to build a pure VB6 (no API) UTF-8 only I/O stream class for when the need arise to quickly convert on the fly, a BOM compliant smaller text files.
Note also that once Notepad rolls the dice and makes its guess it calls... you guessed it... MultiByteToWideChar as necessary.
BTW, as far as I can tell from available documentation DXF files should never contain ANSI, but only either escaped ASCII or Unicode. Which Unicode encoding is a mystery as they seem to be incredibly poor about expressing such things, but from their ramblings I'd guess UTF-8. They also seem to just ram ANSI-1252 into ASCII fields and let the chips fall where they may.
This sounds like some scary "house of cards" software to me, with the stink of Unix all over it.
I fail to see how slamming such data through Notepad does anything but cause more grief.
Re: Reading a text file, as Binary, Line by Line ?
Originally Posted by dilettante
BTW, as far as I can tell from available documentation DXF files should never contain ANSI, but only either escaped ASCII or Unicode.
Actually, about the graphics files from Unix, I just made the story short calling them ANSI to simplify things a bit. There are a few flavors there too. The DXF are pure ASCII, but they are not standard in the fact that lines are delimited with Lf only instead of CrLf. They are too large to be managed by the Split of vb6 and line input does not work with them, each line being too long and making parsing as per Autodesk specifications a nightmare. Along with these, i also get a number of fairly large .PS, .PDF and .EPS containing RLE encoded sections and the .EPS ones have a binary header of variable length, all being Lf only delimited. Stream processing was the best option to tackle all those problems, since I had access to the encoding C++ librairies, provided by the customer.
The UTF-8 and other similar ones are for a personal project. I ran tests with my own UTF-8 library and the notepad option and the results are consistently identical. I insisted upon the provider to deliver UTF-8 or 16 or ANSI. I will not get into other specifications, so I am think I am good there.
Re: Reading a text file, as Binary, Line by Line ?
read the code...no UTF8 but the 2byte long utf16 (not the variant with 4 bytes or more). Also I check this ChrW(&HFEFF) if it is in the start of file, to decide what to do..next...
for the vblf as in the place of vbcrlf...we have to do something...Id the vblf as a wide char or a single byte??
Last edited by georgekar; Oct 10th, 2014 at 04:16 AM.
Re: Reading a text file, as Binary, Line by Line ?
Originally Posted by Navion
I looked at a number of code examples on the web and found out I did not like all that much the MultiByteToWideChar approach and decided not to use it. Instead, I found a piece of code, all VB, without API calls, that dealt with the matter in what I consider, after exhaustive tests, as flawless ...
Well, in the case of UTF8-decoding I'm nearly 100% sure, that the "piece of VB-Code-without APIs" you
found on the Web will produce *differing* results than the Systems-MultiByteToWideChar-APIcall - and
those differing results will - with high probability - be wrong.
On a side-note ... since when is "Code-snippets which avoid external libs" a criterion for quality?
You might be lucky with your current choice (perhaps because the Input you feed in, is coming from
a "range" which is understood and handled well enough) - but as said, the System-API was tested in
*all* possible scenarios on all Unicode-ranges - you're doing yourself no favour in not choosing it for
UTF8-decoding (and the mapping to 16bit-WideChars).
Originally Posted by Navion
All factors weighted in, the solution that worked best for me, as far as the typesetting business goes is to pre-process the UTF-8 and UTF-16's in notepad so that they be compatible with my ANSI I/O stream class without modification.
But you're loosing information this way ... in case your UTF8-TextBlob contained a mix of english, cyrillic and chinese chars,
you will loose information, when you load this as UTF8 into Notepad - and then saving it from there as ANSI.
Re: Reading a text file, as Binary, Line by Line ?
Inside vb string we have Unicode wide chars. I post a simple routine that open a text file in a windows environment that previous are saved from notepad as Ascii or as Unicode...If we deal with something else then we have to make the right decoder. MultiByteToWideChar is the best for the utf8 to wide-char conversion.
The method of preparation using a reading and saving from notepad has a meaning for a user not for a programmer. A meaning that we accomplish our task...For a programmer...this preparation must be done with some automatic way. So we need a routine to do that, and the os give that "MultiByteToWideChar" for the conversion.
Re: Reading a text file, as Binary, Line by Line ?
Originally Posted by georgekar
I post a simple routine that open a text file in a windows environment that previous are saved from notepad as Ascii or as Unicode...
In case you mean the routine you posted in #2, this is not really recommendable
(since it still contains at least one bug and is not written very efficiently).
Below is an alternative routine which is not longer than yours, but does about twice as much
(including UTF8- as well as 16bit-BigEndian-Decoding and Detection of all 3 BOM-types).
Code:
Declare Function MultiByteToWideChar& Lib "kernel32" (ByVal CodePage&, ByVal dwFlags&, MultiBytes As Any, ByVal cBytes&, ByVal pWideChars&, ByVal cWideChars&)
Function ReadUnicodeOrANSI(FileName As String, Optional ByVal EnsureWinLFs As Boolean) As String
Dim i&, FNr&, BLen&, WChars&, BOM As Integer, BTmp As Byte, B() As Byte
On Error GoTo ErrHandler
BLen = FileLen(FileName)
If BLen = 0 Then Exit Function
FNr = FreeFile
Open FileName For Binary Access Read As FNr
Get FNr, , BOM
Select Case BOM
Case &HFEFF, &HFFFE 'one of the two possible 16 Bit BOMs
If BLen >= 3 Then
ReDim B(0 To BLen - 3): Get FNr, 3, B 'read the Bytes
If BOM = &HFFFE Then 'big endian, so lets swap the byte-pairs
For i = 0 To UBound(B) Step 2
BTmp = B(i): B(i) = B(i + 1): B(i + 1) = BTmp
Next
End If
ReadUnicodeOrANSI = B
End If
Case &HBBEF 'the start of a potential UTF8-BOM
Get FNr, , BTmp
If BTmp = &HBF Then 'it's indeed the UTF8-BOM
If BLen >= 4 Then
ReDim B(0 To BLen - 4): Get FNr, 4, B 'read the Bytes
WChars = MultiByteToWideChar(65001, 0, B(0), BLen - 3, 0, 0)
ReadUnicodeOrANSI = Space$(WChars)
MultiByteToWideChar 65001, 0, B(0), BLen - 3, StrPtr(ReadUnicodeOrANSI), WChars
End If
Else 'not an UTF8-BOM, so read the whole Text as ANSI
ReadUnicodeOrANSI = Space$(BLen)
Get FNr, 1, ReadUnicodeOrANSI
End If
Case Else 'no BOM was detected, so read the whole Text as ANSI
ReadUnicodeOrANSI = Space$(BLen)
Get FNr, 1, ReadUnicodeOrANSI
End Select
If EnsureWinLFs And InStr(ReadUnicodeOrANSI, vbCrLf) = 0 Then
If InStr(ReadUnicodeOrANSI, vbLf) Then
ReadUnicodeOrANSI = Replace(ReadUnicodeOrANSI, vbLf, vbCrLf)
ElseIf InStr(ReadUnicodeOrANSI, vbCr) Then
ReadUnicodeOrANSI = Replace(ReadUnicodeOrANSI, vbCr, vbCrLf)
End If
End If
ErrHandler:
If FNr Then Close FNr
If Err Then Err.Raise Err.Number, Err.Source & ".ReadUnicodeOrANSI", Err.Description
End Function
Edit:
Correction of a Copy&Paste-mistake in the EnsureWinLFs-section at the end of the above routine...
Changed the old Line:
ElseIf InStr(ReadUnicodeOrANSI, vbLf) Then
To the new one:
ElseIf InStr(ReadUnicodeOrANSI, vbCr) Then
Thanks to Bonnie West for pointing out that bug...
Olaf
Last edited by Schmidt; Oct 10th, 2014 at 07:45 PM.
Reason: Code-correction in the EnsureWinLFs-section
Re: Reading a text file, as Binary, Line by Line ?
Are you sure that you can get as many bytes as the length of a byte array in one read operation, or is a limit or 32768 or something like that, in a binary access for reading file?
Re: Reading a text file, as Binary, Line by Line ?
mr Schmidt,
your code is ok...
So your code is working for utf8 also. I think that this code can be insert in my M2000 interpreter...(the latest version)...
I have to remember why I use partial reads...and not one read for all...Where you found a bug in my code?
Re: Reading a text file, as Binary, Line by Line ?
Originally Posted by georgekar
I have to remember why I use partial reads...and not one read for all...
One should use partial reads on huge Files which are larger than - say - 100-200MB,
because above that the Memory-allocator could start to choke whilst attempting to
reserve consecutive memory (for the Byte-Array or VB-String which is supposed to
hold the file-contents).
Aside from that consideration, VBs Binary File-Mode is by principle able to feed directly into
(Variable-provided) allocations of any size.
Originally Posted by georgekar
Where you found a bug in my code?
For example your routine will not read the content of a File which contains only the Character "A",
because of these lines which come already at the top of the function:
mw = FileLen(f$)
...
If mw < 2 Then Exit Function
The rest I see is not really "bugs" - but by not checking for e.g. the quite common 3-Byte UTF8-BOM,
you would currently decode such an UTF8-File with your ANSI-decoder (producing garbage then) -
or your vbCr-to-vbCrLf replacement block is sitting exclusively only under the 16Bit-Uni-Codepath,
not under the ANSI-one... (although it would make sense to do that in both cases) - also you left
out the check for "plain vbLFs" (which are quite common in Unix-Textfiles as e.g. open C-Sources).
Also your Code for that part is not really efficient (the Split-Join thing you do is slower than
Replace with a preceding Instr-check).
Code:
If InStr(1, Buf$, vbCrLf) = 0 Then
A() = Split(Buf$, ChrW(&HD)) ' if we have only vbCR...
Buf$ = Join(A(), vbCrLf)
End If
Buf$ = Left$(Buf$, Len(Buf$))
Also the last line with the Left$-instruction of your code-block above is entirely redundant.
If you care for one more word of advice ...
My coding style is also not the best - but try to get at least the (nested) indentations right -
consequently! (some parts of your code contain proper indents - some parts not at all).
It's really difficult to read - and can give the wrong impression about otherwise interesting
code.
Olaf
Last edited by Schmidt; Oct 11th, 2014 at 09:44 AM.
Re: Reading a text file, as Binary, Line by Line ?
Thank you (I put your code in M2000 Interpreter as from Schmidt member of vbforums, I have work to do on it, unicode support and dropping the common controls -selectors- by using glist)
As for my style of coding..First I write a simple code, to do a simple task, then I run it and observe it, then I do some changes, and observe the execution. I always thinking what can be wrong, so I have to think about the initial status of variables and the range of values to deal. Many times I have no idea what at the end want to do, to be a perfect code, but as I wrote the things get obvious where they going..For this reason...programming is an art.
About indentations: I use indentations when i have to understand what happen...and the things going wrong
For the bug (maybe reading this you see that this isn't a bug). Because the input file should be lines with vbcrlf or vbcr so if one byte was only in a text file then that return us an empty string...If we have a vbcrlf or vbcr then exactly the same we get.
When this is no good? If we have to read data as ascii and not lines of text.
Re: Reading a text file, as Binary, Line by Line ?
A UTF-8 file has a unique 3 bytes BOM at the beginning
Sometimes, but not necessarily.
That's true but without a BOM, there is no way to know what the file contains and if there is encoding in that
particular file, unless maybe its an XML or web file with a UTF-8 tag.
I disagree. There are at least 2 ways to detect UTF-8 encoded text files that do not have a BOM.
See attached project which handles UTF-8 without BOM as well as handlng BOMs UTF16LE, UTF16BE, and UTF8.
Several sample text files are included in project.
BTW - Mozilla Firefox and Notepad++ have in-built detection of UTF-8 without a BOM and there are probably other apps that can do this as well.
Last edited by DrUnicode; Oct 12th, 2014 at 08:33 PM.
Reason: Moer info
Re: Reading a text file, as Binary, Line by Line ?
According to your screenshots, none of the methods processed the original file correctly, which make me wonder if the original file was coded properly in UTF-8. The word "Hejâz" was incorrect in all of your screenshots except the original.
I am guessing that your sample file has ANSI text with a UTF-8 BOM.
I hand typed your original text into Notepad and saved as UTF-8 (with BOM).
It works just fine with prjReadFile.
MultibyteToWideChar performed the worst.
I find that hard to believe. Many of us have been using MultibyteToWideChar for years and it works flawless.
Anyway, try the attached UTF-8 file to see if works OK.
Re: Reading a text file, as Binary, Line by Line ?
After more study, it goes like this :
What we have here is a cross-platform inconsistency combined with a Notepad bug and most likely the use of a seldom used text file format.
I modified my program to show the first three bytes of a file to reduce guesswork.
The conclusions are :
- Both my UTF-8 stream and MultiByteToWideChar work the same on valid UTF-8 files.
- There is a difference though, the UTF-8 stream inspects each line to detect a UTF-8 or ANSI line. If the line of text is not ANSI, it assumes UTF-8 although it actually might be something else. This is my the results are similar to notepad.
- The faulty file has NO bom, yet Notepad falsely but POSITIVELY identifies it as a UTF-8 files. That means Notepad does not rely on BOM alone, and makes the same error so to speak, as the UTF-8 routine, both in the detecting and the processing of each line. Hence, Notepad does not use MultiByteToWideChar.
- RichTextBox does use MultiByteToWideChar or equivalent as the results are the same as GenericReadFile.
- The faulty file essentially has an ANSI text portion at the beginning, followed by a block of text that is neither ANSI, nor UTF-8 (nor UTF-16be or le I would assume), then another ANSI block at the end.
It was a bit of a pain, but some things were learned in the process.
(Edit : it's too bad the uploader reduces files to smaller jpegs. Originals are loss less png's for more quality)
Also, I've got another procedure for reading in Unix style files (with only LF as line terminator). I'll post it too.
The file has to be already opened as Binary before this procedure is called. Also, the maximum line length is 2000 which worked for me, but you can change it.
Code:
Private Function GetNextLine(hFile As Long) As String
Dim s As String * 2000
Dim i As Long
Dim iStartPtr As Long
'
iStartPtr = Seek(hFile)
If EOF(hFile) Then
GetNextLine = Chr$(0)
Exit Function
End If
'
Get hFile, , s
'
i = InStr(s, vbLf)
Select Case i
Case 0 ' Maybe need to make s bigger.
GetNextLine = Chr$(0)
Exit Function
Case 1
Seek hFile, iStartPtr + 1
Exit Function ' Just return null string on repeating vbLF's.
End Select
'
Seek hFile, iStartPtr + i
GetNextLine = Left$(s, i - 1)
End Function
This could also be fairly easily adapted to other types of line terminators. Or, just monitor for Chr$(13) and toss them.
EDIT: Yes, just to note, this routine was specifically written to read ASCII files. There is no consideration of Unicode (and no need for it in the spot it was used).
With files opened for Binary access, an attempt to read through the file using the Input function until EOF returns True generates an error. Use the LOF and Loc functions instead of EOF when reading binary files with Input, or use Get when using the EOF function. With files opened for Output, EOF always returns True.
The LOF() return the length in bytes. If we want to read unicode we have to get 2bytes for each char. So maybe when we want 2 bytes we found 1...so EOF isn't help if we use GET and we get 2 bytes but left only one byte in file. So using 2*(LOF(thatfile)\2 ) we get the end of file that is good for unicode text (suppossed we use UTF16 in 16bit a char).
Last edited by georgekar; Oct 30th, 2014 at 09:24 AM.