[RESOLVED] ADODB.Stream auto-detect charset problem
Sorry, I'm bit annoying with this threads but I wouldn't ask if I'm able to solve that by myself... :/
I have a problem now with reading files that are in different encodings. I have found a pretty simple solution with ADODB.Stream object, but the problem is that it doesn't work good with detecting between ANSI/UTF-8 and Unicode charsets.
Module code:
VB Code:
Private Sub Main()
With CreateObject("ADODB.Stream")
.Open
.LoadFromFile "1.txt"
MsgBox .ReadText
End With
End Sub
Function MsgBox(Prompt As String, Optional Buttons As VbMsgBoxStyle = vbOKOnly, Optional Title As String) As VbMsgBoxResult
File "1.txt", Unicode: Сампле тест - or if ANSI: Sample test
The above example works if the file is saved in Unicode, but when I save "1.txt" in UTF-8 or ANSI, the result from ADODB.Stream will be corrupted. It appears that I need to manually specify charset type in the stream if it isn't Unicode (I have also tried to put .Charset = "_autodetect" but it won't work), so how to do that automatically since I don't know which file will be loaded (it depends on user)? Also, I have read that in some cases BOM doesn't exist in the file but nevertheless programs like Notepad read that files correctly.
Last edited by MikiSoft; Feb 25th, 2015 at 07:03 PM.
Hi MikiSoft, me again. Dilettante's link is a great one for understanding why this is a difficult problem, and that article links another Michael Kaplan article (http://www.siao2.com/2005/01/30/363308.aspx) which goes into even more detail.
Dr Unicode of vbForums is probably the best resource for your problem. He has a nice VB-only implementation of a standard UTF-8 detection system similar to the methods mentioned here. Here's a link to his project:
Dr. Unicode's approach should successfully distinguish between UTF-8 and ANSI about as well as you can hope for. More sophisticated approaches exist (e.g. http://www-archive.mozilla.org/proje...Detection.html), but I've never seen anything that comprehensive implemented in VB.
ANSI encodings are problematic since there are multiple code pages possible and few if any hints to sniff for to decide which one might have been used.
I practice when faced with this you have to limit your ambitions a little. Some things can be sacrificed for a specific application but usually when a schmantzy character encoding of any kind enters the picture it is because you have to deal with characters outside the safe ASCII range from 1 to 126.
So if we know what limits you can assume it is possible to use one of the common "guesser" algorithms or even one of the less typical and simpler but more failure-prone algorithms.
Interesting link! That's not a half-bad heuristic, all things considered.
Like you say, ANSI is the real problem here, because codepages are near-impossible to distinguish without prior knowledge or complex heuristics (e.g. testing letter occurrence probability against various languages).
ANSI vs UTF-8 should be slightly better. Stealing numbers from this link, the false positive rate for a standard UTF-8 check (similar to the cyberactivex link, above) is only 3.9% for a 2-byte check (1920/49152). For a 7-byte sequence, it's less than 1%. For a 12-byte sequence, it's less than 0.1%. For a 24-byte sequence, it's less than 1 in a million.
But if the data is ANSI and potentially from different code-pages, you're probably out of luck. It is theoretically possible to tap into IE's heuristics engine via the IMultiLanguage interface and accompanying DetectInputCodepage function (https://msdn.microsoft.com/en-us/lib...=vs.85%29.aspx), but I've never seen that attempted in VB.
Option Explicit
Private Sub Form_Load()
Dim MLang As CMultiLanguage
Dim IMLang2 As IMultiLanguage2
Dim Encoding() As tagDetectEncodingInfo
Dim encCount As Long
Dim inp() As Byte
Dim index As Long
Open "c:\Test.txt" For Binary As #1
ReDim inp(LOF(1) - 1)
Get #1, , inp()
Close #1
Set MLang = New CMultiLanguage
Set IMLang2 = MLang
encCount = 16
ReDim Encoding(encCount - 1)
IMLang2.DetectInputCodepage 0, 0, inp(0), UBound(inp) + 1, Encoding(0), encCount
For index = 0 To encCount - 1
If Encoding(index).nCodePage = 65001 Then 'UTF-8
Debug.Print "Detection UTF-8, probably " & Encoding(index).nConfidence
End If
Next
End Sub
If you use MLang, you can immediately load text from a file.
How to do that? I understand your code above as example for detecting UTF-8 charset, but it doesn't show any contents of the file in the proper text format
Last edited by MikiSoft; Feb 26th, 2015 at 12:29 PM.
You realize you have Headache Number 2 ahead as well, right?
If you are dealing with text files and not simple strings there is almost zero probability that you can ignore the issue of line separators. While "Windows native" files normally use CRLF a lot of UTF-8 in the wild will have LF. You could even see naked CR, but since the old Mac OS died that's rare now.
And then you have the related issue: are these line separators or terminators? I.e. some files use a CRLF or LF to mean "the end of the last (and each) line" while others treat them as partitions between line. The difference is whether a CRLF or LF at the very end of the file means end of last line or an empty line.
And don't forget the issue of Ctrl-Z in DOS, CP/M, etc. text files. VB6's native text I/O still respects that, but one you start straying outside those lines all bets are off. A text file can have a Ctrl-Z (&H1A) followed by thousands of bytes of garbage you are supposed to ignore.
Thanks, wqweto! It works now when file is in Unicode or ANSI, but not if it's saved with Notepad in UTF-8, so I guess that it has to be combined with some of the codes above for UTF-8 detection.
Last edited by MikiSoft; Mar 1st, 2015 at 08:48 AM.
Re: [RESOVLED] ADODB.Stream auto-detect charset problem
I had some obligations these days and I didn't looked for solving this problem. So finally here it is, a perfectly working sample in attachment.
Thanks to all people who tried to help me, especially to wqweto and The trick!
Last edited by MikiSoft; Mar 1st, 2015 at 10:31 AM.