[RESOLVED] ADODB.Stream auto-detect charset problem

**MikiSoft** · Feb 25th, 2015, 06:56 PM

Sorry, I'm bit annoying with this threads but I wouldn't ask if I'm able to solve that by myself... :/

I have a problem now with reading files that are in different encodings. I have found a pretty simple solution with ADODB.Stream object, but the problem is that it doesn't work good with detecting between ANSI/UTF-8 and Unicode charsets.

Module code:

VB Code:

Private Sub Main()
  With CreateObject("ADODB.Stream")
    .Open
    .LoadFromFile "1.txt"
    MsgBox .ReadText
  End With
End Sub
 
Function MsgBox(Prompt As String, Optional Buttons As VbMsgBoxStyle = vbOKOnly, Optional Title As String) As VbMsgBoxResult
  MsgBox = CreateObject("WScript.Shell").Popup(Prompt, 0&, Title, Buttons)
End Function

File "1.txt", Unicode: Сампле тест - or if ANSI: Sample test

The above example works if the file is saved in Unicode, but when I save "1.txt" in UTF-8 or ANSI, the result from ADODB.Stream will be corrupted. It appears that I need to manually specify charset type in the stream if it isn't Unicode (I have also tried to put .Charset = "_autodetect" but it won't work), so how to do that automatically since I don't know which file will be loaded (it depends on user)? Also, I have read that in some cases BOM doesn't exist in the file but nevertheless programs like Notepad read that files correctly.

**dilettante** · Feb 25th, 2015, 07:05 PM

Maybe see The Notepad file encoding problem, redux.

Summary: there is no magic.

**MikiSoft** · Feb 25th, 2015, 07:15 PM

So is there some function which will detect/guess file encoding like Notepad does?

**dilettante** · Feb 25th, 2015, 07:46 PM

As mentioned in that blog post, you could look at IsTextUnicode. I haven't seen any examples of use though.

**Tanner_H** · Feb 25th, 2015, 09:07 PM

Hi MikiSoft, me again. Dilettante's link is a great one for understanding why this is a difficult problem, and that article links another Michael Kaplan article (http://www.siao2.com/2005/01/30/363308.aspx) which goes into even more detail.

Dr Unicode of vbForums is probably the best resource for your problem. He has a nice VB-only implementation of a standard UTF-8 detection system similar to the methods mentioned here. Here's a link to his project:

http://cyberactivex.com/UnicodeTutor...htm#FileReader

Dr. Unicode's approach should successfully distinguish between UTF-8 and ANSI about as well as you can hope for. More sophisticated approaches exist (e.g. http://www-archive.mozilla.org/proje...Detection.html), but I've never seen anything that comprehensive implemented in VB.

**dilettante** · Feb 26th, 2015, 10:40 AM

ANSI encodings are problematic since there are multiple code pages possible and few if any hints to sniff for to decide which one might have been used.

I practice when faced with this you have to limit your ambitions a little. Some things can be sacrificed for a specific application but usually when a schmantzy character encoding of any kind enters the picture it is because you have to deal with characters outside the safe ASCII range from 1 to 126.

So if we know what limits you can assume it is possible to use one of the common "guesser" algorithms or even one of the less typical and simpler but more failure-prone algorithms.

**dilettante** · Feb 26th, 2015, 10:51 AM

Here is an example of a simple but narrowly applicable algorithm: Simple Character Encoding Detection.

It ignores the possibility of ASCII or ANSI, and assumes the text (at least the portion sampled) will never contain NUL characters as valid text.

**Tanner_H** · Feb 26th, 2015, 11:01 AM

Interesting link! That's not a half-bad heuristic, all things considered.

Like you say, ANSI is the real problem here, because codepages are near-impossible to distinguish without prior knowledge or complex heuristics (e.g. testing letter occurrence probability against various languages).

ANSI vs UTF-8 should be slightly better. Stealing numbers from this link, the false positive rate for a standard UTF-8 check (similar to the cyberactivex link, above) is only 3.9% for a 2-byte check (1920/49152). For a 7-byte sequence, it's less than 1%. For a 12-byte sequence, it's less than 0.1%. For a 24-byte sequence, it's less than 1 in a million.

But if the data is ANSI and potentially from different code-pages, you're probably out of luck. It is theoretically possible to tap into IE's heuristics engine via the IMultiLanguage interface and accompanying DetectInputCodepage function (https://msdn.microsoft.com/en-us/lib...=vs.85%29.aspx), but I've never seen that attempted in VB.

**The trick** · Feb 26th, 2015, 11:07 AM

Use MLANG for this purpose.

Code:

Option Explicit
 
Private Sub Form_Load()
    Dim MLang       As CMultiLanguage
    Dim IMLang2     As IMultiLanguage2
    Dim Encoding()  As tagDetectEncodingInfo
    Dim encCount    As Long
    Dim inp()       As Byte
    Dim index       As Long
    
    Open "c:\Test.txt" For Binary As #1
    ReDim inp(LOF(1) - 1)
    Get #1, , inp()
    Close #1
    
    Set MLang = New CMultiLanguage
    Set IMLang2 = MLang
    
    encCount = 16
    ReDim Encoding(encCount - 1)
    IMLang2.DetectInputCodepage 0, 0, inp(0), UBound(inp) + 1, Encoding(0), encCount
    
    For index = 0 To encCount - 1
        
        If Encoding(index).nCodePage = 65001 Then 'UTF-8
            Debug.Print "Detection UTF-8, probably " & Encoding(index).nConfidence
        End If
        
    Next
    
End Sub

**Tanner_H** · Feb 26th, 2015, 11:12 AM

I stand corrected; someone has put together a .tlb for it. Thanks, Trick!

**MikiSoft** · Feb 26th, 2015, 11:26 AM

Thanks to all! But how to use The tick's implementation with my example in the main post?

**The trick** · Feb 26th, 2015, 11:59 AM

I wrote an example.

**MikiSoft** · Feb 26th, 2015, 12:07 PM

Yes but that doesn't display the contents of a file like ADODB.Stream. I don't know how to put that together.

**The trick** · Feb 26th, 2015, 12:12 PM

Originally Posted by MikiSoft

Yes but that doesn't display the contents of a file like ADODB.Stream. I don't know how to put that together.

Give up the ADODB.Stream. If you use MLang, you can immediately load text from a file.

**MikiSoft** · Feb 26th, 2015, 12:22 PM

Originally Posted by The trick

If you use MLang, you can immediately load text from a file.

How to do that? I understand your code above as example for detecting UTF-8 charset, but it doesn't show any contents of the file in the proper text format

**The trick** · Feb 26th, 2015, 12:30 PM

You know the final encoding (UCS-2), the initial encoding you know via MLang. Next, use WideCharToMultiByte, MultiByteToWideChar to convert.

**dilettante** · Feb 26th, 2015, 12:30 PM

You realize you have Headache Number 2 ahead as well, right?

If you are dealing with text files and not simple strings there is almost zero probability that you can ignore the issue of line separators. While "Windows native" files normally use CRLF a lot of UTF-8 in the wild will have LF. You could even see naked CR, but since the old Mac OS died that's rare now.

And then you have the related issue: are these line separators or terminators? I.e. some files use a CRLF or LF to mean "the end of the last (and each) line" while others treat them as partitions between line. The difference is whether a CRLF or LF at the very end of the file means end of last line or an empty line.

And don't forget the issue of Ctrl-Z in DOS, CP/M, etc. text files. VB6's native text I/O still respects that, but one you start straying outside those lines all bets are off. A text file can have a Ctrl-Z (&H1A) followed by thousands of bytes of garbage you are supposed to ignore.

**MikiSoft** · Feb 26th, 2015, 12:37 PM

So I should give up since there is no complete solution of this problem?

**dilettante** · Feb 26th, 2015, 12:40 PM

I would not give up. But you have to choose a subset of the battles to fight, and expect some cases to fail.

Hopefully you know a little about the files you will have to process and can decide which issues to try to handle.

**wqweto** · Feb 27th, 2015, 04:11 PM

Originally Posted by MikiSoft

(I have also tried to put .Charset = "_autodetect" but it won't work)

Try .Charset = "_autodetect_all" because "_autodetect" is for "Japanese (Auto-Select)" -- probably not what you wanted.

Check out all of the charsets under HKCR\Mime\Database\Charset

cheers,
</wqw>

**MikiSoft** · Feb 27th, 2015, 05:05 PM

Thanks, wqweto! It works now when file is in Unicode or ANSI, but not if it's saved with Notepad in UTF-8, so I guess that it has to be combined with some of the codes above for UTF-8 detection.

**MikiSoft** · Mar 1st, 2015, 08:55 AM

I had some obligations these days and I didn't looked for solving this problem. So finally here it is, a perfectly working sample in attachment.
Thanks to all people who tried to help me, especially to wqweto and The trick!

Thread: [RESOLVED] ADODB.Stream auto-detect charset problem

Thread Tools

Display

[RESOLVED] ADODB.Stream auto-detect charset problem

Re: ADODB.Stream auto-detect charset problem

Re: ADODB.Stream auto-detect charset problem

Re: ADODB.Stream auto-detect charset problem

Re: ADODB.Stream auto-detect charset problem

Re: ADODB.Stream auto-detect charset problem

Re: ADODB.Stream auto-detect charset problem

Re: ADODB.Stream auto-detect charset problem

Re: ADODB.Stream auto-detect charset problem

Re: ADODB.Stream auto-detect charset problem

Re: ADODB.Stream auto-detect charset problem

Re: ADODB.Stream auto-detect charset problem

Re: ADODB.Stream auto-detect charset problem

Re: ADODB.Stream auto-detect charset problem

Re: ADODB.Stream auto-detect charset problem

Re: ADODB.Stream auto-detect charset problem

Re: ADODB.Stream auto-detect charset problem

Re: ADODB.Stream auto-detect charset problem

Re: ADODB.Stream auto-detect charset problem

Re: ADODB.Stream auto-detect charset problem

Re: ADODB.Stream auto-detect charset problem

Re: [RESOVLED] ADODB.Stream auto-detect charset problem

Tags for this Thread

Posting Permissions