PHP User Warning: fetch_template() calls should be replaced by the vB_Template class. Template name: bbcode_highlight in ..../includes/functions.php on line 4197
[RESOLVED] ADODB.Stream auto-detect charset problem-VBForums
Results 1 to 22 of 22

Thread: [RESOLVED] ADODB.Stream auto-detect charset problem

  1. #1

    Thread Starter
    Hyperactive Member
    Join Date
    Jun 2011
    Posts
    461

    Resolved [RESOLVED] ADODB.Stream auto-detect charset problem

    Sorry, I'm bit annoying with this threads but I wouldn't ask if I'm able to solve that by myself... :/

    I have a problem now with reading files that are in different encodings. I have found a pretty simple solution with ADODB.Stream object, but the problem is that it doesn't work good with detecting between ANSI/UTF-8 and Unicode charsets.

    Module code:
    VB Code:
    1. Private Sub Main()
    2.   With CreateObject("ADODB.Stream")
    3.     .Open
    4.     .LoadFromFile "1.txt"
    5.     MsgBox .ReadText
    6.   End With
    7. End Sub
    8.  
    9. Function MsgBox(Prompt As String, Optional Buttons As VbMsgBoxStyle = vbOKOnly, Optional Title As String) As VbMsgBoxResult
    10.   MsgBox = CreateObject("WScript.Shell").Popup(Prompt, 0&, Title, Buttons)
    11. End Function
    File "1.txt", Unicode: Сампле тест - or if ANSI: Sample test

    The above example works if the file is saved in Unicode, but when I save "1.txt" in UTF-8 or ANSI, the result from ADODB.Stream will be corrupted. It appears that I need to manually specify charset type in the stream if it isn't Unicode (I have also tried to put .Charset = "_autodetect" but it won't work), so how to do that automatically since I don't know which file will be loaded (it depends on user)? Also, I have read that in some cases BOM doesn't exist in the file but nevertheless programs like Notepad read that files correctly.
    Last edited by MikiSoft; Feb 25th, 2015 at 07:03 PM.

  2. #2
    PowerPoster
    Join Date
    Feb 2006
    Posts
    20,969

    Re: ADODB.Stream auto-detect charset problem

    Maybe see The Notepad file encoding problem, redux.

    Summary: there is no magic.

  3. #3

    Thread Starter
    Hyperactive Member
    Join Date
    Jun 2011
    Posts
    461

    Re: ADODB.Stream auto-detect charset problem

    So is there some function which will detect/guess file encoding like Notepad does?
    Last edited by MikiSoft; Feb 25th, 2015 at 07:18 PM.

  4. #4
    PowerPoster
    Join Date
    Feb 2006
    Posts
    20,969

    Re: ADODB.Stream auto-detect charset problem

    As mentioned in that blog post, you could look at IsTextUnicode. I haven't seen any examples of use though.

  5. #5
    Fanatic Member
    Join Date
    Aug 2013
    Posts
    806

    Re: ADODB.Stream auto-detect charset problem

    Hi MikiSoft, me again. Dilettante's link is a great one for understanding why this is a difficult problem, and that article links another Michael Kaplan article (http://www.siao2.com/2005/01/30/363308.aspx) which goes into even more detail.

    Dr Unicode of vbForums is probably the best resource for your problem. He has a nice VB-only implementation of a standard UTF-8 detection system similar to the methods mentioned here. Here's a link to his project:

    http://cyberactivex.com/UnicodeTutor...htm#FileReader

    Dr. Unicode's approach should successfully distinguish between UTF-8 and ANSI about as well as you can hope for. More sophisticated approaches exist (e.g. http://www-archive.mozilla.org/proje...Detection.html), but I've never seen anything that comprehensive implemented in VB.
    Check out PhotoDemon, a pro-grade photo editor written completely in VB6. (Full source available at GitHub.)

  6. #6
    PowerPoster
    Join Date
    Feb 2006
    Posts
    20,969

    Re: ADODB.Stream auto-detect charset problem

    ANSI encodings are problematic since there are multiple code pages possible and few if any hints to sniff for to decide which one might have been used.

    I practice when faced with this you have to limit your ambitions a little. Some things can be sacrificed for a specific application but usually when a schmantzy character encoding of any kind enters the picture it is because you have to deal with characters outside the safe ASCII range from 1 to 126.

    So if we know what limits you can assume it is possible to use one of the common "guesser" algorithms or even one of the less typical and simpler but more failure-prone algorithms.

  7. #7
    PowerPoster
    Join Date
    Feb 2006
    Posts
    20,969

    Re: ADODB.Stream auto-detect charset problem

    Here is an example of a simple but narrowly applicable algorithm: Simple Character Encoding Detection.

    It ignores the possibility of ASCII or ANSI, and assumes the text (at least the portion sampled) will never contain NUL characters as valid text.

  8. #8
    Fanatic Member
    Join Date
    Aug 2013
    Posts
    806

    Re: ADODB.Stream auto-detect charset problem

    Interesting link! That's not a half-bad heuristic, all things considered.

    Like you say, ANSI is the real problem here, because codepages are near-impossible to distinguish without prior knowledge or complex heuristics (e.g. testing letter occurrence probability against various languages).

    ANSI vs UTF-8 should be slightly better. Stealing numbers from this link, the false positive rate for a standard UTF-8 check (similar to the cyberactivex link, above) is only 3.9% for a 2-byte check (1920/49152). For a 7-byte sequence, it's less than 1%. For a 12-byte sequence, it's less than 0.1%. For a 24-byte sequence, it's less than 1 in a million.

    But if the data is ANSI and potentially from different code-pages, you're probably out of luck. It is theoretically possible to tap into IE's heuristics engine via the IMultiLanguage interface and accompanying DetectInputCodepage function (https://msdn.microsoft.com/en-us/lib...=vs.85%29.aspx), but I've never seen that attempted in VB.
    Check out PhotoDemon, a pro-grade photo editor written completely in VB6. (Full source available at GitHub.)

  9. #9
    Frenzied Member
    Join Date
    Feb 2015
    Posts
    1,584

    Re: ADODB.Stream auto-detect charset problem

    Use MLANG for this purpose.
    Code:
    Option Explicit
     
    Private Sub Form_Load()
        Dim MLang       As CMultiLanguage
        Dim IMLang2     As IMultiLanguage2
        Dim Encoding()  As tagDetectEncodingInfo
        Dim encCount    As Long
        Dim inp()       As Byte
        Dim index       As Long
        
        Open "c:\Test.txt" For Binary As #1
        ReDim inp(LOF(1) - 1)
        Get #1, , inp()
        Close #1
        
        Set MLang = New CMultiLanguage
        Set IMLang2 = MLang
        
        encCount = 16
        ReDim Encoding(encCount - 1)
        IMLang2.DetectInputCodepage 0, 0, inp(0), UBound(inp) + 1, Encoding(0), encCount
        
        For index = 0 To encCount - 1
            
            If Encoding(index).nCodePage = 65001 Then 'UTF-8
                Debug.Print "Detection UTF-8, probably " & Encoding(index).nConfidence
            End If
            
        Next
        
    End Sub
    Attached Files Attached Files

  10. #10
    Fanatic Member
    Join Date
    Aug 2013
    Posts
    806

    Re: ADODB.Stream auto-detect charset problem

    I stand corrected; someone has put together a .tlb for it. Thanks, Trick!
    Check out PhotoDemon, a pro-grade photo editor written completely in VB6. (Full source available at GitHub.)

  11. #11

    Thread Starter
    Hyperactive Member
    Join Date
    Jun 2011
    Posts
    461

    Re: ADODB.Stream auto-detect charset problem

    Thanks to all! But how to use The tick's implementation with my example in the main post?
    Last edited by MikiSoft; Feb 26th, 2015 at 11:30 AM.

  12. #12

  13. #13

    Thread Starter
    Hyperactive Member
    Join Date
    Jun 2011
    Posts
    461

    Re: ADODB.Stream auto-detect charset problem

    Yes but that doesn't display the contents of a file like ADODB.Stream. I don't know how to put that together.

  14. #14

  15. #15

    Thread Starter
    Hyperactive Member
    Join Date
    Jun 2011
    Posts
    461

    Re: ADODB.Stream auto-detect charset problem

    Quote Originally Posted by The trick View Post
    If you use MLang, you can immediately load text from a file.
    How to do that? I understand your code above as example for detecting UTF-8 charset, but it doesn't show any contents of the file in the proper text format
    Last edited by MikiSoft; Feb 26th, 2015 at 12:29 PM.

  16. #16

  17. #17
    PowerPoster
    Join Date
    Feb 2006
    Posts
    20,969

    Re: ADODB.Stream auto-detect charset problem

    You realize you have Headache Number 2 ahead as well, right?

    If you are dealing with text files and not simple strings there is almost zero probability that you can ignore the issue of line separators. While "Windows native" files normally use CRLF a lot of UTF-8 in the wild will have LF. You could even see naked CR, but since the old Mac OS died that's rare now.

    And then you have the related issue: are these line separators or terminators? I.e. some files use a CRLF or LF to mean "the end of the last (and each) line" while others treat them as partitions between line. The difference is whether a CRLF or LF at the very end of the file means end of last line or an empty line.


    And don't forget the issue of Ctrl-Z in DOS, CP/M, etc. text files. VB6's native text I/O still respects that, but one you start straying outside those lines all bets are off. A text file can have a Ctrl-Z (&H1A) followed by thousands of bytes of garbage you are supposed to ignore.

  18. #18

    Thread Starter
    Hyperactive Member
    Join Date
    Jun 2011
    Posts
    461

    Re: ADODB.Stream auto-detect charset problem

    So I should give up since there is no complete solution of this problem?

  19. #19
    PowerPoster
    Join Date
    Feb 2006
    Posts
    20,969

    Re: ADODB.Stream auto-detect charset problem

    I would not give up. But you have to choose a subset of the battles to fight, and expect some cases to fail.

    Hopefully you know a little about the files you will have to process and can decide which issues to try to handle.

  20. #20
    Frenzied Member wqweto's Avatar
    Join Date
    May 2011
    Posts
    1,988

    Re: ADODB.Stream auto-detect charset problem

    Quote Originally Posted by MikiSoft View Post
    (I have also tried to put .Charset = "_autodetect" but it won't work)
    Try .Charset = "_autodetect_all" because "_autodetect" is for "Japanese (Auto-Select)" -- probably not what you wanted.

    Check out all of the charsets under HKCR\Mime\Database\Charset

    cheers,
    </wqw>

  21. #21

    Thread Starter
    Hyperactive Member
    Join Date
    Jun 2011
    Posts
    461

    Re: ADODB.Stream auto-detect charset problem

    Thanks, wqweto! It works now when file is in Unicode or ANSI, but not if it's saved with Notepad in UTF-8, so I guess that it has to be combined with some of the codes above for UTF-8 detection.
    Last edited by MikiSoft; Mar 1st, 2015 at 08:48 AM.

  22. #22

    Thread Starter
    Hyperactive Member
    Join Date
    Jun 2011
    Posts
    461

    Resolved Re: [RESOVLED] ADODB.Stream auto-detect charset problem

    I had some obligations these days and I didn't looked for solving this problem. So finally here it is, a perfectly working sample in attachment.
    Thanks to all people who tried to help me, especially to wqweto and The trick!
    Attached Files Attached Files
    Last edited by MikiSoft; Mar 1st, 2015 at 10:31 AM.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Featured


Click Here to Expand Forum to Full Width