Results 1 to 30 of 30

Thread: [RESOLVED] Converting email attachment filenames encodd in UTF-8 to Ansi - .eml (RFC 822) files?

  1. #1

    Thread Starter
    Fanatic Member
    Join Date
    Apr 2015
    Location
    Finland
    Posts
    679

    Resolved [RESOLVED] Converting email attachment filenames encodd in UTF-8 to Ansi - .eml (RFC 822) files?

    Better (API or other) method, to convert UTF-8 encoded .eml attachment filenames to Windows 1252 codepage, than using manual replacement/lookup table?

    fex. attachment filename in .eml file.

    Content-Transfer-Encoding: base64
    Content-Type: application/pdf;
    name*=UTF-8''S%C3%A4hk%C3%B6posti%20testi.pdf
    Content-Disposition: attachment;
    filename*=UTF-8''S%C3%A4hk%C3%B6posti%20testi.pdf;

    Code:
    Debug.Print EMLFilenameDecode("S%C3%A4hk%C3%B6posti%20testi.pdf")
    
    '-> output Sähköposti testi.pdf
    
    Private Function EMLFilenameDecode(ByVal StringToDecode As String) As String
    Dim sTemp As String
     
    sTemp = StringToDecode
    sTemp = Replace(sTemp, "%20", " ") 'space
    sTemp = Replace(sTemp, "%C3%A4", "ä") 'ä = C3 A4
    sTemp = Replace(sTemp, "%C3%84", "Ä") 'Ä = C3 84
    sTemp = Replace(sTemp, "%C3%B6", "ö") 'ö = C3 B6
    sTemp = Replace(sTemp, "%C3%96", "Ö") 'Ö = C3 96
    sTemp = Replace(sTemp, "%C3%BC", "ü") 'ü = C3 BC
    sTemp = Replace(sTemp, "%C3%9C", "Ü") 'Ü = C3 9C
    'etc
    
    EMLFilenameDecode = sTemp
    End Function

  2. #2

    Thread Starter
    Fanatic Member
    Join Date
    Apr 2015
    Location
    Finland
    Posts
    679

    Re: Converting email attachment filenames encodd in UTF-8 to Ansi - .eml (RFC 822) fi

    Nah... That sounded to be too easy, not... The problem lies in either ADODB stream or CDOEX...

    Code:
    'Using CDO for Exchange 2000 (CDOEX), load the eml into a CDO.Message
                'object and extract attachments.
                Dim strm As New AdoDB.Stream
                Dim myMail As New CDO.Message
                'Dim strm As ADODB.Stream
                Set strm = myMail.GetStream()
                strm.Type = AdoDB.StreamTypeEnum.adTypeBinary  '.adTypeText
                strm.LoadFromFile sFilename '("c:\MyTrash\TestFile.eml")
                strm.Flush
                                        
                Dim attach As IBodyPart
                
                sEmailDate = myMail.SentOn
                If Len(sEmailDate) < 2 Then 'if sent time is not set, then use received time
                    sEmailDate = myMail.ReceivedTime
                End If
                
                lngAttachmentCount = myMail.Attachments.Count
                If lngAttachmentCount = 0 Then 'No attachments
                'Filename in inline or base64 string
                'extract attachment from BodyPart if found.
                Dim Strm2
                iItemCount = myMail.BodyPart.BodyParts.Count
                For iItems = 1 To iItemCount
                    iStart = myMail.BodyPart.BodyParts.Item(iItems).BodyParts.Count
                    If iStart > 0 Then
                    For iStart = 1 To myMail.BodyPart.BodyParts.Item(iItems).BodyParts.Count
                        sHeader = myMail.BodyPart.BodyParts.Item(iItems).BodyParts.Item(iStart).ContentTransferEncoding
                        Debug.Print sHeader
                        If sHeader = "base64" Or sHeader = "inline" Then
                            sHeader = myMail.BodyPart.BodyParts.Item(iItems).BodyParts.Item(iStart).Filename 'Is there filename?
                            If Len(sHeader) Then 'filename found.
    'etc...
    Now the question is, how to tell CDOEX that attachment filename is UTF-8 encoded?

    Codeline
    Code:
     sHeader = myMail.BodyPart.BodyParts.Item(iItems).BodyParts.Item(iStart).Filename 'Is there filename?
    returns empty string, when name and filename are UTF-8 encoded as follows;

    name*=UTF-8''S%C3%A4hk%C3%B6posti%20testi.pdf
    Content-Disposition: attachment;
    filename*=UTF-8''S%C3%A4hk%C3%B6posti%20testi.pdf;

    Likewise, returns filename instead of 'ä' character but '?' -> '䰲' character, when name and filename lines are converted to ansi.

    name="Sähköposti testi.pdf"
    Content-Disposition: attachment;
    filename="Sähköposti testi.pdf";

    I am stuck on this for now.

  3. #3

    Thread Starter
    Fanatic Member
    Join Date
    Apr 2015
    Location
    Finland
    Posts
    679

    Re: Converting email attachment filenames encodd in UTF-8 to Ansi - .eml (RFC 822) fi

    ContentTransferEncoding Property has adjustable value, but i think that this property is used during email message creation.
    https://msdn.microsoft.com/en-us/lib...exchg.65).aspx
    In compliance with RFC 2045, the ContentTransferEncoding property is enforced to be "7bit", "8bit", or "binary" when ContentMediaType indicates a composite content type such as "message" or "multipart". An attempt to encode a composite type with any other mechanism results in an error.

    CDO can decode a body part from "mac-binhex40" but cannot encode using this mechanism.

    The contents of ContentTransferEncoding are not case-sensitive. The default value is "7bit".
    Yet, the default is '7bit', so possibly it is that way also, when loading message from stream.

  4. #4
    PowerPoster
    Join Date
    Feb 2006
    Posts
    24,482

    Re: Converting email attachment filenames encodd in UTF-8 to Ansi - .eml (RFC 822) fi

    I don't have anything handy to create EML files, so the examples I found might be too limited. Too bad you haven't posted an example that exhibits your issue.

    This seems to look fine for the examples that I do have:

    Name:  sshot1.png
Views: 4216
Size:  6.6 KB


    Name:  sshot2.png
Views: 4257
Size:  7.3 KB

    Sorry about the typo in the column header!
    Attached Files Attached Files

  5. #5
    PowerPoster
    Join Date
    Jun 2013
    Posts
    7,253

    Re: Converting email attachment filenames encodd in UTF-8 to Ansi - .eml (RFC 822) fi

    Without a concrete File-Example (which consists of valid, unchanged content),
    you will have to "explain yourself to dead" in a lot of postings - better (and faster for all)
    to just zip- and load something up, that doesn't contain "confidential data",
    but covers the problem-senario sufficiently

    Olaf

  6. #6

    Thread Starter
    Fanatic Member
    Join Date
    Apr 2015
    Location
    Finland
    Posts
    679

    Re: Converting email attachment filenames encodd in UTF-8 to Ansi - .eml (RFC 822) fi

    Attached is compressed zip example3.eml file sample. Attachment #2 filename is utf-8 encoded.

    Content-Transfer-Encoding: base64
    Content-Type: application/pdf;
    name*=UTF-8''Sis%C3%A4profiili%20oikea%20v2.pdf
    Content-Disposition: attachment;
    filename*=UTF-8''Sis%C3%A4profiili%20oikea%20v2.pdf;
    size=140867
    Attached Images Attached Images  
    Attached Files Attached Files

  7. #7

    Thread Starter
    Fanatic Member
    Join Date
    Apr 2015
    Location
    Finland
    Posts
    679

    Re: Converting email attachment filenames encodd in UTF-8 to Ansi - .eml (RFC 822) fi

    Tested couple of mail clients. All of them seems to have problems with attachment names, some of them generates random name, for the attachment #2.

    Windows Live Mail -> ATT12115.pdf where 12115 is increasing random number
    Outlook -> Unnamed attachment 12345.pdf, where 12345 is likewise increasing random number.
    Thunderbird -> 12345.dat increasing random number.
    Domino/Notes shows filename 'as is' -> Sis%C3%A4profiili%20oikea%20v2.pdf

    Seems like 'industry wide' problem, with names containing other than 7-bit chars.

  8. #8

    Thread Starter
    Fanatic Member
    Join Date
    Apr 2015
    Location
    Finland
    Posts
    679

    Re: Converting email attachment filenames encodd in UTF-8 to Ansi - .eml (RFC 822) fi

    Parsing received .eml files by changing filename encoding to iso8859-1, before opening them in mail client seems to be only option, at least that would work.

    for example, the above example3.eml attachment #2

    name="=?iso-8859-1?Q?Sis=E4profiili=20oikea=20v2.pdf?="
    Content-Disposition: attachment;
    filename="=?iso-8859-1?Q?Sis=E4profiili=20oikea=20v2.pdf?=";

  9. #9
    PowerPoster
    Join Date
    Jul 2010
    Location
    NYC
    Posts
    5,708

    Re: Converting email attachment filenames encodd in UTF-8 to Ansi - .eml (RFC 822) fi

    (ignore, sorry)

  10. #10
    PowerPoster
    Join Date
    Feb 2006
    Posts
    24,482

    Re: Converting email attachment filenames encodd in UTF-8 to Ansi - .eml (RFC 822) fi

    Well taking the bull by the horns and decoding funky names myself, I have something that works for your test case.

    The problem seems to stem from "filename*" not being a legal ADO Field Name. At least that's all I can deduce.


    I handle this as a special case and do a ton of fiddling to extract the "filename*" value and decode it.


    Name:  sshot.png
Views: 4202
Size:  7.2 KB


    Does this look right to you?


    I renamed my example files from 1 through 3 to 0 through 2. Add your own example3.eml after unzipping this attachment (not included because it is so big, download it above if you want it).
    Attached Files Attached Files

  11. #11

    Thread Starter
    Fanatic Member
    Join Date
    Apr 2015
    Location
    Finland
    Posts
    679

    Re: Converting email attachment filenames encodd in UTF-8 to Ansi - .eml (RFC 822) fi

    Quote Originally Posted by dilettante View Post
    The problem seems to stem from "filename*" not being a legal ADO Field Name. At least that's all I can deduce.

    I handle this as a special case and do a ton of fiddling to extract the "filename*" value and decode it.

    Does this look right to you?
    Yes, Sisäprofiili oikea is the name, 'inner profile right' in english.

    Thanks Dilettante.

    Investigated filename dilemma bit more, found out that certain user agents generate 'filename*' instead of 'filename' and 'name*' instead of 'name' header, for the attachment parameter filename when the value part is encoded. I don't know excatly why, as RFC's (particularly RFC6266 and RFC5987)* does not give a glue. Certainly confusing as 'filename*' parameter name is used quite rarely. Regexped some 250000 eml files containing attachments, but found only under 100 files, were 'filename*' notation were used.

    https://tools.ietf.org/html/rfc6266#section-4.1
    https://tools.ietf.org/html/rfc5987#section-3.2.1

    Good mime information page, they have some downloadable test files (samples.zip).
    http://hunnysoft.com/mime/
    http://hunnysoft.com/mime/samples.zip

    Package has m3004.txt test file, produced by Pine mail user agent, uses 'filename*' notation.
    Content-Type: TEXT/PLAIN; charset=iso-8859-1; name*="iso-8859-1''HasenundFr%F6sche.txt"
    Content-Transfer-Encoding: BASE64
    Content-ID: <Pine.LNX.4.21.0005191026120.8452@penguin.example.com>
    Content-Description: Short story in German
    Content-Disposition: attachment; filename*="iso-8859-1''HasenundFr%F6sche.txt"

    fex. other 'filename*' notation found, were produced by Finnish Internet service provider Elisa Webmail/1.0 user agent.

  12. #12
    PowerPoster
    Join Date
    Feb 2006
    Posts
    24,482

    Re: Converting email attachment filenames encodd in UTF-8 to Ansi - .eml (RFC 822) fi

    I suppose this extension to the EML file format came along after CDO was cast in stone. Exchange moved on to other things like a WebDAV API, and then a REST API, etc. Microsoft had given CDO to the Exchange folks and they more or less lost interest and just stopped doing more than bug fixes.

  13. #13

    Thread Starter
    Fanatic Member
    Join Date
    Apr 2015
    Location
    Finland
    Posts
    679

    Re: Converting email attachment filenames encodd in UTF-8 to Ansi - .eml (RFC 822) fi

    Quote Originally Posted by dilettante View Post
    I suppose this extension to the EML file format came along after CDO was cast in stone.
    Possibly so.

    Likewise 'random attachment naming' in user agent (mail applications), seems to me 'kind of quick and dirty fix'.

    Altought - i can't imagine 'even one single good reason', why the 'filename*' notation was proposed/accepted by IETF in the first place.

  14. #14
    PowerPoster
    Join Date
    Jun 2015
    Posts
    2,224

    Re: [RESOLVED] Converting email attachment filenames encodd in UTF-8 to Ansi - .eml (

    good thread info.

    a slightly related blog post.
    https://blog.nodemailer.com/2017/01/...ent-filenames/

  15. #15

    Thread Starter
    Fanatic Member
    Join Date
    Apr 2015
    Location
    Finland
    Posts
    679

    Re: [RESOLVED] Converting email attachment filenames encodd in UTF-8 to Ansi - .eml (

    Mess with attachment filenames still continues.

    At least certain mail clients from 'bitten apple', encode 8-bit ascii characters so that there is 'char' and then there is 'umlaut'. Encoding is supposed to be UTF-8, but it seems not quite like it;

    fex.
    char 'ä' is 'a%CC%88'

    Code:
    Content-Disposition: attachment;
    	filename*=utf-8''Silma%CC%88t%20Test.pdf
    Content-Type: application/octet-stream;
    	x-unix-mode=0777;
    	name="=?utf-8?Q?Silma=CC=88t_Test=2Epdf?="
    Content-Transfer-Encoding: quoted-printable
    Other umlaut chars are encoded likewise
    char 'Ä' is capital 'A' and umlaut -> 'A%CC%88'
    char 'ö' is 'o%CC%88'
    char 'Ö' is 'O%CC%88'
    char 'ü' is 'u%CC%88'
    etc.

    Can't use
    Code:
    Function UTF8decode(ByVal Text As String) As String
        Dim bText() As Byte
        Dim bOut() As Byte
        Dim lOut As Long
        bText = StrConv(Text, vbFromUnicode) 'Text to byte array
        lOut = LenB(Text) * 2
        ReDim bytOut(lOut - 1)
        lOut = MultiByteToWideChar(CP_UTF8, 0&, ByVal VarPtr(bText(0)), Len(Text), ByVal VarPtr(bOut(0)), lOut)
        If lOut > 0 Then
            ReDim Preserve bytOut(lOut * 2 - 1)
            UTF8decode = bOut
        End If
    End Function
    Code:
    'StrConv and MultiByteToWideChar
    sAttachmentFileName = UTF8decode(myMail.Attachments(iAttachment).Filename)
    'nor only
    StrConv(myMail.Attachments(iAttachment).Filename, vbFromUnicode, 1035)
    Any idea how to decode these?
    Last edited by Tech99; Mar 20th, 2018 at 03:03 PM.

  16. #16
    PowerPoster
    Join Date
    Jun 2015
    Posts
    2,224

    Re: [RESOLVED] Converting email attachment filenames encodd in UTF-8 to Ansi - .eml (

    those are proper URL Encoded UTF-8. what are you using to URL Decode?

  17. #17

    Thread Starter
    Fanatic Member
    Join Date
    Apr 2015
    Location
    Finland
    Posts
    679

    Re: [RESOLVED] Converting email attachment filenames encodd in UTF-8 to Ansi - .eml (

    I am using that UTF8decode function. Decodes 'properly', but not quite like, as character in filename is ä not 'a' and umlaut, which is decoded result.

  18. #18
    PowerPoster
    Join Date
    Jun 2015
    Posts
    2,224

    Re: [RESOLVED] Converting email attachment filenames encodd in UTF-8 to Ansi - .eml (

    "URLDecode" not UTF8Decode

    edit: funny story MultiByteToWideChar doesn't decompose them properly anyway...

    you can try this. MessageBox renders it properly.

    Code:
    Public Function URLDecode(ByVal Chars As String) As String
        Dim Bytes() As Byte
        Dim Char As String
        Dim CurChar As Long
        Dim CurByte As Long
        Dim Count As Long
        
        Do
            CurChar = CurChar + 1
            If Mid$(Chars, CurChar, 1) = "%" Then
                Count = Count + 1
                CurChar = CurChar + 2
            Else
                Count = Count + 1
            End If
        Loop Until CurChar >= Len(Chars)
        
        ReDim Bytes(0 To Count - 1) As Byte
        
        CurChar = 0
        If Count Then
            Do
                CurChar = CurChar + 1
                Char = Mid$(Chars, CurChar, 1)
                Select Case Char
                    Case "+"
                        Bytes(CurByte) = 32
                    Case "%"
                        Bytes(CurByte) = Val("&H" & Mid$(Chars, CurChar + 1, 2))
                        CurChar = CurChar + 2
                    Case Else
                        Bytes(CurByte) = AscB(Char)
                End Select
                CurByte = CurByte + 1
            Loop Until CurChar >= Len(Chars)
        End If
        
        URLDecode = UTF8.GetChars(Bytes)
    End Function
    Last edited by DEXWERX; Mar 20th, 2018 at 04:23 PM.

  19. #19

    Thread Starter
    Fanatic Member
    Join Date
    Apr 2015
    Location
    Finland
    Posts
    679

    Re: [RESOLVED] Converting email attachment filenames encodd in UTF-8 to Ansi - .eml (

    Thanks DEXWERX, but does not render.

    Missing UTF8.GetChars(Bytes) line is .Net System.Text call? Can't use that, as this is VB6 app, have to write wrapper function.

  20. #20
    PowerPoster
    Join Date
    Jun 2015
    Posts
    2,224

    Re: [RESOLVED] Converting email attachment filenames encodd in UTF-8 to Ansi - .eml (

    You could have used your UTF8Decode function there.

    UTF8.bas
    Code:
    Private Const CP_UTF8 As Long = 65001
    Private Declare Function WideCharToMultiByte Lib "kernel32" (ByVal CodePage As Long, ByVal dwFlags As Long, ByRef lpWideCharStr As Any, ByVal cchWideChar As Long, ByRef lpMultiByteStr As Any, ByVal cbMultiByte As Long, Optional ByVal lpDefaultChar As Long, Optional ByVal lpUsedDefaultChar As Long) As Long
    Private Declare Function MultiByteToWideChar Lib "kernel32" (ByVal CodePage As Long, ByVal dwFlags As Long, ByRef lpMultiByteStr As Any, ByVal cbMultiByte As Long, ByRef lpWideCharStr As Any, ByVal cchWideChar As Long) As Long
    
    
    Public Function GetBytes(ByVal Chars As String) As Byte()
        Dim Bytes() As Byte
        Dim Length As Long
        Length = WideCharToMultiByte(CP_UTF8, _
                                     0, _
                                     ByVal StrPtr(Chars), _
                                     Len(Chars), _
                                     ByVal 0&, _
                                     0)
        ReDim Bytes(0 To Length - 1)
        WideCharToMultiByte CP_UTF8, _
                            0, _
                            ByVal StrPtr(Chars), _
                            Len(Chars), _
                            Bytes(0), _
                            Length
        GetBytes = Bytes()
    End Function
    
    Public Function GetChars(Bytes() As Byte) As String
        Dim Length As Long
        Length = MultiByteToWideChar(CP_UTF8, _
                                     0, _
                                     Bytes(0), _
                                     UBound(Bytes) + 1, _
                                     ByVal 0&, _
                                     0)
        GetChars = String$(Length, vbNullChar)
        MultiByteToWideChar CP_UTF8, _
                            0, _
                            Bytes(0), _
                            UBound(Bytes) + 1, _
                            ByVal StrPtr(GetChars), _
                            Len(GetChars)
    End Function
    and here's my URLEncode for completeness. They're not very fast, but they are correct.

    Code:
    Public Function UrlEncode(ByVal Chars As String, _
                              Optional SpaceToPlus As Boolean = True _
                              ) As String
                              
        Const LEGAL_CHARS As String = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_.~"
        
        Dim Bytes() As Byte
        Dim Char As String
        Dim Index As Long
        
        Bytes = UTF8.GetBytes(Chars)
        
        For Index = 0 To UBound(Bytes)
            
            Char = Chr$(Bytes(Index))
            
            If InStr(LEGAL_CHARS, Char) Then
                UrlEncode = UrlEncode & Char
            ElseIf Bytes(Index) = &H20 And SpaceToPlus Then
                UrlEncode = UrlEncode & "+"
            ElseIf Bytes(Index) < &H10 Then
                UrlEncode = UrlEncode & "%0" & Hex$(Bytes(Index))
            Else
                UrlEncode = UrlEncode & "%" & Hex$(Bytes(Index))
            End If
        Next
        
    End Function
    Last edited by DEXWERX; Mar 21st, 2018 at 07:06 AM.

  21. #21

    Thread Starter
    Fanatic Member
    Join Date
    Apr 2015
    Location
    Finland
    Posts
    679

    Re: [RESOLVED] Converting email attachment filenames encodd in UTF-8 to Ansi - .eml (

    Thanks, GetChars causes program crash.

    Changed API declarations etc.

    Code:
    Private Declare Function WideCharToMultiByte Lib "kernel32.dll" (ByVal CodePage As Long, ByVal dwFlags As Long, ByVal lpWideCharStr As Long, ByVal cchWideChar As Long, ByVal lpMultiByteStr As Long, ByVal cbMultiByte As Long, ByVal lpDefaultChar As Long, ByVal lpUsedDefaultChar As Long) As Long
    
    Private Declare Function MultiByteToWideChar Lib "kernel32.dll" (ByVal CodePage As Long, ByVal dwFlags As Long, ByVal lpMultiByteStr As Long, ByVal cbMultiByte As Long, ByVal lpWideCharStr As Long, ByVal cchWideChar As Long) As Long
    Code:
    Private Function GetChars(ByRef Source() As Byte) As String
    'Convert from UTF-8
    Dim lSize As Long
    Dim lPointer As Long
    Dim lLength As Long
    Dim Buffer As String
    
    lSize = UBound(Source) - LBound(Source) + 1
    lPointer = VarPtr(Source(LBound(Source)))
    lLength = MultiByteToWideChar(CP_UTF8, 0&, lPointer, lSize, 0&, 0&)
    Buffer = Space$(lLength)
    MultiByteToWideChar CP_UTF8, 0&, lPointer, lSize, StrPtr(Buffer), lLength
    GetChars = Buffer
    End Function

  22. #22

    Thread Starter
    Fanatic Member
    Join Date
    Apr 2015
    Location
    Finland
    Posts
    679

    Re: [RESOLVED] Converting email attachment filenames encodd in UTF-8 to Ansi - .eml (

    Tested this latest revison, no worky...

    fex. a%CC%88 -> still translates to 'a' char + umlaut. Same with other umlaut chars.

    Maybe now is time to make 'raw conversion' in file level, before further file processing, using lookup tables.

  23. #23
    PowerPoster
    Join Date
    Jun 2015
    Posts
    2,224

    Re: [RESOLVED] Converting email attachment filenames encodd in UTF-8 to Ansi - .eml (

    Quote Originally Posted by Tech99 View Post
    Thanks, GetChars causes program crash.

    Changed API declarations etc.

    Code:
    Private Declare Function WideCharToMultiByte Lib "kernel32.dll" (ByVal CodePage As Long, ByVal dwFlags As Long, ByVal lpWideCharStr As Long, ByVal cchWideChar As Long, ByVal lpMultiByteStr As Long, ByVal cbMultiByte As Long, ByVal lpDefaultChar As Long, ByVal lpUsedDefaultChar As Long) As Long
    
    Private Declare Function MultiByteToWideChar Lib "kernel32.dll" (ByVal CodePage As Long, ByVal dwFlags As Long, ByVal lpMultiByteStr As Long, ByVal cbMultiByte As Long, ByVal lpWideCharStr As Long, ByVal cchWideChar As Long) As Long
    Code:
    Private Function GetChars(ByRef Source() As Byte) As String
    'Convert from UTF-8
    Dim lSize As Long
    Dim lPointer As Long
    Dim lLength As Long
    Dim Buffer As String
    
    lSize = UBound(Source) - LBound(Source) + 1
    lPointer = VarPtr(Source(LBound(Source)))
    lLength = MultiByteToWideChar(CP_UTF8, 0&, lPointer, lSize, 0&, 0&)
    Buffer = Space$(lLength)
    MultiByteToWideChar CP_UTF8, 0&, lPointer, lSize, StrPtr(Buffer), lLength
    GetChars = Buffer
    End Function
    right, your API declarations aren't compatible with my use.
    If you paste them into their own UTF8.bas they work correctly.

  24. #24
    PowerPoster
    Join Date
    Jun 2015
    Posts
    2,224

    Re: [RESOLVED] Converting email attachment filenames encodd in UTF-8 to Ansi - .eml (

    Quote Originally Posted by Tech99 View Post
    Tested this latest revison, no worky...

    fex. a%CC%88 -> still translates to 'a' char + umlaut. Same with other umlaut chars.

    Maybe now is time to make 'raw conversion' in file level, before further file processing, using lookup tables.
    yes, it's a known issue with the APIs, and no Flag fixes it, they wont normalize a postfixed umlaut.
    MS never updated their normalization tables. They've left them broken since Vista.

    No big deal though really - the char + umlaut is not wrong - that's how it was URLEncoded from the source.

    garbage in, garbage out

    it should have been encoded as %C3%A4 not a%CC%88


    edit: if it's telling you the file is name a%CC%88 why are you normalizing it?

    for reference: https://stackoverflow.com/questions/...on-environment

    https://github.com/walling/unorm

    https://stackoverflow.com/questions/...ion-in-windows
    Last edited by DEXWERX; Mar 21st, 2018 at 09:34 AM.

  25. #25
    PowerPoster ChrisE's Avatar
    Join Date
    Jun 2017
    Location
    Frankfurt
    Posts
    3,048

    Re: [RESOLVED] Converting email attachment filenames encodd in UTF-8 to Ansi - .eml (

    Quote Originally Posted by DEXWERX View Post
    yes, it's a known issue with the APIs, and no Flag fixes it, they wont normalize a postfixed umlaut.
    MS never updated their normalization tables. They've left them broken since Vista.

    No big deal though really - the char + umlaut is not wrong - that's how it was URLEncoded from the source.

    garbage in, garbage out

    it should have been encoded as %C3%A4 not a%CC%88


    edit: if it's telling you the file is name a%CC%88 why are you normalizing it?

    for reference: https://stackoverflow.com/questions/...on-environment

    https://github.com/walling/unorm

    https://stackoverflow.com/questions/...ion-in-windows
    glad to know I'm not the only one that get's garbage

    Code:
    Private Sub Form_Load()
    Text2.text = UmlauteErsetzen("Silma%CC%88t%20Test.pdf")
    End Sub
    
    Function UmlauteErsetzen(s As String) As String
        s = Replace(s, "Ä", "%C3%84", , , vbBinaryCompare)
        s = Replace(s, "Ö", "%C3%96", , , vbBinaryCompare)
        s = Replace(s, "Ü", "%C3%9c", , , vbBinaryCompare)
        s = Replace(s, "ä", "%C3%A4", , , vbBinaryCompare)
        s = Replace(s, "ö", "%C3%B6", , , vbBinaryCompare)
        s = Replace(s, "ü", "%C3%BC", , , vbBinaryCompare)
        s = Replace(s, "ß", "%DF", , , vbBinaryCompare)
        
        s = Replace(s, "%CC%88t", "%C3%A4", , , vbBinaryCompare)
    
        UmlauteErsetzen = s
    End Function
    regards
    Chris
    Last edited by ChrisE; Mar 21st, 2018 at 10:58 AM.
    to hunt a species to extinction is not logical !
    since 2010 the number of Tigers are rising again in 2016 - 3900 were counted. with Baby Callas it's 3901, my wife and I had 2-3 months the privilege of raising a Baby Tiger.

  26. #26
    PowerPoster
    Join Date
    Jun 2015
    Posts
    2,224

    Re: [RESOLVED] Converting email attachment filenames encodd in UTF-8 to Ansi - .eml (

    Hmm let's see if this works... https://msdn.microsoft.com/en-us/lib...(v=vs.85).aspx

    try this

    Code:
    Private Enum NORM_FORM
      NormalizationOther
      NormalizationC
      NormalizationD
      NormalizationKC = &H5
      NormalizationKD
    End Enum
    
    Private Declare Function NormalizeString Lib "normaliz" ( _
        ByVal NormForm As NORM_FORM, _
        ByVal lpSrcString As Long, _
        Optional ByVal cwSrcLength As Long = -1&, _
        Optional ByVal lpDstString As Long, _
        Optional ByVal cwDstLength As Long _
        ) As Long
    Private Declare Function IsNormalizedString Lib "normaliz" ( _
        ByVal NormForm As NORM_FORM, _
        ByVal lpString As Long, _
        Optional ByVal cwLength As Long = -1& _
        ) As Long
    
    Public Function IsNormalized(Str As String) As Boolean
        IsNormalized = IsNormalizedString(NormalizationC, StrPtr(Str))
    End Function
    
    Public Function Normalize(NonNormalized As String) As String
        Dim Length As Long
        Length = NormalizeString(NormalizationC, StrPtr(NonNormalized))
        Normalize = String$(Length, vbNullChar)
        Length = NormalizeString(NormalizationC, _
                                 StrPtr(NonNormalized), _
                                 Len(NonNormalized), _
                                 StrPtr(Normalize), _
                                 Len(Normalize))
        Normalize = Left$(Normalize, Length)
    End Function
    Code:
    ?Normalize(URL.Decode("a%CC%88"))
    ä
    Last edited by DEXWERX; Mar 21st, 2018 at 10:54 AM. Reason: added IsNormalized()

  27. #27

    Thread Starter
    Fanatic Member
    Join Date
    Apr 2015
    Location
    Finland
    Posts
    679

    Re: [RESOLVED] Converting email attachment filenames encodd in UTF-8 to Ansi - .eml (

    Quote Originally Posted by DEXWERX View Post
    edit: if it's telling you the file is name a%CC%88 why are you normalizing it?
    Applications handling 'Ansi only' filenames, consume these email attachment (mainly cad drawings etc.) sent to us, for quatation and when placing an order - manufacturing.

  28. #28
    PowerPoster
    Join Date
    Jun 2015
    Posts
    2,224

    Re: [RESOLVED] Converting email attachment filenames encodd in UTF-8 to Ansi - .eml (

    see the above Normalize() function

  29. #29

    Thread Starter
    Fanatic Member
    Join Date
    Apr 2015
    Location
    Finland
    Posts
    679

    Re: [RESOLVED] Converting email attachment filenames encodd in UTF-8 to Ansi - .eml (

    Quote Originally Posted by DEXWERX View Post
    Hmm let's see if this works...
    Did not occured to me to try normalize string, glad you steered me. Thanks, it works. As what comes to these attachment filename problems in general, we are facing these mainly from 'bitten apple' people - artists, designers etc.
    Engineers, cad operators etc. tend to use 'proper' systems, never had problems with email attachment filenames received from the latter group.

    That particular sender whose filenames contain scandinavian characters like 'a%CC%88' is domestic partner, so there should be no problems with locales etc. but found out that different versions of 'apple mail' client app encodes / behaves differently.

    Like
    Content-Type: application/pdf;
    name*=UTF-8''SEIN%C3%84.pdf
    Content-Disposition: attachment;
    filename*=UTF-8''SEIN%C3%84.pdf;

    and then %CC%88, also order of name/filename/content-disposition etc. lines differs between versions likewise mixed encodings in filename/name.

    Content-Disposition: attachment;
    filename*=utf-8''...
    Content-Type: application/octet-stream;
    name="=?utf-8?Q?...

  30. #30
    PowerPoster
    Join Date
    Jun 2015
    Posts
    2,224

    Re: [RESOLVED] Converting email attachment filenames encodd in UTF-8 to Ansi - .eml (

    I'm glad we have a Normalizing API to address the shortcomings of MultiByteToWideChar() / MB_PRECOMPOSED.
    Javascripters are stuck rolling their own Unicode Normalize. :/

    Quote Originally Posted by MSDN
    The use of the MB_PRECOMPOSED flag has very little effect on most code pages because most input data is composed already. Consider calling NormalizeString after converting with MultiByteToWideChar. NormalizeString provides more accurate, standard, and consistent data, and can also be faster. Note that for the NORM_FORM enumeration being passed to NormalizeString, NormalizationC corresponds to MB_PRECOMPOSED and NormalizationD corresponds to MB_COMPOSITE.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width