Results 1 to 3 of 3

Thread: How to parse a XML file without The bom character

  1. #1

    Thread Starter
    Junior Member
    Join Date
    Jan 2013
    Posts
    24

    How to parse a XML file without The bom character

    Hello.
    I have to parse several XML files that I can't control how they are written.
    They are giving me an inválid character at position 1 line 1 in the streamreader because of The bom charácter.
    Can I Parse each of them without The bom without having to re-write them?
    Thanks.

  2. #2
    You don't want to know.
    Join Date
    Aug 2010
    Posts
    4,578

    Re: How to parse a XML file without The bom character

    It sort of depends on your parser, but I can tell you enough to get started.

    Reading text in .NET always ends up using a TextReader at the deepest, lowest level. You usually don't have to worry about this because convenience overloads create the TextReader for you. That class is abstract, the most commonly-used implementation is StreamReader. Another fact that is not commonly brought to your attention is any time a StreamReader is used, an Encoding object is selected and configured. The API didn't put a property to control BOM generation on the abstract Encoding class, but specific instances like UTF8Encoding take constructor parameters that allow you to control it.

    So the process to control BOM for most of the XML APIs is:
    • Create the appropriate Encoding object with the appropriate BOM configuration.
    • Create a StreamReader using that Encoding.
    • Create your XML parser/parse your XML using that StreamReader.


    I guess you're not using UTF-8 because I think it uses/expects BOM by default. That you're getting this error might mean you're trying to parse UTF-8 files with Encoding.Ascii. Consider reaching for UTF8 as your default from now on. It is more friendly to multiple languages and has been meant to replace ASCII since 1998. The only reason anyone knows ASCII anymore is each generation stupidly passes ASCII down to the next generation of developers as some kind of perverse joke. And old hardware.

    Anyway, the setup for all of the XML parsing APIs is the same. First you have to create your encoding and your StreamReader:
    Code:
    Dim encoding As New System.Text.UTF8Encoding(True) ' The boolean parameter controls BOM
    Dim reader As New System.Io.StreamReader("your file path", encoding)
    Mind that you remember to dispose of the StreamReader when done, a Using statement might be handy.

    If you're using XmlTextReader:
    Code:
    Dim xmlReader As New XmlTextReader(reader)
    If you're using XmlDocument:
    Code:
    Dim xmlDocument As New XmlDocument()
    xmlDocument.Load(reader)
    If you're using LINQ-to-XML:
    Code:
    Dim xDoc As New XDocument()
    xDoc.Load(reader)
    That will get you an XML parser that doesn't choke on the BOM. I'm not sure what it does if the BOM isn't there. I do know that the UTF8Encoding class specifically checks for the BOM and, if it isn't there, tries to soldier on. Part of why this works is the lower 128 code points are identical to ASCII for compatibility.
    Last edited by Sitten Spynne; Aug 31st, 2016 at 01:49 PM.

  3. #3

    Thread Starter
    Junior Member
    Join Date
    Jan 2013
    Posts
    24

    Resolved Re: How to parse a XML file without The bom character

    Thank you.

    I'm using StreamReader and now it works.

    Quote Originally Posted by Sitten Spynne View Post
    It sort of depends on your parser, but I can tell you enough to get started.

    Code:
    Dim encoding As New System.Text.UTF8Encoding(True) ' The boolean parameter controls BOM
    Dim reader As New System.Io.StreamReader("your file path", encoding)
    Last edited by mehrlicht; Aug 31st, 2016 at 04:38 PM. Reason: missing quote ending tag

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width