How to parse a XML file without The bom character
Hello.
I have to parse several XML files that I can't control how they are written.
They are giving me an inválid character at position 1 line 1 in the streamreader because of The bom charácter.
Can I Parse each of them without The bom without having to re-write them?
Thanks.
Re: How to parse a XML file without The bom character
It sort of depends on your parser, but I can tell you enough to get started.
Reading text in .NET always ends up using a TextReader at the deepest, lowest level. You usually don't have to worry about this because convenience overloads create the TextReader for you. That class is abstract, the most commonly-used implementation is StreamReader. Another fact that is not commonly brought to your attention is any time a StreamReader is used, an Encoding object is selected and configured. The API didn't put a property to control BOM generation on the abstract Encoding class, but specific instances like UTF8Encoding take constructor parameters that allow you to control it.
So the process to control BOM for most of the XML APIs is:
- Create the appropriate Encoding object with the appropriate BOM configuration.
- Create a StreamReader using that Encoding.
- Create your XML parser/parse your XML using that StreamReader.
I guess you're not using UTF-8 because I think it uses/expects BOM by default. That you're getting this error might mean you're trying to parse UTF-8 files with Encoding.Ascii. Consider reaching for UTF8 as your default from now on. It is more friendly to multiple languages and has been meant to replace ASCII since 1998. The only reason anyone knows ASCII anymore is each generation stupidly passes ASCII down to the next generation of developers as some kind of perverse joke. And old hardware.
Anyway, the setup for all of the XML parsing APIs is the same. First you have to create your encoding and your StreamReader:
Code:
Dim encoding As New System.Text.UTF8Encoding(True) ' The boolean parameter controls BOM
Dim reader As New System.Io.StreamReader("your file path", encoding)
Mind that you remember to dispose of the StreamReader when done, a Using statement might be handy.
If you're using XmlTextReader:
Code:
Dim xmlReader As New XmlTextReader(reader)
If you're using XmlDocument:
Code:
Dim xmlDocument As New XmlDocument()
xmlDocument.Load(reader)
If you're using LINQ-to-XML:
Code:
Dim xDoc As New XDocument()
xDoc.Load(reader)
That will get you an XML parser that doesn't choke on the BOM. I'm not sure what it does if the BOM isn't there. I do know that the UTF8Encoding class specifically checks for the BOM and, if it isn't there, tries to soldier on. Part of why this works is the lower 128 code points are identical to ASCII for compatibility.
Re: How to parse a XML file without The bom character
Thank you.
I'm using StreamReader and now it works.
Quote:
Originally Posted by
Sitten Spynne
It
sort of depends on your parser, but I can tell you enough to get started.
Code:
Dim encoding As New System.Text.UTF8Encoding(True) ' The boolean parameter controls BOM
Dim reader As New System.Io.StreamReader("your file path", encoding)