Results 1 to 6 of 6

Thread: XML cleanup with missing tags

  1. #1

    Thread Starter
    Hyperactive Member
    Join Date
    Nov 2008
    Location
    USA
    Posts
    257

    XML cleanup with missing tags

    I am trying to format xml entries I have so that I can use the xmltextreader without getting errors. I added a default header and footer in the event I notice there is no opening or closing tags. I remove illegal characters and check for unicode but I always find an issue where an entry slips in and gives the error: data at the root level is invalid and when I check that entry is slipped through the cleaning process or just has an unmatched tag somewhere. Now I use


    Code:
    Dim stringSplitter() As String = {"</entry>"} 
            ' split the file content based on the closing entry tag 
            sampleResults = _html.Split(stringSplitter, StringSplitOptions.RemoveEmptyEntries)
    to split my xml into individual entries before I start the cleanup process. Here are my default headers;


    Code:
    Private defaultheader = "xmlns=""http://www.w3.org/2005/Atom""" 
        Private headerl As String = "<?xml version=""1.0"" encoding=""utf-8""?>" & vbNewLine & "<entry " & defaultNameSpace & ">" 
        Private footer As String = "</entry>"
    is there any tool in the .net framework that can detect and cleanup unmatched tags so that I can get this to work. I mean works for the most part with a whole bunch of IF statements but just wondering if there is something more practical to use
    -- Please rate me if I am helpful --

  2. #2
    PowerPoster techgnome's Avatar
    Join Date
    May 2002
    Posts
    34,687

    Re: XML cleanup with missing tags

    What's the source of the XML? If you are, then don't build XML using strings like that... it'll just case you heart ache (as you're finding out). Instead use the XML namespace and create an XMLDocument. From there you can create nodes, attributes and so on. Then once you have your XMLDocument built and all the nodes added, you can use the .XML property to get the XML itslef, or use the SaveXML/LoadXML (or it might be .Save/.Load) to write/read to/from files.

    -tg
    * I don't respond to private (PM) requests for help. It's not conducive to the general learning of others.*
    * I also don't respond to friend requests. Save a few bits and don't bother. I'll just end up rejecting anyways.*
    * How to get EFFECTIVE help: The Hitchhiker's Guide to Getting Help at VBF - Removing eels from your hovercraft *
    * How to Use Parameters * Create Disconnected ADO Recordset Clones * Set your VB6 ActiveX Compatibility * Get rid of those pesky VB Line Numbers * I swear I saved my data, where'd it run off to??? *

  3. #3
    Karen Payne MVP kareninstructor's Avatar
    Join Date
    Jun 2008
    Location
    Oregon
    Posts
    6,714

    Re: XML cleanup with missing tags

    As per techgnome suggestion check out examples at http://msdn.microsoft.com/en-us/vbasic/bb688087.aspx

    Here is a quick/simple look at how you can easily construction a document using LINQ.

    Code:
        Public Sub xDemo()
    
            Dim Fruits() As String = {"Apple", "Peach", "Orange", "Grape"}
            Dim Prices() As Double = {1.23, 2.0, 1.11, 0.87}
    
            Dim doc = <?xml version="1.0" encoding="UTF-16" standalone="yes"?><foo/>
    
    
            For x As Integer = 0 To Fruits.Count - 1
                doc...<foo>(0).Add( _
                                  <Entry>
                                      <Key><%= x + 1 %></Key>
                                      <Value><%= Fruits(x) %></Value>
                                      <Price><%= Prices(x) %></Price>
                                  </Entry>)
            Next
    
            Console.WriteLine(doc.ToString)
    
        End Sub
    Output
    Code:
    <?xml version="1.0" encoding="utf-16" standalone="yes"?>
    <foo>
      <Entry>
        <Key>1</Key>
        <Value>Apple</Value>
        <Price>1.23</Price>
      </Entry>
      <Entry>
        <Key>2</Key>
        <Value>Peach</Value>
        <Price>2</Price>
      </Entry>
      <Entry>
        <Key>3</Key>
        <Value>Orange</Value>
        <Price>1.11</Price>
      </Entry>
      <Entry>
        <Key>4</Key>
        <Value>Grape</Value>
        <Price>0.87</Price>
      </Entry>
    </foo>

  4. #4

    Thread Starter
    Hyperactive Member
    Join Date
    Nov 2008
    Location
    USA
    Posts
    257

    Re: XML cleanup with missing tags

    reply to techgnome
    The xml data is parsed unformatted from a site to text files, whereafter I take the data to form valid xml entries. When I tried using xmldocument.loadxml()
    and then get the error

    Name cannot begin with the '<' character, hexadecimal value 0x3C.

    So I use the text parsing method when I cannot load an XML doc and I do cleaning of characters before I load the xml doc.
    -- Please rate me if I am helpful --

  5. #5
    PowerPoster techgnome's Avatar
    Join Date
    May 2002
    Posts
    34,687

    Re: XML cleanup with missing tags

    OK, so it's already in XML format... but there seems to be a problem with it? What does the data look like? form the sound of the error, it sounds like there's something wrong with the data to begin with and really should be fixed at the source.

    -tg
    * I don't respond to private (PM) requests for help. It's not conducive to the general learning of others.*
    * I also don't respond to friend requests. Save a few bits and don't bother. I'll just end up rejecting anyways.*
    * How to get EFFECTIVE help: The Hitchhiker's Guide to Getting Help at VBF - Removing eels from your hovercraft *
    * How to Use Parameters * Create Disconnected ADO Recordset Clones * Set your VB6 ActiveX Compatibility * Get rid of those pesky VB Line Numbers * I swear I saved my data, where'd it run off to??? *

  6. #6

    Thread Starter
    Hyperactive Member
    Join Date
    Nov 2008
    Location
    USA
    Posts
    257

    Re: XML cleanup with missing tags

    The data comes from an api response stream and I capture chunks because the data is too large to just readtoend(). Then I append the data to a stringbuilder and write to text files once it reaches a certain size. So the data in the text files is in xml format but not corrected to acceptable format.

    eg.. at time a file may start with

    Code:
    erb xmlns:activity="http........</entry></results>
    instead of beginning as follows:


    Code:
    <results data="1"  publisher="xxxxxr" endpoint="Notices" refreshURL="https://xxxxxxxx.com.activities.xml?max=10000">
    <entry xmlns:.........

    So when I read the text files I have two methods where I first try to load as an xml doc after cleaning illegal chars and adding a default header and footer or if I cannot accomplish this then just do text parsing. The problem comes up with when I need to parse these entries and get values and using regex is so tedious and I found that the XMLtextreader and readelementcontent controls are so convenient to parse. So is a problem in that a typical file will not contain the data in acceptable format so it means I need to find what is the opening tag or what tag is missing and close it.

    Now I hope this makes sense since I don't know any other way to handle it. The idea is to capture the response continously and handle the file parsing on a different thread and then recreate valid xml files to deliver.
    -- Please rate me if I am helpful --

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width