Results 1 to 1 of 1

Thread: Simple SGML parser (for HTML, XHTML & XML)

Threaded View

  1. #1

    Thread Starter
    VB6, XHTML & CSS hobbyist Merri's Avatar
    Join Date
    Oct 2002
    Location
    Finland
    Posts
    6,654

    Simple SGML parser (for HTML, XHTML & XML)

    This is just a random doodle I made. It isn't very usable in real life, but it gives a good basis for writing a better SGML parser. It is simply one function so it doesn't involve tons of class modules that attempt to represent tag elements or W3C specified DOM elements. It also isn't very good against invalid files, broken tag pairs easily makes this parse incorrectly. CSS and JavaScript inside a HTML file can also break things. This simply doesn't attempt to be perfect: it just is a simple basis to build on.

    Code:
    Option Explicit
    
    Private Sub Form_Load()
        Dim colSGML As Collection
        Open "C:\test.html" For Input As #1
            Set colSGML = SGML(Input(LOF(1), #1), 1)
        Close #1
    End Sub
    
    Public Function SGML(Data As String, Optional ByVal DebugDepth As Long = 0) As Collection
        Dim lngOpenTagEnd As Long, lngOpenTagStart As Long, lngCloseTag As Long, lngCloseTagEnd As Long
        Dim lngA As Long, lngDebugDepth As Long
        Dim colTag As Collection, strLCase As String, strAttributes() As String, strAttribute() As String, strTag As String
        Set SGML = New Collection
        strLCase = LCase$(Data)
        lngOpenTagStart = 1
        If DebugDepth Then lngDebugDepth = DebugDepth + 1
        Do While (lngCloseTagEnd < Len(Data)) And (lngOpenTagStart > 0)
            lngOpenTagStart = InStr(lngCloseTagEnd + 1, Data, "<")
            If lngOpenTagStart Then
                lngOpenTagEnd = InStr(lngOpenTagStart + 1, Data, ">")
                If lngOpenTagEnd Then
                    strAttributes = Split(Mid$(Data, lngOpenTagStart + 1, lngOpenTagEnd - lngOpenTagStart - 1), " ")
                    strTag = LCase$(strAttributes(0))
                    Set colTag = New Collection
                    For lngA = 1 To UBound(strAttributes)
                        strAttribute = Split(strAttributes(lngA), "=", 2)
                        On Error Resume Next
                        colTag.Add strAttribute, LCase$(strAttribute(0))
                        On Error GoTo 0
                    Next lngA
                    If strAttributes(UBound(strAttributes)) = "/" Or Left$(strTag, 1) = "!" Then
                        lngCloseTag = lngOpenTagEnd + 1
                        lngCloseTagEnd = lngCloseTag
                    Else
                        lngCloseTag = InStr(lngOpenTagEnd + 1, strLCase, "</" & strTag & ">")
                        If lngCloseTag = 0 Then
                            lngCloseTag = Len(Data)
                            lngCloseTagEnd = lngCloseTag
                            If lngCloseTag = lngOpenTagEnd Then lngOpenTagEnd = lngCloseTag - 1
                        Else
                            lngCloseTagEnd = lngCloseTag + Len(strTag) + 2
                        End If
                    End If
                    If DebugDepth Then Debug.Print String$(DebugDepth - 1, vbTab) & "<" & strTag & ">"
                    SGML.Add Array(strTag, SGML(Mid$(Data, lngOpenTagEnd + 1, lngCloseTag - lngOpenTagEnd - 1), lngDebugDepth), colTag)
                    If DebugDepth Then Debug.Print String$(DebugDepth - 1, vbTab) & "</" & strTag & ">"
                Else
                    If DebugDepth Then
                        Debug.Print String$(DebugDepth - 1, vbTab) & "<TEXT>" & _
                            Replace(Replace(Mid$(Data, lngCloseTagEnd + 1), vbLf, vbNullString), vbCr, vbNullString) & _
                            "</TEXT>"
                    End If
                    SGML.Add Array("Text", Mid$(Data, lngCloseTagEnd + 1))
                    lngCloseTag = Len(Data)
                    lngCloseTagEnd = lngCloseTag
                End If
            Else
                If DebugDepth Then
                    Debug.Print String$(DebugDepth - 1, vbTab) & "<TEXT>" & _
                        Replace(Replace(Mid$(Data, lngCloseTagEnd + 1), vbLf, vbNullString), vbCr, vbNullString) & _
                        "</TEXT>"
                End If
                SGML.Add Array("Text", Mid$(Data, lngCloseTagEnd + 1))
            End If
        Loop
    End Function
    For a far more advanced example you can find something that represents the whole other end at PSC, attempting to be a W3C compliant HTML & XML parser.


    Random trivia: before this post there were a whopping 19 VB6 posts on these forums that mentioned SGML. This is the post #20. Searching for HTML, XHTML or XML gives a bit too many results...



    Edit!
    I also forgot to add detection for text elements before a tag and between tags, so that is one thing to improve upon (unless I actually get myself to spend the time to do it).
    Last edited by Merri; May 30th, 2009 at 08:22 AM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width