|
-
May 30th, 2009, 07:36 AM
#1
Simple SGML parser (for HTML, XHTML & XML)
This is just a random doodle I made. It isn't very usable in real life, but it gives a good basis for writing a better SGML parser. It is simply one function so it doesn't involve tons of class modules that attempt to represent tag elements or W3C specified DOM elements. It also isn't very good against invalid files, broken tag pairs easily makes this parse incorrectly. CSS and JavaScript inside a HTML file can also break things. This simply doesn't attempt to be perfect: it just is a simple basis to build on.
Code:
Option Explicit
Private Sub Form_Load()
Dim colSGML As Collection
Open "C:\test.html" For Input As #1
Set colSGML = SGML(Input(LOF(1), #1), 1)
Close #1
End Sub
Public Function SGML(Data As String, Optional ByVal DebugDepth As Long = 0) As Collection
Dim lngOpenTagEnd As Long, lngOpenTagStart As Long, lngCloseTag As Long, lngCloseTagEnd As Long
Dim lngA As Long, lngDebugDepth As Long
Dim colTag As Collection, strLCase As String, strAttributes() As String, strAttribute() As String, strTag As String
Set SGML = New Collection
strLCase = LCase$(Data)
lngOpenTagStart = 1
If DebugDepth Then lngDebugDepth = DebugDepth + 1
Do While (lngCloseTagEnd < Len(Data)) And (lngOpenTagStart > 0)
lngOpenTagStart = InStr(lngCloseTagEnd + 1, Data, "<")
If lngOpenTagStart Then
lngOpenTagEnd = InStr(lngOpenTagStart + 1, Data, ">")
If lngOpenTagEnd Then
strAttributes = Split(Mid$(Data, lngOpenTagStart + 1, lngOpenTagEnd - lngOpenTagStart - 1), " ")
strTag = LCase$(strAttributes(0))
Set colTag = New Collection
For lngA = 1 To UBound(strAttributes)
strAttribute = Split(strAttributes(lngA), "=", 2)
On Error Resume Next
colTag.Add strAttribute, LCase$(strAttribute(0))
On Error GoTo 0
Next lngA
If strAttributes(UBound(strAttributes)) = "/" Or Left$(strTag, 1) = "!" Then
lngCloseTag = lngOpenTagEnd + 1
lngCloseTagEnd = lngCloseTag
Else
lngCloseTag = InStr(lngOpenTagEnd + 1, strLCase, "</" & strTag & ">")
If lngCloseTag = 0 Then
lngCloseTag = Len(Data)
lngCloseTagEnd = lngCloseTag
If lngCloseTag = lngOpenTagEnd Then lngOpenTagEnd = lngCloseTag - 1
Else
lngCloseTagEnd = lngCloseTag + Len(strTag) + 2
End If
End If
If DebugDepth Then Debug.Print String$(DebugDepth - 1, vbTab) & "<" & strTag & ">"
SGML.Add Array(strTag, SGML(Mid$(Data, lngOpenTagEnd + 1, lngCloseTag - lngOpenTagEnd - 1), lngDebugDepth), colTag)
If DebugDepth Then Debug.Print String$(DebugDepth - 1, vbTab) & "</" & strTag & ">"
Else
If DebugDepth Then
Debug.Print String$(DebugDepth - 1, vbTab) & "<TEXT>" & _
Replace(Replace(Mid$(Data, lngCloseTagEnd + 1), vbLf, vbNullString), vbCr, vbNullString) & _
"</TEXT>"
End If
SGML.Add Array("Text", Mid$(Data, lngCloseTagEnd + 1))
lngCloseTag = Len(Data)
lngCloseTagEnd = lngCloseTag
End If
Else
If DebugDepth Then
Debug.Print String$(DebugDepth - 1, vbTab) & "<TEXT>" & _
Replace(Replace(Mid$(Data, lngCloseTagEnd + 1), vbLf, vbNullString), vbCr, vbNullString) & _
"</TEXT>"
End If
SGML.Add Array("Text", Mid$(Data, lngCloseTagEnd + 1))
End If
Loop
End Function
For a far more advanced example you can find something that represents the whole other end at PSC, attempting to be a W3C compliant HTML & XML parser.
Random trivia: before this post there were a whopping 19 VB6 posts on these forums that mentioned SGML. This is the post #20. Searching for HTML, XHTML or XML gives a bit too many results...
Edit!
I also forgot to add detection for text elements before a tag and between tags, so that is one thing to improve upon (unless I actually get myself to spend the time to do it).
Last edited by Merri; May 30th, 2009 at 08:22 AM.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|