dcsimg
Results 1 to 3 of 3

Thread: CDO HTML to Plain Text

  1. #1

    Thread Starter
    PowerPoster
    Join Date
    Feb 2006
    Posts
    20,549

    CDO HTML to Plain Text

    We've had CDO For Windows 2000 quite a while now. But about all I've seen people doing with it is sending out spam via SMTP using code obviously cribbed from some ancient ASP VBScript.

    I'm not sure people realize it can be used for a lot more things than that. Here is just one of them.

    HTML page:

    Code:
    <!DOCTYPE html>
    <html>
    :
    :
    <body>
    :
    :
    <div class="A">
    :
    <h1>first header text</h1>
    :
    <h3>second header text</h3>
    :
    <div class="B">
    :
    bunch of text with random whitespace and <br> line-breaks
    :
    </div>
    :
    </div>
    :
    :
    </body>
    </html>
    To avoid a lot of hand-editing after the fact we'll extract the two header values and then the innerText of <div/> B. Wrap that innerText in a minimal HTML page and then run that through a CDO.Message to convert the HTML to plan text.

    Or you could skip that and just feed in the entire raw HTML document.

    CDO handles mapping random whitespace (multiple CR, LF, CRLF, TAB, blank spaces... all intermixed) to a single space.

    A little extra fiddling and we can work around the CDO-imposed 72-character plain text line width limit too, since for most text a "line" is really a paragraph.

    Code:
        Dim Html As String
        Dim Pos As Long
        Dim PosInnerText As Long
        Dim PosCloseTag As Long
        Dim SpecsItem As Variant
        Dim SpecsSubItem As Variant
    
        mnuGo.Enabled = False
        Text1.Visible = False
    
        'Load HTML into Html as Unicode text
        With New ADODB.Stream
            .Open
            .Type = adTypeText
            .CharSet = cdoUTF_8
            .LineSeparator = adLF
            .LoadFromFile "Sample s01e01.txt"
            Html = .ReadText(adReadAll)
            .Close
        End With
    
        'Parse out the portion we want as String variable Html:
        Pos = 1
        SkipDivClass Pos, Html, DIV_CLASS_MAIN_CONTENT_LEFT
        Text1.SelText = "Script: " & ExtractInnerText(Pos, Html, "h1") & vbNewLine & vbNewLine
        Text1.SelText = "Title: " & ExtractInnerText(Pos, Html, "h3") & vbNewLine & vbNewLine
        'This <div/> contains our body text:
        SkipDivClass Pos, Html, DIV_CLASS_SCROLLING_STRIPT_CONTAINER
        Pos = ScanWhile(Pos, Html, WHITE_SPACE)
        PosInnerText = Pos
        PosCloseTag = FindSkipCloseTag(Pos, Html, "div")
        Html = Mid$(Html, PosInnerText, PosCloseTag - PosInnerText)
    
        'Wrap the extracted HTML fragment in a dummy document, convert the HTML into plain
        'text via CDO:
    
        'Work around CDO's forced 72-char line wrapping (part 1), tag the hard line breaks
        'with pilcrows (paragraph marks):
        Html = Replace$(Html, "<br>", "&para;<br>")
        With Message
            .AutoGenerateTextBody = True
            .HTMLBody = "<html><body>" & Html & "</body></html>"
            'Work around CDO's forced 72-char line wrapping (part 2), retain only hard line
            'breaks and change soft breaks to spaces:
            Text1.SelText = Replace$(Replace$(Replace$(.TextBody, _
                                                       "" & vbNewLine, _
                                                       MARKER_CHAR), _
                                              vbNewLine, _
                                              " "), _
                                     MARKER_CHAR, _
                                     vbNewLine)
            .BodyPart.BodyParts.DeleteAll
        End With
        Text1.SelStart = 0
        Text1.Visible = True

    Name:  sshot.png
Views: 113
Size:  4.8 KB


    In this case the HTML was a page from a TV show fan site containing contributed transcriptions of episodes. They wanted to create an offline desktop database with a front-end permitting free text searching the entire show, adding personal annotations, highlighting passages, bookmarking, etc.

    The result looks a lot like a CHM Help file, but with a few more features.
    Attached Files Attached Files
    Last edited by dilettante; Oct 4th, 2019 at 04:44 AM.

  2. #2

    Thread Starter
    PowerPoster
    Join Date
    Feb 2006
    Posts
    20,549

    Re: CDO HTML to Plain Text

    Oops! I didn't mean to imply I attached their database application above.

    The code above is just a demo of using CDO to assist in scraping their web site to plain text documents for each episode from the awful HTML generated by the tool they had used to create the site. The demo program doesn't fetch the HTML from the site, it just uses a copy captured to disk (one dummy page included).

  3. #3
    PowerPoster
    Join Date
    Jun 2013
    Posts
    4,379

    Re: CDO HTML to Plain Text

    For simpler stuff, I'm using (in my *.asp-Scripts) the MS-HTML-ObjectLib for that task -
    (e.g. when integrated into your Demo, it would look like that...):
    Code:
    Private Sub mnuGo2_Click()
      Text1.Text = Html2Text(ReadFromUTF8File(App.Path & "\Sample s01e01.txt"))
    
    ''  it also understands shorter snippets, which don't contain a body-tag (like e.g. delivered from CKEditor)
    '  Text1.Text = Html2Text("<p><b>abc</b> <strong>123</strong><br><i>xyz</i></p>")
    End Sub
    
    Function Html2Text(ByVal sHTML)
       With CreateObject("htmlfile")
         If InStr(1, sHTML, "</body>", 1) = 0 Then sHTML = "<body>" & sHTML & "</body>"
         .Write sHTML
         Html2Text = .body.InnerText 'that's it already...
         
         'the rest is cosmetics
         Html2Text = Replace(Html2Text, vbCrLf & vbCrLf & vbCrLf & vbCrLf & vbCrLf, vbCrLf & vbCrLf)
         Html2Text = Replace(Html2Text, vbCrLf & vbCrLf & vbCrLf & vbCrLf, vbCrLf & vbCrLf)
         Html2Text = Replace(Html2Text, vbCrLf & vbCrLf & vbCrLf, vbCrLf & vbCrLf)
      End With
    End Function
    
    Function ReadFromUTF8File(FileName)
      With CreateObject("ADODB.Stream")
        .Open: .Type = 2 'adTypeText
          .CharSet = "utf-8"
          .LoadFromFile FileName
          ReadFromUTF8File = .ReadText
        .Close
      End With
    End Function
    Olaf

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Featured


Click Here to Expand Forum to Full Width