Extracting main body text from webpage

**JXDOS** · Sep 10th, 2013, 11:42 PM

Hi all,

I am trying to download various texts from different news sources in a systematic way. I know I can use get element by id/tag but it would be quite tedious to make one for each of the 100+ sources. Is there a way to use webclient or webbrowser to extract the main content without the html formatting?

e.g. wclient.downloadMainBodyText?? A bit like the content that shows up when an iPad/iPhone uses the Reader function in safari.

Thanks in advance.

**dunfiddlin** · Sep 11th, 2013, 09:28 AM

Webpages seem to have moved on without you. Just about feasible for those old home pages when we still used <h1> for the title and nobody had heard of CSS but totally impractical now. I doubt that many, if any, of your websources even have text as such. All that the reader function does is strip away extraneous frames, ads, headers and so on. If it gets you exactly what you want then it's more by luck than judgment!

**JXDOS** · Sep 11th, 2013, 07:19 PM

I have thought of a potential solution to this though.. the main body tends to have the least <> around it.. making the distance between the last > or < and the next < or > longest. So I have replaced all <> with ~ and tried to locate all ~ as to identify the start point and end point of the largest chunk of text without <>, but currently my loop seems rather inefficient and its making my program unresponsive. Any suggestions for a more efficient approach?

The code I have now is as follows:

Code:

 WebBrowser1.Navigate(url1)
        WaitForPageLoad()

        Dim abstract As String = WebBrowser1.DocumentText
        abstract = abstract.Replace("<", "~")
        abstract = abstract.Replace(">", "~")
        Dim wordColl As System.Text.RegularExpressions.MatchCollection = System.Text.RegularExpressions.Regex.Matches(abstract, "~")

        Dim m As Integer = CInt(wordColl.Count)
        MsgBox(m)

        Dim textend As Integer = abstract.Length
        Dim lastindex As Integer = abstract.LastIndexOf("~")
        Dim last1 As Integer = 0
        RichTextBox1.Text = abstract


        While (last1 < lastindex)
            On Error Resume Next
            RichTextBox1.Find("~", last1, textend, RichTextBoxFinds.WholeWord)
            Dim n As Integer = RichTextBox1.SelectionStart
            'MsgBox(n)
            ListBox4.Items.Add(n)
            last1 = RichTextBox1.Text.IndexOf("~", last1) + 2
        End While

I was gonna add listbox5 items as the differences between the listbox4 numbers.

Another problem I'm having is that the site is saying the browser I am using is out of date.. and gives me an error 500 when I use webclient.downloadstring instead.

**dunfiddlin** · Sep 11th, 2013, 07:37 PM

the main body tends to have the least <> around it

Er .. well ... not really, but let's go with it for now ....

Why are we replacing < and > with that squiggly thing that I can never find on this keyboard exactly? And what's that regex supposed to capture?

Why is this monstrosity On Error Resume Next here at all?

the site is saying the browser I am using is out of date

Well, is it? Bear in mind that the WebBrowser control does not announce the IE version that underlies it so it will usually be treated as IE7 or an unknown browser by sites that test for such things. Error 500 is a server side error so unless you're doing something to break the server (which shouldn't be possible!) there is absolutely nothing you can do about it other than choose better coded websites!

If you're prepared to reveal some of the sites you're using and what you're trying to get from them then I'm happy to spend some time tomorrow to see if something a little more logical is possible. As of now I'm afraid I can't make head nor tail of what you're trying to do in the code you've posted.

**JXDOS** · Sep 11th, 2013, 08:44 PM

Thanks a lot dunfiddlin~! My logic was that news articles tend to have less formatting within the body texts. And the reason for replacing them with ~ is to use them as uniform markers for starts and ends of code/abstract. Longer chunks between the markers, the more likely it is to be the lengthy body text.

A few examples of the links to the pages I want to extract are as follows:

http://www.legacy.com/obituaries/nyt...&pid=124294512
http://www.legacy.com/obituaries/chi...&pid=109461797
http://www.legacy.com/obituaries/lvr...&pid=140928569
http://www.legacy.com/obituaries/des...&pid=156017697
http://www.legacy.com/obituaries/her...&pid=155729720

**ident** · Sep 12th, 2013, 12:24 PM

Your program is unresponsive since you you are likely looping until the page has completed loading. Whats this WaitForPageLoad() do. I imagine some loop with a application.doevents inside.

A browser is a UI element so should not be used. If a bad request is being thrown then you are not meeting the requests standards required. Simply a user agent.

Obviously this will block the calling thread so we should be looking at the webclients Async method.

vb Code:

Imports System.Net
 
Public Class Form1
 
    Private ReadOnly m_userAgent As String = "Mozilla/5.0 (Windows NT 5.1; rv:13.0) Gecko/20100101 Firefox/13.0"
 
    Private Function GetHtml(ByVal url As String) As String
        Dim source As String = Nothing
 
        Using wClient As New WebClient
            wClient.Headers.Set(HttpRequestHeader.UserAgent, Me.m_userAgent)
            source = wClient.DownloadString(New Uri(url))
        End Using
 
        Return source
    End Function
End Class

Now it's a question of parsing the HTML. In the links you provided there are Two types of formats. The first One is notably "ContentPlaceHolder1_ObitText" and the other "ctl00_MainContentPlaceholder_Text" span tag. Ok so focusing on Obituary's format http://obits.reviewjournal.com/obitu...69#fbLoggedOut I'v placed the html already downloaded in htmlTextBox. Each of these can not be exactly matched because the html is different or varies. You will also need to sort how you wish to remove the rest of the html.

any way a start

vb Code:

Imports System.Net
Imports System.Text.RegularExpressions
 
Public Class Form1
 
    Private ReadOnly m_userAgent As String = "Mozilla/5.0 (Windows NT 5.1; rv:13.0) Gecko/20100101 Firefox/13.0"
    Private ReadOnly m_pattern As String = "(?<=<span id=""ctl00_MainContentPlaceholder.+>).+(?=</p></span>)"
 
    Private Sub Form1_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load
        Dim source As String = GetHtml("http://www.legacy.com/obituaries/chicagotribune/obituary.aspx?n=Paul-Rasmussen&pid=109461797#fbLoggedOut")
        Dim rx As New Regex(Me.m_pattern)
        Me.RichTextBox1.Text = rx.Match(source).Value
    End Sub
 
    Private Function GetHtml(ByVal url As String) As String
        Dim source As String = Nothing
 
        Using wClient As New WebClient
            wClient.Headers.Set(HttpRequestHeader.UserAgent, Me.m_userAgent)
            source = wClient.DownloadString(New Uri(url))
        End Using
 
        Return source
    End Function
End Class

Thread: Extracting main body text from webpage

Thread Tools

Display

Extracting main body text from webpage

Re: Extracting main body text from webpage

Re: Extracting main body text from webpage

Re: Extracting main body text from webpage

Re: Extracting main body text from webpage

Re: Extracting main body text from webpage

Tags for this Thread

Posting Permissions