Results 1 to 6 of 6

Thread: Extracting main body text from webpage

  1. #1

    Thread Starter
    Hyperactive Member JXDOS's Avatar
    Join Date
    Aug 2006
    Location
    Mars...
    Posts
    423

    Extracting main body text from webpage

    Hi all,

    I am trying to download various texts from different news sources in a systematic way. I know I can use get element by id/tag but it would be quite tedious to make one for each of the 100+ sources. Is there a way to use webclient or webbrowser to extract the main content without the html formatting?

    e.g. wclient.downloadMainBodyText?? A bit like the content that shows up when an iPad/iPhone uses the Reader function in safari.

    Thanks in advance.
    If my post has been helpful, please rate it!

  2. #2
    PowerPoster dunfiddlin's Avatar
    Join Date
    Jun 2012
    Posts
    8,245

    Re: Extracting main body text from webpage

    Webpages seem to have moved on without you. Just about feasible for those old home pages when we still used <h1> for the title and nobody had heard of CSS but totally impractical now. I doubt that many, if any, of your websources even have text as such. All that the reader function does is strip away extraneous frames, ads, headers and so on. If it gets you exactly what you want then it's more by luck than judgment!
    As the 6-dimensional mathematics professor said to the brain surgeon, "It ain't Rocket Science!"

    Reviews: "dunfiddlin likes his DataTables" - jmcilhinney

    Please be aware that whilst I will read private messages (one day!) I am unlikely to reply to anything that does not contain offers of cash, fame or marriage!

  3. #3

    Thread Starter
    Hyperactive Member JXDOS's Avatar
    Join Date
    Aug 2006
    Location
    Mars...
    Posts
    423

    Re: Extracting main body text from webpage

    I have thought of a potential solution to this though.. the main body tends to have the least <> around it.. making the distance between the last > or < and the next < or > longest. So I have replaced all <> with ~ and tried to locate all ~ as to identify the start point and end point of the largest chunk of text without <>, but currently my loop seems rather inefficient and its making my program unresponsive. Any suggestions for a more efficient approach?

    The code I have now is as follows:
    Code:
     WebBrowser1.Navigate(url1)
            WaitForPageLoad()
    
            Dim abstract As String = WebBrowser1.DocumentText
            abstract = abstract.Replace("<", "~")
            abstract = abstract.Replace(">", "~")
            Dim wordColl As System.Text.RegularExpressions.MatchCollection = System.Text.RegularExpressions.Regex.Matches(abstract, "~")
    
            Dim m As Integer = CInt(wordColl.Count)
            MsgBox(m)
    
            Dim textend As Integer = abstract.Length
            Dim lastindex As Integer = abstract.LastIndexOf("~")
            Dim last1 As Integer = 0
            RichTextBox1.Text = abstract
    
    
            While (last1 < lastindex)
                On Error Resume Next
                RichTextBox1.Find("~", last1, textend, RichTextBoxFinds.WholeWord)
                Dim n As Integer = RichTextBox1.SelectionStart
                'MsgBox(n)
                ListBox4.Items.Add(n)
                last1 = RichTextBox1.Text.IndexOf("~", last1) + 2
            End While
    I was gonna add listbox5 items as the differences between the listbox4 numbers.

    Another problem I'm having is that the site is saying the browser I am using is out of date.. and gives me an error 500 when I use webclient.downloadstring instead.
    If my post has been helpful, please rate it!

  4. #4
    PowerPoster dunfiddlin's Avatar
    Join Date
    Jun 2012
    Posts
    8,245

    Re: Extracting main body text from webpage

    the main body tends to have the least <> around it
    Er .. well ... not really, but let's go with it for now ....

    Why are we replacing < and > with that squiggly thing that I can never find on this keyboard exactly? And what's that regex supposed to capture?

    Why is this monstrosity On Error Resume Next here at all?

    the site is saying the browser I am using is out of date
    Well, is it? Bear in mind that the WebBrowser control does not announce the IE version that underlies it so it will usually be treated as IE7 or an unknown browser by sites that test for such things. Error 500 is a server side error so unless you're doing something to break the server (which shouldn't be possible!) there is absolutely nothing you can do about it other than choose better coded websites!

    If you're prepared to reveal some of the sites you're using and what you're trying to get from them then I'm happy to spend some time tomorrow to see if something a little more logical is possible. As of now I'm afraid I can't make head nor tail of what you're trying to do in the code you've posted.
    As the 6-dimensional mathematics professor said to the brain surgeon, "It ain't Rocket Science!"

    Reviews: "dunfiddlin likes his DataTables" - jmcilhinney

    Please be aware that whilst I will read private messages (one day!) I am unlikely to reply to anything that does not contain offers of cash, fame or marriage!

  5. #5

    Thread Starter
    Hyperactive Member JXDOS's Avatar
    Join Date
    Aug 2006
    Location
    Mars...
    Posts
    423

    Re: Extracting main body text from webpage

    Thanks a lot dunfiddlin~! My logic was that news articles tend to have less formatting within the body texts. And the reason for replacing them with ~ is to use them as uniform markers for starts and ends of code/abstract. Longer chunks between the markers, the more likely it is to be the lengthy body text.

    A few examples of the links to the pages I want to extract are as follows:

    http://www.legacy.com/obituaries/nyt...&pid=124294512
    http://www.legacy.com/obituaries/chi...&pid=109461797
    http://www.legacy.com/obituaries/lvr...&pid=140928569
    http://www.legacy.com/obituaries/des...&pid=156017697
    http://www.legacy.com/obituaries/her...&pid=155729720
    Last edited by JXDOS; Sep 11th, 2013 at 08:49 PM.
    If my post has been helpful, please rate it!

  6. #6
    Bad man! ident's Avatar
    Join Date
    Mar 2009
    Location
    Cambridge
    Posts
    5,401

    Re: Extracting main body text from webpage

    Your program is unresponsive since you you are likely looping until the page has completed loading. Whats this WaitForPageLoad() do. I imagine some loop with a application.doevents inside.

    A browser is a UI element so should not be used. If a bad request is being thrown then you are not meeting the requests standards required. Simply a user agent.

    Obviously this will block the calling thread so we should be looking at the webclients Async method.

    vb Code:
    1. Imports System.Net
    2.  
    3. Public Class Form1
    4.  
    5.     Private ReadOnly m_userAgent As String = "Mozilla/5.0 (Windows NT 5.1; rv:13.0) Gecko/20100101 Firefox/13.0"
    6.  
    7.     Private Function GetHtml(ByVal url As String) As String
    8.         Dim source As String = Nothing
    9.  
    10.         Using wClient As New WebClient
    11.             wClient.Headers.Set(HttpRequestHeader.UserAgent, Me.m_userAgent)
    12.             source = wClient.DownloadString(New Uri(url))
    13.         End Using
    14.  
    15.         Return source
    16.     End Function
    17. End Class

    Now it's a question of parsing the HTML. In the links you provided there are Two types of formats. The first One is notably "ContentPlaceHolder1_ObitText" and the other "ctl00_MainContentPlaceholder_Text" span tag. Ok so focusing on Obituary's format http://obits.reviewjournal.com/obitu...69#fbLoggedOut I'v placed the html already downloaded in htmlTextBox. Each of these can not be exactly matched because the html is different or varies. You will also need to sort how you wish to remove the rest of the html.

    any way a start

    vb Code:
    1. Imports System.Net
    2. Imports System.Text.RegularExpressions
    3.  
    4. Public Class Form1
    5.  
    6.     Private ReadOnly m_userAgent As String = "Mozilla/5.0 (Windows NT 5.1; rv:13.0) Gecko/20100101 Firefox/13.0"
    7.     Private ReadOnly m_pattern As String = "(?<=<span id=""ctl00_MainContentPlaceholder.+>).+(?=</p></span>)"
    8.  
    9.     Private Sub Form1_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load
    10.         Dim source As String = GetHtml("http://www.legacy.com/obituaries/chicagotribune/obituary.aspx?n=Paul-Rasmussen&pid=109461797#fbLoggedOut")
    11.         Dim rx As New Regex(Me.m_pattern)
    12.         Me.RichTextBox1.Text = rx.Match(source).Value
    13.     End Sub
    14.  
    15.     Private Function GetHtml(ByVal url As String) As String
    16.         Dim source As String = Nothing
    17.  
    18.         Using wClient As New WebClient
    19.             wClient.Headers.Set(HttpRequestHeader.UserAgent, Me.m_userAgent)
    20.             source = wClient.DownloadString(New Uri(url))
    21.         End Using
    22.  
    23.         Return source
    24.     End Function
    25. End Class
    My Github - 1d3nt

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width