|
-
Sep 10th, 2013, 11:42 PM
#1
Thread Starter
Hyperactive Member
Extracting main body text from webpage
Hi all,
I am trying to download various texts from different news sources in a systematic way. I know I can use get element by id/tag but it would be quite tedious to make one for each of the 100+ sources. Is there a way to use webclient or webbrowser to extract the main content without the html formatting?
e.g. wclient.downloadMainBodyText?? A bit like the content that shows up when an iPad/iPhone uses the Reader function in safari.
Thanks in advance.
If my post has been helpful, please rate it! 
-
Sep 11th, 2013, 09:28 AM
#2
Re: Extracting main body text from webpage
Webpages seem to have moved on without you. Just about feasible for those old home pages when we still used <h1> for the title and nobody had heard of CSS but totally impractical now. I doubt that many, if any, of your websources even have text as such. All that the reader function does is strip away extraneous frames, ads, headers and so on. If it gets you exactly what you want then it's more by luck than judgment!
As the 6-dimensional mathematics professor said to the brain surgeon, "It ain't Rocket Science!"
Reviews: "dunfiddlin likes his DataTables" - jmcilhinney
Please be aware that whilst I will read private messages (one day!) I am unlikely to reply to anything that does not contain offers of cash, fame or marriage!
-
Sep 11th, 2013, 07:19 PM
#3
Thread Starter
Hyperactive Member
Re: Extracting main body text from webpage
I have thought of a potential solution to this though.. the main body tends to have the least <> around it.. making the distance between the last > or < and the next < or > longest. So I have replaced all <> with ~ and tried to locate all ~ as to identify the start point and end point of the largest chunk of text without <>, but currently my loop seems rather inefficient and its making my program unresponsive. Any suggestions for a more efficient approach?
The code I have now is as follows:
Code:
WebBrowser1.Navigate(url1)
WaitForPageLoad()
Dim abstract As String = WebBrowser1.DocumentText
abstract = abstract.Replace("<", "~")
abstract = abstract.Replace(">", "~")
Dim wordColl As System.Text.RegularExpressions.MatchCollection = System.Text.RegularExpressions.Regex.Matches(abstract, "~")
Dim m As Integer = CInt(wordColl.Count)
MsgBox(m)
Dim textend As Integer = abstract.Length
Dim lastindex As Integer = abstract.LastIndexOf("~")
Dim last1 As Integer = 0
RichTextBox1.Text = abstract
While (last1 < lastindex)
On Error Resume Next
RichTextBox1.Find("~", last1, textend, RichTextBoxFinds.WholeWord)
Dim n As Integer = RichTextBox1.SelectionStart
'MsgBox(n)
ListBox4.Items.Add(n)
last1 = RichTextBox1.Text.IndexOf("~", last1) + 2
End While
I was gonna add listbox5 items as the differences between the listbox4 numbers.
Another problem I'm having is that the site is saying the browser I am using is out of date.. and gives me an error 500 when I use webclient.downloadstring instead.
If my post has been helpful, please rate it! 
-
Sep 11th, 2013, 07:37 PM
#4
Re: Extracting main body text from webpage
the main body tends to have the least <> around it
Er .. well ... not really, but let's go with it for now ....
Why are we replacing < and > with that squiggly thing that I can never find on this keyboard exactly? And what's that regex supposed to capture?
Why is this monstrosity On Error Resume Next here at all?
the site is saying the browser I am using is out of date
Well, is it? Bear in mind that the WebBrowser control does not announce the IE version that underlies it so it will usually be treated as IE7 or an unknown browser by sites that test for such things. Error 500 is a server side error so unless you're doing something to break the server (which shouldn't be possible!) there is absolutely nothing you can do about it other than choose better coded websites!
If you're prepared to reveal some of the sites you're using and what you're trying to get from them then I'm happy to spend some time tomorrow to see if something a little more logical is possible. As of now I'm afraid I can't make head nor tail of what you're trying to do in the code you've posted.
As the 6-dimensional mathematics professor said to the brain surgeon, "It ain't Rocket Science!"
Reviews: "dunfiddlin likes his DataTables" - jmcilhinney
Please be aware that whilst I will read private messages (one day!) I am unlikely to reply to anything that does not contain offers of cash, fame or marriage!
-
Sep 11th, 2013, 08:44 PM
#5
Thread Starter
Hyperactive Member
Re: Extracting main body text from webpage
Thanks a lot dunfiddlin~! My logic was that news articles tend to have less formatting within the body texts. And the reason for replacing them with ~ is to use them as uniform markers for starts and ends of code/abstract. Longer chunks between the markers, the more likely it is to be the lengthy body text.
A few examples of the links to the pages I want to extract are as follows:
http://www.legacy.com/obituaries/nyt...&pid=124294512
http://www.legacy.com/obituaries/chi...&pid=109461797
http://www.legacy.com/obituaries/lvr...&pid=140928569
http://www.legacy.com/obituaries/des...&pid=156017697
http://www.legacy.com/obituaries/her...&pid=155729720
Last edited by JXDOS; Sep 11th, 2013 at 08:49 PM.
If my post has been helpful, please rate it! 
-
Sep 12th, 2013, 12:24 PM
#6
Re: Extracting main body text from webpage
Your program is unresponsive since you you are likely looping until the page has completed loading. Whats this WaitForPageLoad() do. I imagine some loop with a application.doevents inside.
A browser is a UI element so should not be used. If a bad request is being thrown then you are not meeting the requests standards required. Simply a user agent.
Obviously this will block the calling thread so we should be looking at the webclients Async method.
vb Code:
Imports System.Net Public Class Form1 Private ReadOnly m_userAgent As String = "Mozilla/5.0 (Windows NT 5.1; rv:13.0) Gecko/20100101 Firefox/13.0" Private Function GetHtml(ByVal url As String) As String Dim source As String = Nothing Using wClient As New WebClient wClient.Headers.Set(HttpRequestHeader.UserAgent, Me.m_userAgent) source = wClient.DownloadString(New Uri(url)) End Using Return source End Function End Class
Now it's a question of parsing the HTML. In the links you provided there are Two types of formats. The first One is notably "ContentPlaceHolder1_ObitText" and the other "ctl00_MainContentPlaceholder_Text" span tag. Ok so focusing on Obituary's format http://obits.reviewjournal.com/obitu...69#fbLoggedOut I'v placed the html already downloaded in htmlTextBox. Each of these can not be exactly matched because the html is different or varies. You will also need to sort how you wish to remove the rest of the html.
any way a start
vb Code:
Imports System.Net Imports System.Text.RegularExpressions Public Class Form1 Private ReadOnly m_userAgent As String = "Mozilla/5.0 (Windows NT 5.1; rv:13.0) Gecko/20100101 Firefox/13.0" Private ReadOnly m_pattern As String = "(?<=<span id=""ctl00_MainContentPlaceholder.+>).+(?=</p></span>)" Private Sub Form1_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load Dim source As String = GetHtml("http://www.legacy.com/obituaries/chicagotribune/obituary.aspx?n=Paul-Rasmussen&pid=109461797#fbLoggedOut") Dim rx As New Regex(Me.m_pattern) Me.RichTextBox1.Text = rx.Match(source).Value End Sub Private Function GetHtml(ByVal url As String) As String Dim source As String = Nothing Using wClient As New WebClient wClient.Headers.Set(HttpRequestHeader.UserAgent, Me.m_userAgent) source = wClient.DownloadString(New Uri(url)) End Using Return source End Function End Class
Tags for this Thread
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|