-
Aug 29th, 2012, 03:14 PM
#1
Thread Starter
Addicted Member
HtmlElement In a_tags is not staying within if condition
I'm trying to scrape using the code below but it is not doing what I intend it to do. I want to scrape the inner text of the tag "a" when the attribute itemprop = "name". The result is correct but it scraped it twice.
The For Each a_tag As HtmlElement In a_tags is not staying within the itemprop="offers" condition.
******************
<tr itemtype="http://schema.org/Offer" itemscope="" itemprop="offers">
<a itemprop="name" title="2 Sets of Cross Country Asics Spikes with Handles!" class="vip" href="http://www.ebay.com/itm/2-Sets-of-Cross-Country-Asics-Spikes-with-Handles-/140838036468?pt=LH_DefaultDomain_0&hash=item20ca99e3f4">2 Sets of Cross Country Asics Spikes with Handles!</a>
</tr>
<tr itemtype="http://schema.org/Offer" itemscope="" itemprop="offers">
<a itemprop="name" title="2 Sets of Cross Country Asics Spikes with Handles!" class="vip" href="http://www.ebay.com/itm/2-Sets-of-Cross-Country-Asics-Spikes-with-Handles-/140838036468?pt=LH_DefaultDomain_0&hash=item20ca99e3f4">#2 - 2 Sets of Cross Country Asics Spikes with Handles!</a>
</tr>
*****************************
Dim trs As HtmlElementCollection = WebBrowser1.Document.GetElementsByTagName("tr")
For Each tr As HtmlElement In trs
If tr.GetAttribute("itemprop") = "offers" Then
Dim a_tags As HtmlElementCollection = WebBrowser1.Document.GetElementsByTagName("a")
For Each a_tag As HtmlElement In a_tags
If a_tag.GetAttribute("itemprop") = "name" Then
textbox1.text = textbox1.text & vbcrlf & a_tag.InnerText)
End If
Next
End If
Next
-
Aug 29th, 2012, 03:26 PM
#2
Re: HtmlElement In a_tags is not staying within if condition
Well, yeah. It would. At every <tr> you load all the page tags into a second collection and then stop at the first <a> tag to come along. So if you've got 2 <tr> tags, you get 2 copies of the same <a> tag text. Why are you tracking <tr> at all if its the <a> tags you actually want?
Dim a_tags As HtmlElementCollection = WebBrowser1.Document.GetElementsByTagName("a")
For Each a_tag As HtmlElement In a_tags
If a_tag.GetAttribute("itemprop") = "name" Then
textbox1. AppendText(a_tag.InnerText & vbCrLf)
End If
Next
-
Aug 29th, 2012, 03:45 PM
#3
Thread Starter
Addicted Member
Re: HtmlElement In a_tags is not staying within if condition
The data I'm trying to scrape is in a table with multiple columns. Each row has a tag "tr." If I just scraped the "a" tag then I cannot keep the data together from each row.
-
Aug 29th, 2012, 04:02 PM
#4
Re: HtmlElement In a_tags is not staying within if condition
Well that's not quite true. The HTML has to follow a logical sequence so the <a> tags will always appear in the same order. The items from each row will always be adjacent to each other in a list. If you want to separate them in sections you can divide a_tags.Count by trs.Count to get the number of items per row.
-
Aug 29th, 2012, 04:17 PM
#5
Thread Starter
Addicted Member
Re: HtmlElement In a_tags is not staying within if condition
I understand what you are suggesting. Is that how most people scrape data from a table? It seems like if you're off by one count, your entire data is incorrect.
I was trying to grab the first "tr" tag that met my condition, then search all "a" tags within that tag that met my other conditions. I was hoping to scrape multiple columns of data on that row and then move to the next "tr" tag and continue in that fashion.
-
Aug 29th, 2012, 04:48 PM
#6
Re: HtmlElement In a_tags is not staying within if condition
Well you might be able to do it that way if, say, you created a new HTML document from the inner html of each table row and then scanned that for the <a> tags. So you'd have something like (structure not code)
For Each <tr> in Original HTML
New HTML = inner <tr>
For each <a> in New HTML
Append Text inner <a>
Next
Append Text separator
Next
-
Aug 29th, 2012, 05:46 PM
#7
Thread Starter
Addicted Member
Re: HtmlElement In a_tags is not staying within if condition
That's a great solution. I'm having trouble implementing the code. I'm not sure how to temporarily store the HtmlElement and set a new HtmlElementCollection.
****************
Dim current_tag_tr As String
Dim trs As HtmlElementCollection = WebBrowser1.Document.GetElementsByTagName("tr")
For Each tr As HtmlElement In trs
If tr.GetAttribute("itemprop") = "offers" Then
'syntax not correct
current_tag_tr = tr.OuterHtml
Dim a_tags As HtmlElementCollection = current_tag_tr.GetElementsByTagName("a")
For Each a_tag As HtmlElement In a_tags
If a_tag.GetAttribute("itemprop") = "name" Then
TextBox11.Text = TextBox11.Text & vbCrLf & a_tag.InnerText
End If
Next
End If
Next
-
Aug 30th, 2012, 11:21 AM
#8
Thread Starter
Addicted Member
Re: HtmlElement In a_tags is not staying within if condition
I think the syntax is correct in this code but I'm still not getting the correct action. I have two test points in this code. The result of the 1 test is shown below. The 2nd point is blank.
#1
*******
<TR itemprop="offers" itemtype="http://schema.org/Offer" itemscope><TD class="pic lt"><!-- Moved to ResultSet.tag --><!-- Moved to ResultSet.tag --><A class=img href="http://www.ebay.com/itm/Mens-Asics-Gel-Noosa-TRI-6-Racings-Shoes-Neon-yellow-White-Turquoise-/280950896347?pt=US_Men_s_Shoes&hash=item4169fa76db" itemprop="url"><IMG class=img alt="Men's Asics-Gel Noosa TRI 6 Racings Shoes Neon yellow/White/Turquoise" src="http://thumbs4.ebaystatic.com/d/l225/m/miz8OhHhVLOkFCVWXGaDycg.jpg" itemprop="image"> </A></TD>
<TD class=dtl>
<DIV class=ittl><A class=vip title="Men's Asics-Gel Noosa TRI 6 Racings Shoes Neon yellow/White/Turquoise" href="http://www.ebay.com/itm/Mens-Asics-Gel-Noosa-TRI-6-Racings-Shoes-Neon-yellow-White-Turquoise-/280950896347?pt=US_Men_s_Shoes&hash=item4169fa76db" itemprop="name">Men's Asics-Gel Noosa TRI 6 Racings Shoes Neon yellow/White/Turquoise</A> </DIV><!-- Moved to ResultSet.tag -->
<DIV class="dyn dynS">
<DIV class="s2 distLoc"></DIV>
<DIV class=s2>Returns: Not accepted</DIV>
<DIV style="CLEAR: left"></DIV></DIV>
<DIV></DIV>
<DIV class=anchors>
<DIV class=group>
<DIV class=mi-l><!-- Moved to ResultSet.tag -->
<DIV class=mi><A class="lnk iconQuickLook_14x14 mi-a" url="http://www.ebay.com/sch/moreinfo/?_id=280950896347&_ptns=US_Men_s_Shoes&_pppn=r1" t="QL">Quick Look</A> </DIV></DIV></DIV></DIV></TD>
<TD class=trs></TD>
<TD class="bids bin1"><!-- Moved to ResultSet.tag -->
<DIV>9 bids</DIV><!-- Moved to ResultSet.tag --></TD>
<TD class=prc><!-- Moved to ResultSet.tag -->
<DIV class=g-b itemprop="price">$27.00</DIV><!-- Moved to ResultSet.tag --></TD>
<TD class="tme "><B class=hidlb>Time left:</B> <SPAN class=tme><B class=hidlb>Time left:</B> <SPAN>3d 6h 55m</SPAN> </SPAN></TD></TR>
*************************
code
-----
Dim trs As HtmlElementCollection = WebBrowser1.Document.GetElementsByTagName("tr")
For Each tr As HtmlElement In trs
If tr.GetAttribute("itemprop") = "offers" Then
Dim wb As New WebBrowser
'wb.DocumentText = "your html string"
wb.DocumentText = tr.OuterHtml
TextBox3.Text = TextBox3.Text & vbCrLf & "ZZ" & tr.OuterHtml
Dim doc As HtmlDocument = wb.Document
Dim a_tags As HtmlElementCollection = doc.GetElementsByTagName("a")
For Each a_tag As HtmlElement In a_tags
TextBox3.Text = TextBox3.Text & vbCrLf & "XX" & a_tag.OuterHtml
If a_tag.GetAttribute("itemprop") = "name" Then
TextBox4.Text = TextBox4.Text & vbCrLf & a_tag.GetAttribute("href").ToString
TextBox11.Text = TextBox11.Text & vbCrLf & a_tag.InnerText
End If
Next
End If
Next
-
Aug 30th, 2012, 11:23 AM
#9
Thread Starter
Addicted Member
Re: HtmlElement In a_tags is not staying within if condition
I think the syntax is correct in this code but I'm still not getting the correct action. I have two test points in this code. The result of the 1 test is shown below. The 2nd point is blank.
#1
*******
<TR itemprop="offers" itemtype="http://schema.org/Offer" itemscope><TD class="pic lt"><!-- Moved to ResultSet.tag --><!-- Moved to ResultSet.tag --><A class=img href="http://www.ebay.com/itm/Mens-Asics-Gel-Noosa-TRI-6-Racings-Shoes-Neon-yellow-White-Turquoise-/280950896347?pt=US_Men_s_Shoes&hash=item4169fa76db" itemprop="url"><IMG class=img alt="Men's Asics-Gel Noosa TRI 6 Racings Shoes Neon yellow/White/Turquoise" src="http://thumbs4.ebaystatic.com/d/l225/m/miz8OhHhVLOkFCVWXGaDycg.jpg" itemprop="image"> </A></TD>
<TD class=dtl>
<DIV class=ittl><A class=vip title="Men's Asics-Gel Noosa TRI 6 Racings Shoes Neon yellow/White/Turquoise" href="http://www.ebay.com/itm/Mens-Asics-Gel-Noosa-TRI-6-Racings-Shoes-Neon-yellow-White-Turquoise-/280950896347?pt=US_Men_s_Shoes&hash=item4169fa76db" itemprop="name">Men's Asics-Gel Noosa TRI 6 Racings Shoes Neon yellow/White/Turquoise</A> </DIV><!-- Moved to ResultSet.tag -->
<DIV class="dyn dynS">
<DIV class="s2 distLoc"></DIV>
<DIV class=s2>Returns: Not accepted</DIV>
<DIV style="CLEAR: left"></DIV></DIV>
<DIV></DIV>
<DIV class=anchors>
<DIV class=group>
<DIV class=mi-l><!-- Moved to ResultSet.tag -->
<DIV class=mi><A class="lnk iconQuickLook_14x14 mi-a" url="http://www.ebay.com/sch/moreinfo/?_id=280950896347&_ptns=US_Men_s_Shoes&_pppn=r1" t="QL">Quick Look</A> </DIV></DIV></DIV></DIV></TD>
<TD class=trs></TD>
<TD class="bids bin1"><!-- Moved to ResultSet.tag -->
<DIV>9 bids</DIV><!-- Moved to ResultSet.tag --></TD>
<TD class=prc><!-- Moved to ResultSet.tag -->
<DIV class=g-b itemprop="price">$27.00</DIV><!-- Moved to ResultSet.tag --></TD>
<TD class="tme "><B class=hidlb>Time left:</B> <SPAN class=tme><B class=hidlb>Time left:</B> <SPAN>3d 6h 55m</SPAN> </SPAN></TD></TR>
*************************
code
-----
Dim trs As HtmlElementCollection = WebBrowser1.Document.GetElementsByTagName("tr")
For Each tr As HtmlElement In trs
If tr.GetAttribute("itemprop") = "offers" Then
Dim wb As New WebBrowser
'wb.DocumentText = "your html string"
wb.DocumentText = tr.OuterHtml
TextBox3.Text = TextBox3.Text & vbCrLf & "ZZ" & tr.OuterHtml
Dim doc As HtmlDocument = wb.Document
Dim a_tags As HtmlElementCollection = doc.GetElementsByTagName("a")
For Each a_tag As HtmlElement In a_tags
TextBox3.Text = TextBox3.Text & vbCrLf & "XX" & a_tag.OuterHtml
If a_tag.GetAttribute("itemprop") = "name" Then
TextBox4.Text = TextBox4.Text & vbCrLf & a_tag.GetAttribute("href").ToString
TextBox11.Text = TextBox11.Text & vbCrLf & a_tag.InnerText
End If
Next
End If
Next
-
Aug 30th, 2012, 11:38 AM
#10
Re: HtmlElement In a_tags is not staying within if condition
vb.net Code:
Dim trs As HtmlElementCollection = WebBrowser1.Document.GetElementsByTagName("tr")
For Each tr As HtmlElement In trs
If tr.GetAttribute("itemprop") = "offers" Then
Dim wb As WebBrowser = New WebBrowser 'allows the use of HTMLDocument
wb.DocumentText = "" 'initialises document
wb.Document.Write(tr.OuterHtml) 'creates HTML document from the <tr>
Dim a_tags As HtmlElementCollection = wb.Document.GetElementsByTagName("a")
For Each a_tag As HtmlElement In a_tags
If a_tag.GetAttribute("itemprop") = "name" Then
TextBox11.AppendText(a_tag.InnerText & vbCrLf) 'that's the way to do it!
End If
Next
End If
Next
-
Aug 30th, 2012, 01:08 PM
#11
Thread Starter
Addicted Member
Re: HtmlElement In a_tags is not staying within if condition
That is it. Thanks. Appreciate all your help.
Tags for this Thread
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|