html / tag relationship

Discussions and Tech Support related to automating the iMacros Browser or Internet Explorer from any scripting and programming language, such as VBS (WSH), VBA, VB, Perl, Delphi, C# or C++.
Forum rules
Before asking a question or reporting an issue:
1. Please review the list of FAQ's.
2. Use the search box (at the top of each forum page) to see if a similar problem or question has already been addressed.
3. Try searching the iMacros Wiki - it contains the complete iMacros reference as well as plenty of samples and tutorials.
4. We can respond much faster to your posts if you include the following information: CLICK HERE FOR IMPORTANT INFORMATION TO INCLUDE IN YOUR POST
Post Reply
Claudiu
Posts: 32
Joined: Fri Aug 07, 2009 10:16 am

html / tag relationship

Post by Claudiu » Tue Dec 06, 2011 12:03 am

i'm trying to pull data from google search results and i'm having issues extracting the url for all websites displayed on google 1st page.
main problem beeing that, using imacros tags extract both good data and bad data

the webpage i'm extracting data from is
http://www.google.co.uk/search?q=double ... WIBQ&gbv=2

* the good data is: 1st position in google, 2nd, etc
** the bad data is: the results from Shopping results. (Limelight Beds Furniture Limelight, etc)

the very same tag that is extracting the good data is also extracting the bad data.
there might be a way around this though, and maybe with your help i can get it to work.

what could help me ignore the bad data is the html code.
below you can see the html for both good and bad data.

you'll see that both of them look similar. BUT there is one thing though that makes the difference. and that is the html tag <h3 class="r"> which is only seen is the good data

html code for the good data

Code: Select all

<h3 class="r">
	<a href="http://www.website.co.uk/" class=l onmousedown="return rwt">
		Double Beds Frames 
	</a>
</h3>

html code for the bad data

Code: Select all

	<a href="http://www.website.co.uk/" class=l onmousedown="return rwt">
		Double Beds Frames 
	</a>
notice how the good data has an extra html tag:

Code: Select all

<h3 class="r">
that line, i think, makes the difference for achieving what i need.

bottom line ...i need to extract the website url within the google 1st position, than 2nd, 3rd, etc ... which is easy if we're using the imacros browser.
the imacro browser would return

Code: Select all

TAG POS=1 TYPE=CLASS:l&&TXT:*&&HREF: EXTRACT=HREF
this is good cause it extracts what i need.
problem is that is also extracts data that i don't need. and that is, the urls from the shopping results. (i need to skip the shopping results badly. i need to make it so that imacros won't see them)



i've tried 1ST

Code: Select all

TAG POS=1 TYPE=H3 ATTR=CLASS:r EXTRACT=HREF
which only works if the extract type is EXTRACT=TXT

i've tried 2ND

Code: Select all

TAG POS=1 TYPE=H3 ATTR=CLASS:r&&TXT:*&&HREF:* EXTRACT=TXT
which ALSO, only works if the extract type is EXTRACT=TXT

i've also tried 3RD

Code: Select all

TAG POS=1 TYPE=H3 ATTR=CLASS:r&&CLASS:l&&TXT:*&&HREF:* EXTRACT=TXT
which doesn't work at all.




i'd appreciate some help on this
Marcia, Tech Support
Posts: 1095
Joined: Thu Jan 29, 2009 1:10 pm

Re: html / tag relationship

Post by Marcia, Tech Support » Tue Dec 06, 2011 9:11 am

Hello,

Since you want to tag only elements which are inside the H3 class="r", I think your best bet is to use XPATH.

Please, try the following TAG command:

Code: Select all

TAG XPATH="//*[@id='rso']/li[{{!LOOP}}]/div/h3[@class='r']/a[@class='l']" EXTRACT=HREF
Regards,

Marcia
Claudiu
Posts: 32
Joined: Fri Aug 07, 2009 10:16 am

Re: html / tag relationship

Post by Claudiu » Tue Dec 06, 2011 11:23 am

perfecto ! .. thank you ^1000
Post Reply