How to deal with information that is only sometimes there

Information related to the use of iMacros for Web Scraping, Data Mining and creating Mashups.

Moderator: iMacros Moderators

How to deal with information that is only sometimes there

by Tech Support on Fri Dec 05, 2008 5:25 pm

When you extract information e. g. from a job website you might notice that not all listing have all information all the time. So if you use relative extraction the TAG command that is used as anchor will trigger an error message.

As example we use our demo listing at http://www.iopus.com/imacros/demo/v6/ex ... sting1.htm

The standard extraction macro is:
Code: Select all
URL GOTO=http://www.iopus.com/imacros/demo/v6/extract1/listing1.htm     
TAG POS=1 TYPE=B ATTR=TXT:Salary: 
TAG POS=R-1 TYPE=NOBR ATTR=TXT:* EXTRACT=TXT
TAG POS=1 TYPE=B ATTR=TXT:*Position*   
TAG POS=R-1 TYPE=NOBR ATTR=TXT:* EXTRACT=TXT
TAG POS=1 TYPE=B ATTR=TXT:*Ref*     
TAG POS=R-1 TYPE=NOBR ATTR=TXT:* EXTRACT=TXT
(Note: POS=R-1 is used because the HTML tag with the information starts before the text-HTML tag.)

The problem with this macro is that if e. g. the "Position:" information is missing, you will get a TAG error message.

Solution: Extract the anchor text, too! This way you will (a) get no TAG error if the anchor missing and you can easily match anchor text with extracted information:


Code: Select all
URL GOTO=http://www.iopus.com/imacros/demo/v6/extract1/listing1.htm     
TAG POS=1 TYPE=B ATTR=TXT:Salary:  EXTRACT=TXT
TAG POS=R-1 TYPE=NOBR ATTR=TXT:* EXTRACT=TXT
TAG POS=1 TYPE=B ATTR=TXT:*Position*    EXTRACT=TXT 
TAG POS=R-1 TYPE=NOBR ATTR=TXT:* EXTRACT=TXT
TAG POS=1 TYPE=B ATTR=TXT:*Ref*     EXTRACT=TXT
TAG POS=R-1 TYPE=NOBR ATTR=TXT:* EXTRACT=TXT
User avatar
Tech Support
 
Posts: 5003
Joined: Tue Sep 20, 2005 12:25 pm

Return to How-To's and Examples for Web Scraping

Who is online

Users browsing this forum: No registered users and 1 guest

-->