How to deal with information that is only sometimes there

Information related to the use of iMacros for Web Scraping, Data Mining and creating Mashups.
Post Reply
User avatar
Tech Support
Posts: 5003
Joined: Tue Sep 20, 2005 7:25 pm
Contact:

How to deal with information that is only sometimes there

Post by Tech Support » Sat Dec 06, 2008 12:25 am

When you extract information e. g. from a job website you might notice that not all listing have all information all the time. So if you use relative extraction the TAG command that is used as anchor will trigger an error message.

As example we use our demo listing at http://www.iopus.com/imacros/demo/v6/ex ... sting1.htm

The standard extraction macro is:

Code: Select all

URL GOTO=http://www.iopus.com/imacros/demo/v6/extract1/listing1.htm     
TAG POS=1 TYPE=B ATTR=TXT:Salary:  
TAG POS=R-1 TYPE=NOBR ATTR=TXT:* EXTRACT=TXT 
TAG POS=1 TYPE=B ATTR=TXT:*Position*    
TAG POS=R-1 TYPE=NOBR ATTR=TXT:* EXTRACT=TXT 
TAG POS=1 TYPE=B ATTR=TXT:*Ref*     
TAG POS=R-1 TYPE=NOBR ATTR=TXT:* EXTRACT=TXT 
(Note: POS=R-1 is used because the HTML tag with the information starts before the text-HTML tag.)

The problem with this macro is that if e. g. the "Position:" information is missing, you will get a TAG error message.

Solution: Extract the anchor text, too! This way you will (a) get no TAG error if the anchor missing and you can easily match anchor text with extracted information:

Code: Select all

URL GOTO=http://www.iopus.com/imacros/demo/v6/extract1/listing1.htm     
TAG POS=1 TYPE=B ATTR=TXT:Salary:  EXTRACT=TXT 
TAG POS=R-1 TYPE=NOBR ATTR=TXT:* EXTRACT=TXT 
TAG POS=1 TYPE=B ATTR=TXT:*Position*    EXTRACT=TXT  
TAG POS=R-1 TYPE=NOBR ATTR=TXT:* EXTRACT=TXT 
TAG POS=1 TYPE=B ATTR=TXT:*Ref*     EXTRACT=TXT 
TAG POS=R-1 TYPE=NOBR ATTR=TXT:* EXTRACT=TXT 
Post Reply