Three fundamental techniques of extracting a table's data

Information related to the use of iMacros for Web Scraping, Data Mining and creating Mashups.

Moderator: iMacros Moderators

Three fundamental techniques of extracting a table's data

by Hannes, Tech Support on Mon Sep 24, 2007 3:13 am

We will demonstrate the different ways to extracting a table's data by the following page: http://finance.yahoo.com/q?s=ACU (Note that iOpus is neither associated to Yahoo, nor to Acme United Corp. (ACU). We use this site for demonstrational purposes, only.)

Here's how the site and the table do look like:

Image

Technique 1: By position only:

As HTML table fields are enclosed by <td> tags, we can use is in the extraction. Here's the code that extracts the first field ("Last Trade") using the table's heading as an extraction anchor

Code: Select all
TAG POS=1 TYPE=B ATTR=TXT:ACME<SP>UNITED<SP>CP   
TAG POS=R3 TYPE=TD ATTR=TXT:* EXTRACT=TXT 


Playing with the parameters in the extraction wizard, we find that POS=R4 extracts the next field (first line, second column)

Image

and POS=R5 extracts the first entry of the second line.

Thus, extracting the whole table simply means to loop through the position values (starting with POS=R3). The odd values extracting the left column's data, while the even values identify the second column.

======

Technique 2: By special formatting
When using the Extraction Wizard on the first element of the first column ("Last Trade"),

Image

the extraction command looks like this:

Code: Select all
TAG POS=1 TYPE=TD ATTR=CLASS:yfnc_tablehead1&&TXT:* EXTRACT=TXT


And playing with the POS values in the Wizard shows that simply looping through the POS values (POS=1, POS=2,...) extracts all items of the first column.

For the second column, it is
Code: Select all
TAG POS=1 TYPE=TD ATTR=CLASS:yfnc_tabledata1&&TXT:* EXTRACT=TXT 


that does extract all fields (when looping through the POS values).

(Note that the extraction might need to be manipulated in a script to yield valid CSV,)

=======
[Edit: updated this section on 2008/09/26]

Technique 3: By extracting whole lines or the whole table at once:

As HTML uses the <tr> tag to enclose a table's lines, we can also use these to extract the table's data. Using the Wizard (and relative extraction, again), we click on the table's heading, and then replace the type TR and the ATTRibute by "TXT:*".

get.tables.line.png


When saving this extraction, the macro looks like this:
Code: Select all
TAG POS=1 TYPE=H1 ATTR=TXT:ACME<SP>UNITED<SP>CP   
TAG POS=R1 TYPE=TR ATTR=TXT:* EXTRACT=TXT 


And all following POS values (POS=2, POS=3, ...) do indeed extract the following lines.

The same can be done by using TABLE as the extraction tag's TYPE

get.table.at.once.png


which produces the following macro commands:
Code: Select all
TAG POS=1 TYPE=H1 ATTR=TXT:ACME<SP>UNITED<SP>CP   
TAG POS=1 TYPE=TABLE ATTR=TXT:* EXTRACT=TXT 


(Note that the extraction might need to be manipulated in a script to yield valid CSV,)
Hannes, iOpus Support
Hannes, Tech Support
 
Posts: 2120
Joined: Thu Sep 21, 2006 6:27 am

Re: Three fundamental techniques of extracting a table's data

by dharmendra2000 on Tue Jul 22, 2008 11:53 pm

iMacro for Firefox does not include Extraction Wizard?

Dharmendra Uteshiya
dharmendra2000
 
Posts: 214
Joined: Fri Jul 04, 2008 6:28 am
Location: Ahmedabad

Re: Three fundamental techniques of extracting a table's data

by Hannes, Tech Support on Mon Jul 28, 2008 3:31 am

dharmendra2000 wrote:iMacro for Firefox does not include Extraction Wizard?


Yes, but you can "create" extract commands by recording the TAG (by clicking the site's element) and manually adding "EXTRACT=TXT" (or the like) to the recorded TAG command.

Cf. http://wiki.imacros.net/TAG#The_EXTRACT_Parameter
Hannes, iOpus Support
Hannes, Tech Support
 
Posts: 2120
Joined: Thu Sep 21, 2006 6:27 am


Return to How-To's and Examples for Web Scraping

Who is online

Users browsing this forum: No registered users and 1 guest

Website Monitoring