Three fundamental techniques of extracting a table's data

Information related to the use of iMacros for Web Scraping, Data Mining and creating Mashups.
Forum rules
iMacros EOL - Attention!

The renewal maintenance has officially ended for Progress iMacros effective November 20, 2023 and all versions of iMacros are now considered EOL (End-of-Life). The iMacros products will no longer be supported by Progress (aside from customer license issues), and these forums will also no longer be moderated from the Progress side.

Thank you again for your business and support.

Sincerely,
The Progress Team
Post Reply
Hannes, Tech Support

Three fundamental techniques of extracting a table's data

Post by Hannes, Tech Support » Mon Sep 24, 2007 10:13 am

We will demonstrate the different ways to extracting a table's data by the following page: http://finance.yahoo.com/q?s=ACU (Note that iOpus is neither associated to Yahoo, nor to Acme United Corp. (ACU). We use this site for demonstrational purposes, only.)

Here's how the site and the table do look like:

Image

Technique 1: By position only:

As HTML table fields are enclosed by <td> tags, we can use is in the extraction. Here's the code that extracts the first field ("Last Trade") using the table's heading as an extraction anchor

Code: Select all

TAG POS=1 TYPE=B ATTR=TXT:ACME<SP>UNITED<SP>CP   
TAG POS=R3 TYPE=TD ATTR=TXT:* EXTRACT=TXT  
Playing with the parameters in the extraction wizard, we find that POS=R4 extracts the next field (first line, second column)

Image

and POS=R5 extracts the first entry of the second line.

Thus, extracting the whole table simply means to loop through the position values (starting with POS=R3). The odd values extracting the left column's data, while the even values identify the second column.

======

Technique 2: By special formatting
When using the Extraction Wizard on the first element of the first column ("Last Trade"),

Image

the extraction command looks like this:

Code: Select all

TAG POS=1 TYPE=TD ATTR=CLASS:yfnc_tablehead1&&TXT:* EXTRACT=TXT 
And playing with the POS values in the Wizard shows that simply looping through the POS values (POS=1, POS=2,...) extracts all items of the first column.

For the second column, it is

Code: Select all

TAG POS=1 TYPE=TD ATTR=CLASS:yfnc_tabledata1&&TXT:* EXTRACT=TXT  
that does extract all fields (when looping through the POS values).

(Note that the extraction might need to be manipulated in a script to yield valid CSV,)

=======
[Edit: updated this section on 2008/09/26]

Technique 3: By extracting whole lines or the whole table at once:

As HTML uses the <tr> tag to enclose a table's lines, we can also use these to extract the table's data. Using the Wizard (and relative extraction, again), we click on the table's heading, and then replace the type TR and the ATTRibute by "TXT:*".
get.tables.line.png
When saving this extraction, the macro looks like this:

Code: Select all

TAG POS=1 TYPE=H1 ATTR=TXT:ACME<SP>UNITED<SP>CP   
TAG POS=R1 TYPE=TR ATTR=TXT:* EXTRACT=TXT  
And all following POS values (POS=2, POS=3, ...) do indeed extract the following lines.

The same can be done by using TABLE as the extraction tag's TYPE
get.table.at.once.png
which produces the following macro commands:

Code: Select all

TAG POS=1 TYPE=H1 ATTR=TXT:ACME<SP>UNITED<SP>CP   
TAG POS=1 TYPE=TABLE ATTR=TXT:* EXTRACT=TXT  
(Note that the extraction might need to be manipulated in a script to yield valid CSV,)
dharmendra2000
Posts: 214
Joined: Fri Jul 04, 2008 1:28 pm
Location: Ahmedabad
Contact:

Re: Three fundamental techniques of extracting a table's data

Post by dharmendra2000 » Wed Jul 23, 2008 6:53 am

iMacro for Firefox does not include Extraction Wizard?

Dharmendra Uteshiya
Hannes, Tech Support

Re: Three fundamental techniques of extracting a table's data

Post by Hannes, Tech Support » Mon Jul 28, 2008 10:31 am

dharmendra2000 wrote:iMacro for Firefox does not include Extraction Wizard?
Yes, but you can "create" extract commands by recording the TAG (by clicking the site's element) and manually adding "EXTRACT=TXT" (or the like) to the recorded TAG command.

Cf. http://wiki.imacros.net/TAG#The_EXTRACT_Parameter
Post Reply