## Three fundamental techniques of extracting a table's data

Information related to the use of iMacros for Web Scraping, Data Mining and creating Mashups.

Moderator: iMacros Moderators

### Three fundamental techniques of extracting a table's data

We will demonstrate the different ways to extracting a table's data by the following page: http://finance.yahoo.com/q?s=ACU (Note that iOpus is neither associated to Yahoo, nor to Acme United Corp. (ACU). We use this site for demonstrational purposes, only.)

Here's how the site and the table do look like:

Technique 1: By position only:

As HTML table fields are enclosed by <td> tags, we can use is in the extraction. Here's the code that extracts the first field ("Last Trade") using the table's heading as an extraction anchor

Code: Select all
`TAG POS=1 TYPE=B ATTR=TXT:ACME<SP>UNITED<SP>CP   TAG POS=R3 TYPE=TD ATTR=TXT:* EXTRACT=TXT  `

Playing with the parameters in the extraction wizard, we find that POS=R4 extracts the next field (first line, second column)

and POS=R5 extracts the first entry of the second line.

Thus, extracting the whole table simply means to loop through the position values (starting with POS=R3). The odd values extracting the left column's data, while the even values identify the second column.

======

Technique 2: By special formatting
When using the Extraction Wizard on the first element of the first column ("Last Trade"),

the extraction command looks like this:

Code: Select all
`TAG POS=1 TYPE=TD ATTR=CLASS:yfnc_tablehead1&&TXT:* EXTRACT=TXT `

And playing with the POS values in the Wizard shows that simply looping through the POS values (POS=1, POS=2,...) extracts all items of the first column.

For the second column, it is
Code: Select all
`TAG POS=1 TYPE=TD ATTR=CLASS:yfnc_tabledata1&&TXT:* EXTRACT=TXT  `

that does extract all fields (when looping through the POS values).

(Note that the extraction might need to be manipulated in a script to yield valid CSV,)

=======
[Edit: updated this section on 2008/09/26]

Technique 3: By extracting whole lines or the whole table at once:

As HTML uses the <tr> tag to enclose a table's lines, we can also use these to extract the table's data. Using the Wizard (and relative extraction, again), we click on the table's heading, and then replace the type TR and the ATTRibute by "TXT:*".

When saving this extraction, the macro looks like this:
Code: Select all
`TAG POS=1 TYPE=H1 ATTR=TXT:ACME<SP>UNITED<SP>CP   TAG POS=R1 TYPE=TR ATTR=TXT:* EXTRACT=TXT  `

And all following POS values (POS=2, POS=3, ...) do indeed extract the following lines.

The same can be done by using TABLE as the extraction tag's TYPE

which produces the following macro commands:
Code: Select all
`TAG POS=1 TYPE=H1 ATTR=TXT:ACME<SP>UNITED<SP>CP   TAG POS=1 TYPE=TABLE ATTR=TXT:* EXTRACT=TXT  `

(Note that the extraction might need to be manipulated in a script to yield valid CSV,)
Hannes, iOpus Support
Hannes, Tech Support

Posts: 2120
Joined: Thu Sep 21, 2006 6:27 am

### Re: Three fundamental techniques of extracting a table's data

iMacro for Firefox does not include Extraction Wizard?

Dharmendra Uteshiya
dharmendra2000

Posts: 214
Joined: Fri Jul 04, 2008 6:28 am

### Re: Three fundamental techniques of extracting a table's data

dharmendra2000 wrote:iMacro for Firefox does not include Extraction Wizard?

Yes, but you can "create" extract commands by recording the TAG (by clicking the site's element) and manually adding "EXTRACT=TXT" (or the like) to the recorded TAG command.

Cf. http://wiki.imacros.net/TAG#The_EXTRACT_Parameter
Hannes, iOpus Support
Hannes, Tech Support

Posts: 2120
Joined: Thu Sep 21, 2006 6:27 am