Extracting Nested Tables

Information related to the use of iMacros for Web Scraping, Data Mining and creating Mashups.

Moderator: iMacros Moderators

Extracting Nested Tables

by Tom, Tech Support on Mon Jan 23, 2012 4:59 am

Hello everyone,

As many of you are aware, the table extraction format changed from iMacros 6 to iMacros 7. As a result, iMacros 7 no longer formats table data with the #NEXT# and #NEWLINE# delimiters when extracting a table containing nested tables - a normal text extraction is performed instead.

[Update: iMacros 8 now includes #NEXT# and #NEWLINE# delimiters when extracting nested tables.]

One workaround to this is to only extract the innermost table. However, there are many cases where you may need to extract the entire table including all sub-tables, while retaining the format of V6. This is especially important if you have many legacy macros and scripts that depend on this format.

The following VBScript routines will accomplish this for you!

This solution requires the iMacros Scripting Edition, and it also requires you to use EXTRACT=HTM when extracting your table instead of EXTRACT=TXT.

Once you have extracted the table HTML and retrieved it with a call to iimGetLastExtract, you simply pass it to ExtractTableV6 and it returns the data in a format similar to that used in V6.

Code: Select all
Function ExtractTableV6(htmlTable)
' Extracts a table (including nested tables) from raw HTML extracted with iMacros
' using the TYPE=TABLE and EXTRACT=HTM parameters.

   Dim doc
   Set doc = CreateObject("HTMLFile")
   doc.write htmlTable
   
   Dim outermostTable, tableData
   Set outermostTable = doc.getElementsByTagName("TABLE")(0)
   
   ExtractTableV6 = ExtractChildNodesV6(outermostTable, tableData, False)

End Function

Function ExtractChildNodesV6(ByRef node, ByRef tableData, ByVal isNestedTable)
' Recursive function to extract all child nodes of the given outermost TABLE node.
' The returned data contains #NEXT# and #NEWLINE# delimiters for each table cell and
' row in a format similar to iMacros 6.

   Dim child, text
   
   For Each child In node.children
      If child.tagName = "P" Then
         tableData = tableData & vbNewLine
      End If

      If child.children.length = 0 Then
         
         If child.tagName = "BR" Then
            text = vbNewLine
         Else
            If Len(tableData) > 1 And Right(tableData, 2) <> vbNewLine  Then
               text = " "
            Else
               text = ""
            End If
            text = text & child.innerText
         End If
         
         tableData = tableData & text

      ElseIf child.children.length > 0 Then
         Dim afterBegin : afterBegin = child.getAdjacentText("afterBegin")
         
         If Len(afterBegin) > 0 Then
            tableData = tableData & child.getAdjacentText("afterBegin")
         End If
         
         If child.tagName = "TABLE" Then
            isNestedTable = True
         End If
         
         tableData = ExtractChildNodesV6(child, tableData, isNestedTable)
         tableData = tableData & child.getAdjacentText("beforeEnd")
      End If

      If child.tagName = "TD" Or child.tagName = "TH" Then
         tableData = tableData & "#NEXT#"
      ElseIf child.tagName = "TR" Then   
         tableData = tableData & "#NEWLINE#"
         If isNestedTable Then
            tableData = tableData & vbNewLine
         End If
      End If
   Next
   
   ExtractChildNodesV6 = tableData

End Function

Note: The format may not be 100% identical to V6 with regards to spacing and end-of-line characters, however, the placement of the #NEXT# and #NEWLINE# markers for delimiting table cells and rows should be equivalent to V6. This code is not guaranteed to work with all tables, and it has only been tested with a limited set of test data.

This code is provided AS-IS, please feel free to update and modify it to suit your own specific needs!

Attached is a complete example script that replicates the behavior and output of iMacros 6 for the following macro:

Code: Select all
URL GOTO=http://www.iopus.com/imacros/demo/v6/extract2/
TAG POS=1 TYPE=P ATTR=CLASS:heading EXTRACT=TXT
TAG POS=1 TYPE=BLOCKQUOTE ATTR=CLASS:bdytxt EXTRACT=TXT
TAG POS=1 TYPE=TABLE ATTR=TXT:This<SP>line<SP>is<SP>extracted* EXTRACT=TXT
SAVEAS TYPE=EXTRACT FOLDER=* FILE=TestExtract_{{!NOW:hhmmss}}.csv
Attachments
ExtractNestedTables.zip
(2.37 KIB) Downloaded 1000 times
Regards,

Tom, iMacros Support
Tom, Tech Support
 
Posts: 3298
Joined: Mon May 31, 2010 9:59 am

Return to How-To's and Examples for Web Scraping

Who is online

Users browsing this forum: No registered users and 1 guest

-->