Extracting Nested Tables

Information related to the use of iMacros for Web Scraping, Data Mining and creating Mashups.
Post Reply
Tom, Tech Support
Posts: 3353
Joined: Mon May 31, 2010 4:59 pm

Extracting Nested Tables

Post by Tom, Tech Support » Mon Jan 23, 2012 11:59 am

Hello everyone,

As many of you are aware, the table extraction format changed from iMacros 6 to iMacros 7. As a result, iMacros 7 no longer formats table data with the #NEXT# and #NEWLINE# delimiters when extracting a table containing nested tables - a normal text extraction is performed instead.

[Update: iMacros 8 now includes #NEXT# and #NEWLINE# delimiters when extracting nested tables.]

One workaround to this is to only extract the innermost table. However, there are many cases where you may need to extract the entire table including all sub-tables, while retaining the format of V6. This is especially important if you have many legacy macros and scripts that depend on this format.

The following VBScript routines will accomplish this for you!

This solution requires the iMacros Scripting Edition, and it also requires you to use EXTRACT=HTM when extracting your table instead of EXTRACT=TXT.

Once you have extracted the table HTML and retrieved it with a call to iimGetLastExtract, you simply pass it to ExtractTableV6 and it returns the data in a format similar to that used in V6.

Code: Select all

Function ExtractTableV6(htmlTable)
' Extracts a table (including nested tables) from raw HTML extracted with iMacros
' using the TYPE=TABLE and EXTRACT=HTM parameters. 

	Dim doc
	Set doc = CreateObject("HTMLFile")
	doc.write htmlTable
	
	Dim outermostTable, tableData
	Set outermostTable = doc.getElementsByTagName("TABLE")(0)
	
	ExtractTableV6 = ExtractChildNodesV6(outermostTable, tableData, False)

End Function

Function ExtractChildNodesV6(ByRef node, ByRef tableData, ByVal isNestedTable)
' Recursive function to extract all child nodes of the given outermost TABLE node.
' The returned data contains #NEXT# and #NEWLINE# delimiters for each table cell and 
' row in a format similar to iMacros 6.

	Dim child, text
	
	For Each child In node.children
		If child.tagName = "P" Then
			tableData = tableData & vbNewLine
		End If

		If child.children.length = 0 Then
			
			If child.tagName = "BR" Then
				text = vbNewLine
			Else
				If Len(tableData) > 1 And Right(tableData, 2) <> vbNewLine  Then
					text = " "
				Else
					text = ""
				End If
				text = text & child.innerText
			End If
			
			tableData = tableData & text

		ElseIf child.children.length > 0 Then
			Dim afterBegin : afterBegin = child.getAdjacentText("afterBegin")
			
			If Len(afterBegin) > 0 Then
				tableData = tableData & child.getAdjacentText("afterBegin")
			End If
			
			If child.tagName = "TABLE" Then
				isNestedTable = True
			End If
			
			tableData = ExtractChildNodesV6(child, tableData, isNestedTable)
			tableData = tableData & child.getAdjacentText("beforeEnd")
		End If

		If child.tagName = "TD" Or child.tagName = "TH" Then
			tableData = tableData & "#NEXT#"
		ElseIf child.tagName = "TR" Then	
			tableData = tableData & "#NEWLINE#"
			If isNestedTable Then
				tableData = tableData & vbNewLine
			End If
		End If
	Next
	
	ExtractChildNodesV6 = tableData

End Function
Note: The format may not be 100% identical to V6 with regards to spacing and end-of-line characters, however, the placement of the #NEXT# and #NEWLINE# markers for delimiting table cells and rows should be equivalent to V6. This code is not guaranteed to work with all tables, and it has only been tested with a limited set of test data.

This code is provided AS-IS, please feel free to update and modify it to suit your own specific needs!

Attached is a complete example script that replicates the behavior and output of iMacros 6 for the following macro:

Code: Select all

URL GOTO=http://www.iopus.com/imacros/demo/v6/extract2/
TAG POS=1 TYPE=P ATTR=CLASS:heading EXTRACT=TXT
TAG POS=1 TYPE=BLOCKQUOTE ATTR=CLASS:bdytxt EXTRACT=TXT
TAG POS=1 TYPE=TABLE ATTR=TXT:This<SP>line<SP>is<SP>extracted* EXTRACT=TXT
SAVEAS TYPE=EXTRACT FOLDER=* FILE=TestExtract_{{!NOW:hhmmss}}.csv
Attachments
ExtractNestedTables.zip
(2.37 KiB) Downloaded 1937 times
Regards,

Tom, iMacros Support
Post Reply