need help with REGEXP extraction

Discussions and Tech Support related to website data extraction, screen scraping and data mining using iMacros.

Moderators: Community Moderators, iMacros Moderators

Forum rules
Before asking a question or reporting an issue:
1. Please review the list of FAQ's.
2. Use the Google search box (at the top of each forum page) to see if a similar problem or question has already been addressed. This will search the entire contents of the forums as well as the iMacros Wiki.
3. We can respond much faster to your posts if you include the following information:


Answering your own posts (e.g. attempting to "bump" your topic) drops your topic from the list of unanswered threads, so it may actually receive less views.

need help with REGEXP extraction

by unfathomable on Thu Mar 01, 2012 3:42 pm

I'm trying to extract posts from a forum. So I need to extract all the html coding from the post IE: <img> and <a>, then have the post saved into a htm file for later reading with images and links working. Here is the section I am scraping content from:

Code: Select all
<!-- END TEMPLATE: ad_showthread_firstpost_start -->
post content<br />
<img src="http://www." border="0"/> more content</div>

      <!-- / message -->

I have the scraping and saving working with this imacro below:
Code: Select all
SEARCH SOURCE=REGEXP:"<!-- END TEMPLATE: ad_showthread_firstpost_start -->([^+]+)<!-- / message -->" EXTRACT=$1

Now the problem is that when I view the test.htm file, the source code looks something like this:
Code: Select all
post content<br />
<img src=""http://www."" border=""0""/> more content

I'm sure the double quotes have something to do with regexp but I'm not experienced with it at all and would appreciate any help, thanks.
Posts: 1
Joined: Thu Mar 01, 2012 3:19 pm

Re: need help with REGEXP extraction

by Marcia, Tech Support on Fri Mar 02, 2012 4:00 am


The SAVEAS TYPE=EXTRACT command saves the extracted data in CSV tabular format, what means that columns are separated by commas and rows are separated by a new line. Each time SAVEAS runs, a new "CSV row" is added to the "CSV table".

Since commas and new lines are separators in this format, there must be a way to have them inside the data as well. This is done by wrapping the data with double quotes. Now we get to the next problem: what happens if we want to have quotes in our data? We "escape" them, by writing them twice.

From this explanation you see that the CSV format is not quite adequate to export html... But once you know what is happening, you can "unwrap" it. If you have one file for each SAVEAS, the solution is simple: exchange all "" by ", and remove the opening and closing quotes. If you have several extractions in one single file, besides exchanging "" by ", you will need to remove the quotes which wrap each data as well.

You might find easier to use the scripting interface to retrieve the extracted data, instead of SAVEAS. Please, have a look at iMacros for Firefox javascript scripting interface.

Marcia, Tech Support
Posts: 1060
Joined: Thu Jan 29, 2009 6:10 am

Return to Data Extraction and Web Screen Scraping

Who is online

Users browsing this forum: No registered users and 3 guests