Most Efficient Way To Extract Source Code

Discussions and Tech Support related to website data extraction, screen scraping and data mining using iMacros.

Moderators: Community Moderators, iMacros Moderators

Forum rules
Before asking a question or reporting an issue:
1. Please review the list of FAQ's.
2. Use the Google search box (at the top of each forum page) to see if a similar problem or question has already been addressed. This will search the entire contents of the forums as well as the iMacros Wiki.
3. We can respond much faster to your posts if you include the following information:

CLICK HERE FOR IMPORTANT INFORMATION TO INCLUDE IN YOUR POST

Answering your own posts (e.g. attempting to "bump" your topic) drops your topic from the list of unanswered threads, so it may actually receive less views.

Most Efficient Way To Extract Source Code

by thisissolame on Tue Oct 12, 2010 5:24 pm

This isn't really a question, just how I'm getting the source code for websites. I get ideas from old posts so maybe this will be of use to someone in the future.

I've been using Imacros 6 to get the source for web pages by saving the source code each time then messing with it through outside scripting. This is a roundabout way for me because in the end I just want the entire source code to be populated by a variable.

The wiki talks about grabbing the source with TAG
Code: Select all
TAG POS=1 TYPE=HTML ATTR=* EXTRACT=HTM

But this changes html tags to all upper case, doesn't respect whitespace (line breaks, tabs, etc), doesn't include comments, and might not include weird or incorrect html/other types of (sloppy) code.
-----

I've been using Imacros 7 for the past month and with it's search command you can use regexp to get the raw source code untouched as a string in a variable.

Code: Select all
SEARCH SOURCE=REGEXP:"(?s)(.*)" EXTRACT="$1"

I'm not very good with regex and had to mess around to get it right so if there's a better way to do it, let me know. But so far this is the quickest way I've found to get the source code.
thisissolame
 
Posts: 3
Joined: Mon May 03, 2010 2:33 pm

Re: Most Efficient Way To Extract Source Code

by Tom, Tech Support on Thu Oct 14, 2010 4:15 am

Excellent tip! Thanks for sharing with the community! I have added a reference to this topic to the Wiki page.
Regards,

Tom, iMacros Support
Tom, Tech Support
 
Posts: 3298
Joined: Mon May 31, 2010 9:59 am

Re: Most Efficient Way To Extract Source Code

by Tom, Tech Support on Mon Jan 24, 2011 5:38 am

Please note that the REGEXP posted above only works with the regex evaluator used by the iMacros 7 Browser and iMacros for IE.

The following REGEXP which works with iMacros for Firefox. It captures everything within the <html></html> tags, but not the actual opening and closing <html> tags themselves.

Code: Select all
SEARCH SOURCE=REGEXP:"([\\s\\S]*)" EXTRACT="$1"

You can interactively try other expressions using the following site:

http://www.gskinner.com/RegExr/
Regards,

Tom, iMacros Support
Tom, Tech Support
 
Posts: 3298
Joined: Mon May 31, 2010 9:59 am

Re: Most Efficient Way To Extract Source Code

by bumusic on Wed Dec 25, 2013 8:35 pm

Code: Select all
SAVEAS TYPE=HTM FOLDER=* FILE=abc.txt


this one better
bumusic
 
Posts: 36
Joined: Sun Nov 08, 2009 2:59 pm


Return to Data Extraction and Web Screen Scraping

Who is online

Users browsing this forum: Google [Bot] and 3 guests

-->