Most Efficient Way To Extract Source Code

Discussions and Tech Support related to website data extraction, screen scraping and data mining using iMacros.
Forum rules
Before asking a question or reporting an issue:
1. Please review the list of FAQ's.
2. Use the search box (at the top of each forum page) to see if a similar problem or question has already been addressed.
3. Try searching the iMacros Wiki - it contains the complete iMacros reference as well as plenty of samples and tutorials.
4. We can respond much faster to your posts if you include the following information:CLICK HERE FOR IMPORTANT INFORMATION TO INCLUDE IN YOUR POST
Post Reply
thisissolame
Posts: 3
Joined: Mon May 03, 2010 9:33 pm

Most Efficient Way To Extract Source Code

Post by thisissolame » Wed Oct 13, 2010 12:24 am

This isn't really a question, just how I'm getting the source code for websites. I get ideas from old posts so maybe this will be of use to someone in the future.

I've been using Imacros 6 to get the source for web pages by saving the source code each time then messing with it through outside scripting. This is a roundabout way for me because in the end I just want the entire source code to be populated by a variable.

The wiki talks about grabbing the source with TAG

Code: Select all

TAG POS=1 TYPE=HTML ATTR=* EXTRACT=HTM
But this changes html tags to all upper case, doesn't respect whitespace (line breaks, tabs, etc), doesn't include comments, and might not include weird or incorrect html/other types of (sloppy) code.
-----

I've been using Imacros 7 for the past month and with it's search command you can use regexp to get the raw source code untouched as a string in a variable.

Code: Select all

SEARCH SOURCE=REGEXP:"(?s)(.*)" EXTRACT="$1"
I'm not very good with regex and had to mess around to get it right so if there's a better way to do it, let me know. But so far this is the quickest way I've found to get the source code.
Tom, Tech Support
Posts: 3461
Joined: Mon May 31, 2010 4:59 pm

Re: Most Efficient Way To Extract Source Code

Post by Tom, Tech Support » Thu Oct 14, 2010 11:15 am

Excellent tip! Thanks for sharing with the community! I have added a reference to this topic to the Wiki page.
Regards,

Tom, iMacros Support
Tom, Tech Support
Posts: 3461
Joined: Mon May 31, 2010 4:59 pm

Re: Most Efficient Way To Extract Source Code

Post by Tom, Tech Support » Mon Jan 24, 2011 12:38 pm

Please note that the REGEXP posted above only works with the regex evaluator used by the iMacros 7 Browser and iMacros for IE.

The following REGEXP which works with iMacros for Firefox. It captures everything within the <html></html> tags, but not the actual opening and closing <html> tags themselves.

Code: Select all

SEARCH SOURCE=REGEXP:"([\\s\\S]*)" EXTRACT="$1"
You can interactively try other expressions using the following site:

http://www.gskinner.com/RegExr/
Regards,

Tom, iMacros Support
bumusic
Posts: 36
Joined: Sun Nov 08, 2009 9:59 pm

Re: Most Efficient Way To Extract Source Code

Post by bumusic » Thu Dec 26, 2013 3:35 am

Code: Select all

SAVEAS TYPE=HTM FOLDER=* FILE=abc.txt
this one better
Post Reply