Most Efficient Way To Extract Source Code

Discussions and Tech Support related to website data extraction, screen scraping and data mining using iMacros.
Forum rules
iMacros EOL - Attention!

The renewal maintenance has officially ended for Progress iMacros effective November 20, 2023 and all versions of iMacros are now considered EOL (End-of-Life). The iMacros products will no longer be supported by Progress (aside from customer license issues), and these forums will also no longer be moderated from the Progress side.

Thank you again for your business and support.

Sincerely,
The Progress Team

Before asking a question or reporting an issue:
1. Please review the list of FAQ's.
2. Use the search box (at the top of each forum page) to see if a similar problem or question has already been addressed.
3. Try searching the iMacros Wiki - it contains the complete iMacros reference as well as plenty of samples and tutorials.
4. We can respond much faster to your posts if you include the following information: CLICK HERE FOR IMPORTANT INFORMATION TO INCLUDE IN YOUR POST
Post Reply
thisissolame
Posts: 3
Joined: Mon May 03, 2010 9:33 pm

Most Efficient Way To Extract Source Code

Post by thisissolame » Wed Oct 13, 2010 12:24 am

This isn't really a question, just how I'm getting the source code for websites. I get ideas from old posts so maybe this will be of use to someone in the future.

I've been using Imacros 6 to get the source for web pages by saving the source code each time then messing with it through outside scripting. This is a roundabout way for me because in the end I just want the entire source code to be populated by a variable.

The wiki talks about grabbing the source with TAG

Code: Select all

TAG POS=1 TYPE=HTML ATTR=* EXTRACT=HTM
But this changes html tags to all upper case, doesn't respect whitespace (line breaks, tabs, etc), doesn't include comments, and might not include weird or incorrect html/other types of (sloppy) code.
-----

I've been using Imacros 7 for the past month and with it's search command you can use regexp to get the raw source code untouched as a string in a variable.

Code: Select all

SEARCH SOURCE=REGEXP:"(?s)(.*)" EXTRACT="$1"
I'm not very good with regex and had to mess around to get it right so if there's a better way to do it, let me know. But so far this is the quickest way I've found to get the source code.
Tom, Tech Support
Posts: 3834
Joined: Mon May 31, 2010 4:59 pm

Re: Most Efficient Way To Extract Source Code

Post by Tom, Tech Support » Thu Oct 14, 2010 11:15 am

Excellent tip! Thanks for sharing with the community! I have added a reference to this topic to the Wiki page.
Regards,

Tom, iMacros Support
Tom, Tech Support
Posts: 3834
Joined: Mon May 31, 2010 4:59 pm

Re: Most Efficient Way To Extract Source Code

Post by Tom, Tech Support » Mon Jan 24, 2011 12:38 pm

Please note that the REGEXP posted above only works with the regex evaluator used by the iMacros 7 Browser and iMacros for IE.

The following REGEXP which works with iMacros for Firefox. It captures everything within the <html></html> tags, but not the actual opening and closing <html> tags themselves.

Code: Select all

SEARCH SOURCE=REGEXP:"([\\s\\S]*)" EXTRACT="$1"
You can interactively try other expressions using the following site:

http://www.gskinner.com/RegExr/
Regards,

Tom, iMacros Support
bumusic
Posts: 36
Joined: Sun Nov 08, 2009 9:59 pm

Re: Most Efficient Way To Extract Source Code

Post by bumusic » Thu Dec 26, 2013 3:35 am

Code: Select all

SAVEAS TYPE=HTM FOLDER=* FILE=abc.txt
this one better
Post Reply