Crawling a site to make site map

Discussions and Tech Support related to website data extraction, screen scraping and data mining using iMacros.
Forum rules
iMacros EOL - Attention!

The renewal maintenance has officially ended for Progress iMacros effective November 20, 2023 and all versions of iMacros are now considered EOL (End-of-Life). The iMacros products will no longer be supported by Progress (aside from customer license issues), and these forums will also no longer be moderated from the Progress side.

Thank you again for your business and support.

Sincerely,
The Progress Team

Before asking a question or reporting an issue:
1. Please review the list of FAQ's.
2. Use the search box (at the top of each forum page) to see if a similar problem or question has already been addressed.
3. Try searching the iMacros Wiki - it contains the complete iMacros reference as well as plenty of samples and tutorials.
4. We can respond much faster to your posts if you include the following information: CLICK HERE FOR IMPORTANT INFORMATION TO INCLUDE IN YOUR POST
Post Reply
MacroUser
Posts: 4
Joined: Sun Jul 13, 2014 7:27 pm

Crawling a site to make site map

Post by MacroUser » Wed Jul 16, 2014 6:06 pm

Hi,

There's an informative site that does not provide a site map and I want to make one for my own use in referring to the information it provides. I'm looking to see if I can accomplish this with iMacros. What I need to do is crawl the site and extract the text and URLs for internal links on all its pages, saving them in some sort of tree that indicates the hierarchy of the links. Also, the site does have section menus in the left margin, and I could construct a full site map by just extracting those and knitting them together.

Is it possible to do that with the FireFox add-on?

As a brute-force method, I could also do it if I can just crawl all the links to URLs within the site and download the HTML for all pages and use a text editor to manually extract link tags and then arrange those in the proper order. I could do that if I just have the ability to crawl the links and download the pages.

I looked through the iMacros Wiki page on "Data Extraction" and didn't see anything on how to crawl a site. I searched the Wiki for "crawl" and for "site map" and got no results. I looked through the forum on "How-To's and Examples for Web Scraping" and didn't see anything that seems applicable. And I did a search in this forum (Data Extraction and Web Screen Scraping) for "crawl" and "site map" and found one thread (http://forum.imacros.net/viewtopic.php?f=7&t=3291) that referred to the Wiki page on Web Scripting, which apparently requires the Enterprise edition, which I don't have, and which also does not say anything directly about crawling.

So, two bottom line questions:
* Can a site be crawled using a free version of iMacros?
* Is there documentation somewhere specifically on how to use iMacros to crawl a site?

Thanks for your help.


The requested background info is:
1. What version of iMacros are you using? - iMacros for Firefox 8.8.2
2. What operating system are you using? - Windows 7 Pro 64-bit English
3. Which browser(s) are you using? - FireFox v 30.0
chivracq
Posts: 10301
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: Crawling a site to make site map

Post by chivracq » Thu Jul 17, 2014 12:25 am

Yep, I think it can be done with iMacros (Add-on for FF, that's what I use), or I would personally manage to do it, I think, even if that might be tricky at some point, and really some kind of taylor-made-work... You don't mention the URL of your Site, so I can't give you any more "educated" Feedback, like mentioned in my Sig, but nested looping for "TAG POS=n TYPE=A etc... EXTRACT=HREF" with 'n' computed on !LOOP using 'EVAL()' in pure .iim Script or using JavaScript would get you all the Links of the Site...

For the "Tree" Structure, you may be able to get it from the URL's themselves of from the H1-H2-H3 big Titles on each Page...

But you have some specialized Web-Crawling Tools that might do a better and quicker job...
- (F)CI(M) = (Full) Config Info (Missing): iMacros + Browser + OS (+ all 3 Versions + 'Free'/'PE'/'Trial').
- FCI not mentioned: I don't even read the Qt...! (or only to catch Spam!)
- Script & URL help a lot for more "educated" Help...
MacroUser
Posts: 4
Joined: Sun Jul 13, 2014 7:27 pm

Re: Crawling a site to make site map

Post by MacroUser » Thu Jul 17, 2014 4:30 am

Thanks for your reply chivracq.

I didn't post the site URL because it's one that some people may not like (although it's perfectly respectable), being a sex education site. So I asked first without linking to it in case there was a simple answer that didn't depend on the details of the site.

There is a set of six online courses linked at http://www.sexarchive.info/Entrance_Pag ... e_Courses/ . Each course has a root URL of the form http://www.sexarchive.info/ECE<n>/ , where <n> is 1 through 6. Those are the six sites I want to map.

When you go to the first one, http://www.sexarchive.info/ECE1/ , you see there is a menu of links in the left margin. Every page has a menu like that that shows the next level of subsections. If I could just extract the menus for all the pages, I could knit them together to make a full site map. Or maybe there's a better way to do it.

Unfortunately, the site has no subfolders within each course, so the map can not be inferred from folder structure.

When you mention Web crawling tools, if there's something designed to do this better than iMacros, please do let me know. Otherwise, is there something that explains how to do it in iMacros?

I appreciate the help.
chivracq
Posts: 10301
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: Crawling a site to make site map

Post by chivracq » Fri Jul 18, 2014 10:25 am

OK, I had a look at ECE1 (interesting Content to read indeed btw...!), and indeed the organisation of the Sections and Sub-Sections etc is a bit messy...

I would say, the easiest and quickest way I would choose to draw the Site Structure would be to manually click on all the Menu Items you want to include in the Site Map in Record Mode (either Normal Mode or Full HTML Mode if you want the URL together) with Click-Click-Click for Items at the same Level, then use BACK to go back to one Level higher and even start again from the Home Button for a distinction between the main Sections at the first Level...
And your Site Map will then be visible in your recorded Macro, which you can later on edit and clean up in Notepad and Excel.
And you could add an "EXTRACT=HREF" to all recorded Clicks on Links to extract the URL's in a .CSV File and run the recorded Macro. Or you could automatically put the URL in the Clipboard (+ on a PROMPT to halt your Macro and give you the time) to paste it in your Excel Sheet next to the corresponding Menu Item.
- (F)CI(M) = (Full) Config Info (Missing): iMacros + Browser + OS (+ all 3 Versions + 'Free'/'PE'/'Trial').
- FCI not mentioned: I don't even read the Qt...! (or only to catch Spam!)
- Script & URL help a lot for more "educated" Help...
Sontos15
Posts: 4
Joined: Thu Feb 15, 2018 9:20 pm

Re: Crawling a site to make site map

Post by Sontos15 » Sat Jul 21, 2018 6:54 am

Hello,
is there a way to crawl automatic within a website? I will use that time to time for checking some of my websites for corrct running?

Thank you!
Post Reply