Extract urls from certain domain

Discussions and Tech Support specific to the iMacros Firefox add-on.
Forum rules
iMacros EOL - Attention!

The renewal maintenance has officially ended for Progress iMacros effective November 20, 2023 and all versions of iMacros are now considered EOL (End-of-Life). The iMacros products will no longer be supported by Progress (aside from customer license issues), and these forums will also no longer be moderated from the Progress side.

Thank you again for your business and support.

Sincerely,
The Progress Team

Before asking a question or reporting an issue:
1. Please review the list of FAQ's.
2. Use the search box (at the top of each forum page) to see if a similar problem or question has already been addressed.
3. Try searching the iMacros Wiki - it contains the complete iMacros reference as well as plenty of samples and tutorials.
4. We can respond much faster to your posts if you include the following information: CLICK HERE FOR IMPORTANT INFORMATION TO INCLUDE IN YOUR POST
Post Reply
Chilly_Bang
Posts: 29
Joined: Tue Jan 27, 2015 9:13 am

Extract urls from certain domain

Post by Chilly_Bang » Mon Apr 29, 2019 9:23 pm

Hi
I'm on:
Win7x64
FF52
iMacros: BUILD=8881205

I'm trying to do following:

there are bunch of paginated pages - i want to extract from the source code of every page urls, which are from certain domain (not from other domains, which are existing too).
Throgh the bunch of page i go easy with

Code: Select all

URL GOTO https://www.example.com/page/{{!LOOP}}
Then i wait pair of second until the page is loaded completely with

Code: Select all

WAIT SECONDS=4
But then comes the fail:

Code: Select all

TAG POS={{!LOOP}} TYPE=A ATTR=<A<SP>HREF="https://www.mydomain/mypath/"* EXTRACT=HREF
SAVEAS TYPE=EXTRACT FOLDER=* FILE=urls.csv
In the line where i want to match urls from the needed domain+path i get a syntax error. I tested pretty many notation variants, but i seem to not to know something substantial.

Could somebody help me out with it? I'm comfortable in matching things in my Notepad++ - it would be not a rocket science to match it with something like

Code: Select all

^.*mydomain\/mypath.*\r\n
, but here i've run out of ideas...:(
FCI: Win 7 x64 + Win10 x64 + FF 45.9.0 + iMacro for FF 9.0.3
chivracq
Posts: 10301
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: Extract urls from certain domain

Post by chivracq » Tue Apr 30, 2019 2:29 am

Chilly_Bang wrote:
Mon Apr 29, 2019 9:23 pm
Hi
I'm on:

Code: Select all

Win7x64
FF52
iMacros: BUILD=8881205
I'm trying to do following:

there are bunch of paginated pages - i want to extract from the source code of every page urls, which are from certain domain (not from other domains, which are existing too).
Throgh the bunch of page i go easy with

Code: Select all

URL GOTO https://www.example.com/page/{{!LOOP}}
Then i wait pair of second until the page is loaded completely with

Code: Select all

WAIT SECONDS=4
But then comes the fail:

Code: Select all

TAG POS={{!LOOP}} TYPE=A ATTR=<A<SP>HREF="https://www.mydomain/mypath/"* EXTRACT=HREF
SAVEAS TYPE=EXTRACT FOLDER=* FILE=urls.csv
In the line where i want to match urls from the needed domain+path i get a syntax error. I tested pretty many notation variants, but i seem to not to know something substantial.

Could somebody help me out with it? I'm comfortable in matching things in my Notepad++ - it would be not a rocket science to match it with something like

Code: Select all

^.*mydomain\/mypath.*\r\n
, but here i've run out of ideas...:(
Oh...!, nice to see that you rediscovered our Forum 3 days after first opening your (parallel) thread on SOF: :wink:
- iMacros: extract all urls from special domain+path
Yep, I didn't reply to that one on SOF because FCI not "really" mentioned, and RuntimeError not mentioned either, hum, nor here also... :idea:
Hum, and I'm always a bit "allergic" to 'example.com' + 'mydomain' + 'myurl' etc, as I usually need to give 20 Levels of IF-IF-IF to cover all possible Cases and I can't do any Testing by myself... :idea:

FCI mentioned on our Forum, perfect, except that it doesn't correspond to the one mentioned in your Sig...!
=> Which Version of iMacros for FF are you REALLY using...!? => v9.0.3 like mentioned in your Sig and your Script on SOF...?, or v8.8.8 lie mentioned in your FCI in this Thread...?
Mentioning your FCI in your Sig is never a "really" good idea anyway, as you only have one dynamic Sig for the whole Forum and it's only "confusing" for your older Threads/Posts or for your current Thread if you forgot to update your FCI in your Sig... And if you really per se want to mention it in your Sig, you should mention "Current" + the Date of your last Update of your Sig... :idea:

v8.8.8 for FF is a bit of a "strange" Version to use with FF52 anyway... :?
I'm not even sure that Version still worked in FF52, nor v8.8.9 which would already be a mini little bit more "Standard"...
:arrow: THE iMacros Version to use with FF52 is v8.9.7 anyway. (... And recommended FF Version is then FF v55.0.3... (or FF50 if you use the 'FILTER' Command on heavy Picture Pages, that got broken in FF51, + Multi-Login Authentication that became a bit of a hassle from FF51 also, hum, and maybe from FF50 already actually...)
Hum, or are you using FF52 because you are on the ESR Channel...? (Last ESR Version indeed to support v8.9.7, (before WebExtensions), but still no "valid" Reason to use v8.8.8...! :? )

But OK, what stroke me and which probably triggers your "secret" Syntax Error is the "<A<SP>" part in:

Code: Select all

TAG POS={{!LOOP}} TYPE=A ATTR=<A<SP>HREF="https://www.mydomain/mypath/"* EXTRACT=HREF
Some correct Syntax I would think would be: (=> "I would think", because I cannot check anything with your "myexample"/"myurl" etc Info...)

Code: Select all

TAG POS={{!LOOP}} TYPE=A ATTR=HREF:https://www.mydomain/mypath/* EXTRACT=HREF
'HREF' is an Attribute, just like 'TXT' or 'CLASS'...
And hum, I'm not completely certain "HREF" is correct for the 'ATTR' Parameter, it might be "SRC" instead... :?
And notice the Alternance between "=" and ":" in the Syntax for Attributes... :wink:

Maybe some other things are also playing a role in your Pb but it's difficult to think "past" the first few "obvious" things... :wink:
- (F)CI(M) = (Full) Config Info (Missing): iMacros + Browser + OS (+ all 3 Versions + 'Free'/'PE'/'Trial').
- FCI not mentioned: I don't even read the Qt...! (or only to catch Spam!)
- Script & URL help a lot for more "educated" Help...
serbeer
Posts: 44
Joined: Fri Sep 11, 2015 5:36 am

Re: Extract urls from certain domain

Post by serbeer » Fri May 03, 2019 7:44 pm

chivracq,
not related to the original question, but what is this Multi-Login Authentication "that became a bit of a hassle from FF51 also, hum, and maybe from FF50 already " you mentioned in your response? Search of the forum resulted in no hits.

I am planning to upgrade my setup to the latest FF that iMacros version 8.9.7 will work with, which, per my research appears to be 56.0.2 (I realize I will have to remove image filtering from my scripts, and that should be OK), but wanted to make sure I am not missing any other implications, even though this is unlikely to be an issue since I am currently on FF v52 anyway, which is beyond FF50/51 you mentioned.

Thanks!
chivracq
Posts: 10301
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: Extract urls from certain domain

Post by chivracq » Sat May 04, 2019 2:07 am

serbeer wrote:
Fri May 03, 2019 7:44 pm
chivracq,
not related to the original question, but what is this Multi-Login Authentication "that became a bit of a hassle from FF51 also, hum, and maybe from FF50 already " you mentioned in your response? Search of the forum resulted in no hits.

I am planning to upgrade my setup to the latest FF that iMacros version 8.9.7 will work with, which, per my research appears to be 56.0.2 (I realize I will have to remove image filtering from my scripts, and that should be OK), but wanted to make sure I am not missing any other implications, even though this is unlikely to be an issue since I am currently on FF v52 anyway, which is beyond FF50/51 you mentioned.

Thanks!
Multi-Login:
OK, I remember after updating FF50 to FF51 that "stg" had changed in FF51 (was "vaguely" documented like always in the FF Change-Log), and I had to change/tweak/toggle 2 'about:config' Settings in FF to get some "OK-Acceptable" Config, took me less than 5min then...
Mini-search now, in my 'about:config' on "multi"/"logon"/"login"/"auth"/"log" for my FF55 and hum, I cannot locate which Settings I did modify at that time... So, hum-hum...!? I'm still at FF55, so FF didn't update (and is complaining often enough that my Version is "critically unsafe", ah-ah...!), so hum..., I don't know..., should be easy to find I would think, from the Change-Log for FF50 or FF51...
But yep indeed, if you were already at FF52, you were already "after" that Multi-Login Authentication Change and if you didn't notice anything when you had previously updated to FF52 (ESR I guess), then I guess you are already "immune"...

Hum-hum, was related to 'http'/'https' I think, and only for 'http'...
OK, YES, found them, "http" and "https" didn't find atg relevant, but "secure" did, youpidoo...!
=> OK, those are the 2 'about:config' Settings I had set to "false" after FF50 or FF51 around that time:
- security.insecure_password.ui.enabled
- security.insecure_field_warning.contextual.enabled

And I see I have another related Setting:
- security.insecure_field_warning.ignore_local_ip_address
=> but left at its Default (= "true") value...

My FF FCI is iMacros for FF v8.9.7 + FF v55.0.3 + Win10_x64.
I am planning to upgrade my setup to the latest FF that iMacros version 8.9.7 will work with, which, per my research appears to be 56.0.2
Not a "good" Idea is my "Advice"... :shock:
Yep, v8.9.7 still works on FF56, but hum..., had many Reports (on the Forum) that "many" Things didn't work anymore or correctly in FF56.

My Advice...: :wink:
Don't go any further than FF v55.0.3, which is the Version I use myself, and works correctly with v8.9.7 for FF. (apart from the 'FILTER' Command that got broken(*) from FF52 or FF53, I don't remember exactly..., I had quickly updated between 2 or 3 FF Versions in one day at that time as soon as I had found out that 'FILTER=ON' would "freeze" my Scripts for several Minutes..., and I was hoping that the next FF Version would "unbreak" it again...)

And "broken(*)" is only noticeable for "heavy" Pages with 1000++ 'IMG' Elements, most Users won't notice any Difference, I would think, would need to 'profile' iMacros before/after and/or to stopwatch-monitor it (or with '!NOW' at the ms Level) to check... Well, with 1000++ Images, we don't talk about ms (=Milliseconds) anymore, rather 10-Sec Unity, oops...!
A Page that would take 2-3-5 Sec to load with 1000++ Images on FF50 or FF51 with 'FILTER=ON', would then need between 3-5 Min to load on FF52 or FF53 (hum, or maybe FF54...(?)) (with 'FILTER=ON'), against about 45 Sec with 'FILTER=OFF' in both FF Versions..., meaning 'FILTER=ON' has no Use anymore from FF51+ as it actually drastically slows down the Loading Time of heavy Pages...
- (F)CI(M) = (Full) Config Info (Missing): iMacros + Browser + OS (+ all 3 Versions + 'Free'/'PE'/'Trial').
- FCI not mentioned: I don't even read the Qt...! (or only to catch Spam!)
- Script & URL help a lot for more "educated" Help...
Chilly_Bang
Posts: 29
Joined: Tue Jan 27, 2015 9:13 am

Re: Extract urls from certain domain

Post by Chilly_Bang » Sat May 04, 2019 3:28 pm

Thanks to all guys! While we were talking here, the source page was updated and inks got titles !!! So i got it with

Code: Select all

TYPE=A ATTR=TITLE:link<SP>title EXTRACT=HREF
FCI: Win 7 x64 + Win10 x64 + FF 45.9.0 + iMacro for FF 9.0.3
chivracq
Posts: 10301
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: Extract urls from certain domain

Post by chivracq » Sat May 04, 2019 3:38 pm

Chilly_Bang wrote:
Sat May 04, 2019 3:28 pm
Thanks to all guys! While we were talking here, the source page was updated and inks got titles !!! So i got it with

Code: Select all

TYPE=A ATTR=TITLE:link<SP>title EXTRACT=HREF
Hum..., I'm not really convinced this is a very "reliable" Solution... Having the Link (also) displayed as the Page Title is probably a temporary "dirty" Solution for the time being from the Developer/Designer of that Page/Site that they will probably change again later into some "cleaner" Title or where they will truncate the URL or where some "special" Chars (like Space/Apostrophe/Qt_Mark/etc) can get converted and your Sol won't work anymore...
- (F)CI(M) = (Full) Config Info (Missing): iMacros + Browser + OS (+ all 3 Versions + 'Free'/'PE'/'Trial').
- FCI not mentioned: I don't even read the Qt...! (or only to catch Spam!)
- Script & URL help a lot for more "educated" Help...
serbeer
Posts: 44
Joined: Fri Sep 11, 2015 5:36 am

Re: Extract urls from certain domain

Post by serbeer » Mon May 06, 2019 6:26 pm

chivracq wrote:
Sat May 04, 2019 2:07 am
serbeer wrote:
Fri May 03, 2019 7:44 pm
chivracq,
not related to the original question, but what is this Multi-Login Authentication "that became a bit of a hassle from FF51 also, hum, and maybe from FF50 already " you mentioned in your response? Search of the forum resulted in no hits.

I am planning to upgrade my setup to the latest FF that iMacros version 8.9.7 will work with, which, per my research appears to be 56.0.2 (I realize I will have to remove image filtering from my scripts, and that should be OK), but wanted to make sure I am not missing any other implications, even though this is unlikely to be an issue since I am currently on FF v52 anyway, which is beyond FF50/51 you mentioned.

Thanks!
Multi-Login:
OK, I remember after updating FF50 to FF51 that "stg" had changed in FF51 (was "vaguely" documented like always in the FF Change-Log), and I had to change/tweak/toggle 2 'about:config' Settings in FF to get some "OK-Acceptable" Config, took me less than 5min then...
Mini-search now, in my 'about:config' on "multi"/"logon"/"login"/"auth"/"log" for my FF55 and hum, I cannot locate which Settings I did modify at that time... So, hum-hum...!? I'm still at FF55, so FF didn't update (and is complaining often enough that my Version is "critically unsafe", ah-ah...!), so hum..., I don't know..., should be easy to find I would think, from the Change-Log for FF50 or FF51...
But yep indeed, if you were already at FF52, you were already "after" that Multi-Login Authentication Change and if you didn't notice anything when you had previously updated to FF52 (ESR I guess), then I guess you are already "immune"...

Hum-hum, was related to 'http'/'https' I think, and only for 'http'...
OK, YES, found them, "http" and "https" didn't find atg relevant, but "secure" did, youpidoo...!
=> OK, those are the 2 'about:config' Settings I had set to "false" after FF50 or FF51 around that time:
- security.insecure_password.ui.enabled
- security.insecure_field_warning.contextual.enabled

And I see I have another related Setting:
- security.insecure_field_warning.ignore_local_ip_address
=> but left at its Default (= "true") value...

My FF FCI is iMacros for FF v8.9.7 + FF v55.0.3 + Win10_x64.
I am planning to upgrade my setup to the latest FF that iMacros version 8.9.7 will work with, which, per my research appears to be 56.0.2
Not a "good" Idea is my "Advice"... :shock:
Yep, v8.9.7 still works on FF56, but hum..., had many Reports (on the Forum) that "many" Things didn't work anymore or correctly in FF56.

My Advice...: :wink:
Don't go any further than FF v55.0.3, which is the Version I use myself, and works correctly with v8.9.7 for FF. (apart from the 'FILTER' Command that got broken(*) from FF52 or FF53, I don't remember exactly..., I had quickly updated between 2 or 3 FF Versions in one day at that time as soon as I had found out that 'FILTER=ON' would "freeze" my Scripts for several Minutes..., and I was hoping that the next FF Version would "unbreak" it again...)

And "broken(*)" is only noticeable for "heavy" Pages with 1000++ 'IMG' Elements, most Users won't notice any Difference, I would think, would need to 'profile' iMacros before/after and/or to stopwatch-monitor it (or with '!NOW' at the ms Level) to check... Well, with 1000++ Images, we don't talk about ms (=Milliseconds) anymore, rather 10-Sec Unity, oops...!
A Page that would take 2-3-5 Sec to load with 1000++ Images on FF50 or FF51 with 'FILTER=ON', would then need between 3-5 Min to load on FF52 or FF53 (hum, or maybe FF54...(?)) (with 'FILTER=ON'), against about 45 Sec with 'FILTER=OFF' in both FF Versions..., meaning 'FILTER=ON' has no Use anymore from FF51+ as it actually drastically slows down the Loading Time of heavy Pages...
Thank you for the detailed response chivracq. I looked and
- security.insecure_password.ui.enabled is already false in my build of FF, and this is actually default there, did not have to change it.
- security.insecure_field_warning.contextual.enabled does not even exit, and neither does security.insecure_field_warning.ignore_local_ip_address. Weired.

Thanks for the advice regarding staying at v55, perhaps it is a good idea. The reason I was planning to go to 56 because more than one website would tell me that I need at least v56 of FF to use the browser, and even though none of the sites I actually use complain yet, so I thought there was something significant about v56, I know they merged 32 and 64 bit FF codebase in it for example.

The sites I use filter on have several dozens but not hundreds of images. I may still get rid of image filter because of change of way I use iMacros after original implementation. I now very rarely open more than 2 or 3 sites at the same time, so no longer do simultaneous processing of 15-20 of them in separate tabs as my setup allows, as I got much better at guessing right which websites will be useful and which will not for a particular search. So filtering images out is not saving me much time currently, and the sites are surely much more pleasant to use with images on... In retrospective, should have made it a switch in my contol panel, like I did for some other things, like sorting results, not worth it retrofiting at this time...

Thanks!
Post Reply