SOF: "Save all HREF's inside LI's in a named UL to a '.txt' File"

Discussions and Tech Support related to website data extraction, screen scraping and data mining using iMacros.
Forum rules
Before asking a question or reporting an issue:
1. Please review the list of FAQ's.
2. Use the search box (at the top of each forum page) to see if a similar problem or question has already been addressed.
3. Try searching the iMacros Wiki - it contains the complete iMacros reference as well as plenty of samples and tutorials.
4. We can respond much faster to your posts if you include the following information: CLICK HERE FOR IMPORTANT INFORMATION TO INCLUDE IN YOUR POST
Post Reply
chivracq
Posts: 9807
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

SOF: "Save all HREF's inside LI's in a named UL to a '.txt' File"

Post by chivracq » Thu Jul 22, 2021 1:53 am

"Interesting Thread on SOF, surprisingly "'Good Quality"... First time I gave a '+1" I think...! (Or one out of 5 max...!)
https://stackoverflow.com/q/68442789/3799241

=> Opening a Parallel Thread on our Forum, @OP from SOF hasn't reacted so far, and I never had a "good" Relationship with that Forum/Site, stupidly Buggy + Monopoly Game about Posting and full of big Ego's controlling the Content... OK, no Comment...! :roll: )

I started posting an "Answer" on the Site, but I'm "afraid" it will "disappear" very soon, either from the User deleting their Qt once the get their Answer/Script working, or some over-zealous "Cleaning-Bot" or some over-zealous Mod with 10k+ Rep on the Site who won't like my "Free Talking", ah-ah...!
,
How can I save all href values from list items to a text file with iMacros?

I am a newbie to imacros but have version 12.0.501.6698 installed on the PC.

I am trying to extract from the html of a page all the href values located in multiple list items. Then save those URLs into a text file.

The number of list items can be different so I cannot use a loop of a known number of iterations; I have to grab all the list items and extract the href attribute values.

Example of the format of the html code

Code: Select all

<ul class="bullet-list columns-2 columns--regular">
<li><a href="/search/agents/results.htm?location=ampthill" >Estate Agents in Ampthill</a></li>
<li><a href="/search/agents/results.htm?location=barton_le_clay" >Estate Agents in Barton-Le-Clay</a></li>
<li><a href="/search/agents/results.htm?location=bedford" >Estate Agents in Bedford</a></li>
<li><a href="/search/agents/results.htm?location=biggleswade" >Estate Agents in Biggleswade</a></li>
<li><a href="/search/agents/results.htm?location=bromham" >Estate Agents in Bromham</a></li>
<li><a href="/search/agents/results.htm?location=clapham_beds" >Estate Agents in Clapham</a></li>
</ul>
I have looked at the code in similar articles such as - iMacros: Extract ID attribute from a ul li list

This is the code I have tried in imacros.

Code: Select all

VERSION BUILD=12.0.501.6698
TAB T=1
SET !ERRORIGNORE YES
SET !EXTRACT_TEST_POPUP NO
TAB CLOSEALLOTHERS
'SET !PLAYBACKDELAY 0.00
URL GOTO=https://www.home.co.uk/search/agents/?county=beds

TAG POS=1 TYPE=UL ATTR=ID:bullet-list EXTRACT=LI
TAG POS=R{{!LOOP}} TYPE=A ATTR=ID:* EXTRACT=HREF
SAVEAS TYPE=EXTRACT FOLDER=* FILE=c:\Development\towns.txt
I get an error Box

enter image description here

I have also tried modifying all the permutations for the Values of TYPE and what to EXTRACT.

The text file that should store the URLs from the href attributes in this line:

<li><a href="/search/agents/results.htm?location=clapham_beds" >Estate Agents in Clapham</a></li>

just contains a line #EANF# not "/search/agents/results.htm?location=clapham_beds"

>>>

https://i.stack.imgur.com/lb3OV.png
- (F)CI(M) = (Full) Config Info (Missing): iMacros + Browser + OS (+ all 3 Versions + 'Free'/'PE'/'Trial').
- I don't even read the Qt if that (required) Info is not mentioned...!
- Script & URL help a lot for more "educated" Help...
chivracq
Posts: 9807
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: SOF: "Save all HREF's inside LI's in a named UL to a '.txt' File"

Post by chivracq » Thu Jul 22, 2021 2:11 am

Mini-Paste of all Comments on the Thread...:

And the Programmers of the Site really have some time to "waste", ah-ah..., it is now (supposedly) "impossible" to Select and Copy Comments, pfff...! Or only 1 by 1... :roll:
A bit useless, ah-ah...! :lol:
(by me...)
Hum, OK, gave you a '+1' for maybe the first "Quality" Qt in the iMacros Tag (imacros) since many months, => Compliment...! I intend to post an Answer (several Solutions), but hum..., could you correct spelling "iMacros" correctly (x3 times!)...?, + add your OS to your FCI...? (=> Win7/Win10_x32/_64, I reckon). + Maybe correct the Typo about "know"=>"known"...? [Advanced User in the Tag on this Site, Mod on the iMacros Forum and "Tech-Guru" for the 'Data Extraction' Sub-Forum...
(Oh..., by me again, + some 40k+ User who edited the Thread, I didn't really agree with their Edit..., but not really relevant for the overall Content...)
=> A few more Comments, check SOF if interested..

... And @OP never reacted... :shock:
- (F)CI(M) = (Full) Config Info (Missing): iMacros + Browser + OS (+ all 3 Versions + 'Free'/'PE'/'Trial').
- I don't even read the Qt if that (required) Info is not mentioned...!
- Script & URL help a lot for more "educated" Help...
chivracq
Posts: 9807
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: SOF: "Save all HREF's inside LI's in a named UL to a '.txt' File"

Post by chivracq » Thu Jul 22, 2021 2:16 am

Alright, the "Important" Part, in case/before it gets deleted, I've already spent several Hours on this Thread/Qt...:

... C&P seems to work on "Answers"...
Direct Link to SOF...

Parallel Thread on the iMacros Forum (Opened by me..., as I don't really trust this Site for "Continuity"...):

SOF: Save all HREF's inside LI's in a named UL to a '.txt' File
Grrr..., annoying, Links never work on this Buggy Site, => Direct Link...:
viewtopic.php?f=7&t=31672

[Answer is now more or less "finished", but I might still edit it "slightly"...]

[Time spent writing this Answer: About [10] Hours...]...
=> Forum Posting: ~2h approx, Writing Script(s) and Testing: ... the rest, ah-ah...!
(And first time ever I spend so much time on an Answer, annoying to have to "fight" against the "Design" of the Site...)

>>>

Addressing all your different Qt's more or less in reverse order, + posting 2 (or 3 actually) different Solutions/Implementations in "minimalistic" Implementations/Scripts, I will mention several Concepts/Techniques that I won't explain (in depth) or I'd need to quote half of the Wiki and/or of the iMacros Forum...
(Terms I enclose between Single Quotes or Backticks are such Terms...)

>>>

Warning Popup about "Loop" and "Play":
Well, read the Msg on that Popup, it looks pretty clear and self-explanatory to me...
(Well, apart from the ugly Typo in it, of course...!)

>>>

Getting #EANF# in the EXTRACT and SAVEAS:

=> Yep, normal, that's because you are using the UL Element as 'Anchor' for 'Relative Positioning', you would need to use 'Double Relative Positioning' in this Case as the UL Element is actually the 'Container' for all LI + A Elements...
(More Explanation [on the Forum][1], where I've explained the Concept/Technique many times already...)
Hum, Linking doesn't seem to work..., here is the Link:
search.php?keywords=Double+Relative+Positioning
The number of list items can be different so I cannot use a loop of a known number of iterations...
(Emphasis mine...)
Hum, well..., this is not really true, this would actually be the "easiest" Implementation in my Opinion..., as that will have the Advantage that you can then let SAVEAS take care of saving each Link on a separate/new Row for each Loop, (or you'll need to add/implement yourself a Mechanism for that Func...), and you can simply let iMacros abort your Script "naturally" if an Element is "not found", like when there is no new Link to extract...
... And I will use 2 different Mechanisms for that part...

(All Scripts written and tested in iMacros for FF v8.8.2, PM v26.3.3, Win10_PRO_x64.)

>>>

Implementation 1: Looping + Abort on Not Found:

And that will give stg like:

Code: Select all

VERSION BUILD=8820413 RECORDER=FX
TAB T=1

SET Search_Keyword "Estate Agents"

'Debug:
'SET !LOOP 15

'URL GOTO=https://www.home.co.uk/search/agents/?county=beds

'Extract Links using 'Relative Positioning':
'TAG POS=1 TYPE=H1 ATTR=TXT:Estate<SP>Agents<SP>in<SP>Bedfordshire  //  (Recorded)
TAG POS=1 TYPE=H1 ATTR=TXT:{{Search_Keyword}}<SP>in<SP>*
TAG POS=R{{!LOOP}} TYPE=A ATTR=TXT:{{Search_Keyword}}<SP>in<SP>* EXTRACT=HREF
'>
'Debug:
'PROMPT {{!EXTRACT}}

'Save Link to '.CSV':
SAVEAS TYPE=EXTRACT FOLDER=* FILE=c:\Development\towns.txt
'SAVEAS TYPE=EXTRACT FOLDER=* FILE=SOF_MSB.txt

'Abort Script if no more Link(s) to extract:
SET !TIMEOUT_STEP 1
TAG POS=R1 TYPE=LI ATTR=TXT:{{Search_Keyword}}<SP>in<SP>*
Yep, OK, this one works already...
21 Links on the Page with URL provided, I looped the Script 30x times, and it aborts by itself at the end of Loop=21...!

- Notice, I don't use !ERRORIGNORE, and the Abort Func actually relies on that...
- While extracting and looping on the "Links" (the A Elements), I "switched back" to an LI Element for the 2nd R-POS to abort the Script, as if I had used also the next Link, the EXTRACT Command never aborts a Script (by Design), it will simply return #EANF# if the Element is not found, and without the EXTRACT, the Script would then click on and follow the Links for all previous Loops.
- And !EXTRACT_TEST_POPUP can be omitted when looping a Script...
- Works "best" with the Page already loaded once "manually", or reloading the Page on every Loop will slow the Execution... If the Page "really" needs to be loaded from the Script, it's possible to add a Mechanism for a 'Conditional URL GOTO' (another "Concept/Technique" to search the iMacros Forum for, ah-ah...!) for loading the Page only for Loop=1...

>>>

Implementation 2: Looping + Abort with MacroError() + Report:

Alright..., and this one would be my "Favorite"...!:
Same like Script_1 but can be applied to an A Element to abort the Script and using MacroError() allows to display some mini-Report in the iMacros Side-Panel like for example:

Code: Select all

VERSION BUILD=8820413 RECORDER=FX
TAB T=1

SET Search_Keyword "Estate Agents"

'Debug:
'SET !LOOP 15

'URL GOTO=https://www.home.co.uk/search/agents/?county=beds

'Extract Links using 'Relative Positioning':
'TAG POS=1 TYPE=H1 ATTR=TXT:Estate<SP>Agents<SP>in<SP>Bedfordshire  //  (Recorded)
TAG POS=1 TYPE=H1 ATTR=TXT:{{Search_Keyword}}<SP>in<SP>* EXTRACT=TXT
SET Title {{!EXTRACT}}
SET !EXTRACT NULL
TAG POS=R{{!LOOP}} TYPE=A ATTR=TXT:{{Search_Keyword}}<SP>in<SP>* EXTRACT=HREF
'>
'Debug:
'PROMPT {{!EXTRACT}}

'Save Link to '.CSV' (or '.TXT':
'SAVEAS TYPE=EXTRACT FOLDER=* FILE=c:\Development\towns.txt
SAVEAS TYPE=EXTRACT FOLDER=* FILE=SOF_MSB.txt

'Abort Script if no more Link(s) to extract:
SET !TIMEOUT_STEP 1
SET !EXTRACT NULL
'TAG POS=R1 TYPE=LI ATTR=TXT:{{Search_Keyword}}<SP>in<SP>*
TAG POS=R1 TYPE=A ATTR=TXT:{{Search_Keyword}}<SP>in<SP>* EXTRACT=TXT

'Prepare mini-Report:
SET Report {{!LOOP}}<SP>Links<SP>extracted<SP>for:<BR>{{Title}}
SET Summary (No<SP>Error...!!)<SP>({{!NOW:yyyy-mm-dd<SP>hhhnn}})<BR><BR>{{Report}}<BR><BR>

SET !ERRORIGNORE NO
SET Abort_Report EVAL("var s='{{!EXTRACT}}'; if(s=='#EANF#'){MacroError(\"{{Summary}}\");}")
SET !ERRORIGNORE YES
Like Script_1, => looped 30 or 50 times, and which will display:

Code: Select all

MacroError: (No Error...!!) (2021-07-22 15h57)

21 Links extracted for:
Estate Agents in Bedfordshire

, line 36 (Error code: -1340)

>>>

Implementation 3: Extract all LI Elements with 1 EXTRACT from the Containing UL Element:

[... Work in Progress...]

This one is a "quick and dirty" Demo, as I find it a bit of a cumbersome Implementation, but here you go...:

Code: Select all

VERSION BUILD=8820413 RECORDER=FX
TAB T=1

SET Search_Keyword "Estate Agents"

URL GOTO=https://www.home.co.uk/search/agents/?county=beds

'TAG POS=1 TYPE=LI ATTR=TXT:Estate<SP>Agents<SP>in<SP>Ampthill
'TAG POS=1 TYPE=LI ATTR=TXT:Estate<SP>Agents<SP>in<SP>Barton-Le-Clay
'TAG POS=1 TYPE=DIV ATTR=TXT:Estate<SP>agent<SP>listings<SP>are<SP>available<SP>for<SP>th* EXTRACT=HTM

'TAG POS=1 TYPE=P ATTR=TXT:Estate<SP>agent<SP>listings<SP>are<SP>available*
'TAG POS=R1 TYPE=UL ATTR=* EXTRACT=HTM

'Hum, can better use the 'H1' Element as Anchor...:
'TAG POS=1 TYPE=H1 ATTR=TXT:Estate<SP>Agents<SP>in<SP>Bedfordshire  //  (Recorded)
TAG POS=1 TYPE=H1 ATTR=TXT:{{Search_Keyword}}<SP>in<SP>*
TAG POS=R1 TYPE=UL ATTR=* EXTRACT=HTM

SET Results_HREF EVAL("var s='{{!EXTRACT}}'; var w,x,y,z; w=s.split('regular\">')[1]; x=w.split('\"'); y=x[1]+','+x[3]+','+x[5]; z=y.split(',').join('\\r\\n'); z;")
'>
'Debug:
PROMPT Results:<BR><BR>_{{Results_HREF}}_

'Not really finished... (Quick and dirty Demo...)

'Save Links to '.CSV' (or '.TXT':
'SAVEAS TYPE=EXTRACT FOLDER=* FILE=c:\Development\towns.txt

'>>>

'Extracted:
'<ul style="outline: 1px solid blue;" class="bullet-list columns-2 columns--regular"> 
'<li style="outline: 1px solid blue;"><a href="/search/agents/results.htm?location=ampthill">Estate Agents in Ampthill</a></li> 
'<li style="outline: 1px solid blue;"><a href="/search/agents/results.htm?location=barton_le_clay">Estate Agents in Barton-Le-Clay</a></li> 
'<li><a href="/search/agents/results.htm?location=bedford">Estate Agents in Bedford</a></li> 
'<li><a href="/search/agents/results.htm?location=biggleswade">Estate Agents in Biggleswade</a></li> 
'<li><a href="/search/agents/results.htm?location=bromham">Estate Agents in Bromham</a></li> 
'<li><a href="/search/agents/results.htm?location=clapham_beds">Estate Agents in Clapham</a></li> <li><a href="/search/agents/results.htm?location=dunstable">Estate Agents in Dunstable</a></li> <li><a href="/search/agents/results.htm?location=flitwick">Estate Agents in Flitwick</a></li> <li><a href="/search/agents/results.htm?location=harlington">Estate Agents in Harlington</a></li> <li><a href="/search/agents/results.htm?location=henlow">Estate Agents in Henlow</a></li> <li><a href="/search/agents/results.htm?location=houghton_regis">Estate Agents in Houghton Regis</a></li> <li><a href="/search/agents/results.htm?location=kempston">Estate Agents in Kempston</a></li> <li><a href="/search/agents/results.htm?location=langford">Estate Agents in Langford</a></li> <li><a href="/search/agents/results.htm?location=leighton_buzzard">Estate Agents in Leighton Buzzard</a></li> <li><a href="/search/agents/results.htm?location=linslade">Estate Agents in Linslade</a></li> <li><a href="/search/agents/results.htm?location=luton">Estate Agents in Luton</a></li> <li><a href="/search/agents/results.htm?location=potton">Estate Agents in Potton</a></li> <li><a href="/search/agents/results.htm?location=sandy">Estate Agents in Sandy</a></li> <li><a href="/search/agents/results.htm?location=shefford">Estate Agents in Shefford</a></li> <li><a href="/search/agents/results.htm?location=stotfold">Estate Agents in Stotfold</a></li> 
'<li><a href="/search/agents/results.htm?location=toddington">Estate Agents in Toddington</a></li> </ul>
About the Script, it's a "Quick and dirty" Solution, about the y part, only demonstrating for the first 3 Links...
Neater would be to use a for Loop until x.length/2 (Incr=2), with Array.push(), but that was just a quick and dirty Demo, where recreating the y String/Array will need to be "hard-coded" 30 or 50 times...

=> See Content of the Debug PROMPT...

(And the Script needs to be run only x1 time, => with the 'Play' Button, not with the 'Loop' Button.)

---


[1]: search.php?keywords=Double%20Relative%20Positioning

Oh yeah...!, about the "ugly Typo" in the Error Msg / Warning Popup, for @TechSup to see this Screenshot from @OP on 'imgur'...: :shock:

Code: Select all

https://i.stack.imgur.com/lb3OV.png
=> About "variable" with a "creative" Spelling, ah-ah...!

>>>

EDIT: Comment added to the 3rd Implementation/Script:
Hum and I should mention that for this Implementation, it is actually "recommended" to "fresh"-load (or reload) the Page (=> with `URL GOTO` that I have (re)activated in this Script), and certainly to not "play" with iMacros on that Page before "the" Script will run, or iMacros (Recording or Replay) will inject some Styling in the HTML Structure of the Page, => like visible in my "Extracted:" Section with `style="outline: 1px solid blue;"` for the `UL` Element and the first 2 `LI` Elements...
"Problem" is that this `style` extra-Attribute contains (2) Double Quotes each time, but I actually based one of the `split()` in the `EVAL()` on this very Double Quote Char (`"`) to isolate the `HREF` Values, or the `x[1]`/`x[3]`/`x[5]`/etc will shift to higher Values for the (Start)Index from the Array... And the Increment would also change together...
- (F)CI(M) = (Full) Config Info (Missing): iMacros + Browser + OS (+ all 3 Versions + 'Free'/'PE'/'Trial').
- I don't even read the Qt if that (required) Info is not mentioned...!
- Script & URL help a lot for more "educated" Help...
chivracq
Posts: 9807
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: SOF: "Save all HREF's inside LI's in a named UL to a '.txt' File"

Post by chivracq » Thu Jul 22, 2021 3:24 pm

Oh...!, and I thought "New for our Forum", that the Concept about posting a mini-Report at the end of a Script would be interesting, but I see I had already posted a Script about this Technique..., hum in 2016 already, ah-ah...!, in this Thread for example... OK, good-good...! 8)

>>>

Also quite "powerful", is the 'EVAL()' Statement I used in the 3rd Script:

Code: Select all

SET Results_HREF EVAL("var s='{{!EXTRACT}}'; var w,x,y,z; w=s.split('regular\">')[1]; x=w.split('\"'); y=x[1]+','+x[3]+','+x[5]; z=y.split(',').join('\\r\\n'); z;")
It's quite a "nice" Demonstration of the "Power" of the 'split()' Command that I quite like and find very "powerful" to manage to isolate all 'HREF' Values from a full HTML Extract... (Could probably use 'REGEX' also, but I prefer 'split()'...) :twisted:
- (F)CI(M) = (Full) Config Info (Missing): iMacros + Browser + OS (+ all 3 Versions + 'Free'/'PE'/'Trial').
- I don't even read the Qt if that (required) Info is not mentioned...!
- Script & URL help a lot for more "educated" Help...
Post Reply