How to extract nested informations using XPATH?

Discussions and Tech Support related to website data extraction, screen scraping and data mining using iMacros.
Forum rules
Before asking a question or reporting an issue:
1. Please review the list of FAQ's.
2. Use the search box (at the top of each forum page) to see if a similar problem or question has already been addressed.
3. Try searching the iMacros Wiki - it contains the complete iMacros reference as well as plenty of samples and tutorials.
4. We can respond much faster to your posts if you include the following information: CLICK HERE FOR IMPORTANT INFORMATION TO INCLUDE IN YOUR POST
Post Reply
balandongiv

How to extract nested informations using XPATH?

Post by balandongiv » Sun Apr 12, 2020 4:48 pm

iMacros ver: 10.0.2.1450 (FREE), Firefox, Windows 10

Hello,
The objective is to extract the value of HTML DOM Property such as id,href and data-download-file-url for each of the images displayed from this [website][https://www.freepik.com/search?dates=an ... rt=popular]. I believe XPATH will be suitable for this task as each of the image can be accessed by the following generalise XPATH

Code: Select all

/html/body/main/section[2]/div/div/figure[X]/div
with the capital X indicate the Image label that take the value from 1 to 50, for the aforementioned website.

I know that, to extract the properties of Figure 1, for example, can be achieved by

Code: Select all

TAG XPATH="/html/body/main/section[2]/div/div/figure[1]"  EXTRACT=TXT
However, the line above outputted all DOM Property including the one that I am not interested with.

According to the tutorial below;

[OP1][https://forum.imacros.net/viewtopic.php?t=26155]
[OP2][https://stackoverflow.com/questions/385 ... cros-xpath]

Extracting specific DOM property can be achieved by something like the following

Code: Select all

TAG XPATH="/html/body/main/section[2]/div/div/figure[1]/div[@id='showcase__content'] "  EXTRACT=TXT
However, the execution instead give an error.

I really appreciate if someone can shed some light about this problem.
Last edited by balandongiv on Mon Apr 13, 2020 4:22 am, edited 5 times in total.
chivracq
Posts: 9494
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: How to extract nested informations using XPATH

Post by chivracq » Sun Apr 12, 2020 5:18 pm

balandongiv wrote:
Sun Apr 12, 2020 4:48 pm

Code: Select all

iMacros ver: 10.0.2.1450 (FREE), Firefox, WIndow 10
Hello,
The objective is to extract the value of HTML DOM Property such as id,href and data-download-file-url for each of the images displayed from this [website][https://www.freepik.com/search?dates=an ... rt=popular]. I believe XPATH will be suitable for this task as each of the image can be accessed by the following generalise XPATH

Code: Select all

/html/body/main/section[2]/div/div/figure[X]/div
with the capital X indicate the Image label that take the value from 1 to 50, for the aforementioned website.

I know that, to extract the properties of Figure 1, for example, can be achieved by

Code: Select all

TAG XPATH="/html/body/main/section[2]/div/div/figure[1]"  EXTRACT=TXT
However, the line above outputted all DOM Property including the one that I am not interested with.

According to the tutorial below;

[OP1][https://forum.imacros.net/viewtopic.php?t=26155]
[OP2][https://stackoverflow.com/questions/385 ... cros-xpath]

Extracting specific DOM property can be achieved by something like the following

Code: Select all

TAG XPATH="/html/body/main/section[2]/div/div/figure[1]/div[@id='showcase__content'] "  EXTRACT=TXT
However, the execution instead give an error.

I really appreciate if someone can shed some light about this problem.

Example of the DOM property for Figure 1. The properties are all in pink color.

Code: Select all

https://drive.google.com/open?id=190q615C3uXLZUQNI8K4AJYL3Slii1ktO]

Alright, Compliment on the Good Quality for your Post/Thread... :D

3 "Things" that you'll still need to edit/improve for me to answer/do any Thinking...:
- Question Mark missing in your Thread Title, you are not sharing a 'HowTo'...!
- FCIM...! :mrgreen: (Read my Sig...) => FF Version is missing...
- Can you upload your Screenshot directly to our Forum, rather than using some External Pix Hosting Site/'GoogleDrive'..., like mentioned/explained in the Forum Rules...? (You can also leave the Link to the 'GoogleDrive' if you want, even if that Link will probably stop working "one day"...)

+ 1 more more "Thing", not "Blocking", but hum..., maybe "handy" if you could also put that Content (of the Screenshot/HTML Source) in plain Text in the Thread (in some ']code[' Formatting Tag), as a Screenshot is not searchable, and I won't go retyping any Content from it if I "need it"... :idea:

And I don't "really care", but correct Spelling is "Windows", you manage to squeeze 2 Typos in it, ah-ah...! :shock:

("FCIM" applies to your parallel Thread on SOF also..., I don't "care" for the other Items... :!: )

>>>

+ For those interested, parallel Thread on SOF:
- How to access HTML DOM Property using iMacros - xPath
- (F)CI(M) = (Full) Config Info (Missing): iMacros + Browser + OS (+ all 3 Versions + 'Free'/'PE').
- I don't even read the Qt if that (required) Info is not mentioned...!
- Script & URL help a lot for more "educated" Help...
balandongiv

Re: How to extract nested informations using XPATH?

Post by balandongiv » Mon Apr 13, 2020 1:23 am

Thanks admin @chivracq for approving the post and your compliment. i had made the changes as requested
chivracq
Posts: 9494
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: How to extract nested informations using XPATH?

Post by chivracq » Mon Apr 13, 2020 2:14 am

balandongiv wrote:
Mon Apr 13, 2020 1:23 am
Thanks admin @chivracq for approving the post and your compliment. i had made the changes as requested

OK..., I'm not "Admin", ah-ah...!, only "Moderator"..., and caring for Quality...! :wink:

OK, about your "Changes", hum..., Question Mark in Thread Title, yep, perfect, and Thanks, ... but for "the rest"..., hum...?

- Your FF Version is still missing from your FCI... (And on SOF also..., but OK, forget about SOF, the Quality is good enough there already...)
- And about uploading your Screenshot to the Forum, tja...!, you've simply removed/deleted that part, while it is nearly the most "interesting" part of your OP to understand a bit what you want actually, so no, I don't really "agree" with that "Change"..., it is not improving the Quality...
(And I'm nearly surprised you didn't get the same Request from SOF also, or that nobody edited your Qt/Thread yet to re-upload your Screenshot to their dedicated 'imgur' Pix Hosting Site...)

(And you still have 1 Typo in "Windows"..., I hope for you you'll never want to start programming in JavaScript, because you'll have a very hard time, ah-ah...!, as that Prog Lang is Case Sensitive... :P )

But OK, I see you got "a bit lucky" already, and got one Answer on SOF..., which looks a bit OK to me, so I guess you won't be very motivated to put much-much Effort into editing your OP... Alright, fair enough... I would have had some other Solution/Implementation, without using 'XPATH' that I don't like and never use... And hum, 'XPATH' was actually a bit Buggy in iMacros v10.0.x for FF and/or CR, I don't remember exactly..., but I hope for you you don't hit "that Bug"... :|
- (F)CI(M) = (Full) Config Info (Missing): iMacros + Browser + OS (+ all 3 Versions + 'Free'/'PE').
- I don't even read the Qt if that (required) Info is not mentioned...!
- Script & URL help a lot for more "educated" Help...
balandongiv

Re: How to extract nested informations using XPATH?

Post by balandongiv » Mon Apr 13, 2020 2:23 am

I did tried the solution from the SOF, but did not manage to make it work. Just try my luck here for another creative solution from the member.
Im interested to know about your non xpath solution though
chivracq
Posts: 9494
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: How to extract nested informations using XPATH?

Post by chivracq » Mon Apr 13, 2020 2:30 am

balandongiv wrote:
Mon Apr 13, 2020 2:23 am
I did tried the solution from the SOF, but did not manage to make it work. Just try my luck here for another creative solution from the member.
Im interested to know about your non xpath solution though

Hum, OK, I didn't try it...

But I told you I will start answering your Thread and doing any "Thinking" once you'll have complied with my 3 "Things", but you've only complied with 1 until now... :(
- (F)CI(M) = (Full) Config Info (Missing): iMacros + Browser + OS (+ all 3 Versions + 'Free'/'PE').
- I don't even read the Qt if that (required) Info is not mentioned...!
- Script & URL help a lot for more "educated" Help...
balandongiv

Re: How to extract nested informations using XPATH?

Post by balandongiv » Mon Apr 13, 2020 2:40 am

On the requested changes
- Im not sure what typo your are thinking about the windows 10?
- The screenshot was removed as the term HTML DOM Property already imply the meaning and the image seem make thread redundant.
chivracq
Posts: 9494
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: How to extract nested informations using XPATH?

Post by chivracq » Mon Apr 13, 2020 4:02 am

balandongiv wrote:
Mon Apr 13, 2020 2:40 am
On the requested changes
- Im not sure what typo your are thinking about the windows 10?
- The screenshot was removed as the term HTML DOM Property already imply the meaning and the image seem make thread redundant.

"Windows" <> "WIndows" :P

And for the rest, oh well, never mind if it's too "complicated"..., then your Thread doesn't meet my "Quality Criteria", and I don't answer..., fair enough... :|
You'll probably find my Solution anyway if you search my Posts on the Forum, surprisingly with "xpath" as Search Term, I would think..., hum, and I'm being "nice": + "eval+split"... 8)

>>>

Hum, and for those interested, quoting the one Answer at this moment on SOF I had referred to (as most Users on that Forum delete their Threads/Qt's related to iMacros once they've got their Solution/Working Script...):
Your XPath contains an error (@id instead of @class). Fix it with :

Code: Select all

//figure[1]/div[@class='showcase__content']
To access the url for downloading the file, it would be :

Code: Select all

//figure[1]/div[@class='showcase__content']//@data-download-file-url
- (F)CI(M) = (Full) Config Info (Missing): iMacros + Browser + OS (+ all 3 Versions + 'Free'/'PE').
- I don't even read the Qt if that (required) Info is not mentioned...!
- Script & URL help a lot for more "educated" Help...
balandongiv

Re: How to extract nested informations using XPATH?

Post by balandongiv » Mon Apr 13, 2020 4:29 am

Fix the typo.
Maybe you can be kind enough to share the link to the mentioned thread. I even use advanced search, but there is no thread that similar to my problem or I miss look it.

And other thing, I have a very good reputation in SOF (please check my reputation) as well as in MATLAB forum, and unlikely to delete any suggestion by other. In fact, I am as much surprise as you for this kind of unrespective act by other user! On the same not, perhaps you can mention in your previous response that the suggestion(as you quoted from the SOF) is not working, atleast from my experience. This can help future reader to know the usability of the suggestion.
chivracq
Posts: 9494
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: How to extract nested informations using XPATH?

Post by chivracq » Mon Apr 13, 2020 8:59 pm

balandongiv wrote:
Mon Apr 13, 2020 4:29 am
Fix the typo.
Maybe you can be kind enough to share the link to the mentioned thread. I even use advanced search, but there is no thread that similar to my problem or I miss look it.

And other thing, I have a very good reputation in SOF (please check my reputation) as well as in MATLAB forum, and unlikely to delete any suggestion by other. In fact, I am as much surprise as you for this kind of unrespective act by other user! On the same not, perhaps you can mention in your previous response that the suggestion(as you quoted from the SOF) is not working, atleast from my experience. This can help future reader to know the usability of the suggestion.

Oh, good...!, nearly 1 Page of a Thread to correct a Typo in "Windows", MS will be very Happy, "we" finally made it...!! :D

"but there is no thread that similar to my problem", hum..., nearly all 17 Hits are relevant, and one is even your own Thread... :roll:

Yeah-yeah, don't worry, I had already checked your "Reputation", even before approving your Post/Thread on our Forum, I always run some "pretty Extended" Check on 1st time Posters on our Forum, and you don't want to know what "pretty Extended" means, ah-ah...! :twisted:

>>>

But OK, for those "Interested", some interesting Development on the SOF Front... The same User who had already posted an Answer, did some Try-out with iMacros, and updated their Answer, which apparently works, @OP is "Happy" and accepted the Sol, and hum...!, I'm pretty impressed, that User is GOOD...! And for a 1st time User of iMacros..., or not a regular User... Compliment...! :D

Quoting their updated Answer:
Your XPath contains an error (@id instead of @class). Fix it with :

Code: Select all

//figure[1]/div[@class='showcase__content']
To access the url for downloading the file, it would be :

Code: Select all

//figure[1]/div[@class='showcase__content']//@data-download-file-url
EDIT : To get values from specific attributes you have to extract the code from the element with the HTM function and then use regex. HREF attributes can be extracted directly.

I'm not an imacros user, so my code might not be the smartest :

Code: Select all

VERSION BUILD=1005 RECORDER=CR
URL GOTO=https://www.freepik.com/search?dates=any&format=search&page=1&query=Polygonal%20Human&sort=popular
TAG XPATH="//figure[1]/div[@class='showcase__content']/a" EXTRACT=HREF
SET !VAR3 {{!EXTRACT}}
TAG XPATH="//figure[1]/div[@class='showcase__content']/a" EXTRACT=HTM
SET !VAR1 EVAL("var regex = /url=\"(.+?)\"/; var str = '{{!EXTRACT}}';str.match(regex)[1];")
SET !VAR2 EVAL("var regex = /id=\"(.+?)\"/; var str = '{{!EXTRACT}}';str.match(regex)[1];")
PROMPT {{!VAR1}}
PROMPT {{!VAR2}}
PROMPT {{!VAR3}}
Side notes : free users of imacros are limited to 3 declared variables (!VAR1 to 3). You might need loops and SET !EXTRACT_TEST_POPUP NO to achieve your final goal.
"I'm not an imacros user, so my code might not be the smartest..." => Hum, actually that Code is nearly "too smart", ah-ah...!, I don't think many Users will be able to understand and reuse it for (slightly) different Needs... :|

And their Sol with 'EVAL()' is very similar to mine, except they use in my Opinion a "cumbersome" and "complicated" Construction with 'REGEXP' + 'match()' while 'split()' can get the same Result(s) in a much easier Way..., and always Repetitive, hardly any "Thinking" required... :|
- (F)CI(M) = (Full) Config Info (Missing): iMacros + Browser + OS (+ all 3 Versions + 'Free'/'PE').
- I don't even read the Qt if that (required) Info is not mentioned...!
- Script & URL help a lot for more "educated" Help...
balandongiv

Re: How to extract nested informations using XPATH?

Post by balandongiv » Tue Apr 14, 2020 5:42 am

I seldom say this on the net, but I definitely discourage people to join this forum as some of the admin/moderator are focusing on the formality and tend to copy paste solution from other!
chivracq
Posts: 9494
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: How to extract nested informations using XPATH?

Post by chivracq » Tue Apr 14, 2020 7:26 pm

balandongiv wrote:
Tue Apr 14, 2020 5:42 am
I seldom say this on the net, but I definitely discourage people to join this forum as some of the admin/moderator are focusing on the formality and tend to copy paste solution from other!

Yep indeed, Thanks for the Recommendation, this is indeed a Quality Tech Forum... 8)
- (F)CI(M) = (Full) Config Info (Missing): iMacros + Browser + OS (+ all 3 Versions + 'Free'/'PE').
- I don't even read the Qt if that (required) Info is not mentioned...!
- Script & URL help a lot for more "educated" Help...
chivracq
Posts: 9494
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: How to extract nested informations using XPATH?

Post by chivracq » Thu Aug 27, 2020 8:54 am

Hum..., and for those interested, some very similar Thread opened recently on the Forum where I posted "my" Solution using 'EVAL()' + 'split()' that I find muuuch simpler than using 'XPATH' and 'REGEX'...: :idea:
(... And that I had never posted in this current Thread as the Quality went a bit quickly "down" as @OP "preferred" to start arguing about "nearly everything" with me, and they seem to have deleted their Account since, OK, fair enough... :( ) )
- Re: How to grab variable in source code to extract? (script type="text/javascript"> var abc = <payload>)

+ More Info and more Explanation(s) on "my" Technique in my 2nd Post with Scripts a bit further in the same Thread... 8)
- (F)CI(M) = (Full) Config Info (Missing): iMacros + Browser + OS (+ all 3 Versions + 'Free'/'PE').
- I don't even read the Qt if that (required) Info is not mentioned...!
- Script & URL help a lot for more "educated" Help...
Post Reply