Scrape with smart XPath

Discussions and Tech Support related to website data extraction, screen scraping and data mining using iMacros.
Forum rules
Before asking a question or reporting an issue:
1. Please review the list of FAQ's.
2. Use the search box (at the top of each forum page) to see if a similar problem or question has already been addressed.
3. Try searching the iMacros Wiki - it contains the complete iMacros reference as well as plenty of samples and tutorials.
4. We can respond much faster to your posts if you include the following information:CLICK HERE FOR IMPORTANT INFORMATION TO INCLUDE IN YOUR POST
Post Reply
Chilly_Bang
Posts: 29
Joined: Tue Jan 27, 2015 9:13 am

Scrape with smart XPath

Post by Chilly_Bang » Sun Jun 23, 2019 7:58 pm

Usually i scrape with a tool, which understands "smart" XPath - means, if i set scraper to scrape XPath

Code: Select all

//div[@class='example-class']
, so i get content from all occurences of this class. I really need this, because the site i scrape has different amount of this class occurences - from zero to 10.

If i use this kind of expression with iMacros, i get only the first occurence of this class. How should i setup Xpath in iMacros to get all occurences of the class, independently of the amount?
FCI: Win 7 x64 + Win10 x64 + FF 45.9.0 + iMacro for FF 9.0.3
chivracq
Posts: 8523
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: Scrape with smart XPath

Post by chivracq » Sun Jun 23, 2019 10:25 pm

Chilly_Bang wrote:
Sun Jun 23, 2019 7:58 pm
Usually i scrape with a tool, which understands "smart" XPath - means, if i set scraper to scrape XPath

Code: Select all

//div[@class='example-class']
, so i get content from all occurences of this class. I really need this, because the site i scrape has different amount of this class occurences - from zero to 10.

If i use this kind of expression with iMacros, i get only the first occurence of this class. How should i setup Xpath in iMacros to get all occurences of the class, independently of the amount?
FCIM...! :mrgreen:

Code: Select all

FCI: Win 7 x64 + Win10 x64 + FF 45.9.0 + iMacro for FF 9.0.3
=> Can you confirm/hard-code your FCI in your OP when you open a new Thread...? (FCI in Sig (with no Date when you last checked/updated it) is only confusing if you don't hard-post it in your OP, I've probably already told you and it's now clearly mentioned in the Forum Rules...)

v9.0.3 for FF is not a stable Version to use, and way too buggy...
+ FF45 is also a "strange" Version...

>>>

Then hum, you only mention some 'XPATH' without any Script posted, so I cannot see how you are using it, but 'EXTRACT' always only extracts 1 Element at the time, 'XPATH' or 'POS' Parameter, it doesn't make a Difference, ah-ah...!
If you want to extract several Elements at the same time, that's possible, but you need to extract at some higher Level in the HTML/CSS Structure of your Page, and then using 'EVAL()' (or 'SEARCH' directly on the Source of the Page instead of 'EXTRACT') to isolate all Extracts to populate an Array for example...

Mini-Rmk: I never use 'XPATH' myself, there are "better" (in my Opinion!) and easier ways with iMacros to achieve what you want without using the ueber-complex 'XPATH' Syntax, ah-ah...! :shock:
The only Case where 'XPATH' can be useful is for "Negative/Exclusive Tagging"... 8)
- (F)CIM = (Full) Config Info Missing: iMacros + Browser + OS with all 3 Versions...
- I usually don't even read the Question if that (required) Info is not mentioned...
- Script & URL usually help a lot for a more "educated" Help...
Chilly_Bang
Posts: 29
Joined: Tue Jan 27, 2015 9:13 am

Re: Scrape with smart XPath

Post by Chilly_Bang » Mon Jun 24, 2019 1:45 pm

FCI in signature is up to date. 9.0.3. is latest stable version i use and since some years driving good with it - i mean, have no causes to complain, even as with FF45...
but 'EXTRACT' always only extracts 1 Element at the time,
Aha, that is my problem, not the XPath...! Good to know - every day something new:)

I relate to your answer in this thread:
and then use 'EVAL()' + 'split()' to isolate the Data that you want to keep
Eval→split has potential to become one of my main tools while working with iMacros - but i have a knowledge gap just at the very beginning. Could you please in short explain, how eval→split really works?

i tried to study one example from your answer:

Code: Select all

SET !EXTRACT "The repayable TSL debt as at 2016-10-16 is 4372.66."
    SET !VAR1 EVAL("var s='{{!EXTRACT}}'; var x,y,z; y=s.split(' is '); z=y[1].split('.'); z[0];")
    PROMPT {{!VAR1}}
What mean in this code example [1] and [0]? And why is there firstly mentioned three vars (var x,y,z;) and than are only two (y and z) in use?
FCI: Win 7 x64 + Win10 x64 + FF 45.9.0 + iMacro for FF 9.0.3
chivracq
Posts: 8523
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: Scrape with smart XPath

Post by chivracq » Mon Jun 24, 2019 4:03 pm

Chilly_Bang wrote:
Mon Jun 24, 2019 1:45 pm
FCI in signature is up to date. 9.0.3. is latest stable version i use and since some years driving good with it - i mean, have no causes to complain, even as with FF45...
but 'EXTRACT' always only extracts 1 Element at the time,
Aha, that is my problem, not the XPath...! Good to know - every day something new:)

I relate to your answer in this thread:
and then use 'EVAL()' + 'split()' to isolate the Data that you want to keep
Eval→split has potential to become one of my main tools while working with iMacros - but i have a knowledge gap just at the very beginning. Could you please in short explain, how eval→split really works?

i tried to study one example from your answer:

Code: Select all

SET !EXTRACT "The repayable TSL debt as at 2016-10-16 is 4372.66."
    SET !VAR1 EVAL("var s='{{!EXTRACT}}'; var x,y,z; y=s.split(' is '); z=y[1].split('.'); z[0];")
    PROMPT {{!VAR1}}
What mean in this code example [1] and [0]? And why is there firstly mentioned three vars (var x,y,z;) and than are only two (y and z) in use?
Hum, OK for your FCI then but I "hope" you got my Msg about hard-posting it your OP when open a new Thread (for next time)...
And I guess you probably don't do any very "Advanced" Things in your current Macros or you would have quickly reverted to v8.9.7..., the 'EXTRACT' Mechanism is buggy for example in v9.0.3 and returns sometimes/often/always (I'm not sure, I never used this Version myself) the Data extracted double(!) :shock: for the 1st 'EXTRACT' in a Script, or when extracting a Table... But you will quickly find out by yourself if you get hit...

Then OK, for the Functionality that you want, very good, you've done your "Homework" and found indeed 2 very good Threads. :D
But, hum..., no Script, no Source, no URL posted, I will only be able to give you some "generic" Guidelines...

If you expect Max=10 Results from the Page, you could already use @iimfun's Solution about the [1] <-> [10] occurrences with your current 'XPATH' Statement by hard-coding them 10x times, and using 'EVAL()' to clean or count them based on "#EANF#". Or you can already count them upfront by using/combining the Method you half-quoted from me from the same Thread + the Method I demonstrated in this Thread:
- Re: Number of Options in a Select tag
Then you concatenate all 'EXTRACT''s together, to re-'split()' them to only keep the first nx Occurrences, or you use that n to compute the [1] <-> [10] incrementally until only n.

Or still using "my Method" that you half-quoted, you either use 10x times 'EVAL()' + 'split()' to isolate all 10x Results one by one by incrementing both y[1] & z[0] from the 2nd Thread by 1 (=> y[2] & z[1] => [...] => y[10] & z[9]), or you can combine the 10x Double 'split()''s in one same 'EVAL()' Statement to return all Results together as an Array or directly as the "real" Content of the '!EXTRACT' Var like if you had done nx "Standard" 'EXTRACT''s. (I don't know what you want to do with the Results as you didn't post any Script...)
And neater if you want to return all Results in just one 'EVAL()' would actually be to make a Function for the Double 'split()' and to use the 'push()' Method to populate the Array... I think there is an Example on the Forum, + several Examples on SOF (Stackoverflow) in the JS Section, and I think I made one for myself in one of my own Scripts, a few months ago... :o
(Well..., "I think" => because I'm not completely sure anymore, that was indeed my "first" way to implement the Functionality I wanted then, but "I think" I had then later chosen for another Implementation, and I'm not sure anymore if I had kept the part with 'push()' or not, ah-ah...!, that Script was fairly complex and I've since been a bit "reluctant" to dig into it again, ah-ah...!)

And then about your Qt about "y[1]" and "z[0]", well, 'split()' returns an Array and the Items from an Array are identified by "Array", with the Index 'i' starting at "0".

And in 'EVAL()', I often declare 3 Vars "x,y,z", even if I only use/need 1 or 2, then the Expression can easily be extended if needed, and all needed Vars have then already been declared... That's my "personal" Method/Syntax... :wink:
- (F)CIM = (Full) Config Info Missing: iMacros + Browser + OS with all 3 Versions...
- I usually don't even read the Question if that (required) Info is not mentioned...
- Script & URL usually help a lot for a more "educated" Help...
Post Reply