multiple row different extract areas

Discussions and Tech Support related to website data extraction, screen scraping and data mining using iMacros.
Forum rules
Before asking a question or reporting an issue:
1. Please review the list of FAQ's.
2. Use the Google search box (at the top of each forum page) to see if a similar problem or question has already been addressed. This will search the entire contents of the forums as well as the iMacros Wiki.
3. We can respond much faster to your posts if you include the following information:

CLICK HERE FOR IMPORTANT INFORMATION TO INCLUDE IN YOUR POST

Answering your own posts (e.g. attempting to "bump" your topic) drops your topic from the list of unanswered threads, so it may actually receive less views.
Post Reply
Maxscrap
Posts: 12
Joined: Thu Aug 02, 2018 7:57 am

multiple row different extract areas

Post by Maxscrap » Tue Oct 16, 2018 3:46 pm

Dear all, I know the rules, before checks on the forum to see if there is some similar FAQ .... but I can't find somenthing helpful.

I am trying to build an .iim to scarp some data from a web commerce in order to find a good spare parts for my very old car and because I am new with that wonderful "iMacros personal V12" tool I would see if it can help me in finding competitive prices by different e-commerce.
I did try to write the instruction but I can't make it clear when there are more suppliers for one item because for someone are missing info in the web (price or correspondenting code) and when the instruction do scrap the webpage, it download random info with those available only, not following any schema. However I am posting my question trying to make more clear what I don't understand about it if there is anybody whom can help me.

Here is my (sketch) instruction:

SET !extract_TEST_POPUP NO
VERSION BUILD=12.0.501.6698
TAB T=1
SET !DATASOURCE C:\Users\Massimo\Documents\iMacros\DataSources\test.csv
SET !LOOP 1
SET !DATASOURCE_LINE {{!LOOP}}
SET !EXTRACT NULL
SET !ERRORIGNORE YES
ADD !EXTRACT {{!COL1}}
ADD !EXTRACT {{!NOW:yyyy/mm/dd_hhnn}}

'SET !PLAYBACKDELAY 0.00
URL GOTO=https://https://www.autoparti.it/ricerc ... rd={{!COL1}}

TAG POS=1 TYPE=INPUT:SEARCH ATTR=NAME:pcode CONTENT={{!COL1}}
TAG POS=1 TYPE=INPUT:SUBMIT ATTR=CLASS:header-search__search-submit-btn
TAG POS=R1 TYPE=DIV ATTR=CLASS:art EXTRACT=TXT
TAG POS=R1 TYPE=SPAN ATTR=CLASS:price EXTRACT=TXT
TAG POS=R2 TYPE=DIV ATTR=CLASS:art EXTRACT=TXT
TAG POS=R2 TYPE=SPAN ATTR=CLASS:price EXTRACT=TXT
SAVEAS TYPE=EXTRACT FOLDER=* FILE=*


(I did use a .csv file with codes in order to keep trace of my research).

As you can see here above, I did use R to set POS, here R1, R2 but there are spare parts with more suppliers which need by the circumstance more Rn..., I do wonder if there is a way to set the instruction in order to use a general TAG, able to select any POSITION as many as are available.


Thanks for any help.
chivracq
Posts: 7722
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: multiple row different extract areas

Post by chivracq » Tue Oct 16, 2018 6:33 pm

Maxscrap wrote:Dear all, I know the rules, before checks on the forum to see if there is some similar FAQ .... but I can't find somenthing helpful.

I am trying to build an .iim to scarp some data from a web commerce in order to find a good spare parts for my very old car and because I am new with that wonderful

Code: Select all

"iMacros personal V12"
tool I would see if it can help me in finding competitive prices by different e-commerce.
I did try to write the instruction but I can't make it clear when there are more suppliers for one item because for someone are missing info in the web (price or correspondenting code) and when the instruction do scrap the webpage, it download random info with those available only, not following any schema. However I am posting my question trying to make more clear what I don't understand about it if there is anybody whom can help me.

Here is my (sketch) instruction:

Code: Select all

SET !extract_TEST_POPUP NO 
VERSION BUILD=12.0.501.6698
TAB T=1
SET !DATASOURCE C:\Users\Massimo\Documents\iMacros\DataSources\test.csv
SET !LOOP 1
SET !DATASOURCE_LINE {{!LOOP}}
SET !EXTRACT NULL
SET !ERRORIGNORE YES
ADD !EXTRACT {{!COL1}}
ADD !EXTRACT {{!NOW:yyyy/mm/dd_hhnn}}

'SET !PLAYBACKDELAY 0.00
URL GOTO=https://https://www.autoparti.it/ricerca?keyword={{!COL1}}

TAG POS=1 TYPE=INPUT:SEARCH ATTR=NAME:pcode CONTENT={{!COL1}}
TAG POS=1 TYPE=INPUT:SUBMIT ATTR=CLASS:header-search__search-submit-btn
TAG POS=R1 TYPE=DIV ATTR=CLASS:art EXTRACT=TXT
TAG POS=R1 TYPE=SPAN ATTR=CLASS:price EXTRACT=TXT
TAG POS=R2 TYPE=DIV ATTR=CLASS:art EXTRACT=TXT
TAG POS=R2 TYPE=SPAN ATTR=CLASS:price EXTRACT=TXT
SAVEAS TYPE=EXTRACT FOLDER=* FILE=*
(I did use a .csv file with codes in order to keep trace of my research).

As you can see here above, I did use R to set POS, here R1, R2 but there are spare parts with more suppliers which need by the circumstance more Rn..., I do wonder if there is a way to set the instruction in order to use a general TAG, able to select any POSITION as many as are available.

Thanks for any help.
Hum, always mention your FCI a bit more "clearly" than some "iMacros personal V12" a bit lost in the middle of some Text, preferably at the complete beginning of your OP when you open a Thread, and you OS is missing btw..., => will be Win7/Win10 I reckon...?, from your Path for '!DATASOURCE' and iMB only supports Win-OS anyway...
=> FCI:

Code: Select all

iMB v12.0 'PE', Win7/Win10...?
I'll be able/willing to have a look at your Site/Script if you can post a few Keywords from your '!COL1' returning a different Number of Results illustrating the different Scenarios you describe... :idea:

But the "Idea" could be for example, to always try to extract, say 3 Results, on 3 different Lines with 3x 'SAVEAS', and to make the 2nd and the 3rd 'SAVEAS' "Conditional" if there is/are only 1 or 2 Results..., unless you don't mind Rows filled with "#EANF#" x4 times for the 4x 'EXTRACT''s...

Some Remark:
Why do you double the "https://" part in the 'URL GOTO' Line...? This would be a Bug I suspect if this got recorded "like that" by iMB12...? :o
- (F)CIM = (Full) Config Info Missing: iMacros + Browser + OS with all 3 Versions...
- I usually don't even read the Question if that (required) Info is not mentioned...
- Script & URL usually help a lot for a more "educated" Help...
Maxscrap
Posts: 12
Joined: Thu Aug 02, 2018 7:57 am

Re: multiple row different extract areas

Post by Maxscrap » Tue Oct 16, 2018 6:56 pm

Hello Chivracq.
Glad to read you back!
always mention your FCI a bit more "clearly" than some "iMacros personal V12" a bit lost in the middle of some Text, preferably at the complete beginning of your OP when you open a Thread, and you OS is missing btw..., => will be Win7/Win10 I reckon...?, from your Path for '!DATASOURCE' and iMB only supports Win-OS anyway...
Sorry you are right and my OS is Window 8.1

I'll be able/willing to have a look at your Site/Script if you can post a few Keywords from your '!COL1' returning a different Number of Results illustrating the different Scenarios you describe... :idea:
Here:
19019505
19018653

Some Remark:
Why do you double the "https://" part in the 'URL GOTO' Line...? This would be a Bug I suspect if this got recorded "like that" by iMB12...? :o
I just did copy like it is from the URL ...

But the "Idea" could be for example, to always try to extract, say 3 Results, on 3 different Lines with 3x 'SAVEAS', and to make the 2nd and the 3rd 'SAVEAS' "Conditional" if there is/are only 1 or 2 Results..., unless you don't mind Rows filled with "#EANF#" x4 times for the 4x 'EXTRACT''s...
Well I don't mind if appears "#EANF#" x4 times, I could afterthat filter with Excel, no problem. By the way, I am going to try that path you gave me (if I am able :P )


Thanks Chivracq, very helpful like always!
Mass
Maxscrap
Posts: 12
Joined: Thu Aug 02, 2018 7:57 am

Re: multiple row different extract areas

Post by Maxscrap » Thu Oct 18, 2018 10:29 am

As per my previous post, I've been trying to figure out the problem without success, or better, I did but .... :shock:

Solved the first problem it comes out another one for which I worked some hours without understand how to fix it :| , in fact, as the loop work perfectly searching for each required code, it happen that I don't know how to use Extract properly because some code required has more than one offer with different supplier name.

In that scenario, everytime the code searched doesn't have price for one supplier, automatically it extract the price by the following supplier diplayed (which has price).


My question is, how to fix the output required only considering each specific supplier, matching price and supplier name, code by code (in the loop).

Here the new code for that instance with car part with the following code: 19018873

SET !extract_TEST_POPUP NO
VERSION BUILD=12.0.501.6698
TAB T=1

SET !DATASOURCE C:\Users\Massimo\Documents\iMacros\DataSources\test.csv
SET !LOOP 1
SET !DATASOURCE_LINE {{!LOOP}}
SET !EXTRACT NULL

ADD !EXTRACT {{!COL1}}
ADD !EXTRACT {{!COL2}}
ADD !EXTRACT {{!NOW:yyyy/mm/dd_hhnn}}

'SET !PLAYBACKDELAY 0.00
URL GOTO=https://www.exist.ru/Price/?pcode={{!COL1}}

TAG POS=1 TYPE=INPUT:SEARCH ATTR=NAME:pcode CONTENT={{!COL1}}
TAG POS=1 TYPE=INPUT:SUBMIT ATTR=CLASS:header-search__search-submit-btn

SET !ERRORIGNORE NO
TAG POS=1 TYPE=DIV ATTR=CLASS:art EXTRACT=CORTECO
TAG POS=1 TYPE=SPAN ATTR=CLASS:bestOffers EXTRACT=TXT
TAG POS=1 TYPE=SPAN ATTR=CLASS:price EXTRACT=TXT
SET !ERRORIGNORE YES

SAVEAS TYPE=EXTRACT FOLDER=* FILE=*


p.s. I am checking randomly e-commerce website (doesn't mean anything where, it is an exercise to improve my skill with iMacros)
Maxscrap
Posts: 12
Joined: Thu Aug 02, 2018 7:57 am

Re: multiple row different extract areas

Post by Maxscrap » Fri Oct 19, 2018 4:54 am

I realized just now that I missed to specify my browser, which is iMB = Internet Explorer (it has been the only option I could use when I did install the personal version of iMacros)..
Maxscrap
Posts: 12
Joined: Thu Aug 02, 2018 7:57 am

Re: multiple row different extract areas

Post by Maxscrap » Fri Oct 19, 2018 8:40 am

To solve the problem, is it the right path to follow the "absolute positioning"?
chivracq
Posts: 7722
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: multiple row different extract areas

Post by chivracq » Fri Oct 19, 2018 1:47 pm

A bit late Reply but the Forum is a bit "busy" at the moment, and me as well, ah-ah...!, but I had had a look at your Site with the 2 Product Nb's you had provided..., which yield 23 and 33 different Results (a bit more than the 1-2-3 maybe max 5 I expected, ah-ah...!) and the Site offers 3 different Types of Views to present the Results, grouped by 20/12/21. (The second Type of View is supposed to display 20 Results but only displays 12, oops, mini-Bug...)

Depending on which Type of View you would decide to extract the Data from, you could decide to extract only the Results from the first Page, or if you want to extract all Results and know there will never be more than, say 50 Results in total, you could hard-code depending on the Type of View => 60(20x3)/48|60(12x4|5)/63(21x3) Blocks using the same Principle like I had previously mentioned for 3 or 5 Blocks.

The easiest (for iMacros) to extract Data is when the Data is presented/organized in Tables, like that was the case in your previous Thread, where you can extract the whole Table with just one 'TAG' Statement, which might retrieve "too much" Data if you are not interested by the whole Content of the Table, or Cell by Cell, using fixed/incremental Positioning, or 'Relative Positioning' on some Header or first Col as Anchor, as the "other" Cells in the Table are always there, even if they don't contain any Data, they will simply be empty.

But on this Site, the Data (= the Results) is presented in 'DIV' (and 'SPAN') Elements and when some Fields are sometimes not present, like for the 'Price' you mentioned, the corresponding 'DIV' Element will usually not be present in the HTML Structure, which makes 'Relative Positioning' then "not reliable" as some 'POS=R1' might then "catch" the Price from the next Result, oops...!

In this case, you then need to use 'Double Relative Positioning' to first "verify" that the Field tagged by the 2nd R-Pos is indeed the one that you used as Anchor within that Result and not the one from the next Result. And upon that Check, you will then "fire" the "real" Extract or not... It can be done, and I use it myself in a few of my own Scripts, but it's a bit cumbersome, and can easily be broken if the Site changes "anything" in the HTML Structure for each Result, by adding/removing any Field, for example if they add some special Promotion Price or Discount Code around Xmas, or "Free Shipping"...

More reliable in this case might be to extract the "whole" Data per Result at the Level of the Containing 'DIV' for each Result using 'EXTRACT=HTM' (rather than 'EXTRACT=TXT' which will contain the whole Data as well, but "badly" formatted), and using 'EVAL()' to isolate each Field one by one that you wanted to extract. It can be a bit cumbersome as well, but the extracted Data will be more reliable than using 'Relative Positioning'...
- (F)CIM = (Full) Config Info Missing: iMacros + Browser + OS with all 3 Versions...
- I usually don't even read the Question if that (required) Info is not mentioned...
- Script & URL usually help a lot for a more "educated" Help...
Maxscrap
Posts: 12
Joined: Thu Aug 02, 2018 7:57 am

Re: multiple row different extract areas

Post by Maxscrap » Fri Oct 19, 2018 2:06 pm

Well, thank you for your nice reply. I understand you are quite busy and I appreciate your support anyway, even later on because I think that I need to do my own effort as well to work it out, in fact you are helping and not replacing me :P

Anyway, I am going to follow your advices, which gives me important view on how to reach that solution and I'll back asap (compatibly with my results :lol: )

Mass
chivracq
Posts: 7722
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: multiple row different extract areas

Post by chivracq » Fri Oct 19, 2018 2:14 pm

Maxscrap wrote:Well, thank you for your nice reply. I understand you are quite busy and I appreciate your support anyway, even later on because I think that I need to do my own effort as well to work it out, in fact you are helping and not replacing me :P

Anyway, I am going to follow your advices, which gives me important view on how to reach that solution and I'll back asap (compatibly with my results :lol: )

Mass
Yeah, good luck..., and don't worry, I see all Posts on the Forum, dare to "shout" if you get stuck somewhere, ah-ah...! :wink:
- (F)CIM = (Full) Config Info Missing: iMacros + Browser + OS with all 3 Versions...
- I usually don't even read the Question if that (required) Info is not mentioned...
- Script & URL usually help a lot for a more "educated" Help...
Post Reply