Scrape / Extract the data from second url / page

Support for iMacros. The iMacros software is the unique solution for automating every activity inside a web browser, for data extraction and web testing.

Moderators: Community Moderators, iMacros Moderators

Forum rules
Before asking a question or reporting an issue:
1. Please review the list of FAQ's.
2. Use the Google search box (at the top of each forum page) to see if a similar problem or question has already been addressed. This will search the entire contents of the forums as well as the iMacros Wiki.
3. We can respond much faster to your posts if you include the following information:

CLICK HERE FOR IMPORTANT INFORMATION TO INCLUDE IN YOUR POST

Answering your own posts (e.g. attempting to "bump" your topic) drops your topic from the list of unanswered threads, so it may actually receive less views.

Scrape / Extract the data from second url / page

by a2515125 on Wed Nov 08, 2017 9:30 pm

Windows 10/ FF v50.1 / imacros v8.9.7

hey,

I know how to scrape the urls of product images page (first) from online shopping website but today I have problem of extraction

I would like to extract the seller's urls or id which I have to click the product image first to get into second page that is product's description page with seller information

My question is : is it possible to scrape seller's information on product image page (on first page) without clicking into second page...

because If the code need to click to second page in order to extract seller information would take a lot of time

I want to get the user id " dakang.tw " on first page instead of second page


first page
Image

second page
Image
a2515125
 
Posts: 84
Joined: Mon Dec 05, 2016 8:37 pm

Re: Scrape / Extract the data from second url / page

by chivracq on Thu Nov 09, 2017 7:46 am

a2515125 wrote:Windows 10/ FF v50.1 / imacros v8.9.7

hey,

I know how to scrape the urls of product images page (first) from online shopping website but today I have problem of extraction

I would like to extract the seller's urls or id which I have to click the product image first to get into second page that is product's description page with seller information

My question is : is it possible to scrape seller's information on product image page (on first page) without clicking into second page...

because If the code need to click to second page in order to extract seller information would take a lot of time

I want to get the user id " dakang.tw " on first page instead of second page

first page
Image

second page
Image

If that "dakang.tw" Info is already displayed on the "first" Page (or contained somewhere in the HTML Source), then yep, of course you can extract it from the first Page by using some 'EXTRACT=TXT/HREF/HTM' + maybe 'EVAL()' if you need to "isolate" that specific Data only from some larger Content returned by the Extract.
- (F)CIM = (Full) Config Info Missing: iMacros + Browser + OS with all 3 Versions...
- I usually don't even read the Question if that (required) Info is not mentioned...
- Script & URL usually help a lot for a more "educated" Help...
chivracq
 
Posts: 6481
Joined: Sat Apr 13, 2013 6:07 am
Location: Amsterdam (NL)

Re: Scrape / Extract the data from second url / page

by a2515125 on Thu Nov 09, 2017 4:47 pm

unluckily, that "dakang.tw" Info and relevant html tag are only contained in second page instead of first page.

That is why I ask if it is possible to extract the "dakang.tw" without clicking into second page that will dramatically slow the speed of extraction

If I want to extract 10 seller's id, the script has to click 10 times into second page and BACK in order to extract the sellers id :idea:
a2515125
 
Posts: 84
Joined: Mon Dec 05, 2016 8:37 pm

Re: Scrape / Extract the data from second url / page

by chivracq on Thu Nov 09, 2017 11:18 pm

a2515125 wrote:unluckily, that "dakang.tw" Info and relevant html tag are only contained in second page instead of first page.

That is why I ask if it is possible to extract the "dakang.tw" without clicking into second page that will dramatically slow the speed of extraction

If I want to extract 10 seller's id, the script has to click 10 times into second page and BACK in order to extract the sellers id :idea:

Hum..., if you can post the URL of your Site then I can have a look, I don't really "trust" your "Technical Insight", ah-ah...!, and I might have a few "Tricks" and "Techniques" that I can't really explain easily, or we'll be one week further before you get an Idea of how to code them, ah-ah...! :wink:
(And with no Script and no URL posted, I can't have a look at the Site or try a few "things", and I can only give some "generic" Advice...)

:arrow: Hum..., and for one of those Tricks, maybe the best one if your Info can indeed really only be found on the 2nd Page, make sure to stay at FF50 and to NOT update FF to any later Version as it got broken from FF52 or FF53. (It was still working in FF51.)
- (F)CIM = (Full) Config Info Missing: iMacros + Browser + OS with all 3 Versions...
- I usually don't even read the Question if that (required) Info is not mentioned...
- Script & URL usually help a lot for a more "educated" Help...
chivracq
 
Posts: 6481
Joined: Sat Apr 13, 2013 6:07 am
Location: Amsterdam (NL)

Re: Scrape / Extract the data from second url / page

by a2515125 on Fri Nov 10, 2017 2:06 am

https://shopee.tw/%E5%A5%B3%E7%94%9F%E8%A1%A3%E8%91%97-cat.62

try this link

the below is my stupid code that takes long time to click forth and back between 50 products in every page

in order to extract (the users_id or users_urls) which works fine but too slow

I always stay in FF v.50 which works almost perfect with multifox add-on :P

cannot wait to hear your advice and magic method !!!

Code: Select all

SET !REPLAYSPEED FAST
SET !ERRORIGNORE YES
SET !TIMEOUT_STEP 0
FILTER TYPE=IMAGES STATUS=ON

'WAIT SECONDS=1
'LOOP between 1-50



SET !VAR3 EVAL("var z=\"{{!LOOP}}\";z = (z % 50);if(z == 1){ z = 50};z=z++;")

'product1-50

TAG POS={{!VAR3}} TYPE=DIV ATTR=CLASS:shopee-item-card__btn-gap&&TXT:
WAIT SECONDS=0.5

'Extract the sellers_ID


TAG POS=1 TYPE=A ATTR=CLASS:product-page-seller-info__name-status EXTRACT=HREF

SAVEAS TYPE=EXTRACT FOLDER=C:\Users\admin\Desktop\iMacroscript FILE=shopee_seller_test.csv

BACK


SET next EVAL("var x; var y=\"{{!LOOP}}\";if(y%50==0){x=1}else{x=0}; x;")

TAG POS={{next}}  TYPE=DIV ATTR=CLASS:shopee-icon-button<SP>shopee-icon-button--right<SP>



a2515125
 
Posts: 84
Joined: Mon Dec 05, 2016 8:37 pm

Re: Scrape / Extract the data from second url / page

by chivracq on Fri Nov 10, 2017 10:57 am

a2515125 wrote:https://shopee.tw/%E5%A5%B3%E7%94%9F%E8%A1%A3%E8%91%97-cat.62

try this link

the below is my stupid code that takes long time to click forth and back between 50 products in every page

in order to extract (the users_id or users_urls) which works fine but too slow

I always stay in FF v.50 which works almost perfect with multifox add-on :P

cannot wait to hear your advice and magic method !!!

Code: Select all
SET !REPLAYSPEED FAST
SET !ERRORIGNORE YES
SET !TIMEOUT_STEP 0
FILTER TYPE=IMAGES STATUS=ON

'WAIT SECONDS=1
'LOOP between 1-50

SET !VAR3 EVAL("var z=\"{{!LOOP}}\";z = (z % 50);if(z == 1){ z = 50};z=z++;")

'product1-50

TAG POS={{!VAR3}} TYPE=DIV ATTR=CLASS:shopee-item-card__btn-gap&&TXT:
WAIT SECONDS=0.5

'Extract the sellers_ID

TAG POS=1 TYPE=A ATTR=CLASS:product-page-seller-info__name-status EXTRACT=HREF

SAVEAS TYPE=EXTRACT FOLDER=C:\Users\admin\Desktop\iMacroscript FILE=shopee_seller_test.csv

BACK

SET next EVAL("var x; var y=\"{{!LOOP}}\";if(y%50==0){x=1}else{x=0}; x;")

TAG POS={{next}}  TYPE=DIV ATTR=CLASS:shopee-icon-button<SP>shopee-icon-button--right<SP>

Hum..., "magic Method"..., we'll see...! A bit better than yours, but the Site is stupidly heavy, in the 'Amazon' Style, with each Page about 5Mb heavy with about 4Mb of JavaScript Scripts that need to load for the Page to be constructed... :roll:

And you were right, as far as I could tell, the Seller ID or URL is nowhere to be found on the Main Page with the 50 Items.
=> The only Option is then indeed to have to load each individual Page for each Item, but I "improved" a bit your Method, but loading each 2nd Page in a 2nd TAB, which avoids to have to go back to the Main Page (and have to reload it!)...
Using 'FILTER' is a good thing indeed, that's the Command I mentioned that got broken from FF52.
And I use a short '!TIMEOUT_PAGE' for 'TAB_2' of 10 Sec, even if it doesn't really make a Difference, because iMacros waits for the HTML Page to load, but the Browser kind of freezes while loading those 4Mb of JS Scripts and the iMacros Timer freezes as well...

Running the following Script, it still took me 4 Min + 20 Sec for 10 Loops (=10 Items), but I have a very slow Connection, it will probably works much faster for you...

Code: Select all
VERSION BUILD=8820413 RECORDER=FX
TAB T=1
SET !EXTRACT_TEST_POPUP NO

'Loop Scriptfrom:
'URL GOTO=https://shopee.tw/%E5%A5%B3%E7%94%9F%E8%A1%A3%E8%91%97-cat.62

'SET !LOOP 15

'SET !VAR3 EVAL("var z=\"{{!LOOP}}\";z = (z % 50);if(z == 1){ z = 50};z=z++;")
'TAG POS={{!VAR3}} TYPE=DIV ATTR=CLASS:shopee-item-card__btn-gap&&TXT: EXTRACT=HTM

'>>>

'TAG POS={{!LOOP}} TYPE=* ATTR=TXT:*320321* EXTRACT=HTM
'TAG POS={{!LOOP}} TYPE=* ATTR=TXT:*M016002* EXTRACT=HTM

SET !EXTRACT NULL
TAG POS={{!LOOP}} TYPE=DIV ATTR=CLASS:"shopee-item-card__text-name" EXTRACT=TXT
SET Item_Name {{!EXTRACT}}
'>
SET !EXTRACT NULL
TAG POS=1 TYPE=SCRIPT ATTR=TXT:*{{Item_Name}}* EXTRACT=TXT
SET Script_Extract {{!EXTRACT}}
'>
SET URL_Item EVAL("var s='{{!EXTRACT}}'; var x,y,z; x=s.split('url\":\"'); y=x[1].split('\"'); z=y[0]; z;")

TAB T=2
SET !TIMEOUT_PAGE 10
FILTER TYPE=IMAGES STATUS=ON
URL GOTO={{URL_Item}}
FILTER TYPE=IMAGES STATUS=OFF

'Extract Seller URL:
SET !EXTRACT NULL
TAG POS=1 TYPE=A ATTR=CLASS:product-page-seller-info__name-status EXTRACT=HREF

SAVEAS TYPE=EXTRACT FOLDER=* FILE=Shopee.tw_Seller_URLs.csv

SET PROMPT_Msg LOOP:<SP>_{{!LOOP}}_<BR>VAR3:<SP>_{{!VAR3}}_<BR><BR>Item_Name:<BR>_{{Item_Name}}_<BR><BR>Script:<BR>_{{Script_Extract}}_<BR><BR>
ADD PROMPT_Msg URL:<BR>_{{URL_Item}}_<BR><BR>URL_Seller:<BR>_{{!EXTRACT}}_
'PROMPT {{PROMPT_Msg}}

TAB T=1
FILTER TYPE=IMAGES STATUS=OFF

'POS=3:
'_<script style="outline: 1px solid blue;" data-react-helmet="true" type="application/ld+json">
'{"@context":"http://schema.org","@type":"Product","name":"下殺 熱賣飛行外套 韓版 原宿 實拍 男女 雙層 復古 夾克 情侶外套 bf風 棒球外套 寬鬆 必備",
'"description":null,
'"url":"https://shopee.tw/%E4%B8%8B%E6%AE%BA-%E7%86%B1%E8%B3%A3%E9%A3%9B%E8%A1%8C%E5%A4%96%E5%A5%97-%E9%9F%93%E7%89%88-%E5%8E%9F%E5%AE%BF
'-%E5%AF%A6%E6%8B%8D-%E7%94%B7%E5%A5%B3-%E9%9B%99%E5%B1%A4-%E5%BE%A9%E5%8F%A4-%E5%A4%BE%E5%85%8B-%E6%83%85%E4%BE%B6%E5%A4%96%E5%A5%97
'-bf%E9%A2%A8-%E6%A3%92%E7%90%83%E5%A4%96%E5%A5%97-%E5%AF%AC%E9%AC%86-%E5%BF%85%E5%82%99-i.320321.17609900",
'"image":"https://cfshopeetw-a.akamaihd.net/file/59a346f8da7c394f3169a68183f3ca85",
'"offers":{"@type":"Offer","price":"250.00","priceCurrency":"TWD","availability":"http://schema.org/InStock"},
'"aggregateRating":{"@type":"AggregateRating","bestRating":5,"worstRating":1,"ratingCount":1611,"ratingValue":"4.81"}}</script>_

'POS=15:
'<div style="outline: 1px solid blue;" class="shopee-search-result-view__item-card">

'POS=19:
'_<div style="outline: 1px solid blue;" class="shopee-item-card__text-name">
'<!-- react-text: 4764 --> 現貨 ✌【M016002】百搭俏皮可愛素面撞色棒球服圓領短袖t恤 6色<!-- /react-text --></div>_
(Tested on iMacros for FF v8.8.2, PM v26.3.3 (=FF47), Win10_x64.)

And the Result of 10 Loops is:
Code: Select all
"https://shopee.tw/twinklelady"
"https://shopee.tw/twinklelady"
"https://shopee.tw/shafiachen"
"https://shopee.tw/sasateng"
"https://shopee.tw/mito.shop"
"https://shopee.tw/vk.shop"
"https://shopee.tw/pigfish119"
"https://shopee.tw/rabbitqaq"
"https://shopee.tw/qq53453"
"https://shopee.tw/cccat_213"


I left all my Debug Info and how I progressed towards my "Solution", it was a bit like playing "Detective" to reconstruct the URL for each Item to reuse in TAB_2.
- (F)CIM = (Full) Config Info Missing: iMacros + Browser + OS with all 3 Versions...
- I usually don't even read the Question if that (required) Info is not mentioned...
- Script & URL usually help a lot for a more "educated" Help...
chivracq
 
Posts: 6481
Joined: Sat Apr 13, 2013 6:07 am
Location: Amsterdam (NL)

Re: Scrape / Extract the data from second url / page

by chivracq on Fri Nov 10, 2017 1:04 pm

Hum, 2 things I forgot to mention as I was a bit in a rush when I wanted to post my previous Msg and some Chinese Chars that I had to remove were preventing the Post to be published...:
- Unless you hardly notice any Delay for the 4-5Mb of JS Scripts to load for each Page, it might be worth "investigating" and "killing" a few of those heavy Scripts one by one with some Popup Blocker like AddBlock/ABP/ABE/uBlock with which you can kill such Scripts one by one, and therefore prevent them from loading as probably only 1 or 2 are necessary for the Functionality that you need...

- And the Site is so ridiculously heavy, while technically quite high-tech, and with a complex HTML Structure, which makes me think that they might have a Mobile Version of their Site for Mobile Users, (I didn't check...), either with a separate URL/Domain or by dynamically checking your User Agent, which you can fool by using some UA-Swither Add-on for FF or even the iMacros '!USERAGENT' Command...
- (F)CIM = (Full) Config Info Missing: iMacros + Browser + OS with all 3 Versions...
- I usually don't even read the Question if that (required) Info is not mentioned...
- Script & URL usually help a lot for a more "educated" Help...
chivracq
 
Posts: 6481
Joined: Sat Apr 13, 2013 6:07 am
Location: Amsterdam (NL)


Return to General Support & Discussions

Who is online

Users browsing this forum: No registered users and 8 guests

-->