Extracting Text and Images from a Website

Discussions and Tech Support related to website data extraction, screen scraping and data mining using iMacros.
Forum rules
Before asking a question or reporting an issue:
1. Please review the list of FAQ's.
2. Use the search box (at the top of each forum page) to see if a similar problem or question has already been addressed.
3. Try searching the iMacros Wiki - it contains the complete iMacros reference as well as plenty of samples and tutorials.
4. We can respond much faster to your posts if you include the following information: CLICK HERE FOR IMPORTANT INFORMATION TO INCLUDE IN YOUR POST
Post Reply
Ratfink
Posts: 5
Joined: Tue Jan 19, 2016 6:52 am

Extracting Text and Images from a Website

Post by Ratfink » Mon Mar 28, 2016 12:47 am

Hi All,
I am working on getting data from a website, http://www.midfifty.com.
Specifically, I need to get all the information from all the products, text and images, including pricing.
I have listed all the pertinent info below along with the code so far. It works fine however I cant figure out the following:
  • How to loop to the next product/item
  • How to loop to the next page to get the next page of items and so on.
  • How to or the best way to get the product images and a way of attaching them, referring to them or saving them to the csv file from the extracted text?
I need to be able to get all the product data with images on all the pages. I can figure out how to get info on one product but not all. I can figure out how to get one image but not all and how would I capture the images and refer them to the part number text capture in the csv file?

Thankyou in advance, struggling a bit.


iMacros Version is 10.0.2.2823
Operating System is Windows Server 2012R2, English
Browser is IE 11, with latest updates.
iMacros Demos OK
iMacros Scripting Interface is OK
URL; http://www.midfifty.com
iMacro Code is listed below
VERSION BUILD=10022823
TAB T=1
TAB CLOSEALLOTHERS
URL GOTO=http://midfifty.com/store.php
TAG POS=1 TYPE=A ATTR=TXT:57-60<SP>Parts
TAG POS=1 TYPE=DIV ATTR=TXT:Bed<SP>Insulators,<SP>Hold<SP>Down<SP>Pads
TAG POS=1 TYPE=DIV ATTR=CLASS:large EXTRACT=TXT
TAG POS=1 TYPE=DIV ATTR=CLASS:price<SP>bld<SP>lrgr EXTRACT=TXT
TAG POS=1 TYPE=DIV ATTR=CLASS:bld<SP>cntr EXTRACT=TXT
TAG POS=25 TYPE=TD ATTR=* EXTRACT=TXT
TAG POS=1 TYPE=DIV ATTR=CLASS:cntr EXTRACT=TXT
TAG POS=1 TYPE=DIV ATTR=CLASS:price<SP>cntr<SP>lrgr<SP>bld EXTRACT=TXT
TAG POS=2 TYPE=DIV ATTR=CLASS:cntr EXTRACT=TXT
SAVEAS TYPE=EXTRACT FOLDER=* FILE=Extract_{{!NOW:ddmmyy_hhnnss}}.csv
IrishMacro
Posts: 135
Joined: Wed Nov 03, 2010 12:27 pm

Re: Extracting Text and Images from a Website

Post by IrishMacro » Thu Apr 07, 2016 8:44 am

For saving an image try http://wiki.imacros.net/SAVEPICTUREAS

Before you save the picture, extract its href to an excel sheet by simply clicking to create a tag

TAG POS=1 TYPE=IMG ATTR=SRC:http://m.midfifty.com/partpics/147-*jpg EXTRACT=HREF
SAVEAS TYPE=EXTRACT FOLDER=* FILE=resultsfile.csv
TAG POS=1 TYPE=IMG ATTR=SRC:http://m.midfifty.com/partpics/147-*jpg CONTENT=EVENT:SAVEPICTUREAS
Look for the image in your Downloads folder under the imacros folder
Firefox free plugin, last version
Win7
NoraChoi
Posts: 6
Joined: Thu May 26, 2016 6:14 am

Re: Extracting Text and Images from a Website

Post by NoraChoi » Wed Jun 01, 2016 2:38 am

Ratfink wrote:Hi All,
I am working on getting data from a website, http://www.midfifty.com.
Specifically, I need to get all the information from all the products, text and images, including pricing.
I have listed all the pertinent info below along with the code so far. It works fine however I cant figure out the following:
  • How to loop to the next product/item
  • How to loop to the next page to get the next page of items and so on.
  • How to or the best way to get the product images and a way of attaching them, referring to them or saving them to the csv file from the extracted text?
I need to be able to get all the product data with images on all the pages. I can figure out how to get info on one product but not all. I can figure out how to get one image but not all and how would I capture the images and refer them to the part number text capture in the csv file?

Thankyou in advance, struggling a bit.


iMacros Version is 10.0.2.2823
Operating System is Windows Server 2012R2, English
Browser is IE 11, with latest updates.
iMacros Demos OK
iMacros Scripting Interface is OK
URL; http://www.midfifty.com
iMacro Code is listed below
VERSION BUILD=10022823
TAB T=1
TAB CLOSEALLOTHERS
URL GOTO=http://midfifty.com/store.php
TAG POS=1 TYPE=A ATTR=TXT:57-60<SP>Parts
TAG POS=1 TYPE=DIV ATTR=TXT:Bed<SP>Insulators,<SP>Hold<SP>Down<SP>Pads
TAG POS=1 TYPE=DIV ATTR=CLASS:large EXTRACT=TXT
TAG POS=1 TYPE=DIV ATTR=CLASS:price<SP>bld<SP>lrgr EXTRACT=TXT
TAG POS=1 TYPE=DIV ATTR=CLASS:bld<SP>cntr EXTRACT=TXT
TAG POS=25 TYPE=TD ATTR=* EXTRACT=TXT
TAG POS=1 TYPE=DIV ATTR=CLASS:cntr EXTRACT=TXT
TAG POS=1 TYPE=DIV ATTR=CLASS:price<SP>cntr<SP>lrgr<SP>bld EXTRACT=TXT
TAG POS=2 TYPE=DIV ATTR=CLASS:cntr EXTRACT=TXT
SAVEAS TYPE=EXTRACT FOLDER=* FILE=Extract_{{!NOW:ddmmyy_hhnnss}}.csv

Hi Ratfink,

I've opened the site and if you just need the hrefs of all the images, this tool http://www.octoparse.com/?fo may help you.

You can use it extract all the information from all the products and related text, including pricing.

I recommend you check out the tutorials http://www.octoparse.com/Tutorial as it might spark some ideas for you. :wink:
janib4all
Posts: 132
Joined: Wed Jul 21, 2010 6:44 am
Location: Karachi, Sindh, Pakistan
Contact:

Re: Extracting Text and Images from a Website

Post by janib4all » Tue Jun 07, 2016 8:28 pm

Check for the JS tutorials; Octopus and other visual grabbing tools are out there but sometime you need more control over scrapping - use iMacro or Python instead.
Hire the BoT-fReeak!
botspecialist.blogspot.com
Domus
Posts: 1
Joined: Fri Apr 12, 2019 2:52 pm

Re: Extracting Text and Images from a Website

Post by Domus » Fri Apr 12, 2019 3:00 pm

Another option for your web capture needs would be to use an online web scraper like GrabzIt. This allows you to use point and click to define what product details should be scraped. However if you want more control you can manually change the scrape instructions as well.
chivracq
Posts: 8693
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: Extracting Text and Images from a Website

Post by chivracq » Fri Apr 12, 2019 4:43 pm

Domus wrote:
Fri Apr 12, 2019 3:00 pm
Another option for your web capture needs would be to use an online web scraper like GrabzIt. This allows you to use point and click to define what product details should be scraped. However if you want more control you can manually change the scrape instructions as well.
Post approved, @Domus, even if I hesitated a bit, but the Tool and the Site look a bit "serious"...
You could have mentioned there is a Free (limited) Version, and the Full Version costs $6 a month...

+ You could also have mentioned if you are "affiliated" with this Tool/Site or a "User", and if "User" => you could have given your personal Feedback/Experience with this Tool/Service, and if "affiliated", this one Post is "OK", but don't go spamming the Forum by posting a similar Post in "all" Threads related to Web-Scraping on our Forum, or your Account will quickly get banned (and all Posts deleted), like it "nearly" happened with User @NoraChoi who posted earlier about another Tool in this same Thread... :idea:

:arrow: For Users who want to "advertize" about "other" Products/Tools/Services related to iMacros and some Functionality that iMacros also provides, this is "allowed" in some "monitored" way and the same Rules apply like for (Anti-)Captcha Providers... :!:
- (F)CI(M) = (Full) Config Info (Missing): iMacros + Browser + OS (+ all 3 Versions + 'Free'/'PE').
- I don't even read the Qt if that (required) Info is not mentioned...!
- Script & URL help a lot for more "educated" Help...
User avatar
thecoder2012
Posts: 357
Joined: Sat Aug 15, 2015 5:14 pm
Location: Internet
Contact:

Re: Extracting Text and Images from a Website

Post by thecoder2012 » Thu Jul 04, 2019 11:39 am

chivracq wrote:
Fri Apr 12, 2019 4:43 pm
You could have mentioned there is a Free (limited) Version, and the Full Version costs $6 a month...
I see no details without account for the "Free (limited) Version" like newest iMacros, only the free trial (7 days) details. Registerfunction was not possible in my old browser, errormessage "A strange error occured when grabzin' that!" but in the newest firefox engine browser. Only cloud shit in my eyes. I found the details: Current Package for one month: Free, Remaining Captures 1000, Scrape Limit 100 pages

The online api is required says the Perl Client source code, no perl example for custom proxy and no proxy type (socks4,socks5, http/https) possible. I gave up. :wink:
Join 9kw.eu Captcha Service now and let your iMacros continue downloads and scripts while you sleep. - Custom iMacros? Contact me! :idea:
chivracq
Posts: 8693
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: Extracting Text and Images from a Website

Post by chivracq » Thu Jul 04, 2019 12:37 pm

thecoder2012 wrote:
Thu Jul 04, 2019 11:39 am
chivracq wrote:
Fri Apr 12, 2019 4:43 pm
You could have mentioned there is a Free (limited) Version, and the Full Version costs $6 a month...
I see no details without account for the "Free (limited) Version" like newest iMacros, only the free trial (7 days) details. Registerfunction was not possible in my old browser, errormessage "A strange error occured when grabzin' that!" but in the newest firefox engine browser. Only cloud shit in my eyes. I found the details: Current Package for one month: Free, Remaining Captures 1000, Scrape Limit 100 pages

The online api is required says the Perl Client source code, no perl example for custom proxy and no proxy type (socks4,socks5, http/https) possible. I gave up. :wink:
Yeah, well, I had checked the Site 3 months ago when this User had posted, and the Site looked "a bit serious" for $6 per Month, so I guess they've changed their Licensing Model since...

=> Good idea now to quote them in case they change it again, ah-ah...!:
Try all our premium features for free with a 7 day free trial. Then from $5.99 a month, unless cancelled.
$6 a month is still a "decent" Price I would think, even if I don't like the "unless cancelled"..., but I "recently" disapproved a Post about 10 days or maybe 2 weeks ago, and I don't remember the Name of the Site or Product, but I wouldn't mention it anyway actually, and their Licensing Model was 50 USD (I think, or maybe Euro...?, I don't remember) per Month, ouf-ouf already..., and only for Scraping from one single URL/Domain... => Ouf-ouf-ouf...!
=> That one I didn't approve, ah-ah...! :shock:

Even if hum..., I should maybe check with TechSup what they think...
Because for that stupid exorbitant Price, an iMacros 'PE' Version with v10.0.2 for FF or v10.0.5 for CR feels pretty cheap compared to it while iMacros is very powerful in my Opinion for Data-Extraction..., and very precise also... 8)
But that User was "acting" very spammy I remember, and simply trying to post everywhere on Internet and on every Forum somewhat related to Data-Extraction and Web-Scraping... No way I would have approved that User/Post... :roll:

But OK, I'll be checking with TechSup..., but I guess anybody can check Wikipedia for an extended List of all Products/Sites/Services for "Data-Extraction"/"Web-Scraping" and check by themselves each Product/Software one by one for their respective Advantages and Pricing Model, same for "Web-Testing", "Web-Automation", "Test-Automation", etc..., iMacros is never (and has never been) the only Product in each Category/Functionality...
I check those Wikipedia Pages myself also from time to time, just "to keep track" of all the Names and sometimes to try/test a bit other Software Products, I find it a "healthy Attitude", ah-ah...! :|
- (F)CI(M) = (Full) Config Info (Missing): iMacros + Browser + OS (+ all 3 Versions + 'Free'/'PE').
- I don't even read the Qt if that (required) Info is not mentioned...!
- Script & URL help a lot for more "educated" Help...
Post Reply