Copy Link Location From Multi-Layered Source

Discussions and Tech Support related to website data extraction, screen scraping and data mining using iMacros.
Forum rules
Before asking a question or reporting an issue:
1. Please review the list of FAQ's.
2. Use the search box (at the top of each forum page) to see if a similar problem or question has already been addressed.
3. Try searching the iMacros Wiki - it contains the complete iMacros reference as well as plenty of samples and tutorials.
4. We can respond much faster to your posts if you include the following information: CLICK HERE FOR IMPORTANT INFORMATION TO INCLUDE IN YOUR POST
Post Reply
WindWalkerX
Posts: 6
Joined: Sun Mar 27, 2011 8:25 pm

Copy Link Location From Multi-Layered Source

Post by WindWalkerX » Sun Mar 27, 2011 8:41 pm

I'm wanting to do the equivalent of a right-click copy link location. Normally this works fine by simply recording a Tag, then edit line by adding EXTRACT=HREF
However, this fails for me in one type of page element and I cannot solve it.

Here's an example script:

Code: Select all

VERSION BUILD=7110203 RECORDER=FX
SET !EXTRACT_TEST_POPUP NO
TAB T=1
URL GOTO=http://www.yahoo.com/?r0=1301260782
TAG POS=1 TYPE=IMG ATTR=SRC:http://l1.yimg.com/a/i/ww/news/2011/03/24/grandcanyon1-sm.jpg EXTRACT=HREF
ONDOWNLOAD FOLDER=* FILE=* WAIT=YES
SAVEAS TYPE=EXTRACT FOLDER=* FILE=urlsource.txt
On the yahoo page I've chosen as an example, their stories are set on the page with both an image and a link. By clicking the image, it takes you to the link.
If I right click and say copy link location, I get the url to the article
Which is

Code: Select all

http://travel.yahoo.com/p-interests-38846834
However, when I record it, it records the jpg image location, and the EXTRACT grabs the location of the image, not the news article.
This is what it extracts

Code: Select all

http://l1.yimg.com/a/i/ww/news/2011/03/24/grandcanyon1-sm.jpg
So how can I extract the link url in cases like this, rather than the jpeg location?
Tom, Tech Support
Posts: 3530
Joined: Mon May 31, 2010 4:59 pm

Re: Copy Link Location From Multi-Layered Source

Post by Tom, Tech Support » Thu Mar 31, 2011 10:36 am

Hello WindWalkerX,

I attempted to run your macro but it failed to find the image in your TAG command. Would you please provide an updated example so that I can see exactly what you are working with? Thanks.
Regards,

Tom, iMacros Support
WindWalkerX
Posts: 6
Joined: Sun Mar 27, 2011 8:25 pm

Re: Copy Link Location From Multi-Layered Source

Post by WindWalkerX » Fri Apr 01, 2011 11:34 pm

The stories tend to update every few hours. They're the little sliding thumbnails on the main page.
Image
A script that works right now is this:

Code: Select all

VERSION BUILD=7110203 RECORDER=FX
SET !EXTRACT_TEST_POPUP NO
TAB T=1
URL GOTO=http://www.yahoo.com/
TAG POS=1 TYPE=IMG ATTR=SRC:http://l1.yimg.com/a/i/ww/news/2011/04/01/rose-sm.jpg EXTRACT=HREF
ONDOWNLOAD FOLDER=* FILE=* WAIT=YES
SAVEAS TYPE=EXTRACT FOLDER=* FILE=urlsource.txt
I don't know how long the script will continue working. Yahoo was the only public site I could think of with an example like this.
Tom, Tech Support
Posts: 3530
Joined: Mon May 31, 2010 4:59 pm

Re: Copy Link Location From Multi-Layered Source

Post by Tom, Tech Support » Tue Apr 12, 2011 7:16 pm

Hi WindWalkerX,

Sorry it has taken me awhile to respond. I had to dig into the source HTML to figure this one out, and there's a couple of ways to do it, depending on exactly what you want.

First, here's an example of the source HTML for that portion of the page where you want to grab the links from:

Code: Select all

        <div class="y-carousel">
            <ol class="y-carousel-list y-today-ln-1">
                            <li id="p_13872472-panel0" class="y-panel y-today-grad1 clearfix ">                        <a y-pkgId="id-82774" data-b-tdh="_ylt=ArbFFx45YvJYVBz3oD_jbAybvZx4;_ylu=X3oDMTNpc2h1aHJzBGEDMTEwNDEyIHJpdmFscyBmcmVkZXR0ZSBjbGFzcyB0BGNwb3MDMQRnA2lkLTgyNzc0BGludGwDdXMEaXRjAzIEcGtndgMxMARzZWMDdGQtZmVhdARzbGsDdGh1bWIEdGVzdAM3MDEEd29lAzcwNDg0Mg--"  class=" y-today-grad2 y-ln-4 item selected" href="_ylt=AnUC5fW6uoqs74wrfadipymbvZx4;_ylc=X3oDMTh0ZjA1ajJwBF9TAzIwMjM1MzgwNzUEYQMxMTA0MTIgcml2YWxzIGZyZWRldHRlIGNsYXNzIHQEY3BvcwMxBGcDaWQtODI3NzQEaW50bAN1cwRpdGMDMARsdHh0A0JZVWFza3NzdGFydG9zdGF5YXdheQRwa2d2AzEwBHBvcwMwBHNlYwN0ZC1mZWF0BHNsawN0aHVtYmxpbmsEdGFyA2h0dHA6Ly9yaXZhbHMueWFob28uY29tL25jYWEvYmFza2V0YmFsbC9ibG9nL3RoZV9kYWdnZXIvcG9zdC9XaHktQllVLWhhcy1hc2tlZC1KaW1tZXItRnJlZGV0dGUtdG8tc3RvcC1hdHRlbmQ_dXJuPW5jYWFiLXdwMjA4MwR0ZXN0AzcwMQ--/SIG=14k4r04jv/EXP=1302719916/**http%3A//rivals.yahoo.com/ncaa/basketball/blog/the_dagger/post/Why-BYU-has-asked-Jimmer-Fredette-to-stop-attend%3Furn=ncaab-wp2083" >
                            <span class="y-fp-pg-controls indicator"></span>
                            <img class=" image y-ln-4 y-bg-1" src="http://l1.yimg.com/a/i/ww/news/2011/04/12/jimmer-pdsm.jpg" alt="Jimmer Fredette #32 of the Brigham Young Cougars (Photo by Doug Pensinger/Getty Images)"  title="Jimmer Fredette #32 of the Brigham Young Cougars (Photo by Doug Pensinger/Getty Images)">
                            
                            <span class="medium item-label" style="font-family: inherit;line-height:inherit;">BYU asks star to stay away</span>
                        </a>                        <a y-pkgId="id-82444" data-b-tdh="_ylt=AmNAeKTHn_4gTGQhCkzPQJKbvZx4;_ylu=X3oDMTNpZnJsdjlsBGEDMTEwNDExIG11c2ljIG9zYm91cm5lcyB0YXhlcyB0BGNwb3MDMgRnA2lkLTgyNDQ0BGludGwDdXMEaXRjAzIEcGtndgMxOQRzZWMDdGQtZmVhdARzbGsDdGh1bWIEdGVzdAM3MDEEd29lAzcwNDg0Mg--"  class=" y-today-grad1 y-today-ln-1 trans-border item" href="_ylt=AnUC5fW6uoqs74wrfadipymbvZx4;_ylc=X3oDMThkaDhvYWxhBF9TAzIwMjM1MzgwNzUEYQMxMTA0MTEgbXVzaWMgb3Nib3VybmVzIHRheGVzIHQEY3BvcwMyBGcDaWQtODI0NDQEaW50bAN1cwRpdGMDMARsdHh0A09zYm91cm5lc2hpdHdpdGhiaWd0YXhiaWxsBHBrZ3YDMTkEcG9zAzAEc2VjA3RkLWZlYXQEc2xrA3RodW1ibGluawR0YXIDaHR0cDovL25ldy5tdXNpYy55YWhvby5jb20vYmxvZ3Mvc3RvcHRoZXByZXNzZXMvMzkyMTg5L3RoZS1vc2JvdXJuZXMtMTctbWlsbGlvbi1pbi1kZWJ0LXJpc2stbG9zaW5nLWhvbWUvBHRlc3QDNzAx/SIG=13tg2iq5t/EXP=1302719916/**http%3A//new.music.yahoo.com/blogs/stopthepresses/392189/the-osbournes-17-million-in-debt-risk-losing-home/" >

                            <span class="y-fp-pg-controls indicator"></span>
                            <img class=" image y-ln-2 y-bg-1" src="http://l1.yimg.com/a/i/ww/news/2011/04/11/osbourne-sm.jpg" alt="Ozzy and Sharon Osbourne (Jeff Kravitz/FilmMagic)"  title="Ozzy and Sharon Osbourne (Jeff Kravitz/FilmMagic)">
                            
                            <span class="medium item-label" style="font-family: inherit;line-height:inherit;">Osbournes hit with big tax bill</span>
                        </a>                        <a y-pkgId="id-82396" data-b-tdh="_ylt=Ah8pDuJublQtCYboN9xDY_abvZx4;_ylu=X3oDMTNtdWd1MTlhBGEDMTEwNDExIG5ld3MgZ2FkaGFmaSBudXJzZSBzcGVha3MgdARjcG9zAzMEZwNpZC04MjM5NgRpbnRsA3VzBGl0YwMyBHBrZ3YDMjQEc2VjA3RkLWZlYXQEc2xrA3RodW1iBHRlc3QDNzAxBHdvZQM3MDQ4NDI-"  class=" y-today-grad1 y-today-ln-1 trans-border item" href="_ylt=AnUC5fW6uoqs74wrfadipymbvZx4;_ylc=X3oDMThwODY5dXRyBF9TAzIwMjM1MzgwNzUEYQMxMTA0MTEgbmV3cyBnYWRoYWZpIG51cnNlIHNwZWFrcyB0BGNwb3MDMwRnA2lkLTgyMzk2BGludGwDdXMEaXRjAzAEbHR4dANFeC1udXJzZXNwZWFrc291dG9uR2FkaGFmaQRwa2d2AzI0BHBvcwMwBHNlYwN0ZC1mZWF0BHNsawN0aHVtYmxpbmsEdGFyA2h0dHA6Ly9uZXdzLnlhaG9vLmNvbS9zL2RhaWx5YmVhc3QvMjAxMTA0MTEvdHNfZGFpbHliZWFzdC8xMzQyM19va3NhbmFiYWxpbnNrYXlhb25iZWluZ2xpYnlhc211YW1tYXJnYWRkYWZpc251cnNlBHRlc3QDNzAx/SIG=145i57t2s/EXP=1302719916/**http%3A//news.yahoo.com/s/dailybeast/20110411/ts_dailybeast/13423_oksanabalinskayaonbeinglibyasmuammargaddafisnurse" >
                            <span class="y-fp-pg-controls indicator"></span>
                            <img class=" image y-ln-2 y-bg-1" src="http://l1.yimg.com/a/i/ww/news/2011/04/11/041111gadhafi1-sm.jpg" alt="Moammar Gadhafi at United Nations Conference Hall in 2008. (AP/Sayyid Azim)"  title="Moammar Gadhafi at United Nations Conference Hall in 2008. (AP/Sayyid Azim)">
                            
                            <span class="medium item-label" style="font-family: inherit;line-height:inherit;">Ex-nurse speaks out on Gadhafi</span>

                        </a>                        <a y-pkgId="id-82648" data-b-tdh="_ylt=AnoFVd4ZtybQRVof53_FrfubvZx4;_ylu=X3oDMTNudmc1MGRvBGEDMTEwNDEyIG5ld3MgYmxvZyBjaXZpbCB3YXIgcGhvdG9zIHQEY3BvcwM0BGcDaWQtODI2NDgEaW50bAN1cwRpdGMDMgRwa2d2AzE1BHNlYwN0ZC1mZWF0BHNsawN0aHVtYgR0ZXN0AzcwMQR3b2UDNzA0ODQy"  class=" y-today-grad1 y-today-ln-1 trans-border item" href="_ylt=AnUC5fW6uoqs74wrfadipymbvZx4;_ylc=X3oDMThxZW1wbHN2BF9TAzIwMjM1MzgwNzUEYQMxMTA0MTIgbmV3cyBibG9nIGNpdmlsIHdhciBwaG90b3MgdARjcG9zAzQEZwNpZC04MjY0OARpbnRsA3VzBGl0YwMwBGx0eHQDUmFyZXNjZW5lc2Zyb21DaXZpbFdhcgRwa2d2AzE1BHBvcwMwBHNlYwN0ZC1mZWF0BHNsawN0aHVtYmxpbmsEdGFyA2h0dHA6Ly9uZXdzLnlhaG9vLmNvbS9zL3libG9nX25ld3Nyb29tLzIwMTEwNDEyL3VzX3libG9nX25ld3Nyb29tL3JhcmUtY2l2aWwtd2FyLXBob3Rvcy1kb2N1bWVudC1saWZlLWJldHdlZW4tYmF0dGxlcwR0ZXN0AzcwMQ--/SIG=1498tk9hp/EXP=1302719916/**http%3A//news.yahoo.com/s/yblog_newsroom/20110412/us_yblog_newsroom/rare-civil-war-photos-document-life-between-battles" >
                            <span class="y-fp-pg-controls indicator"></span>
                            <img class=" image y-ln-2 y-bg-1" src="http://l1.yimg.com/a/i/ww/news/2011/04/12/041211civilwar-sm.jpg" alt="Soldier's family in Civil War camp. (Apic – Getty Images)"  title="Soldier's family in Civil War camp. (Apic – Getty Images)">
                            
                            <span class="medium item-label" style="font-family: inherit;line-height:inherit;">Rare scenes from Civil War</span>
                        </a>            </li>            <li id="p_13872472-panel1" class="y-panel y-today-grad1 clearfix empty hide ">                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>            </li>            <li id="p_13872472-panel2" class="y-panel y-today-grad1 clearfix empty hide ">                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>            </li>            <li id="p_13872472-panel3" class="y-panel y-today-grad1 clearfix empty hide ">                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>            </li>            <li id="p_13872472-panel4" class="y-panel y-today-grad1 clearfix empty hide ">                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>            </li>            <li id="p_13872472-panel5" class="y-panel y-today-grad1 clearfix empty hide ">                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>            </li>            <li id="p_13872472-panel6" class="y-panel y-today-grad1 clearfix empty hide ">                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>            </li>            <li id="p_13872472-panel7" class="y-panel y-today-grad1 clearfix empty hide ">                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>            </li>            <li id="p_13872472-panel8" class="y-panel y-today-grad1 clearfix empty hide ">                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>            </li>            <li id="p_13872472-panel9" class="y-panel y-today-grad1 clearfix empty hide ">                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>            </li>            <li id="p_13872472-panel10" class="y-panel y-today-grad1 clearfix empty hide ">                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>            </li>            <li id="p_13872472-panel11" class="y-panel y-today-grad1 clearfix empty hide ">                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>            </li>            <li id="p_13872472-panel12" class="y-panel y-today-grad1 clearfix empty hide ">                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>                <span class="y-today-grad1 y-today-ln-1 empty trans-border item"></span>            </li>

            </ol>
        </div>
Each anchor element is fairly sophisticated with convoluted attribute values that make it difficult to TAG the anchor directly. Looking for commonality, I noticed that within each anchor element is a SPAN element with a static class value of "medium item-label". We can easily TAG this element, which puts us in the middle of the anchor element. When we TAG this SPAN element, it will cause the link to be clicked (again, because the SPAN is contained within the anchor <A> element), but since we don't necessarily want to follow the link, we'll simply extract the SPAN element instead.

So now we've positioned ourselves in the middle of the anchor element (via the contained SPAN element), but the problem is, we can't extract the HREF for the anchor element unless we can TAG it first. For this we'll use relative positioning. Because of the way relative positioning works, we'll have to TAG the first anchor element that appears after the anchor element we are currently positioned in, and then backup and TAG the previous anchor element relative to that one, which will be the anchor element we actually want. It sounds confusing, but stay with me.

Since we're going to be tagging a couple of anchor elements to finally get to the one we want, and we don't actually want to click on/follow those links, we'll be adding the EXTRACT parameter to each TAG command. But since we really only want the last extracted value, I will reset the internal !EXTRACT variable prior to the last TAG command.

Here's the macro:

Code: Select all

URL GOTO=http://www.yahoo.com
TAG POS=1 TYPE=SPAN ATTR=CLASS:"medium item-label" EXTRACT=TXT
TAG POS=R1 TYPE=A ATTR=TXT:* EXTRACT=HREF
SET !EXTRACT NULL
TAG POS=R-1 TYPE=A ATTR=TXT:* EXTRACT=HREF
URL GOTO={{!EXTRACT}}
I added the URL GOTO at the end just to demonstrate that the extracted URL takes you to the correct page. However, you may have noticed that in this approach, the full convoluted HREF of the anchor element is extracted.

If you just want the final URL of the article, then I would suggest the following simpler approach. This approach entails clicking the SPAN element and following the link to the article page, then extracting the URL from the internal !URLCURRENT variable:

Code: Select all

URL GOTO=http://www.yahoo.com
TAG POS=1 TYPE=SPAN ATTR=CLASS:"medium item-label"
WAIT SECONDS=1
SET !EXTRACT {{!URLCURRENT}}
BACK
PROMPT {{!EXTRACT}}
Here I added a BACK command to take you back to the home page followed by a PROMPT command to show you the extracted URL.

One final note: the TAG for the SPAN element selects the first (POS=1) story thumbnail in that panel. To select the second story you would specify POS=2, etc., or put it into a loop with POS={{!LOOP}}.
Regards,

Tom, iMacros Support
WindWalkerX
Posts: 6
Joined: Sun Mar 27, 2011 8:25 pm

Re: Copy Link Location From Multi-Layered Source

Post by WindWalkerX » Thu Aug 23, 2012 1:57 am

I actually just encountered a similar problem again and was searching the forums and came across this :lol:
It helped, thanks very much for the assistance :D
Tom, Tech Support
Posts: 3530
Joined: Mon May 31, 2010 4:59 pm

Re: Copy Link Location From Multi-Layered Source

Post by Tom, Tech Support » Thu Aug 23, 2012 10:02 am

You're welcome (again)!
Regards,

Tom, iMacros Support
Post Reply