EXTRACT plain text inside div but not any other tags

Discussions and Tech Support related to website data extraction, screen scraping and data mining using iMacros.
Forum rules
Before asking a question or reporting an issue:
1. Please review the list of FAQ's.
2. Use the Google search box (at the top of each forum page) to see if a similar problem or question has already been addressed. This will search the entire contents of the forums as well as the iMacros Wiki.
3. We can respond much faster to your posts if you include the following information:

CLICK HERE FOR IMPORTANT INFORMATION TO INCLUDE IN YOUR POST

Answering your own posts (e.g. attempting to "bump" your topic) drops your topic from the list of unanswered threads, so it may actually receive less views.
Post Reply
zumiez
Posts: 3
Joined: Wed May 02, 2018 9:34 pm

EXTRACT plain text inside div but not any other tags

Post by zumiez » Wed May 02, 2018 9:49 pm

HTML:

<div>
<h1>title</h1>
<div>address</div>
<div>extrainfo</div>
<div>location</div>
<div>phone</div>
<div>Website: <a href="http:s//www.google.ca">www.google.ca</a></div>
Random description and random products can be bought for $1.99 monday-friday
</div>

ISSUE:

The problem is the Random description line above. It's not within a tag like the other fields so I can't grab only it's data. Instead, it grabs all the content inside the master div combined. Is there a way to only grab plain text, or deny tags another level deep?

MACRO:

SET !ERRORIGNORE YES
SET !TIMEOUT_STEP 1
SET !EXTRACT_TEST_POPUP NO
TAG SELECTOR="#content>DIV>DIV>DIV>DIV>DIV>DIV>DIV>H1" EXTRACT=TXT
TAG SELECTOR="#content>DIV>DIV>DIV>DIV>DIV>DIV>DIV>DIV" EXTRACT=TXT
TAG SELECTOR="#content>DIV>DIV>DIV>DIV>DIV>DIV>DIV>DIV:nth-of-type(2)" EXTRACT=TXT
TAG SELECTOR="#content>DIV>DIV>DIV>DIV>DIV>DIV>DIV>DIV:nth-of-type(3)" EXTRACT=TXT
TAG SELECTOR="#content>DIV>DIV>DIV>DIV>DIV>DIV>DIV>DIV:nth-of-type(4)" EXTRACT=TXT
TAG SELECTOR="#content>DIV>DIV>DIV>DIV>DIV>DIV>DIV>DIV:nth-of-type(5)>a" EXTRACT=HREF
TAG SELECTOR="#content>DIV>DIV>DIV>DIV>DIV>DIV>DIV" EXTRACT=TXT
SAVEAS TYPE=EXTRACT FOLDER=C:\savefilelocation FILE=*
chivracq
Posts: 7722
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: EXTRACT plain text inside div but not any other tags

Post by chivracq » Wed May 02, 2018 11:12 pm

zumiez wrote:HTML:

Code: Select all

<div>
        <h1>title</h1>
        <div>address</div>
        <div>extrainfo</div>
        <div>location</div>
        <div>phone</div>
        <div>Website: <a href="http:s//www.google.ca">www.google.ca</a></div>
        [color=#FF0000]Random description and random products can be bought for $1.99 monday-friday[/color]
</div>
ISSUE:

The problem is the Random description line above. It's not within a tag like the other fields so I can't grab only it's data. Instead, it grabs all the content inside the master div combined. Is there a way to only grab plain text, or deny tags another level deep?

MACRO:

Code: Select all

SET !ERRORIGNORE YES
SET !TIMEOUT_STEP 1
SET !EXTRACT_TEST_POPUP NO
TAG SELECTOR="#content>DIV>DIV>DIV>DIV>DIV>DIV>DIV>H1" EXTRACT=TXT
TAG SELECTOR="#content>DIV>DIV>DIV>DIV>DIV>DIV>DIV>DIV" EXTRACT=TXT
TAG SELECTOR="#content>DIV>DIV>DIV>DIV>DIV>DIV>DIV>DIV:nth-of-type(2)" EXTRACT=TXT
TAG SELECTOR="#content>DIV>DIV>DIV>DIV>DIV>DIV>DIV>DIV:nth-of-type(3)" EXTRACT=TXT
TAG SELECTOR="#content>DIV>DIV>DIV>DIV>DIV>DIV>DIV>DIV:nth-of-type(4)" EXTRACT=TXT
TAG SELECTOR="#content>DIV>DIV>DIV>DIV>DIV>DIV>DIV>DIV:nth-of-type(5)>a" EXTRACT=HREF
[color=#FF0000]TAG SELECTOR="#content>DIV>DIV>DIV>DIV>DIV>DIV>DIV" EXTRACT=TXT[/color]
SAVEAS TYPE=EXTRACT FOLDER=C:\savefilelocation FILE=*
CIM...! :mrgreen: (Read my Sig...)

But yep, different Methods are available for your Purpose..., that I will explain once you'll have mentioned your FCI... :wink:
(Even if I know already what Version you are using, from a Combination of Commands in your Script that are only supported in 2 Versions...)

And mini-Advice..., when using the 'TAG SELECTOR' or the 'EVENT' Mode, you should add Comments to your Script to explain (even to yourself...!) what your Script is doing exactly... If anything changes from its Layout for this Site, you won't know anymore what each Line is doing or was supposed to do if you ever need to modify it... :idea:
- (F)CIM = (Full) Config Info Missing: iMacros + Browser + OS with all 3 Versions...
- I usually don't even read the Question if that (required) Info is not mentioned...
- Script & URL usually help a lot for a more "educated" Help...
zumiez
Posts: 3
Joined: Wed May 02, 2018 9:34 pm

Re: EXTRACT plain text inside div but not any other tags

Post by zumiez » Wed May 02, 2018 11:30 pm

chivracq wrote:
zumiez wrote:HTML:

Code: Select all

<div>
        <h1>title</h1>
        <div>address</div>
        <div>extrainfo</div>
        <div>location</div>
        <div>phone</div>
        <div>Website: <a href="http:s//www.google.ca">www.google.ca</a></div>
        [color=#FF0000]Random description and random products can be bought for $1.99 monday-friday[/color]
</div>
ISSUE:

The problem is the Random description line above. It's not within a tag like the other fields so I can't grab only it's data. Instead, it grabs all the content inside the master div combined. Is there a way to only grab plain text, or deny tags another level deep?

MACRO:

Code: Select all

SET !ERRORIGNORE YES
SET !TIMEOUT_STEP 1
SET !EXTRACT_TEST_POPUP NO
TAG SELECTOR="#content>DIV>DIV>DIV>DIV>DIV>DIV>DIV>H1" EXTRACT=TXT
TAG SELECTOR="#content>DIV>DIV>DIV>DIV>DIV>DIV>DIV>DIV" EXTRACT=TXT
TAG SELECTOR="#content>DIV>DIV>DIV>DIV>DIV>DIV>DIV>DIV:nth-of-type(2)" EXTRACT=TXT
TAG SELECTOR="#content>DIV>DIV>DIV>DIV>DIV>DIV>DIV>DIV:nth-of-type(3)" EXTRACT=TXT
TAG SELECTOR="#content>DIV>DIV>DIV>DIV>DIV>DIV>DIV>DIV:nth-of-type(4)" EXTRACT=TXT
TAG SELECTOR="#content>DIV>DIV>DIV>DIV>DIV>DIV>DIV>DIV:nth-of-type(5)>a" EXTRACT=HREF
[color=#FF0000]TAG SELECTOR="#content>DIV>DIV>DIV>DIV>DIV>DIV>DIV" EXTRACT=TXT[/color]
SAVEAS TYPE=EXTRACT FOLDER=C:\savefilelocation FILE=*
CIM...! :mrgreen: (Read my Sig...)

But yep, different Methods are available for your Purpose..., that I will explain once you'll have mentioned your FCI... :wink:
(Even if I know already what Version you are using, from a Combination of Commands in your Script that are only supported in 2 Versions...)

And mini-Advice..., when using the 'TAG SELECTOR' or the 'EVENT' Mode, you should add Comments to your Script to explain (even to yourself...!) what your Script is doing exactly... If anything changes from its Layout for this Site, you won't know anymore what each Line is doing or was supposed to do if you ever need to modify it... :idea:
I didn't think versions, browsers or OS would matter for this, my bad!

I'm mostly using iMacros Sidebar for Internet Explorer (x64) Version 12.0.501.6698 to run the macro, was using Chrome latest version to get some of the selector stuff since I couldn't find the recording options in the IE version.
Windows 10 Pro 64-bit Version 1803

Good advice on the comments. I likely won't have to run it again for this particular site but I agree it is smart to comment scripts regardless.
chivracq
Posts: 7722
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: EXTRACT plain text inside div but not any other tags

Post by chivracq » Thu May 03, 2018 12:48 am

zumiez wrote:I didn't think versions, browsers or OS would matter for this, my bad!

I'm mostly using iMacros Sidebar for Internet Explorer (x64) Version 12.0.501.6698 to run the macro, was using Chrome latest version to get some of the selector stuff since I couldn't find the recording options in the IE version.
Windows 10 Pro 64-bit Version 1803

Good advice on the comments. I likely won't have to run it again for this particular site but I agree it is smart to comment scripts regardless.
Ouf-ouf...!, you don't "really" need to quote yourself again, that doesn't "really" help for "Readability" of the Thread...

Yeah well, FCI not mentioned, I usually don't even read the Post, that's "simple", and certainly don't react/answer, or only once for first time Posters, ah-ah...!

So, OK, you are on iMacros for IE v12.0, IE... oh...!, IE Version not mentioned, grr...!, Win10-Pro_x64.

Hum, I'm a bit surprised to hear that the 'TAG SELECTOR' Mode works on IE, even with v12.0, I thought it was only supported on CR with v8.4.4 or v10.0.1. I had "tried" on FF with v8.9.7 and it was not supported at all, I took for "granted" if was the same with v12.0 for iMB and IE, and the Release Notes for v12.0 didn't mentioned anything about this Mode... But OK, you could be right, I've never checked in those Browsers/Versions that I don't use... :?

OK, "different Methods", I mentioned...
But hum, a bit surprised to see all the 'DIV''s you've posted from the Source do not have any Class you could use to identify them for the 'TAG POS' Mode, you might then be able to use Relative Positioning, on the 'H1' Field for example, or on the Web-Site. If that works, that would be the easiest...

Otherwise, you'll need indeed to first extract the whole Content of the Containing 'DIV' (or "master div" like you call it), either with 'EXTRACT=TXT' or 'EXTRACT=HTM', and for both, you could add an Extract on the Web-Site Name itself to reuse in a Temp Var to split the whole 'TXT' Data of the Containing 'DIV' into 2 parts with 'EVAL()' + 'split()', the 2nd Split after the Web-Site Name is the Data you are after...
If you do it on the 'EXTRACT=HTM', you don't need an extra Extract on the Web-Site Name, you could reuse directly the 'EXTRACT=HREF' you already have from on the 5th Inner 'DIV'.

On 'EXTRACT=HTM', if you always get 5 Inner 'DIV''s, you could do the 'split()' between the "</div>" corresponding to the Web-Site 'DIV' and the final "</div>" closing the Containing 'DIV'.
And if the Number of Inner 'DIV''s might not be fixed, you might first need to count the Nb of "</div>" in the whole 'EXTRACT=HTM' to only keep the 'n-1'th Split, with 'n' being the Length of the Array returned by the 'split()'.

OK, I hope you understand what I mean, it's always a bit difficult to explain with no URL posted, and you've probably truncated the Source you've posted anyway...
- (F)CIM = (Full) Config Info Missing: iMacros + Browser + OS with all 3 Versions...
- I usually don't even read the Question if that (required) Info is not mentioned...
- Script & URL usually help a lot for a more "educated" Help...
User avatar
thecoder2012
Posts: 248
Joined: Sat Aug 15, 2015 5:14 pm
Location: Internet
Contact:

Re: EXTRACT plain text inside div but not any other tags

Post by thecoder2012 » Thu May 03, 2018 1:13 am

zumiez wrote:I didn't think versions, browsers or OS would matter for this, my bad!
And your html code is incomplete.
zumiez wrote:Good advice on the comments. I likely won't have to run it again for this particular site but I agree it is smart to comment scripts regardless.
Short code for iMacros 8.9.7, Win8.1 and Waterfox 55 with your html code snippet:

Code: Select all

SET !ERRORIGNORE YES
SET !TIMEOUT_STEP 1
SET !EXTRACT_TEST_POPUP NO
TAG POS=1 TYPE=DIV ATTR=* EXTRACT=HTM
SET newvalue EVAL("(\"{{!EXTRACT}}\".split(/<\/div>/))[5];")
PROMPT {{newvalue}}

Code: Select all

<div>
<h1>title</h1>
<div>address</div>
<div>extrainfo</div>
<div>location</div>
<div>phone</div>
<div>Website: <a href="http:s//www.google.ca">www.google.ca</a></div>
Random description and random products can be bought for $1.99 monday-friday
</div>
Second Tab for Javascript in Chrome/IE is required in my eyes. But I have never used IE/Chrome in this case.
Join 9kw.eu Captcha Service now and let your iMacros continue downloads and scripts while you sleep. - Custom iMacros? Contact me!
zumiez
Posts: 3
Joined: Wed May 02, 2018 9:34 pm

Re: EXTRACT plain text inside div but not any other tags

Post by zumiez » Thu May 03, 2018 11:24 pm

chivracq wrote: Ouf-ouf...!, you don't "really" need to quote yourself again, that doesn't "really" help for "Readability" of the Thread...

Yeah well, FCI not mentioned, I usually don't even read the Post, that's "simple", and certainly don't react/answer, or only once for first time Posters, ah-ah...!

So, OK, you are on iMacros for IE v12.0, IE... oh...!, IE Version not mentioned, grr...!, Win10-Pro_x64.

Hum, I'm a bit surprised to hear that the 'TAG SELECTOR' Mode works on IE, even with v12.0, I thought it was only supported on CR with v8.4.4 or v10.0.1. I had "tried" on FF with v8.9.7 and it was not supported at all, I took for "granted" if was the same with v12.0 for iMB and IE, and the Release Notes for v12.0 didn't mentioned anything about this Mode... But OK, you could be right, I've never checked in those Browsers/Versions that I don't use... :?

OK, "different Methods", I mentioned...
But hum, a bit surprised to see all the 'DIV''s you've posted from the Source do not have any Class you could use to identify them for the 'TAG POS' Mode, you might then be able to use Relative Positioning, on the 'H1' Field for example, or on the Web-Site. If that works, that would be the easiest...

Otherwise, you'll need indeed to first extract the whole Content of the Containing 'DIV' (or "master div" like you call it), either with 'EXTRACT=TXT' or 'EXTRACT=HTM', and for both, you could add an Extract on the Web-Site Name itself to reuse in a Temp Var to split the whole 'TXT' Data of the Containing 'DIV' into 2 parts with 'EVAL()' + 'split()', the 2nd Split after the Web-Site Name is the Data you are after...
If you do it on the 'EXTRACT=HTM', you don't need an extra Extract on the Web-Site Name, you could reuse directly the 'EXTRACT=HREF' you already have from on the 5th Inner 'DIV'.

On 'EXTRACT=HTM', if you always get 5 Inner 'DIV''s, you could do the 'split()' between the "</div>" corresponding to the Web-Site 'DIV' and the final "</div>" closing the Containing 'DIV'.
And if the Number of Inner 'DIV''s might not be fixed, you might first need to count the Nb of "</div>" in the whole 'EXTRACT=HTM' to only keep the 'n-1'th Split, with 'n' being the Length of the Array returned by the 'split()'.

OK, I hope you understand what I mean, it's always a bit difficult to explain with no URL posted, and you've probably truncated the Source you've posted anyway...
Sadly the important structure of the html is exactly as I posted it without ids or classes besides the parent #content div and above.

I tried doing a different the default recording and it would sometimes grab other parts of the page since the child divs/fields are dynamic. The selector way seemed to at least keep the spacing once extracted and saved to a csv.

Was just hoping there was a way to get text that is not an html tag built in.

I'll look into the syntax to potentially split the text. Thanks for the insight!
chivracq
Posts: 7722
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: EXTRACT plain text inside div but not any other tags

Post by chivracq » Fri May 04, 2018 12:04 am

zumiez wrote:Sadly the important structure of the html is exactly as I posted it without ids or classes besides the parent #content div and above.

I tried doing a different the default recording and it would sometimes grab other parts of the page since the child divs/fields are dynamic. The selector way seemed to at least keep the spacing once extracted and saved to a csv.

Was just hoping there was a way to get text that is not an html tag built in.

I'll look into the syntax to potentially split the text. Thanks for the insight!
Hum, OK then about the HTML Structure of the Page..., but you could still have used the "Standard" 'TAG' Mode with Relative Positioning for all 5 Inner 'DIV''s, with the 'H1' as Anchor.

For extracting Data, iMacros needs to be able to locate it, and the 'TAG' Mode needs that Data to be specified by a 'TYPE' Meta-Tag in the Source...
But hum, maybe the 'TAG XPATH' Mode can do it... I don't know I never use 'XPATH' and 'REGEX'...

But well, I gave you different Approaches to get what you want, and @thecoder2012 gave you a Code Example (even if it's not the "easiest" Syntax to reuse, I've produced many "easier to reuse Examples" if you search the Forum a bit), corresponding to my...:
chivracq wrote:On 'EXTRACT=HTM', if you always get 5 Inner 'DIV''s, you could do the 'split()' between the "</div>" corresponding to the Web-Site 'DIV' and the final "</div>" closing the Containing 'DIV'.
... which is the "easiest" Case, but it won't always work if you say some Fields are "dynamic", so I guess it's possible there are 2 Phone Nb's (or none) or less or more 'DIV''s..., then you'll need to implement the rest of what I mentioned..., either by counting, or splitting on the Web-Site Field...
- (F)CIM = (Full) Config Info Missing: iMacros + Browser + OS with all 3 Versions...
- I usually don't even read the Question if that (required) Info is not mentioned...
- Script & URL usually help a lot for a more "educated" Help...
Post Reply