How to extract plain text that hasn't got a html tag

Discussions and Tech Support specific to the iMacros Firefox add-on.
Forum rules
Before asking a question or reporting an issue:
1. Please review the list of FAQ's.
2. Use the search box (at the top of each forum page) to see if a similar problem or question has already been addressed.
3. Try searching the iMacros Wiki - it contains the complete iMacros reference as well as plenty of samples and tutorials.
4. We can respond much faster to your posts if you include the following information: CLICK HERE FOR IMPORTANT INFORMATION TO INCLUDE IN YOUR POST
Post Reply
reing
Posts: 4
Joined: Tue Apr 07, 2015 2:09 pm

How to extract plain text that hasn't got a html tag

Post by reing » Tue Apr 07, 2015 2:43 pm

Hi,

While I'm successful most of the time writing the code for iMacros, I'm having a problem now with the extraction of text from a website.
The text is not inside a html tag and has no id.

It looks like this:

Code: Select all

<html>
<body>
<table></table>
<h4></h4>
<br></br>
the leaves on this tree are green
<br></br>
</body>
</html>
So, it's plain text which is not inside any html tag (besides the body and html openings tags), so I can't get it extracted using the TAG TYPE ATTR methods
I was thinking about writing a script that would check for a certain keyword on the page and extract that whole sentence but i failed to get it working.
A second possibility was to check for the first <br> tag and extract to string of text after that tag, but I could find out if this is even possible.
Another possible way is that since the sentence I want is always on the same line-number in the source code it may be possible to extract that specific line of source code, but I could figure out how to do that.

I hope someone can help me with this, it would be greatly appreciated.

I'm using win7, iMacros 8.9.2, Firefox 33.1.1
chivracq
Posts: 9507
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: How to extract plain text that hasn't got a html tag

Post by chivracq » Tue Apr 07, 2015 3:52 pm

reing wrote:Hi,

While I'm successful most of the time writing the code for iMacros, I'm having a problem now with the extraction of text from a website.
The text is not inside a html tag and has no id.

It looks like this:

Code: Select all

<html>
<body>
<table></table>
<h4></h4>
<br></br>
the leaves on this tree are green
<br></br>
</body>
</html>
So, it's plain text which is not inside any html tag (besides the body and html openings tags), so I can't get it extracted using the TAG TYPE ATTR methods
I was thinking about writing a script that would check for a certain keyword on the page and extract that whole sentence but i failed to get it working.
A second possibility was to check for the first <br> tag and extract to string of text after that tag, but I could find out if this is even possible.
Another possible way is that since the sentence I want is always on the same line-number in the source code it may be possible to extract that specific line of source code, but I could figure out how to do that.

I hope someone can help me with this, it would be greatly appreciated.

I'm using win7, iMacros 8.9.2, Firefox 33.1.1
Yep, no pb, you have several Solutions... (I get you all excited, I guess...!)

There is nothing wrong with using 'TYPE=BODY' together with 'EXTRACT=TXT', but it looks like your Page contains a Table as well and other (HTML) Elements so you'll end up selecting everything on the Page and that might be difficult afterwards to isolate and only keep the Sentence you want to extract, so it's better to use 'EXTRACT=HTM' and then use 'EVAL()' with 'split()' on "<br><br>":

Code: Select all

VERSION BUILD=8820413 RECORDER=FX
TAB T=1
URL GOTO=file:///C:/Users/Admin/Desktop/Green%20Leaves.htm
TAG POS=1 TYPE=BODY ATTR=* EXTRACT=HTM

'=> Extracted Text: "<body style="outline: 1px solid blue;"> <table></table> <h4></h4> <br><br> the leaves on this tree are green <br><br>  </body>"
SET Extracted_Sentence EVAL("var s='{{!EXTRACT}}'; var x; x=s.split('<br><br>'); x[1];")
PROMPT Extracted_Sentence:<BR>_{{Extracted_Sentence}}_
Another Solution, which works with your (simplified I suppose) HTML Source is to use the EVENT Mode with 'Shift^PageDown' and 'Ctrl^c' to build yourself your own Extract and copy it to the Clipboard... (But you may need to tweak it depending on other HTML Elements like your Table on the Page...):

Code: Select all

VERSION BUILD=8820413 RECORDER=FX
TAB T=1
URL GOTO=file:///C:/Users/Admin/Desktop/Green%20Leaves.htm
EVENT TYPE=CLICK SELECTOR="HTML>BODY" BUTTON=0
EVENT TYPE=KEYPRESS SELECTOR="HTML>BODY" KEY=34 MODIFIERS="shift"
EVENT TYPE=KEYPRESS SELECTOR="HTML>BODY" CHAR="c" MODIFIERS="ctrl"
SET Extracted_Sentence EVAL("var s='{{!CLIPBOARD}}'; var x=s.trim(); x;")
PROMPT Extracted<SP>Sentence:<BR>_{{Extracted_Sentence}}_
(Tested on iMacros for FF v8.8.2, Pale Moon v24.6.2 (=FF31), Win7-x64, using your 'Green Leaves.htm' HTML Page that I have re-uploaded to my Post...)
Attachments
Green Leaves.zip
(251 Bytes) Downloaded 96 times
- (F)CI(M) = (Full) Config Info (Missing): iMacros + Browser + OS (+ all 3 Versions + 'Free'/'PE').
- I don't even read the Qt if that (required) Info is not mentioned...!
- Script & URL help a lot for more "educated" Help...
reing
Posts: 4
Joined: Tue Apr 07, 2015 2:09 pm

Re: How to extract plain text that hasn't got a html tag

Post by reing » Tue Apr 07, 2015 5:42 pm

chivracq wrote: Yep, no pb, you have several Solutions... (I get you all excited, I guess...!)

There is nothing wrong with using 'TYPE=BODY' together with 'EXTRACT=TXT', but it looks like your Page contains a Table as well and other (HTML) Elements so you'll end up selecting everything on the Page and that might be difficult afterwards to isolate and only keep the Sentence you want to extract, so it's better to use 'EXTRACT=HTM' and then use 'EVAL()' with 'split()' on "<br><br>":

Code: Select all

VERSION BUILD=8820413 RECORDER=FX
TAB T=1
URL GOTO=file:///C:/Users/Admin/Desktop/Green%20Leaves.htm
TAG POS=1 TYPE=BODY ATTR=* EXTRACT=HTM

'=> Extracted Text: "<body style="outline: 1px solid blue;"> <table></table> <h4></h4> <br><br> the leaves on this tree are green <br><br>  </body>"
SET Extracted_Sentence EVAL("var s='{{!EXTRACT}}'; var x; x=s.split('<br><br>'); x[1];")
PROMPT Extracted_Sentence:<BR>_{{Extracted_Sentence}}_
Another Solution, which works with your (simplified I suppose) HTML Source is to use the EVENT Mode with 'Shift^PageDown' and 'Ctrl^c' to build yourself your own Extract and copy it to the Clipboard... (But you may need to tweak it depending on other HTML Elements like your Table on the Page...):

Code: Select all

VERSION BUILD=8820413 RECORDER=FX
TAB T=1
URL GOTO=file:///C:/Users/Admin/Desktop/Green%20Leaves.htm
EVENT TYPE=CLICK SELECTOR="HTML>BODY" BUTTON=0
EVENT TYPE=KEYPRESS SELECTOR="HTML>BODY" KEY=34 MODIFIERS="shift"
EVENT TYPE=KEYPRESS SELECTOR="HTML>BODY" CHAR="c" MODIFIERS="ctrl"
SET Extracted_Sentence EVAL("var s='{{!CLIPBOARD}}'; var x=s.trim(); x;")
PROMPT Extracted<SP>Sentence:<BR>_{{Extracted_Sentence}}_
(Tested on iMacros for FF v8.8.2, Pale Moon v24.6.2 (=FF31), Win7-x64, using your 'Green Leaves.htm' HTML Page that I have re-uploaded to my Post...)

Thank you for the quick and extensive reply. Excited indeed :D. Because I'm embedding this in a loop, the first solution is what I'm looking for. While it's working for the code I've entered above, it's sadly not working for my actual code :oops:. But I think we're almost their. The DOM in Firefox showed <br></br> tags but when I look at the source code it shows <br \> tags which are apparently not compatible with the split function you used, could that be the problem? I've tried several things but couldn't get it working.


So this is what the source code looks like:

Code: Select all

<html>
<body>
<h4>Title</h4>
<br/> <!-- without spaces--> 
the leaves on this tree are green<br /> <!-- with space after br-->
<font color="green">some other text</font><br /><br />
<table></table>
</body>
</html>

Could the macro be adapted to work with this code? That would be fantastic.
chivracq
Posts: 9507
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: How to extract plain text that hasn't got a html tag

Post by chivracq » Tue Apr 07, 2015 6:38 pm

Yep, but I need the exact Extract of the Body Element and what will be changing/remaining constant between different Loops to build a Double 'split()' before and after your Sentence, I cannot use your "fake" "<!-- without spaces-->" / "<!-- with space after br-->", 'split()' then searches for these exact Strings...
And I used 'split()' from your 1st Example because that was the easiest way with "<br><br>" just in front and just after your Sentence, if you don't want to post the URL of the Page or upload a few Samples of that Page, you need to build your own 'EVAL()' Statement(s) isolating your Sentence from the Extract using JavaScript String Functions...
- (F)CI(M) = (Full) Config Info (Missing): iMacros + Browser + OS (+ all 3 Versions + 'Free'/'PE').
- I don't even read the Qt if that (required) Info is not mentioned...!
- Script & URL help a lot for more "educated" Help...
reing
Posts: 4
Joined: Tue Apr 07, 2015 2:09 pm

Re: How to extract plain text that hasn't got a html tag

Post by reing » Tue Apr 07, 2015 7:29 pm

chivracq wrote:Yep, but I need the exact Extract of the Body Element and what will be changing/remaining constant between different Loops to build a Double 'split()' before and after your Sentence, I cannot use your "fake" "<!-- without spaces-->" / "<!-- with space after br-->", 'split()' then searches for these exact Strings...
And I used 'split()' from your 1st Example because that was the easiest way with "<br><br>" just in front and just after your Sentence, if you don't want to post the URL of the Page or upload a few Samples of that Page, you need to build your own 'EVAL()' Statement(s) isolating your Sentence from the Extract using JavaScript String Functions...
I think there is a slight misunderstanding, sorry. The loop does not have anything to do with the extraction, it just increases a page_id so it will extract the same string from the same position for that other page until the loop ends. For each page I save this string and the page_id into a csv file.

Code: Select all

 URL GOTO=https://SomeUrl.com/main.php?page_id={{!LOOP}}
After that your code follows.

The code underneath is an exact representation of the code on the webpage, with your previous macro the content/html starting from the <table> tag is returned as an extract which is also the case on the actual webpage (which I can't share here because of privacy reasons.) So it skips the string I want to have and returns the the content after the last <br /> which is the table.

Code: Select all

<html>
<body>
<h4>Title</h4>
<br/>
the leaves on this tree are green<br /> 
<font color="green">some other text</font><br /><br />
<table></table>
</body>
</html>
I hope this explanation makes a solution to this problem possible?
ZmanZoo
Posts: 21
Joined: Mon Feb 16, 2015 8:24 pm

Re: How to extract plain text that hasn't got a html tag

Post by ZmanZoo » Tue Apr 07, 2015 8:15 pm

have your tried recording? start the recorder and just click on the words, try different recording modes. the wizard for IE iMacros works pretty good. and the result will most likely be compatible with a few tweaks on your preferred browser.
chivracq
Posts: 9507
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: How to extract plain text that hasn't got a html tag

Post by chivracq » Tue Apr 07, 2015 8:33 pm

Yep, there will always be some misunderstanding if I cannot play myself with the Page, as I mentioned... iMacros heavily depends on the HTML Structure of a Page, as soon as stg changes on the Page, you have to adapt your Script. You make it generic, but but you need to investigate what changes/remains constant, like I've said already... But I gave you the Method and there is no difficulty to apply it to your 2nd Example:

In the first Example you gave, the HTM Extract returned:

Code: Select all

<body style="outline: 1px solid blue;"> <table></table> <h4></h4> <br><br> the leaves on this tree are green <br><br>  </body>
The Data you are looking for is surrounded before and after by "<br><br>", so I just used a Double 'split()' on that String:

Code: Select all

SET Extracted_Sentence EVAL("var s='{{!EXTRACT}}'; var x; x=s.split('<br><br>'); x[1];")
PROMPT Extracted_Sentence:<BR>_{{Extracted_Sentence}}_
In your 2nd Example, the HTM Extract returns:

Code: Select all

<body style="outline: 1px solid blue;"> <h4>Title</h4> <br> the leaves on this tree are green<br> <font color="green">some other text</font><br><br> <table></table>  </body>
The Data/Sentence you are after is now surrounded by only 1 "<b>" before and after, and it's still the 2nd Occurrence of a 'split()' on "<br>", then you just need to remove 1 "<br>" in the 'split()':

Code: Select all

SET Extracted_Sentence EVAL("var s='{{!EXTRACT}}'; var x; x=s.split('<br>'); x[1];")
PROMPT Extracted_Sentence:<BR>_{{Extracted_Sentence}}_
Full Script:

Code: Select all

VERSION BUILD=8820413 RECORDER=FX
TAB T=1
URL GOTO=file:///C:/Users/Admin/Desktop/Green%20Leaves_2.htm
TAG POS=1 TYPE=BODY ATTR=* EXTRACT=HTM

'=> Extracted Text: "<body style="outline: 1px solid blue;"> <h4>Title</h4> <br> the leaves on this tree are green<br> <font color="green">some other text</font><br><br> <table></table>  </body>"

SET Extracted_Sentence EVAL("var s='{{!EXTRACT}}'; var x; x=s.split('<br>'); x[1];")
PROMPT Extracted_Sentence:<BR>_{{Extracted_Sentence}}_
(Tested on iMacros for FF v8.8.2, PM v24.6.2, Win7-x64.)

Tested on this File:
Attachments
Green Leaves_2.zip
(288 Bytes) Downloaded 93 times
- (F)CI(M) = (Full) Config Info (Missing): iMacros + Browser + OS (+ all 3 Versions + 'Free'/'PE').
- I don't even read the Qt if that (required) Info is not mentioned...!
- Script & URL help a lot for more "educated" Help...
reing
Posts: 4
Joined: Tue Apr 07, 2015 2:09 pm

Re: How to extract plain text that hasn't got a html tag

Post by reing » Tue Apr 07, 2015 8:47 pm

chivracq wrote:

Code: Select all

SET Extracted_Sentence EVAL("var s='{{!EXTRACT}}'; var x; x=s.split('<br>'); x[1];")
PROMPT Extracted_Sentence:<BR>_{{Extracted_Sentence}}_
The single <br> statement in the split() did it. It works great. Thank you very much :D
chivracq
Posts: 9507
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: How to extract plain text that hasn't got a html tag

Post by chivracq » Tue Apr 07, 2015 8:50 pm

reing wrote:
chivracq wrote:

Code: Select all

SET Extracted_Sentence EVAL("var s='{{!EXTRACT}}'; var x; x=s.split('<br>'); x[1];")
PROMPT Extracted_Sentence:<BR>_{{Extracted_Sentence}}_
The single <br> statement in the split() did it. It works great. Thank you very much :D
OK, good to hear... :D And I hope you understood the Method...
- (F)CI(M) = (Full) Config Info (Missing): iMacros + Browser + OS (+ all 3 Versions + 'Free'/'PE').
- I don't even read the Qt if that (required) Info is not mentioned...!
- Script & URL help a lot for more "educated" Help...
Post Reply