Extracting data of JSON

Discussions and Tech Support related to website data extraction, screen scraping and data mining using iMacros.
Forum rules
iMacros EOL - Attention!

The renewal maintenance has officially ended for Progress iMacros effective November 20, 2023 and all versions of iMacros are now considered EOL (End-of-Life). The iMacros products will no longer be supported by Progress (aside from customer license issues), and these forums will also no longer be moderated from the Progress side.

Thank you again for your business and support.

Sincerely,
The Progress Team

Before asking a question or reporting an issue:
1. Please review the list of FAQ's.
2. Use the search box (at the top of each forum page) to see if a similar problem or question has already been addressed.
3. Try searching the iMacros Wiki - it contains the complete iMacros reference as well as plenty of samples and tutorials.
4. We can respond much faster to your posts if you include the following information: CLICK HERE FOR IMPORTANT INFORMATION TO INCLUDE IN YOUR POST
kevnad
Posts: 25
Joined: Wed Mar 02, 2011 3:22 pm

Re: Extracting data of JSON

Post by kevnad » Wed Oct 31, 2018 12:58 pm

kevnad wrote: I'm trying to use the Search command with a REGEXP and EXTRACT option. Still trying to figure out the REGEX to get everything until the sondage section. I think it might work with that since the ' character will not be present.

I'll let you know how it goes!

THanks

Think I got it

SEARCH SOURCE=REGEXP:"messages(.*La Caisse)" EXTRACT=$1

So I extract everything from messages (that is at the start) and until the string "La Caisse" wich is at the start of the sondage section.

I will try to find something better than La Caisse, cause I believe that in the future, it might be elsewhere in the page.

Thanks for you help! greatly appreciated.
chivracq
Posts: 10301
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: Extracting data of JSON

Post by chivracq » Wed Oct 31, 2018 6:43 pm

kevnad wrote:
chivracq wrote:
1- Do you still get the (full) Extract in the 'EXTRACT_TEST_POPUP'...?

2- Does "PROMPT _{{!EXTRACT}}_" work...?

3- Same Qt's with 'EXTRACT=HTM' instead of 'EXTRACT=TXT'...?

4- Does your 'PROMPT' on 'EVAL()' work using 'EXTRACT=HTM'...?:

Code: Select all

TAG POS=1 TYPE=BODY ATTR=* EXTRACT=HTM
SET URL_nav EVAL("var s='{{!EXTRACT}}'; s;")
PROMPT URL_nav:<BR>_{{URL_nav}}_
5- & 6- Can you try a 'SAVEAS' both with 'EXTRACT=TXT' and '=HTM'...? (In order to bypass the 'PROMPT'...)
Example with 'EXTRACT=TXT':

Code: Select all

TAG POS=1 TYPE=BODY ATTR=* EXTRACT=TXT
SET URL_nav EVAL("var s='{{!EXTRACT}}'; s;")
PROMPT URL_nav:<BR>_{{URL_nav}}_
SET !EXTRACT {{URL_nav}}
SAVEAS TYPE=EXTRACT FOLDER=* FILE=JSON_Test_{{!NOW:yyyy-mm-dd_hhhnn}}.txt
1- Yes, I have the full extract with the EXTRACT_TEST_POPUP
2- Yes, the prompt extract works fine
3- Same problem with EXTRACT HTM than TXT
4- No, the EVAL does not work with EXTRACT HTM
5-6 The SAVEAS works fine.

Really, the issue is with the EVAL command.

I'm trying to use the Search command with a REGEXP and EXTRACT option. Still trying to figure out the REGEX to get everything until the sondage section. I think it might work with that since the ' character will not be present.

I'll let you know how it goes!

THanks
Hum, if 5 & 6 are working, that means that the '!EXTRACT' "transiting" through 'EVAL()' is not the Pb..., and the "End_Of_Page" Trick I mentioned might/should work, I would think... 8)

Ouf-ouf...!, 'SEARCH' + 'REGEX' is indeed a "good Idea", but hum, good luck, ah-ah...!, ['EXTRACT' + 'EVAL()' + 'split()'] is actually my "Workaround" that I find easier to use and more powerful than ['SEARCH' = 'REGEX'] that I find "a bit too complicated", ah-ah...! :oops:
kevnad wrote:
kevnad wrote: I'm trying to use the Search command with a REGEXP and EXTRACT option. Still trying to figure out the REGEX to get everything until the sondage section. I think it might work with that since the ' character will not be present.

I'll let you know how it goes!

THanks
Think I got it

Code: Select all

SEARCH SOURCE=REGEXP:"messages(.*La Caisse)" EXTRACT=$1
So I extract everything from messages (that is at the start) and until the string "La Caisse" wich is at the start of the sondage section.

I will try to find something better than La Caisse, cause I believe that in the future, it might be elsewhere in the page.

Thanks for you help! greatly appreciated.
OK, I had a bit "missed" this last Post that came onto a 3rd Page in the Thread and I implemented in the meantime my "End_Of_Page" Solution...:

Code: Select all

VERSION BUILD=8820413 RECORDER=FX
SET !EXTRACT_TEST_POPUP NO
SET !ERRORIGNORE YES
TAB T=1
'URL GOTO=file:///D:/TEMP/iMacros/Temp/_Forum%20Cases/kevnad/extract.json
'URL GOTO=file:///D:/TEMP/iMacros/Temp/_Forum%20Cases/kevnad/extractv2.json

'Easy Access Vars:
SET Descr_1 "yyyyyy-EOP"
'SET Descr_1 "-EOP"
SET Descr_2 "C.D. DU VIEUX-LONGUEUIL"
SET End_Of_Page "sectionCartesPretsMarges"


'Extract the full Content of the JSON File:
'TAG POS=1 TYPE=PRE ATTR=TXT:{<SP>"messages":[<SP>],<SP>"detention":{<SP>"messages":[* EXTRACT=TXT
TAG POS=1 TYPE=BODY ATTR=* EXTRACT=TXT

'Truncate the Extract to remove the "sondage" Section which contain a _'_  that seems to be problematic with 'EVAL()':
'=> 2 Methods, using 'split()' or 'indexOf()':
SET Extract_Trunc_1 EVAL("var s='{{!EXTRACT}}', eop='{{End_Of_Page}}', x,y,z; x=s.split(eop); z=x[0]; z;")
'PROMPT Extract_Trunc_1:<BR><BR>_{{Extract_Trunc_1}}_
'>
SET Extract_Trunc_2 EVAL("var s='{{!EXTRACT}}', eop='{{End_Of_Page}}', x,y,z; x=s.indexOf(eop); z=s.substring(0,x); z;")
'PROMPT Extract_Trunc_2:<BR><BR>_{{Extract_Trunc_2}}_


'Isolate the 'URL_nav', different Methods...:
'*********************************************
'=> Directly on the Extract (which seems to be problematic with the "sondage" Section):
'SET URL_nav EVAL("var s='{{!EXTRACT}}', d1='{{Descr_1}}', d2='{{Descr_2}}', x,y,z; x=s.split(d1)[1].split(d2); y=x[1].split('http'); z='http'+y[1].split('\"')[0]; z;")
'SET URL_nav EVAL("var s='{{!EXTRACT}}', d1='{{Descr_1}}', d2='{{Descr_2}}', a,b,c,d,e,z; a=s.split(d1); b=a[1].split(d2); c=b[1].split('http'); d=c[1].split('\"'); e=d[0]; z='http'+e; z;")
'>
'=> Using the truncated Extract:
'SET URL_nav EVAL("var s='{{Extract_Trunc_2}}', d1='{{Descr_1}}', d2='{{Descr_2}}', x,y,z; x=s.split(d1)[1].split(d2); y=x[1].split('http'); z='http'+y[1].split('\"')[0]; z;")
SET URL_nav EVAL("var s='{{Extract_Trunc_2}}', d1='{{Descr_1}}', d2='{{Descr_2}}', a,b,c,d,e,z; a=s.split(d1); b=a[1].split(d2); c=b[1].split('http'); d=c[1].split('\"'); e=d[0]; z='http'+e; z;")

PROMPT URL_nav:<BR>_{{URL_nav}}_
(Tested on iMacros for FF v8.8.2, PM v26.3.3 (=FF47), Win10_x64.)

You may want to re-enable the 2x 'PROMPT' about 'Extract_Trunc_[1|2]', and see if the 4th Method I chose is working for you... (They all 4 work for me.)

But your 'SEARCH' Solution is quite nice as well, (I'm impressed actually, ah-ah...! :D ), and I reckon you could use like me the "sectionCartesPretsMarges" String that I used as the "End_Of_Page" instead of your "La Caisse". 8)
- (F)CI(M) = (Full) Config Info (Missing): iMacros + Browser + OS (+ all 3 Versions + 'Free'/'PE'/'Trial').
- FCI not mentioned: I don't even read the Qt...! (or only to catch Spam!)
- Script & URL help a lot for more "educated" Help...
kevnad
Posts: 25
Joined: Wed Mar 02, 2011 3:22 pm

Re: Extracting data of JSON

Post by kevnad » Thu Nov 01, 2018 12:21 pm

I test some of your exemple, but anytime I use the EVAL and the problematic character is in the string, then it fail.

Must be something in the imacro browser.

I will open a support ticket to let them know of this issue.

seems like the ' character, once passe in the EVAL statement is seen as a remark/comment trigger so the rest of the statement, starting with the ' is seen as a Remark/Comment

But with the search command, then I don't have the ' character in the extract, so everything works fine.

Thanks again!
chivracq
Posts: 10301
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: Extracting data of JSON

Post by chivracq » Thu Nov 01, 2018 9:22 pm

kevnad wrote:I test some of your exemple, but anytime I use the EVAL and the problematic character is in the string, then it fail.

Must be something in the imacro browser.

I will open a support ticket to let them know of this issue.

seems like the ' character, once passe in the EVAL statement is seen as a remark/comment trigger so the rest of the statement, starting with the ' is seen as a Remark/Comment

But with the search command, then I don't have the ' character in the extract, so everything works fine.

Thanks again!
OK, still a bit "strange" then...

You nearly got "lucky" then that this Apostrophe/Single Quote (very common Char in French) only occurs once in the whole JSON File, and only after the Data you are interested in...

I don't think your Analysis about the _'_ being treated as a 'Comment' Sign/Command is correct, that Command is only "understood" by iMacros at the '.iim' Language/Script Level, (and the iMacros Editor is a bit "Buggy" with it btw, ah-ah...!, quite funny to see half-Lines coloured like Comments when an Apostrophe or Single Quote occurs in some 'ATTR' or 'CONTENT' Parameter or even in 'EVAL()' indeed...), what happens "inside" 'EVAL()' is handled by the JS Engine of the Browser and I suspect that it expects a 2nd Single Quote after that one and gets a bit "mixed up" because there is an uneven Nb of Single Quotes and/or would expect that Single Quote to have been escaped by iMacros and the 'EXTRACT' Mechanism when passing it through the '!EXTRACT' Var to the 'EVAL()', which apparently doesn't happen, maybe as the "Source File" is a Local '.JSON' File, treated like a '.TXT' File and handled by the Browser through the 'file:///' Protocol and not the 'http(s)://' Protocol..., and FF (or Pale Moon that I use) seem(s) to handle those 2 Protocols a bit differently...

Pb might come from Char Encoding, there is already some Char Encoding "Glitch" in the Process of creating the '.JSON' File as you can see from "Profil et préférences" + "données" + "Dépôt mobile" for example with all Accentuated Chars. :oops:
The File looks to me like it is already 'UTF-8' encoded, which should be correct, I think, but maybe the Char Encoding Settings from the Browser need to be "tuned"... I was using "Unicode (UTF-8)" also on PM26... :idea:

But OK, I'll be curious to hear what TechSup say, maybe include a Link to this Thread when you open your Ticket (and from where they could download your '.JSON' File anyway), and it will be "nice" if you can post their Reply in this current Thread... :wink:
- (F)CI(M) = (Full) Config Info (Missing): iMacros + Browser + OS (+ all 3 Versions + 'Free'/'PE'/'Trial').
- FCI not mentioned: I don't even read the Qt...! (or only to catch Spam!)
- Script & URL help a lot for more "educated" Help...
kevnad
Posts: 25
Joined: Wed Mar 02, 2011 3:22 pm

Re: Extracting data of JSON

Post by kevnad » Fri Nov 02, 2018 12:04 pm

chivracq wrote: OK, still a bit "strange" then...

You nearly got "lucky" then that this Apostrophe/Single Quote (very common Char in French) only occurs once in the whole JSON File, and only after the Data you are interested in...

I don't think your Analysis about the _'_ being treated as a 'Comment' Sign/Command is correct, that Command is only "understood" by iMacros at the '.iim' Language/Script Level, (and the iMacros Editor is a bit "Buggy" with it btw, ah-ah...!, quite funny to see half-Lines coloured like Comments when an Apostrophe or Single Quote occurs in some 'ATTR' or 'CONTENT' Parameter or even in 'EVAL()' indeed...), what happens "inside" 'EVAL()' is handled by the JS Engine of the Browser and I suspect that it expects a 2nd Single Quote after that one and gets a bit "mixed up" because there is an uneven Nb of Single Quotes and/or would expect that Single Quote to have been escaped by iMacros and the 'EXTRACT' Mechanism when passing it through the '!EXTRACT' Var to the 'EVAL()', which apparently doesn't happen, maybe as the "Source File" is a Local '.JSON' File, treated like a '.TXT' File and handled by the Browser through the 'file:///' Protocol and not the 'http(s)://' Protocol..., and FF (or Pale Moon that I use) seem(s) to handle those 2 Protocols a bit differently...

Pb might come from Char Encoding, there is already some Char Encoding "Glitch" in the Process of creating the '.JSON' File as you can see from "Profil et préférences" + "données" + "Dépôt mobile" for example with all Accentuated Chars. :oops:
The File looks to me like it is already 'UTF-8' encoded, which should be correct, I think, but maybe the Char Encoding Settings from the Browser need to be "tuned"... I was using "Unicode (UTF-8)" also on PM26... :idea:

But OK, I'll be curious to hear what TechSup say, maybe include a Link to this Thread when you open your Ticket (and from where they could download your '.JSON' File anyway), and it will be "nice" if you can post their Reply in this current Thread... :wink:
Yes, very lucky indeed and some update to the web site might screw everything we have been able to achieve right now.

The case number with the support is 00460809. I have send them a complete iim and json file to test on their end.

I include a link to this thread.

Will keep you posted!
chivracq
Posts: 10301
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: Extracting data of JSON

Post by chivracq » Fri Nov 02, 2018 6:04 pm

kevnad wrote:Yes, very lucky indeed and some update to the web site might screw everything we have been able to achieve right now.

The case number with the support is 00460809. I have send them a complete iim and json file to test on their end.

I include a link to this thread.

Will keep you posted!
Yep, because the Apostrophe is a very common Char in French, I was nearly surprised when I searched the File on that Char to only find 1 Occurrence indeed... Lucky your "C.D. DU VIEUX-LONGUEUIL" is not called ""L'AGENCE DU VIEUX-LONGUEUIL" for example, ah-ah...! :twisted:
Well..., there are always Workarounds, but pfff..., becomes quickly cumbersome...

OK for the Support Ticket Ref/Nb, even if I can't "do" anything with it and check it directly myself (I don't have any "special" Rights in the Internal Support Tracking System, I can only access the Tickets I have opened myself) but I could still ask Tom from TechSup about it after a while if I don't hear about it anymore... :wink:
- (F)CI(M) = (Full) Config Info (Missing): iMacros + Browser + OS (+ all 3 Versions + 'Free'/'PE'/'Trial').
- FCI not mentioned: I don't even read the Qt...! (or only to catch Spam!)
- Script & URL help a lot for more "educated" Help...
kevnad
Posts: 25
Joined: Wed Mar 02, 2011 3:22 pm

Re: Extracting data of JSON

Post by kevnad » Wed Nov 28, 2018 6:37 pm

I did not forget about you !

Just got word back from support, and the solution is very simple...

Basically, we need to use double quote instead of single quote (because there's a single quote in the text). and to use double quote in javascript, we need to use the \ as en Escaped character

All we need to change is s='{{!EXTRACT}}' to s=\"{{!EXTRACT}}\".

Complete answer from support :

Hello,

So per Tom the error is occurring because you are using single-quotes to enclose a string value in the Javascript code, but the data that is extracted also contains a single-quote (apostrophe) within it, so this gets interpreted as the ending of the string value rather than as part of the string itself. Javascript allows both single-quote and double-quote characters to be used for string literals in the code, so in this case, all that is needed is to use double-quotes in the Javascript code to reference the extracted data and it will work as expected. Since the EVAL command itself uses double-quotes to enclose the entire Javascript code, any embedded double-quotes must be escaped using a backslash (\) character.

Here is the revised line 12:

SET URL_DETAILCOMPTE EVAL("var s=\"{{!EXTRACT}}\", d1='{{Descr_1}}', d2='{{Descr_2}}', x,y,z; x=s.split(d1)[1].split(d2); y=x[1].split('http'); z='http'+y[1].split('\"')[0]; z;")

All that was changed s='{{!EXTRACT}}' to s=\"{{!EXTRACT}}\".

Thank you,
chivracq
Posts: 10301
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: Extracting data of JSON

Post by chivracq » Thu Nov 29, 2018 5:46 am

kevnad wrote:I did not forget about you !

Just got word back from support, and the solution is very simple...

Basically, we need to use double quote instead of single quote (because there's a single quote in the text). and to use double quote in javascript, we need to use the \ as en Escaped character

All we need to change is s='{{!EXTRACT}}' to s=\"{{!EXTRACT}}\".

Complete answer from support :

Hello,

So per Tom the error is occurring because you are using single-quotes to enclose a string value in the Javascript code, but the data that is extracted also contains a single-quote (apostrophe) within it, so this gets interpreted as the ending of the string value rather than as part of the string itself. Javascript allows both single-quote and double-quote characters to be used for string literals in the code, so in this case, all that is needed is to use double-quotes in the Javascript code to reference the extracted data and it will work as expected. Since the EVAL command itself uses double-quotes to enclose the entire Javascript code, any embedded double-quotes must be escaped using a backslash (\) character.

Here is the revised line 12:

Code: Select all

SET URL_DETAILCOMPTE EVAL("var s=\"{{!EXTRACT}}\", d1='{{Descr_1}}', d2='{{Descr_2}}', x,y,z; x=s.split(d1)[1].split(d2); y=x[1].split('http'); z='http'+y[1].split('\"')[0]; z;") 
All that was changed s='{{!EXTRACT}}' to s=\"{{!EXTRACT}}\".

Thank you,
Oh...!, nice to hear from you again..., and Thanks for posting this Update of course... :D

Hum, I'm not completely "convinced", it shouldn't make a Difference, and it works fine "for me" on FF using v8.8.2 + v8.9.7, so that means that iMB (v12 that you are using) is treating Single/Double Quotes a bit differently around Vars in 'EVAL()', and not escaping them automatically inside a Var like that happens on FF.

I've been using Single Quotes for years inside 'EVAL()' Statements around iMacros Vars and inside JS Functions without any Pb, and alternating Single with Double Quotes when you have several Levels of Double Quotes (as the inner Double Quotes then need to be Double/Triple Escaped otherwise...), but Escaped Double Quotes is indeed the (only) "official way" documented in the Wiki... => Fair enough then, good to know this Difference between different Browsers and different flavours of iMacros... :wink:
- (F)CI(M) = (Full) Config Info (Missing): iMacros + Browser + OS (+ all 3 Versions + 'Free'/'PE'/'Trial').
- FCI not mentioned: I don't even read the Qt...! (or only to catch Spam!)
- Script & URL help a lot for more "educated" Help...
Post Reply