Extraction of Table from a web page

Discussions and Tech Support related to website data extraction, screen scraping and data mining using iMacros.
Forum rules
Before asking a question or reporting an issue:
1. Please review the list of FAQ's.
2. Use the search box (at the top of each forum page) to see if a similar problem or question has already been addressed.
3. Try searching the iMacros Wiki - it contains the complete iMacros reference as well as plenty of samples and tutorials.
4. We can respond much faster to your posts if you include the following information: CLICK HERE FOR IMPORTANT INFORMATION TO INCLUDE IN YOUR POST
Post Reply
jyotirmaya
Posts: 41
Joined: Wed Jul 27, 2016 6:25 pm

Extraction of Table from a web page

Post by jyotirmaya » Mon May 04, 2020 4:48 am

I am using Bowser Firefox 56.0
iMacros for Firefox 8.9.7
Windows 7 Professional 64-bit Operating system

I am using the below code to extract data from a website

Code: Select all

VERSION BUILD=8970419 RECORDER=FX
'Uses a Windows script to submit several datasets to a website, e. g. for filling an online database
TAB T=1     
' Specify input file (if !COL variables are used, IIM automatically assume a CSV format of the input file
'CSV = Comma Separated Values in each line of the file
SET !DATASOURCE TEST.csv

'SET !DATASOURCE_COLUMNS 2
'Start at line 2 to skip the header in the file
SET !LOOP 2
'Increase the current position in the file with each loop 
SET !DATASOURCE_LINE {{!LOOP}}
SET !EXTRACT_TEST_POPUP NO
SET My_Data EVAL("var s='{{!EXTRACT}}'; var x,y,z; z=s.split('[EXTRACT]').join('<BR>'); z;")
TAG POS=1 TYPE=LEGEND FORM=ID:aspnetForm ATTR=TXT:Select<SP>Location<SP>for<SP>RoR
TAG POS=1 TYPE=TD ATTR=TXT:District
TAG POS=1 TYPE=SELECT FORM=ID:aspnetForm ATTR=ID:ctl00_ContentPlaceHolder1_ddlDistrict CONTENT=%5
WAIT SECONDS=3
TAG POS=1 TYPE=TD ATTR=TXT:Tahasil
TAG POS=1 TYPE=SELECT FORM=ID:aspnetForm ATTR=ID:ctl00_ContentPlaceHolder1_ddlTahsil CONTENT=%4
WAIT SECONDS=3
TAG POS=1 TYPE=TD ATTR=TXT:Village
TAG POS=1 TYPE=SELECT FORM=ID:aspnetForm ATTR=ID:ctl00_ContentPlaceHolder1_ddlVillage CONTENT=%62
WAIT SECONDS=3
TAG POS=1 TYPE=SPAN ATTR=ID:ctl00_ContentPlaceHolder1_lblColumnName
TAG POS=1 TYPE=SELECT FORM=ID:aspnetForm ATTR=ID:ctl00_ContentPlaceHolder1_ddlBindData CONTENT=%{{!COL1}}
WAIT SECONDS=1
TAG POS=1 TYPE=INPUT:SUBMIT FORM=ID:aspnetForm ATTR=ID:ctl00_ContentPlaceHolder1_btnRORFront
'TAG POS=1 TYPE=DIV ATTR=TXT:Schedule<SP>I<SP>Form<SP>No.39-A
'TAG POS=1 TYPE=TD ATTR=TXT:ଥାନା<SP>ନମ୍ବର<SP>:<SP>"149"
'Anchor:
TAG POS=1 TYPE=TD ATTR=TXT:ଜମିଦାରଙ୍କ<SP>ନାମ<SP>ଓ<SP>ଖେୱାଟ<SP>ବା<SP>ଖତିୟାନର<SP>କ୍ରମିକ*
'TAG POS=1 TYPE=TD ATTR=TXT:1)<SP>ଖତିୟାନର<SP>କ୍ରମିକ<SP>ନମ୍ବର
SET !EXTRACT NULL
TAG POS=R3 TYPE=TD ATTR=TXT:* EXTRACT=TXT
SET My_Data {{!EXTRACT}}
'TAG POS=1 TYPE=TD ATTR=TXT:1
SET !EXTRACT NULL
TAG POS=R1 TYPE=TD ATTR=TXT:* EXTRACT=TXT
ADD My_Data {{!EXTRACT}}
'TAG POS=1 TYPE=TD ATTR=TXT:2)<SP>ପ୍ରଜାର<SP>ନାମ,<SP>ପିତାର<SP>ନାମ,<SP>ଜାତି<SP>ଓ<SP>ବାସସ୍ଥ*
SET !EXTRACT NULL
TAG POS=R1 TYPE=TD ATTR=TXT:* EXTRACT=TXT
ADD My_Data {{!EXTRACT}}
'TAG POS=1 TYPE=TD ATTR=TXT:3)<SP>ସ୍ଵତ୍ଵ
TAG POS=R1 TYPE=TD ATTR=TXT:* EXTRACT=TXT
ADD My_Data {{!EXTRACT}}
'TAG POS=1 TYPE=TD ATTR=TXT:ସ୍ଥିତିବାନ
SET !EXTRACT NULL
TAG POS=R1 TYPE=TD ATTR=TXT:* EXTRACT=TXT
ADD My_Data {{!EXTRACT}}
'TAG POS=1 TYPE=TD ATTR=TXT:ସ୍ଥିତିବାନ
SET !EXTRACT NULL
TAG POS=1 TYPE=SPAN ATTR=ID:gvfront_ctl02_lblSpecialCase EXTRACT=TXT
ADD My_Data {{!EXTRACT}}    
'PROMPT {{!EXTRACT}}
'PROMPT {{My_Data}}
SET !CLIPBOARD {{My_Data}}
TAG POS=1 TYPE=INPUT:SUBMIT FORM=ID:form1 ATTR=ID:btnBackPg
TAG POS=2 TYPE=TABLE ATTR=* EXTRACT=TXT
SET TABLE EVAL("'{{!EXTRACT}}'.trim()")
TAG POS=1 TYPE=INPUT:SUBMIT FORM=ID:form1 ATTR=ID:btnKhatiyan
TAG POS=1 TYPE=TD ATTR=TXT:District

Up to this portion I am using the code to copy the front web page data

Code: Select all

     TAG POS=1 TYPE=INPUT:SUBMIT FORM=ID:form1 ATTR=ID:btnBackPg
and after this I am using the code to copy the table in the backside of the page

Code: Select all

 TAG POS=2 TYPE=TABLE ATTR=* EXTRACT=TXT
        SET TABLE EVAL("'{{!EXTRACT}}'.trim()")
        
       TAG POS=1 TYPE=INPUT:SUBMIT FORM=ID:form1 ATTR=ID:btnKhatiyan
        TAG POS=1 TYPE=TD ATTR=TXT:District
But I am able to extract only front page data, and the code after clicking the Back button doesn't work, and when I am using the only code

Code: Select all

 TAG POS=2 TYPE=TABLE ATTR=* EXTRACT=TXT
        SET TABLE EVAL("'{{!EXTRACT}}'.trim()")
It extracts the data but not in table format.
Capture.JPG
table
I want to extract the marked data after the columns showing 7 8 9 10 11 12 till the end of page. And for some pages there are 100's of rows are there in a table to extract. What should be the change in the code ?
Attachments
html page.zip
(173.56 KiB) Downloaded 44 times
Last edited by jyotirmaya on Mon May 04, 2020 1:15 pm, edited 2 times in total.
chivracq
Posts: 9425
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: Extraction of Table from a web page

Post by chivracq » Mon May 04, 2020 10:55 am

jyotirmaya wrote:
Mon May 04, 2020 4:48 am
I am using

Code: Select all

Bowser Firefox 56.0
iMacros for Firefox 8.9.7
Windows 7 Professional 64-bit Operating system
I am using the below code to extract data from a website

Code: Select all

VERSION BUILD=8970419 RECORDER=FX
'Uses a Windows script to submit several datasets to a website, e. g. for filling an online database
TAB T=1     
' Specify input file (if !COL variables are used, IIM automatically assume a CSV format of the input file
'CSV = Comma Separated Values in each line of the file
SET !DATASOURCE TEST.csv

'SET !DATASOURCE_COLUMNS 2
'Start at line 2 to skip the header in the file
SET !LOOP 2
'Increase the current position in the file with each loop 
SET !DATASOURCE_LINE {{!LOOP}}
        SET !EXTRACT_TEST_POPUP NO
        SET My_Data EVAL("var s='{{!EXTRACT}}'; var x,y,z; z=s.split('[EXTRACT]').join('<BR>'); z;")


        TAG POS=1 TYPE=LEGEND FORM=ID:aspnetForm ATTR=TXT:Select<SP>Location<SP>for<SP>RoR
        TAG POS=1 TYPE=TD ATTR=TXT:District
        TAG POS=1 TYPE=SELECT FORM=ID:aspnetForm ATTR=ID:ctl00_ContentPlaceHolder1_ddlDistrict CONTENT=%5
	WAIT SECONDS=3
        TAG POS=1 TYPE=TD ATTR=TXT:Tahasil
        TAG POS=1 TYPE=SELECT FORM=ID:aspnetForm ATTR=ID:ctl00_ContentPlaceHolder1_ddlTahsil CONTENT=%4
  	WAIT SECONDS=3
        TAG POS=1 TYPE=TD ATTR=TXT:Village
        TAG POS=1 TYPE=SELECT FORM=ID:aspnetForm ATTR=ID:ctl00_ContentPlaceHolder1_ddlVillage CONTENT=%62
        WAIT SECONDS=3

        
TAG POS=1 TYPE=SPAN ATTR=ID:ctl00_ContentPlaceHolder1_lblColumnName
        TAG POS=1 TYPE=SELECT FORM=ID:aspnetForm ATTR=ID:ctl00_ContentPlaceHolder1_ddlBindData CONTENT=%{{!COL1}}
        WAIT SECONDS=1
        TAG POS=1 TYPE=INPUT:SUBMIT FORM=ID:aspnetForm ATTR=ID:ctl00_ContentPlaceHolder1_btnRORFront
        'TAG POS=1 TYPE=DIV ATTR=TXT:Schedule<SP>I<SP>Form<SP>No.39-A
        'TAG POS=1 TYPE=TD ATTR=TXT:ଥାନା<SP>ନମ୍ବର<SP>:<SP>"149"
        'Anchor:
        TAG POS=1 TYPE=TD ATTR=TXT:ଜମିଦାରଙ୍କ<SP>ନାମ<SP>ଓ<SP>ଖେୱାଟ<SP>ବା<SP>ଖତିୟାନର<SP>କ୍ରମିକ*
        'TAG POS=1 TYPE=TD ATTR=TXT:1)<SP>ଖତିୟାନର<SP>କ୍ରମିକ<SP>ନମ୍ବର
        SET !EXTRACT NULL
        TAG POS=R3 TYPE=TD ATTR=TXT:* EXTRACT=TXT
        SET My_Data {{!EXTRACT}}
        'TAG POS=1 TYPE=TD ATTR=TXT:1
        SET !EXTRACT NULL
        TAG POS=R1 TYPE=TD ATTR=TXT:* EXTRACT=TXT
        ADD My_Data {{!EXTRACT}}
        'TAG POS=1 TYPE=TD ATTR=TXT:2)<SP>ପ୍ରଜାର<SP>ନାମ,<SP>ପିତାର<SP>ନାମ,<SP>ଜାତି<SP>ଓ<SP>ବାସସ୍ଥ*
        SET !EXTRACT NULL
        TAG POS=R1 TYPE=TD ATTR=TXT:* EXTRACT=TXT
        ADD My_Data {{!EXTRACT}}
        'TAG POS=1 TYPE=TD ATTR=TXT:3)<SP>ସ୍ଵତ୍ଵ
        TAG POS=R1 TYPE=TD ATTR=TXT:* EXTRACT=TXT
        ADD My_Data {{!EXTRACT}}
        'TAG POS=1 TYPE=TD ATTR=TXT:ସ୍ଥିତିବାନ
        SET !EXTRACT NULL
        TAG POS=R1 TYPE=TD ATTR=TXT:* EXTRACT=TXT
        ADD My_Data {{!EXTRACT}}
         'TAG POS=1 TYPE=TD ATTR=TXT:ସ୍ଥିତିବାନ
        SET !EXTRACT NULL
        TAG POS=1 TYPE=SPAN ATTR=ID:gvfront_ctl02_lblSpecialCase EXTRACT=TXT
         ADD My_Data {{!EXTRACT}}    
        'PROMPT {{!EXTRACT}}
        'PROMPT {{My_Data}}
        SET !CLIPBOARD {{My_Data}}
        TAG POS=1 TYPE=INPUT:SUBMIT FORM=ID:form1 ATTR=ID:btnBackPg
        
        TAG POS=2 TYPE=TABLE ATTR=* EXTRACT=TXT
        SET TABLE EVAL("'{{!EXTRACT}}'.trim()")
        
       TAG POS=1 TYPE=INPUT:SUBMIT FORM=ID:form1 ATTR=ID:btnKhatiyan
        TAG POS=1 TYPE=TD ATTR=TXT:District

Up to this portion I am using the code to copy the front web page data

Code: Select all

     TAG POS=1 TYPE=INPUT:SUBMIT FORM=ID:form1 ATTR=ID:btnBackPg
and after this I am using the code to copy the table in the backside of the page

Code: Select all

 TAG POS=2 TYPE=TABLE ATTR=* EXTRACT=TXT
        SET TABLE EVAL("'{{!EXTRACT}}'.trim()")
        
       TAG POS=1 TYPE=INPUT:SUBMIT FORM=ID:form1 ATTR=ID:btnKhatiyan
        TAG POS=1 TYPE=TD ATTR=TXT:District
But I am able to extract only front page data, and the code after clicking the Back button doesn't work, and when I am using the only code

Code: Select all

 TAG POS=2 TYPE=TABLE ATTR=* EXTRACT=TXT
        SET TABLE EVAL("'{{!EXTRACT}}'.trim()")
It extracts the data but not in table format.

Code: Select all

[img]https://1.bp.blogspot.com/-Od0JHUa71ok/Xq-Y8qVjIdI/AAAAAAAAGbQ/McAuVQfjHMIAyxmGbSc13IBzOUy_7ujDACLcBGAsYHQ/s1600/Capture.JPG[/img]
I want to extract the marked data after the columns showing 7 8 9 10 11 12 till the end of page. And for some pages there are 100's of rows are there in a table to extract. What should be the change in the code ?

Can you upload your Screenshot directly to the Forum, rather than using some external Pix Hosting Site...? (Explained in the Forum Rules...)

And can you maybe "clean" your Scripts, but especially the first/long one from all unnecessary many Whitespaces/Tabs at the beginning of nearly all Lines...? They make your Scripts very difficult to "read" and to follow/understand...

>>>

But OK from a quick Look without trying to read/follow your long Script (until you've "cleaned" it), use 'EVAL()' with 'split()' on the Cell containing the Nb "12" to keep the Data only after that Cell. That will remove all the Headers...
=> With stg like...:

Code: Select all

TAG POS=2 TYPE=TABLE ATTR=* EXTRACT=TXT
'SET TABLE EVAL("'{{!EXTRACT}}'.trim()")
SET Full_Table EVAL("'{{!EXTRACT}}'.trim()")
PROMPT Full_Table:<BR><BR>_{{Full_Table}}_
SET Table_Data EVAL("var ft='{{Full_Table}}'; var x,y,z; x=ft.split('\"12\",'); z=x[1]; z;")
PROMPT Table_Data:<BR><BR>_{{Table_Data}}_
I'm not completely sure what you need exactly in the 'split()', use the 2 'PROMPT''s to find out/debug...

And hum..., your Screenshot "only" shows Cols [7-12], but I "suspect"/reckon, the Table also contains Cols [1-6], and if I understand correctly, you won't want the Data from those 6 Cols, meaning you'll further need to spit the Content of 'Table_Data' "vertically" for all Rows, which is a little bit more "complicated", you will need a 'for' Loop in the 'EVAL()'...
You'll find a few Examples on the Forum..., not from me I think, but rather from other Advanced User @iimfun I think I remember... I'll do some further "Thinking" if you don't come out by yourself after you've complied with the Rule for Screenshots..., and I think I will need more Info about the Full Structure of the Full Table and what you get exactly in the 'EXTRACT'...

>>>

Oh...!, and you'll notice I renamed your "TABLE" Var into "Full_Table"...
1- "Convention" is to keep All Capitals for iMacros Commands and Vars, and not to use All Capitals for User Defined Vars... (for Readability).
2- Not to use "Reserved" iMacros/JS/HTML Keywords/Commands for your (User Defined) Vars, as you might get some "unexpected" Results, or Conflicts, and it's "confusing" to read/understand your Script anyway... And if it's confusing for me as an Advanced User, it must be even more confusing for yourself to re-understand/debug your own Script in case of a "Problem"...

>>>

Hum..., and first time I see using 'trim()' on a 'TABLE'... It probably has no Use and does nothing actually, I would think... :o
Because 'trim()' removes Spaces (+ (soft) Tabs and Returns) at the very beginning + very end of the 'EXTRACT' (often useful for 'DIV' and 'SPAN' or even 'TD' Elements indeed), but iMacros in an 'EXTRACT' Statement of "TYPE=TABLE" already wraps all Cells between Double Quotes, and from the 1st Cell in the Table, until the last one, + even adds 2 extra Double Quotes around the whole Table, so there won't be any Spaces to trim before that 1st Double Quote, and after the last one...
- (F)CI(M) = (Full) Config Info (Missing): iMacros + Browser + OS (+ all 3 Versions + 'Free'/'PE').
- I don't even read the Qt if that (required) Info is not mentioned...!
- Script & URL help a lot for more "educated" Help...
jyotirmaya
Posts: 41
Joined: Wed Jul 27, 2016 6:25 pm

Re: Extraction of Table from a web page

Post by jyotirmaya » Mon May 04, 2020 1:11 pm

Can you upload your Screenshot directly to the Forum, rather than using some external Pix Hosting Site...? (Explained in the Forum Rules...)

And can you maybe "clean" your Scripts, but especially the first/long one from all unnecessary many Whitespaces/Tabs at the beginning of nearly all Lines...? They make your Scripts very difficult to "read" and to follow/understand...

>>>

Sorry I wasn't aware of the rule, and sorry for the whitespaces or tabs, I will make clean my scripts when I will post further.

But OK from a quick Look without trying to read/follow your long Script (until you've "cleaned" it), use 'EVAL()' with 'split()' on the Cell containing the Nb "12" to keep the Data only after that Cell. That will remove all the Headers...
=> With stg like...:

Code: Select all

TAG POS=2 TYPE=TABLE ATTR=* EXTRACT=TXT
'SET TABLE EVAL("'{{!EXTRACT}}'.trim()")
SET Full_Table EVAL("'{{!EXTRACT}}'.trim()")
PROMPT Full_Table:<BR><BR>_{{Full_Table}}_
SET Table_Data EVAL("var ft='{{Full_Table}}'; var x,y,z; x=ft.split('\"12\",'); z=x[1]; z;")
PROMPT Table_Data:<BR><BR>_{{Table_Data}}_
I'm not completely sure what you need exactly in the 'split()', use the 2 'PROMPT''s to find out/debug...
After using 12 as split also its keeping headers. In the first prompt it extracted Full Table data like

"ଖତିୟାନର କ୍ରମିକ ନଂ : 1","ମୌଜା : ଉପରହରିଡ଼ାବାଡ଼ି","ଜିଲ୍ଲା : ଗଂଜାମ"
"ପ୍ଲଟ ନମ୍ବର ଓ ଚକର ନାମ","କିସମ ଓ ପ୍ଲଟର ଖଜଣା","କିସମର ବିସ୍ତାରିତ ବିବରଣୀ ଓ ଚୌହଦି","ରକବା","ମନ୍ତବ୍ୟ"
"ଏ."," ଡି.","ହେକ୍ଟର"
"7","8","9","1","0","11","12"
"
256

କୂଅଝୋଳା
","
ବିଲ ଜଳସେଚିତ ଏକ ଫସଲି
","

ଉ: ସୁରେନ୍ଦ୍ର ମଳିକ


ଦ: ପତିତ


","
0
","
226
","

","

"
"
265

କୂଅଝୋଳା
","
ବିଲ ଜଳସେଚିତ ଏକ ଫସଲି
","

ଉ: ନିଜ


ଦ: ନିଜ


","
0
","
088
","

","

"
"
267

କୂଅଝୋଳା
","
ବିଲ ଜଳସେଚିତ ଏକ ଫସଲି
","

ଉ: ନିଜ


ଦ: ପତିତ


","
0
","
145
","

","

"
"3 plots","","","0","459","",""


and after the 2nd prompt it extracted Table Data as

__



And hum..., your Screenshot "only" shows Cols [7-12], but I "suspect"/reckon, the Table also contains Cols [1-6], and if I understand correctly, you won't want the Data from those 6 Cols, meaning you'll further need to spit the Content of 'Table_Data' "vertically" for all Rows, which is a little bit more "complicated", you will need a 'for' Loop in the 'EVAL()'...
You'll find a few Examples on the Forum..., not from me I think, but rather from other Advanced User @iimfun I think I remember... I'll do some further "Thinking" if you don't come out by yourself after you've complied with the Rule for Screenshots..., and I think I will need more Info about the Full Structure of the Full Table and what you get exactly in the 'EXTRACT'...
I have added the HTML page as .zip attachment, kindly can you check that to help me ?? and the table doesn't contain columns 1 to 6. I have uploaded the screenshot in this forum also, sorry for that. It doesn't extract the data in Table format, what may be the change in the code further ? or this is the maximum I can get from extraction ?
chivracq
Posts: 9425
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: Extraction of Table from a web page

Post by chivracq » Mon May 04, 2020 2:01 pm

jyotirmaya wrote:
Mon May 04, 2020 1:11 pm
Can you upload your Screenshot directly to the Forum, rather than using some external Pix Hosting Site...? (Explained in the Forum Rules...)

And can you maybe "clean" your Scripts, but especially the first/long one from all unnecessary many Whitespaces/Tabs at the beginning of nearly all Lines...? They make your Scripts very difficult to "read" and to follow/understand...

Sorry I wasn't aware of the rule, and sorry for the whitespaces or tabs, I will make clean my scripts when I will post further.

Yeah, OK, perfect, Thanks for Uploading your Screenshot directly to the Forum, it really helps understanding the "Scenario", and it would be "a pity" if that Screenshot disappeared after "a while" which often happens with external Hosting Sites that all go dark or become commercial "one day"...

The few empty Lines in your Script were still "useful", you didn't need to remove them... But OK, never mind... I understood your Scenario anyway, I don't really need to dig into the "big Script", ah-ah...!

>>>
jyotirmaya wrote:
Mon May 04, 2020 1:11 pm
But OK from a quick Look without trying to read/follow your long Script (until you've "cleaned" it), use 'EVAL()' with 'split()' on the Cell containing the Nb "12" to keep the Data only after that Cell. That will remove all the Headers...
=> With stg like...:

Code: Select all

TAG POS=2 TYPE=TABLE ATTR=* EXTRACT=TXT
'SET TABLE EVAL("'{{!EXTRACT}}'.trim()")
SET Full_Table EVAL("'{{!EXTRACT}}'.trim()")
PROMPT Full_Table:<BR><BR>_{{Full_Table}}_
SET Table_Data EVAL("var ft='{{Full_Table}}'; var x,y,z; x=ft.split('\"12\",'); z=x[1]; z;")
PROMPT Table_Data:<BR><BR>_{{Table_Data}}_
I'm not completely sure what you need exactly in the 'split()', use the 2 'PROMPT''s to find out/debug...
After using 12 as split also its keeping headers. In the first prompt it extracted Full Table data like

Code: Select all

[color=#8040FF]"ଖତିୟାନର କ୍ରମିକ ନଂ : 1","ମୌଜା : ଉପରହରିଡ଼ାବାଡ଼ି","ଜିଲ୍ଲା : ଗଂଜାମ"
"ପ୍ଲଟ ନମ୍ବର ଓ ଚକର ନାମ","କିସମ ଓ ପ୍ଲଟର ଖଜଣା","କିସମର ବିସ୍ତାରିତ ବିବରଣୀ ଓ ଚୌହଦି","ରକବା","ମନ୍ତବ୍ୟ"
"ଏ."," ଡି.","ହେକ୍ଟର"
"7","8","9","1","0","11","12"
"
                                            256
                                                
                                                 କୂଅଝୋଳା
                                        ","
                                            ବିଲ ଜଳସେଚିତ ଏକ ଫସଲି
                                        ","
                                            
                                            ଉ: ସୁରେନ୍ଦ୍ର ମଳିକ
                                            
                                            
                                            ଦ: ପତିତ
                                            
                                            
                                        ","
                                            0
                                        ","
                                            226
                                        ","
                                            
                                        ","
                                            
                                        "
"
                                            265
                                                
                                                 କୂଅଝୋଳା
                                        ","
                                            ବିଲ ଜଳସେଚିତ ଏକ ଫସଲି
                                        ","
                                            
                                            ଉ: ନିଜ
                                            
                                            
                                            ଦ: ନିଜ
                                            
                                            
                                        ","
                                            0
                                        ","
                                            088
                                        ","
                                            
                                        ","
                                            
                                        "
"
                                            267
                                                
                                                 କୂଅଝୋଳା
                                        ","
                                            ବିଲ ଜଳସେଚିତ ଏକ ଫସଲି
                                        ","
                                            
                                            ଉ: ନିଜ
                                            
                                            
                                            ଦ: ପତିତ
                                            
                                            
                                        ","
                                            0
                                        ","
                                            145
                                        ","
                                            
                                        ","
                                            
                                        "
"3 plots","","","0","459","",""[/color]
and after the 2nd prompt it extracted Table Data as

Code: Select all

[color=#8040FF]__[/color]

OK, I see, and that's "normal", that's because there is no Comma after the "12" in the 'EXTRACT', so you simply need to remove that Comma in the 'split()'... Try this one::

Code: Select all

SET Table_Data EVAL("var ft='{{Full_Table}}'; var x,y,z; x=ft.split('\"12\"'); z=x[1]; z;")
PROMPT Table_Data:<BR><BR>_{{Table_Data}}_
>>>
jyotirmaya wrote:
Mon May 04, 2020 1:11 pm
And hum..., your Screenshot "only" shows Cols [7-12], but I "suspect"/reckon, the Table also contains Cols [1-6], and if I understand correctly, you won't want the Data from those 6 Cols, meaning you'll further need to spit the Content of 'Table_Data' "vertically" for all Rows, which is a little bit more "complicated", you will need a 'for' Loop in the 'EVAL()'...
You'll find a few Examples on the Forum..., not from me I think, but rather from other Advanced User @iimfun I think I remember... I'll do some further "Thinking" if you don't come out by yourself after you've complied with the Rule for Screenshots..., and I think I will need more Info about the Full Structure of the Full Table and what you get exactly in the 'EXTRACT'...
I have added the HTML page as .zip attachment, kindly can you check that to help me ?? and the table doesn't contain columns 1 to 6. I have uploaded the screenshot in this forum also, sorry for that. It doesn't extract the data in Table format, what may be the change in the code further ? or this is the maximum I can get from extraction ?

Yep, very good for the HTML Saveas, (I already had some "jyotirmay" Folder in my "Forum Cases" Folder, ah-ah...!, from some previous Thread of yours from Aug 2018, re-ah-ah...!)... Well, then if there are no Cols [1-6], then that's the "easy" Situation and I expect the "current" 'split()' without the Comma to already be the Solution... 8)
I let you do the Testing, I didn't do any myself... :wink:
- (F)CI(M) = (Full) Config Info (Missing): iMacros + Browser + OS (+ all 3 Versions + 'Free'/'PE').
- I don't even read the Qt if that (required) Info is not mentioned...!
- Script & URL help a lot for more "educated" Help...
chivracq
Posts: 9425
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: Extraction of Table from a web page

Post by chivracq » Mon May 04, 2020 2:12 pm

Oh...!, but..., yep indeed, you're right with "It doesn't extract the data in Table format"... :oops:

Yep-yep-yep...!, the 'split()' will also remove the very 1st Double Quote at the beginning of the Extract, the one meant for enclosing the whole Table, which will then be missing, ah-ah...!, and therefore needs to be re-added to the Data... OK, let me think...! :evil:

Hum, OK, try this one then...:

Code: Select all

SET Table_Data EVAL("var ft='{{Full_Table}}'; var x,y,z; x=ft.split('\"12\"'); z=\"+x[1]; z;")
PROMPT Table_Data:<BR><BR>_{{Table_Data}}_
- (F)CI(M) = (Full) Config Info (Missing): iMacros + Browser + OS (+ all 3 Versions + 'Free'/'PE').
- I don't even read the Qt if that (required) Info is not mentioned...!
- Script & URL help a lot for more "educated" Help...
jyotirmaya
Posts: 41
Joined: Wed Jul 27, 2016 6:25 pm

Re: Extraction of Table from a web page

Post by jyotirmaya » Tue May 05, 2020 4:47 am

Yep, very good for the HTML Saveas, (I already had some "jyotirmay" Folder in my "Forum Cases" Folder, ah-ah...!, from some previous Thread of yours from Aug 2018, re-ah-ah...!)... Well, then if there are no Cols [1-6], then that's the "easy" Situation and I expect the "current" 'split()' without the Comma to already be the Solution... 8)
I let you do the Testing, I didn't do any myself... :wink:
I am so lucky that a folder with my name is already there with you. :)

When I tried

Code: Select all

TAG POS=2 TYPE=TABLE ATTR=* EXTRACT=TXT
'SET TABLE EVAL("'{{!EXTRACT}}'.trim()")
SET Full_Table EVAL("'{{!EXTRACT}}'.trim()")
PROMPT Full_Table:<BR><BR>_{{Full_Table}}_

SET Table_Data EVAL("var ft='{{Full_Table}}'; var x,y,z; x=ft.split('\"12\"'); z=\"+x[1]; z;")
PROMPT Table_Data:<BR><BR>_{{Table_Data}}_
I got result as Full Table

Code: Select all

Full_Table:

_ଖତିୟାନର କ୍ରମିକ ନଂ : 1","ମୌଜା : ଉପରହରିଡ଼ାବାଡ଼ି","ଜିଲ୍ଲା : ଗଂଜାମ"
"ପ୍ଲଟ ନମ୍ବର ଓ ଚକର ନାମ","କିସମ ଓ ପ୍ଲଟର ଖଜଣା","କିସମର ବିସ୍ତାରିତ ବିବରଣୀ ଓ ଚୌହଦି","ରକବା","ମନ୍ତବ୍ୟ"
"ଏ."," ଡି.","ହେକ୍ଟର"
"7","8","9","1","0","11","12"
"
                                            256
                                                
                                                 କୂଅଝୋଳା
                                        ","
                                            ବିଲ ଜଳସେଚିତ ଏକ ଫସଲି
                                        ","
                                            
                                            ଉ: ସୁରେନ୍ଦ୍ର ମଳିକ
                                            
                                            
                                            ଦ: ପତିତ
                                            
                                            
                                        ","
                                            0
                                        ","
                                            226
                                        ","
                                            
                                        ","
                                            
                                        "
"
                                            265
                                                
                                                 କୂଅଝୋଳା
                                        ","
                                            ବିଲ ଜଳସେଚିତ ଏକ ଫସଲି
                                        ","
                                            
                                            ଉ: ନିଜ
                                            
                                            
                                            ଦ: ନିଜ
                                            
                                            
                                        ","
                                            0
                                        ","
                                            088
                                        ","
                                            
                                        ","
                                            
                                        "
"
                                            267
                                                
                                                 କୂଅଝୋଳା
                                        ","
                                            ବିଲ ଜଳସେଚିତ ଏକ ଫସଲି
                                        ","
                                            
                                            ଉ: ନିଜ
                                            
                                            
                                            ଦ: ପତିତ
                                            
                                            
                                        ","
                                            0
                                        ","
                                            145
                                        ","
                                            
                                        ","
                                            
                                        "
"3 plots","","","0","459","","_
And when I have tried

Code: Select all

TAG POS=2 TYPE=TABLE ATTR=* EXTRACT=TXT
'SET TABLE EVAL("'{{!EXTRACT}}'.trim()")
SET Full_Table EVAL("'{{!EXTRACT}}'.trim()")
PROMPT Full_Table:<BR><BR>_{{Full_Table}}_

SET Table_Data EVAL("var ft='{{Full_Table}}'; var x,y,z; x=ft.split('\"12\"'); z=x[1]; z;")
PROMPT Table_Data:<BR><BR>_{{Table_Data}}_
I got result Full Table as

Code: Select all

Full_Table:

_ଖତିୟାନର କ୍ରମିକ ନଂ : 1","ମୌଜା : ଉପରହରିଡ଼ାବାଡ଼ି","ଜିଲ୍ଲା : ଗଂଜାମ"
"ପ୍ଲଟ ନମ୍ବର ଓ ଚକର ନାମ","କିସମ ଓ ପ୍ଲଟର ଖଜଣା","କିସମର ବିସ୍ତାରିତ ବିବରଣୀ ଓ ଚୌହଦି","ରକବା","ମନ୍ତବ୍ୟ"
"ଏ."," ଡି.","ହେକ୍ଟର"
"7","8","9","1","0","11","12"
"
                                            256
                                                
                                                 କୂଅଝୋଳା
                                        ","
                                            ବିଲ ଜଳସେଚିତ ଏକ ଫସଲି
                                        ","
                                            
                                            ଉ: ସୁରେନ୍ଦ୍ର ମଳିକ
                                            
                                            
                                            ଦ: ପତିତ
                                            
                                            
                                        ","
                                            0
                                        ","
                                            226
                                        ","
                                            
                                        ","
                                            
                                        "
"
                                            265
                                                
                                                 କୂଅଝୋଳା
                                        ","
                                            ବିଲ ଜଳସେଚିତ ଏକ ଫସଲି
                                        ","
                                            
                                            ଉ: ନିଜ
                                            
                                            
                                            ଦ: ନିଜ
                                            
                                            
                                        ","
                                            0
                                        ","
                                            088
                                        ","
                                            
                                        ","
                                            
                                        "
"
                                            267
                                                
                                                 କୂଅଝୋଳା
                                        ","
                                            ବିଲ ଜଳସେଚିତ ଏକ ଫସଲି
                                        ","
                                            
                                            ଉ: ନିଜ
                                            
                                            
                                            ଦ: ପତିତ
                                            
                                            
                                        ","
                                            0
                                        ","
                                            145
                                        ","
                                            
                                        ","
                                            
                                        "
"3 plots","","","0","459","","_
and Table Data

Code: Select all

Table_Data:

_
"
                                            256
                                                
                                                 କୂଅଝୋଳା
                                        ","
                                            ବିଲ ଜଳସେଚିତ ଏକ ଫସଲି
                                        ","
                                            
                                            ଉ: ସୁରେନ୍ଦ୍ର ମଳିକ
                                            
                                            
                                            ଦ: ପତିତ
                                            
                                            
                                        ","
                                            0
                                        ","
                                            226
                                        ","
                                            
                                        ","
                                            
                                        "
"
                                            265
                                                
                                                 କୂଅଝୋଳା
                                        ","
                                            ବିଲ ଜଳସେଚିତ ଏକ ଫସଲି
                                        ","
                                            
                                            ଉ: ନିଜ
                                            
                                            
                                            ଦ: ନିଜ
                                            
                                            
                                        ","
                                            0
                                        ","
                                            088
                                        ","
                                            
                                        ","
                                            
                                        "
"
                                            267
                                                
                                                 କୂଅଝୋଳା
                                        ","
                                            ବିଲ ଜଳସେଚିତ ଏକ ଫସଲି
                                        ","
                                            
                                            ଉ: ନିଜ
                                            
                                            
                                            ଦ: ପତିତ
                                            
                                            
                                        ","
                                            0
                                        ","
                                            145
                                        ","
                                            
                                        ","
                                            
                                        "
"3 plots","","","0","459","","_
I got the result with this code as Table data but the result is not again in table format. :(
chivracq
Posts: 9425
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: Extraction of Table from a web page

Post by chivracq » Tue May 05, 2020 3:58 pm

Hum-hum-hum..., OK-OK indeed..., but hum..., several "Things" are or can be playing a Role...:

1- 'PROMPT' is maybe not the best "Tool" to debug and inspect/control the "exact" Content of 'EXTRACT' on a 'TABLE' Element, because that 'EXTRACT' contains "a lot" of Double Quotes, and 'PROMPT' tries to interpret those, usually as the Delimiter of a String (that might contain some Spaces in it) and won't display the "Outer" Double Quotes enclosing that String.

And this is for example visible I think in:

Code: Select all

Full_Table:

_ଖତିୟାନର କ୍ରମିକ ନଂ : 1","ମୌଜା : ଉପରହରିଡ଼ାବାଡ଼ି","ଜିଲ୍ଲା : ଗଂଜାମ"
[...]
"3 plots","","","0","459","","_
I would expect 1 or maybe 2 Double Quotes at the very beginning just after the '_' that I always use as Delimiter to display a 'PROMPT'.
No real big deal in your Case, as you actually cut that Content after the ["12"], so you don't really care about the beginning of the 'EXTRACT', but still, this is not exactly the Behaviour I would have expected... :o

And you can see for the last Row in the 'PROMPT', => [...,"0","459","","_], you do get a "final"/"orphan" Double Quote at the very end, but then I'm a bit surprised by the Comma just before it, because you didn't get a Comma after the ["12"], I would have rather expected [...,"0","459","""_] in the 'PROMPT'..., => with 3 Double Quotes together, =2 for the last Cell, which happens to be Empty, and +1 for enclosing the whole Table... :o


2- The "Purpose" of an 'EXTRACT' on a 'TYPE=TABLE' Element is "meant" to be combined with a 'SAVEAS TYPE=EXTRACT' to save that Data/Table to a '.CSV' File that in return is meant to be opened in 'Excel' (or your corresponding Software Prog from 'OO' or 'LO' ('OpenOffice' or 'LibreOffice') where you'll need to make sure to select the "correct" Settings for Delimiter + Separator(s) for the Data that will decide how the Data finally gets displaid in 'Excel'.

I don't see in your (long) Script what you are further doing with that extracted and cut and cleaned Data... What do you do with it...? :?:


3- Oh ja...!, related to '1-' and that 'PROMPT' is maybe not the best Tool to use to inspect the Content of 'Table_Data', the "best" Way to inspect/check that Data in a "raw State" and to make sure that iMacros with the 'PROMPT' Command, or the Browser is not trying to interpret some HTML/CSS Formatting, is to save that Data with 'SAVEAS TYPE=EXTRACT' to a '.CSV' or '.TXT' File, and to open that File in 'Notepad'... (And not in 'Excel', even if you saved it as '.CSV' and probably have the '.CSV' File Extension associated with 'Excel'.)
That's the only Way to check the raw Content... Even if the 'SAVEAS' Command still plays a Role in the Process, and behaves differently in different Versions of iMacros and with different Browsers... iMacros for FF v8.9.7 adds Double Quotes around every Cell and the whole Table in the 'EXTRACT' and 'SAVEAS', while you don't get any Double Quotes, or maybe only for the whole Table, in other Browsers... (And you then need to choose different Settings when you want to open the '.CSV' in 'Excel'...)


4- The 'EXTRACT' Mechanism on a 'TABLE' Element is rather "meant" for "simple"/"normal"/"standard" HTML Tables, made of a 'TBODY' + 1 'TH' (Table Header) + several 'TR' (Table Rows) with several 'TD' (Cells) all containing some "raw" Text, and you can quickly get some "unexpected" Results with "complex" or "fancy" Tables like in your Case, where for example the Cell containing the "459" is a "Standard"/fairly simple Cell:

Code: Select all

<td style="border-color: #000000">459</td>
... but if you take for example the Cell from Col_9 containing "ଉ: ନିଜ + ଦ: ପତିତ" (displaid on 2 Rows within the Cell), this one is defined by:

Code: Select all

<td style="font-size:17px;width:500px;border-color: #000000" align="center">
   <span id="gvRorBack_ctl07_lblKisama" class="line" style="color:#000000;font-size:14px;"></span>
   <span id="gvRorBack_ctl07_lbln_occu" class="line" style="color:#000000;font-size:14px;">ଉ: ନିଜ</span>
   <span id="gvRorBack_ctl07_lble_occu" class="line" style="color:#000000;font-size:14px;"></span>
   <br>
   <span id="gvRorBack_ctl07_lbls_occu" class="line" style="color:#000000;font-size:14px;">ଦ: ପତିତ</span>
   <br>
   <span id="gvRorBack_ctl07_lblw_occu" class="line" style="color:#000000;font-size:14px;"></span>
</td>
... => 5x 'SPAN' Elements + 2x '<BR>' Tags, and all those Elements and Formatting inside 1 single Cell...! And 3 of the 'SPAN' Elements are not even used and are Empty...! Then, tja...!, no wonder that iMacros then has some "Difficulties" trying to extract that "Cell". It's already nearly a "Miracle" that it manages to extract the Text Content, but don't be surprised to also get a lot a Spaces and Soft/Hard Returns with that Text...!

=> If you want the Data to be more "clean" than the 'EXTRACT' Mechanism can get on a 'TYPE=TABLE' Element, I'm "afraid" you'll have to extract that Data Cell by Cell (*), where using 'EVAL()', you'll then be able to "clean" the Data with 'trim()'. But even that might not be "enough", because "extracting Cell by Cell" means extracting at the 'TD' Level, and 'trim()' will only remove the Spaces and Tabs/Returns at the beginning + end of the Cell, but for the Example about the "ଉ: ନିଜ + ଦ: ପତିତ" Cell, you may still get some Spaces defined at the Sub-Level of the 'SPAN' Elements, and for sure the 2x '<BR>' Hard Returns, so you would even need to go one Level deeper in the HTML Structure of that/some Cell(s) and to extract at the 'SPAN' Level, and this whole Process, just for 1 Cell...!

+ You will probably need to loop your Script to handle 1 Row per Loop... You could hard-code 10 Blocks if you expected Max 10 Rows per Page/Table, but you mentioned that some Pages could contain 100's of Rows :shock: , so hard-coding 1 Block per Row would not be workable in your Case, then maybe you'll have to convert your '.iim' Script to a '.js' Script where you could loop per 1 Row/1 Block, depending on the Length and how many Rows on a Page... :idea:


5- (*) And before you ask, in case you get the Idea, ah-ah...! :twisted: , there is no Mechanism to extract a Full Row in one Go... :(

Of course you can extract a "TYPE=TR" Element, but hum.., 'EXTRACT=TXT' will give you the whole Text Content of the whole Row in just 1 Block of Data, all Formatting gets lost, you would get for example for your "easy" last Row "3 plots0459", all Data gets concatenated in just 1 String, without any Separator, you then need to use 'EXTRACT=HTM', but pfff..., you are then a bit on your own, and you then need to re-code from scratch the whole 'EXTRACT' Mechanism from that HTML Source for a whole Row, which is already not very easy for a "simple"/"standard" Table, but will be a complete pain in the ass in your Case, with the Cells with 5x 'SPAN' Elements... :shock:
- (F)CI(M) = (Full) Config Info (Missing): iMacros + Browser + OS (+ all 3 Versions + 'Free'/'PE').
- I don't even read the Qt if that (required) Info is not mentioned...!
- Script & URL help a lot for more "educated" Help...
jyotirmaya
Posts: 41
Joined: Wed Jul 27, 2016 6:25 pm

Re: Extraction of Table from a web page

Post by jyotirmaya » Wed May 06, 2020 11:39 am

First of all thanks for your valuable TIMEand efforts for detailed analysis :)
I don't see in your (long) Script what you are further doing with that extracted and cut and cleaned Data... What do you do with it...? :?:
There are 1000 of pages I need to extract to save the data offline for further use, as the data is there available in Online and I need the data frequently in offline because I can not access to internet all the time while my job that's why I want to extract the data so that I can use it in offline without accessing the internet, I want to store the data in excel after extracting. I am using Ditto software and after the data extracted in iMacros it stores in Ditto and I copy the data and paste in Excel to use it further.

I want to save the data like this
iMacros Extract1.JPG
iMacros Extract1.JPG (17.24 KiB) Viewed 2262 times
I dont want to extract the data of column 9 and the texts below 256,265,267
iMacros Extract.JPG
5- (*) And before you ask, in case you get the Idea, ah-ah...! :twisted: , there is no Mechanism to extract a Full Row in one Go... :(
If the extract will be of concatenated then it will be a problem for me but as you said it will be very difficult to extract the text in Table manner but there are no problem even if I get the data like this I will use some VBA excel Macro to arrange it accordingly as per my use.

Code: Select all

"256
 କୂଅଝୋଳା
 ,"""
ବିଲ ଜଳସେଚିତ ଏକ ଫସଲି
0
226
"265
 କୂଅଝୋଳା
 ,"""
ବିଲ ଜଳସେଚିତ ଏକ ଫସଲି
0
88
"267
 କୂଅଝୋଳା
 ,"""
ବିଲ ଜଳସେଚିତ ଏକ ଫସଲି
0
145
3 plots,"","","0","459","",""
iMacros Extract2.JPG
When I will get a rhythm for the extracted data then I can use Excel VBA macros as in this I need to transform 1st four rows into columns and again the next four lines into columns below to the first made columns and so on. And I don't want any comma (,) or quotation (") or double quotes I will remove them all in excel after extraction they are of no use to me.
2- The "Purpose" of an 'EXTRACT' on a 'TYPE=TABLE' Element is "meant" to be combined with a 'SAVEAS TYPE=EXTRACT' to save that Data/Table to a '.CSV' File that in return is meant to be opened in 'Excel' (or your corresponding Software Prog from 'OO' or 'LO' ('OpenOffice' or 'LibreOffice') where you'll need to make sure to select the "correct" Settings for Delimiter + Separator(s) for the Data that will decide how the Data finally gets displaid in 'Excel'.
Yes I also need the data to be saved in Excel and if possible if it will save in a .csv file then its ok even if that will need a lot of coding stuff then I am ok with the Ditto software.
Post Reply