Extract Formatting, curious if there is a better way.

Discussions and Tech Support related to website data extraction, screen scraping and data mining using iMacros.

Moderators: Community Moderators, iMacros Moderators

Forum rules
Before asking a question or reporting an issue:
1. Please review the list of FAQ's.
2. Use the Google search box (at the top of each forum page) to see if a similar problem or question has already been addressed. This will search the entire contents of the forums as well as the iMacros Wiki.
3. We can respond much faster to your posts if you include the following information:

CLICK HERE FOR IMPORTANT INFORMATION TO INCLUDE IN YOUR POST

Answering your own posts (e.g. attempting to "bump" your topic) drops your topic from the list of unanswered threads, so it may actually receive less views.

Extract Formatting, curious if there is a better way.

by rolakedot on Thu Sep 28, 2017 12:57 pm

Hiya,

FF = 56.0 (64-bit)
IM = 9.0.3 for Firefox
OS = Windows 7 Professional SP1

I finally got this script to work (ish, under ideal conditions). The problem arises when there are gaps in the extracted data (missing tags), which is totally fine, but it messes up the formatting in the .csv. The columns are all shifted right by the number of tags that are missing.

URL is question: https://www.ontario.ca/environment-and-energy/map-well-records

Once a region is selected, the relevant records are listed, and then I loop this script for the number of records.

My Script:
Code: Select all

'VARIABLE SETUP
SET !EXTRACT_TEST_POPUP NO
SET !ERRORIGNORE YES
SET !TIMEOUT_PAGE 60
SET DELAY 0.2
SET !LOOP 1
SET STEP 7
SET WELLID EVAL("{{STEP}}*{{!LOOP}}-6")
SET DATE EVAL("{{STEP}}*{{!LOOP}}")
SET DEPTH EVAL("{{STEP}}*{{!LOOP}}-1")

'LOOPS THOUGH THE MAIN TABLE LIST
'=================================
TAG POS={{WELLID}} TYPE=TD ATTR=* EXTRACT=TXT
WAIT SECONDS={{DELAY}}
ADD !EXTRACT //
WAIT SECONDS={{DELAY}}

TAG POS={{DATE}} TYPE=TD ATTR=* EXTRACT=TXT
WAIT SECONDS={{DELAY}}
ADD !EXTRACT //
WAIT SECONDS={{DELAY}}

TAG POS={{DEPTH}} TYPE=TD ATTR=* EXTRACT=TXT
WAIT SECONDS={{DELAY}}
ADD !EXTRACT //
WAIT SECONDS={{DELAY}}

'RESET POSITION
TAG POS={{WELLID}} TYPE=TD ATTR=*
WAIT SECONDS={{DELAY}}
'CLICK ON HTML RECORD
TAG POS=R1 TYPE=A ATTR=TXT:HTML
WAIT SECONDS={{DELAY}}

'PAGE LOAD ALLOWANCE
WAIT SECONDS=1.5

'PULLS REST OF WELL DATA FROM HTML RECORD
'========================================
'Well Type (Domestic, Industrial, etc.)
TAG POS=1 TYPE=TH ATTR=TXT:Well<SP>Use
WAIT SECONDS={{DELAY}}
TAG POS=R4 TYPE=TD ATTR=* EXTRACT=TXT
WAIT SECONDS={{DELAY}}
ADD !EXTRACT //
WAIT SECONDS={{DELAY}}

'Well Status (Water Supply, Observation, etc.)
TAG POS=6 TYPE=P ATTR=* EXTRACT=TXT
WAIT SECONDS={{DELAY}}
ADD !EXTRACT //
WAIT SECONDS={{DELAY}}

'Water found at depth
TAG POS=1 TYPE=TH ATTR=TXT:Kind
WAIT SECONDS={{DELAY}}
TAG POS=R1 TYPE=TD ATTR=* EXTRACT=TXT
WAIT SECONDS={{DELAY}}
ADD !EXTRACT //
WAIT SECONDS={{DELAY}}

'Static Water Level
TAG POS=1 TYPE=TD ATTR=TXT:SWL
WAIT SECONDS={{DELAY}}
TAG POS=R1 TYPE=TD ATTR=* EXTRACT=TXT
WAIT SECONDS={{DELAY}}
ADD !EXTRACT //
WAIT SECONDS={{DELAY}}

'Pump Rate
TAG POS=1 TYPE=TH ATTR=TXT:Pumping<SP>Rate
WAIT SECONDS={{DELAY}}
TAG POS=R1 TYPE=TD ATTR=* EXTRACT=TXT
WAIT SECONDS={{DELAY}}
ADD !EXTRACT //
WAIT SECONDS={{DELAY}}

'Recommended Pump rate
TAG POS=1 TYPE=TH ATTR=TXT:Recommended<SP>pump<SP>rate
WAIT SECONDS={{DELAY}}
TAG POS=R1 TYPE=TD ATTR=* EXTRACT=TXT
WAIT SECONDS={{DELAY}}

'Save Record
SAVEAS TYPE=EXTRACT FOLDER=* FILE=WELL_RECORDS_{{!NOW:ddmmyyyy}}.csv
WAIT SECONDS={{DELAY}}

'RETURN TO THE MAIN TABLE
TAG POS=1 TYPE=A ATTR=TXT:Go<SP>Back<SP>to<SP>Map&&ONCLICK:MOECC_UI.goBacktoMap()

WAIT SECONDS=1.5


The only way I have found that kind of works is to append a 'spacer' ("//") after each extract, which at least allows me to count back the number of columns to reformat with the missing info, and then deleting all the placeholder columns. This can get very tedious when dealing with hundreds of records.

Is there not a better way?

Thanks in Advance,
rolakedot
 
Posts: 2
Joined: Thu Sep 28, 2017 12:25 pm

Re: Extract Formatting, curious if there is a better way.

by chivracq on Thu Sep 28, 2017 8:33 pm

rolakedot wrote:Hiya,

Code: Select all
FF = 56.0 (64-bit)
IM = 9.0.3 for Firefox
OS = Windows 7 Professional SP1


I finally got this script to work (ish, under ideal conditions). The problem arises when there are gaps in the extracted data (missing tags), which is totally fine, but it messes up the formatting in the .csv. The columns are all shifted right by the number of tags that are missing.

URL is question: https://www.ontario.ca/environment-and-energy/map-well-records

Once a region is selected, the relevant records are listed, and then I loop this script for the number of records.

My Script:
Code: Select all
'VARIABLE SETUP
SET !EXTRACT_TEST_POPUP NO
SET !ERRORIGNORE YES
SET !TIMEOUT_PAGE 60
SET DELAY 0.2
SET !LOOP 1
SET STEP 7
SET WELLID EVAL("{{STEP}}*{{!LOOP}}-6")
SET DATE EVAL("{{STEP}}*{{!LOOP}}")
SET DEPTH EVAL("{{STEP}}*{{!LOOP}}-1")

'LOOPS THOUGH THE MAIN TABLE LIST
'=================================
TAG POS={{WELLID}} TYPE=TD ATTR=* EXTRACT=TXT
WAIT SECONDS={{DELAY}}
ADD !EXTRACT //
WAIT SECONDS={{DELAY}}

TAG POS={{DATE}} TYPE=TD ATTR=* EXTRACT=TXT
WAIT SECONDS={{DELAY}}
ADD !EXTRACT //
WAIT SECONDS={{DELAY}}

TAG POS={{DEPTH}} TYPE=TD ATTR=* EXTRACT=TXT
WAIT SECONDS={{DELAY}}
ADD !EXTRACT //
WAIT SECONDS={{DELAY}}

'RESET POSITION
TAG POS={{WELLID}} TYPE=TD ATTR=*
WAIT SECONDS={{DELAY}}
'CLICK ON HTML RECORD
TAG POS=R1 TYPE=A ATTR=TXT:HTML
WAIT SECONDS={{DELAY}}

'PAGE LOAD ALLOWANCE
WAIT SECONDS=1.5

'PULLS REST OF WELL DATA FROM HTML RECORD
'========================================
'Well Type (Domestic, Industrial, etc.)
TAG POS=1 TYPE=TH ATTR=TXT:Well<SP>Use
WAIT SECONDS={{DELAY}}
TAG POS=R4 TYPE=TD ATTR=* EXTRACT=TXT
WAIT SECONDS={{DELAY}}
ADD !EXTRACT //
WAIT SECONDS={{DELAY}}

'Well Status (Water Supply, Observation, etc.)
TAG POS=6 TYPE=P ATTR=* EXTRACT=TXT
WAIT SECONDS={{DELAY}}
ADD !EXTRACT //
WAIT SECONDS={{DELAY}}

'Water found at depth
TAG POS=1 TYPE=TH ATTR=TXT:Kind
WAIT SECONDS={{DELAY}}
TAG POS=R1 TYPE=TD ATTR=* EXTRACT=TXT
WAIT SECONDS={{DELAY}}
ADD !EXTRACT //
WAIT SECONDS={{DELAY}}

'Static Water Level
TAG POS=1 TYPE=TD ATTR=TXT:SWL
WAIT SECONDS={{DELAY}}
TAG POS=R1 TYPE=TD ATTR=* EXTRACT=TXT
WAIT SECONDS={{DELAY}}
ADD !EXTRACT //
WAIT SECONDS={{DELAY}}

'Pump Rate
TAG POS=1 TYPE=TH ATTR=TXT:Pumping<SP>Rate
WAIT SECONDS={{DELAY}}
TAG POS=R1 TYPE=TD ATTR=* EXTRACT=TXT
WAIT SECONDS={{DELAY}}
ADD !EXTRACT //
WAIT SECONDS={{DELAY}}

'Recommended Pump rate
TAG POS=1 TYPE=TH ATTR=TXT:Recommended<SP>pump<SP>rate
WAIT SECONDS={{DELAY}}
TAG POS=R1 TYPE=TD ATTR=* EXTRACT=TXT
WAIT SECONDS={{DELAY}}

'Save Record
SAVEAS TYPE=EXTRACT FOLDER=* FILE=WELL_RECORDS_{{!NOW:ddmmyyyy}}.csv
WAIT SECONDS={{DELAY}}

'RETURN TO THE MAIN TABLE
TAG POS=1 TYPE=A ATTR=TXT:Go<SP>Back<SP>to<SP>Map&&ONCLICK:MOECC_UI.goBacktoMap()

WAIT SECONDS=1.5


The only way I have found that kind of works is to append a 'spacer' ("//") after each extract, which at least allows me to count back the number of columns to reformat with the missing info, and then deleting all the placeholder columns. This can get very tedious when dealing with hundreds of records.

Is there not a better way?

Thanks in Advance,

Hum, you shouldn't use only CAPITALS for your own User-Defined Vars, your 'DELAY' Var got me off-side for more than 30 sec, ah-ah...! (because of the '-delay' Switch' on 'imacros.exe' and '!PLAYBACKDELAY', I thought for a while, WTF, what is this Var...!? Ah-ah...!)

But OK, welcome to the "World" of buggy v9.0.3..., yep with this new buggy Behaviour for 'EXTRACT' and "#EANF#" that I have demonstrated in some other Thread to be buggy, hum, I don't remember exactly why...

Yep, your Workaround is OK, I would come up with a similar one as well, much easier is to revert to v8.9.7..., but hum, you are already on FF v56.0, and I didn't have a chance yet to check if v8.9.7 still works on FF56 (you are the 2nd User on the Forum with FF56, the 1st one was a bit vague, I think...), and I wasn't able to check if v8.9.7 still works on FF56 because I was not eligible from the completely fake random Mozilla Channel Update Scheme to update to FF56..., hum, now 3 days later I am, but still on FF v55.0.3, and a bit late now, back from some DJ-Evening, if anything goes wrong with the Update, I don't want to go spending hours of "Investigations", so you'll have to wait for "tomorrow"...

But hum, not sure what you mean exactly with "hundreds of records", that means "hundreds of Columns" in some '.CSV' to me, don't know what you do with those "hundreds of Columns", not easy for Analysis I would think...
I guess I would come up with some "easier" Workaround, I think, but I never bothered to install v9.0.3 which was too limited and buggy from Day_1, so I never had to "struggle" against it, ah-ah...!
- (F)CIM = (Full) Config Info Missing: iMacros + Browser + OS with all 3 Versions...
- I usually don't even read the Question if that (required) Info is not mentioned...
- Script & URL usually help a lot for a more "educated" Help...
chivracq
 
Posts: 6479
Joined: Sat Apr 13, 2013 6:07 am
Location: Amsterdam (NL)

Re: Extract Formatting, curious if there is a better way.

by rolakedot on Fri Sep 29, 2017 6:11 am

Hi,

Thanks for the reply,

Sorry about the variable confusion, some habits die hard..

Well if reverting to older, less buggy versions of IM and FF will resolve the issue for now then that's fine with me.

And by hundred of records I meant that it scrapes each (of potentially hundreds) record for about 8 pieces of information (columns) that are relevant to me.
With one SAVEAS at the end, each record gets its own row.

Example output:
Image

Which works great about 80% of the time, its just when an entry is missing (records are old, it happens) all subsequent info gets shifted.
Adding my spacer "//" lets me at least know where the gaps line up, but going back through a large recordset to scrub out the missing data becomes tedious.

Any way, I will try again with reverted versions and see how that goes.

Thanks again.

UPDATE: Confirmed, reverted to IM v8.9.7, and FF 55.0.3 and it works perfectly! missing tags are left blank but remain in the correct locations.
rolakedot
 
Posts: 2
Joined: Thu Sep 28, 2017 12:25 pm

Re: Extract Formatting, curious if there is a better way.

by chivracq on Sat Sep 30, 2017 10:23 am

rolakedot wrote:Hi,

Thanks for the reply,

Sorry about the variable confusion, some habits die hard..

Well if reverting to older, less buggy versions of IM and FF will resolve the issue for now then that's fine with me.

And by hundred of records I meant that it scrapes each (of potentially hundreds) record for about 8 pieces of information (columns) that are relevant to me.
With one SAVEAS at the end, each record gets its own row.

Example output:
Image

Which works great about 80% of the time, its just when an entry is missing (records are old, it happens) all subsequent info gets shifted.
Adding my spacer "//" lets me at least know where the gaps line up, but going back through a large recordset to scrub out the missing data becomes tedious.

Any way, I will try again with reverted versions and see how that goes.

Thanks again.

UPDATE: Confirmed, reverted to IM v8.9.7, and FF 55.0.3 and it works perfectly! missing tags are left blank but remain in the correct locations.

Ah OK, for "only" 8 Extracts/Columns, then you would only need to check 8 times the Result of each 'EXTRACT' and if Empty or Null, to set it (back) to an empty String or "#EANF#" using 'EVAL()' and to conditionally add an extra "[EXTRACT]" to the '!EXTRACT' Content. Sounds a bit cumbersome, but some People didn't like the "#EANF#" and would do it already..., by transforming the "#EANF#" into an empty String or a "Not found" or whatever... (Hum, I do it myself as well in one of my Scripts, ah-ah...!)

But hum, if I'm correct, a missing 'EXTRACT' would still return "#EANF" in v9.0.3 if '!ERRORIGNORE' is disabled.

And hum, I'm not sure if the "Column Shifting" happens at the 'EXTRACT' itself, or later at the 'SAVEAS' if '!EXTRACT' contains 2 "[EXTRACT][EXTRACT]" next to each other..., then an empty String would not solve the Pb, but you could use a Space (" ") or a Dot (".") instead...

But OK, so you went back to v8.9.7 and FF v55.0.3, OK... 8)
Hum, pity you didn't try v8.9.7 first on FF v56.0, I still haven't tried to update my FF55 to FF56... Hum, maybe today if I don't go out..., but I still have a few Scripts that need a bit to run on FF (because of some Flash Element) even if my "Prod" Env. runs on Pale Moon (v26.3.3) + v8.8.2 for FF, but the Update Process is a bit cumbersome as I always backup my FF Profiles in case I need to go back to a previous FF Version, and some "big" Changes are happening/going to happen soon to Profiles not being retro-compatible anymore between FF55/56/57, so if v8.9.7 doesn't work anymore on FF56, I want to make sure I can go back to the last working Env. as I don't want to (have to) use v9.0.3, beurk...! :evil:
Yep indeed, v9.0.3 is definitely not an Option for me, I use a lot of mini-TEMP 'SAVEAS' Files resulting from some 'EXTRACT's to be reused by some other Macro(s) to "communicate" with each others, and the File Structure is then important, and I practically always use '!ERRORIGNORE' in all my Macros, I think... Or I would need a few days to spot all the Places in about 50,000 Lines of Code, "not-gonna-happen!", no way...! :shock:
Hum, would be less actually as I only use a small part on FF, but pfff..., that would be even more complicated for me because the same Macros would still need to be able to run in v8.8.2 as well, and I don't want to have to fork and maintain 2 Versions for maybe 10 big Scripts...

Oh...!, and I think I remember why 'EXTRACT' is really Buggy in v9.0.3, it''s because 'EVENT:FAIL_IF_FOUND' apparently relies on 'EXTRACT' and "#EANF#" (by probably making a Check on 'EXTRACT' <> "#EANF#") and needs '!ERRORIGNORE' to be disabled to stop the Macro, but in v9.0.3, the Macro always stops, I think... (Hum, and that means the "Process" and Change in Behaviour really happens at the 'EXTRACT' Level, and not at the 'SAVEAS'...)
- (F)CIM = (Full) Config Info Missing: iMacros + Browser + OS with all 3 Versions...
- I usually don't even read the Question if that (required) Info is not mentioned...
- Script & URL usually help a lot for a more "educated" Help...
chivracq
 
Posts: 6479
Joined: Sat Apr 13, 2013 6:07 am
Location: Amsterdam (NL)


Return to Data Extraction and Web Screen Scraping

Who is online

Users browsing this forum: Google [Bot] and 3 guests

-->