Extract blank when tag fails

Discussions and Tech Support related to website data extraction, screen scraping and data mining using iMacros.

Moderators: Community Moderators, iMacros Moderators

Forum rules
Before asking a question or reporting an issue:
1. Please review the list of FAQ's.
2. Use the Google search box (at the top of each forum page) to see if a similar problem or question has already been addressed. This will search the entire contents of the forums as well as the iMacros Wiki.
3. We can respond much faster to your posts if you include the following information:

CLICK HERE FOR IMPORTANT INFORMATION TO INCLUDE IN YOUR POST

Answering your own posts (e.g. attempting to "bump" your topic) drops your topic from the list of unanswered threads, so it may actually receive less views.

Extract blank when tag fails

by szechuansauce on Mon May 15, 2017 7:38 am

Hello,

I'm using iMacro's build 9030808 for Firefox 53.0.2 (32-bit) on Windows 7. I've been using iMacros to scrape social media profiles for certain pieces of information. Because information on the platform is self-reported, certain pieces of info are often missing, resulting in the tag failing. Ultimately, this means a lot of work for me on the back end sorting the data so that each piece of information is in the correct column. Instead, I'd like to extract a blank each time the program fails to find a piece of information.

I've looked all over the forums, but have yet to find a solution for this yet. This one appears to come close, but was never resolved. viewtopic.php?f=7&t=25503

I imagine the solution would involve the EVAL function, but I don't have any experience in javascript. Is there an obvious way to do this? I've pasted an example script below for context:

Code: Select all
VERSION BUILD=9030808 RECORDER=FX
TAB T=1
TAB CLOSEALLOTHERS
SET !ERRORIGNORE YES
SET !DATASOURCE Leadership.csv
SET !DATASOURCE_COLUMNS 1
SET !LOOP 1
SET !DATASOURCE_LINE {{!loop}}
URL GOTO={{!COL1}}

WAIT SECONDS=5

TAG POS=1 TYPE=h1 ATTR=CLASS:* EXTRACT=TXT
TAG POS=1 TYPE=h3 ATTR=class:"Sans-17px-black-85%-semibold" EXTRACT=TXT
TAG POS=1 TYPE=span ATTR=class:"pv-position-entity__secondary-title pv-entity__secondary-title Sans-15px-black-55%" EXTRACT=TXT
TAG POS=1 TYPE=h4 ATTR=class:"pv-entity__date-range Sans-**px-black-**%" EXTRACT=TXT
TAG POS=1 TYPE=h4 ATTR=class:"pv-entity__location Sans-**px-black-**% block" EXTRACT=TXT

TAG POS=2 TYPE=h3 ATTR=class:"Sans-17px-black-85%-semibold" EXTRACT=TXT
TAG POS=2 TYPE=span ATTR=class:"pv-position-entity__secondary-title pv-entity__secondary-title Sans-15px-black-55%" EXTRACT=TXT
TAG POS=2 TYPE=h4 ATTR=class:"pv-entity__date-range Sans-**px-black-**%" EXTRACT=TXT
TAG POS=2 TYPE=h4 ATTR=class:"pv-entity__location Sans-**px-black-**% block" EXTRACT=TXT

TAG POS=1 TYPE=h3 ATTR=class:"pv-entity__school-name Sans-17px-black-85%-semibold" EXTRACT=TXT
TAG POS=1 TYPE=p ATTR=class:"pv-education-entity__secondary-title pv-entity__degree-name pv-entity__secondary-title Sans-**px-black-**%" EXTRACT=TXT
TAG POS=1 TYPE=p ATTR=class:"pv-education-entity__secondary-title pv-entity__fos pv-entity__secondary-title Sans-**px-black-**%" EXTRACT=TXT
TAG POS=1 TYPE=p ATTR=class:"pv-education-entity__date pv-entity__dates Sans-**px-black-**%" EXTRACT=TXT

SAVEAS TYPE=EXTRACT FOLDER=* FILE=Leadership.csv

WAIT SECONDS=5


Thanks!
szechuansauce
 
Posts: 2
Joined: Mon May 15, 2017 7:04 am

Re: Extract blank when tag fails

by chivracq on Mon May 15, 2017 9:15 am

szechuansauce wrote:Hello,

I'm using
Code: Select all
iMacro's build 9030808 for Firefox 53.0.2 (32-bit) on Windows 7.

I've been using iMacros to scrape social media profiles for certain pieces of information. Because information on the platform is self-reported, certain pieces of info are often missing, resulting in the tag failing. Ultimately, this means a lot of work for me on the back end sorting the data so that each piece of information is in the correct column. Instead, I'd like to extract a blank each time the program fails to find a piece of information.

I've looked all over the forums, but have yet to find a solution for this yet. This one appears to come close, but was never resolved. viewtopic.php?f=7&t=25503

I imagine the solution would involve the EVAL function, but I don't have any experience in javascript. Is there an obvious way to do this? I've pasted an example script below for context:

Code: Select all
VERSION BUILD=9030808 RECORDER=FX
TAB T=1
TAB CLOSEALLOTHERS
SET !ERRORIGNORE YES
SET !DATASOURCE Leadership.csv
SET !DATASOURCE_COLUMNS 1
SET !LOOP 1
SET !DATASOURCE_LINE {{!loop}}
URL GOTO={{!COL1}}

WAIT SECONDS=5

TAG POS=1 TYPE=h1 ATTR=CLASS:* EXTRACT=TXT
TAG POS=1 TYPE=h3 ATTR=class:"Sans-17px-black-85%-semibold" EXTRACT=TXT
TAG POS=1 TYPE=span ATTR=class:"pv-position-entity__secondary-title pv-entity__secondary-title Sans-15px-black-55%" EXTRACT=TXT
TAG POS=1 TYPE=h4 ATTR=class:"pv-entity__date-range Sans-**px-black-**%" EXTRACT=TXT
TAG POS=1 TYPE=h4 ATTR=class:"pv-entity__location Sans-**px-black-**% block" EXTRACT=TXT

TAG POS=2 TYPE=h3 ATTR=class:"Sans-17px-black-85%-semibold" EXTRACT=TXT
TAG POS=2 TYPE=span ATTR=class:"pv-position-entity__secondary-title pv-entity__secondary-title Sans-15px-black-55%" EXTRACT=TXT
TAG POS=2 TYPE=h4 ATTR=class:"pv-entity__date-range Sans-**px-black-**%" EXTRACT=TXT
TAG POS=2 TYPE=h4 ATTR=class:"pv-entity__location Sans-**px-black-**% block" EXTRACT=TXT

TAG POS=1 TYPE=h3 ATTR=class:"pv-entity__school-name Sans-17px-black-85%-semibold" EXTRACT=TXT
TAG POS=1 TYPE=p ATTR=class:"pv-education-entity__secondary-title pv-entity__degree-name pv-entity__secondary-title Sans-**px-black-**%" EXTRACT=TXT
TAG POS=1 TYPE=p ATTR=class:"pv-education-entity__secondary-title pv-entity__fos pv-entity__secondary-title Sans-**px-black-**%" EXTRACT=TXT
TAG POS=1 TYPE=p ATTR=class:"pv-education-entity__date pv-entity__dates Sans-**px-black-**%" EXTRACT=TXT

SAVEAS TYPE=EXTRACT FOLDER=* FILE=Leadership.csv

WAIT SECONDS=5


Thanks!

Yep, pity indeed the User from the Thread you mention never mentioned their FCI (you are already doing a better "job" on that part, very good...!) and didn't bother to follow up, the Solution in that "Case" was simply to use '!TIMEOUT_STEP' with a short Value of '1' or '0' to shorten the Tag Waiting Time. This would be the Solution for you as well, well..., at least partially...
Because that other Thread was from Decb 2015, and I guess at that time v8.9.2 or v8.9.5 or v8.9.7 for FF were the "current" Versions while you are now (May 2017) using v9.0.3 for FF.

And some "fundamental" Behaviour changed with v9.0.3 compared to v8.9.x related to the 'EXTRACT' Functionality, and that is that when using '!ERRORIGNORE' like you do, 'EXTRACT' will "skip" an HTML Element if it is not found, instead of storing "#EANF#" in previous Versions. I'm not sure if this is an intentional Change or a Bug in v9.0.3 as it is not mentioned in the Release Notes for v9.0.3, there are a few Threads about this "Feature" on the Forum...

You have a few Options you can choose...!:
1- v9.0.3 is a bit Buggy and limited anyway, compared to v8.9.x, and the stable Version at this moment, which still works on FF53 is still v8.9.7 for FF. (Make sure to disable Automatic Updates for iMacros if you "downdate" to v8.9.7 or iMacros will want to update itself again to v9.0.3, ah-ah...!) With v8.9.7, you will get the "old" Behaviour of always getting "#EANF#" when a Field is not found, whether '!ERRORIGNORE' is enabled or not...

2- Option 2 is (in your current v9.0.3 for FF Version) to disable '!ERRORIGNORE' before doing your Extracts, which will return the '#EANF#' Values when the Fields are not found, and will conserve your Table Structure in your 'SAVEAS'.
(When using the 'EXTRACT' Command, your Script will never trigger a RuntimeError on a 'TAG' Statement if the Field is not found, if you were wondering...!)

3- Option 3 is indeed if you are not already "happy" with the '#EANF#' Values, to "transform" them using a fairly simple 'EVAL()' Statement to an empty String or any String you would prefer...

I modify your Script a bit to include Options 2 + 3 (with some easy Config-Switch for the String you would like at the beginning of the Script):
Code: Select all
VERSION BUILD=9030808 RECORDER=FX
TAB T=1
TAB CLOSEALLOTHERS
SET !ERRORIGNORE YES
SET !TIMEOUT_STEP 0

'Easy Access:
SET EANF_String ""

SET !DATASOURCE Leadership.csv
SET !DATASOURCE_COLUMNS 1
SET !LOOP 1
SET !DATASOURCE_LINE {{!loop}}
URL GOTO={{!COL1}}

WAIT SECONDS=5

'Disable '!ERRORIGNORE' to prevent iMacros from skipping Fields not found in v9.0.3 for FF:
SET !ERRORIGNORE NO

TAG POS=1 TYPE=h1 ATTR=CLASS:* EXTRACT=TXT
TAG POS=1 TYPE=h3 ATTR=class:"Sans-17px-black-85%-semibold" EXTRACT=TXT
TAG POS=1 TYPE=span ATTR=class:"pv-position-entity__secondary-title pv-entity__secondary-title Sans-15px-black-55%" EXTRACT=TXT
TAG POS=1 TYPE=h4 ATTR=class:"pv-entity__date-range Sans-**px-black-**%" EXTRACT=TXT
TAG POS=1 TYPE=h4 ATTR=class:"pv-entity__location Sans-**px-black-**% block" EXTRACT=TXT

TAG POS=2 TYPE=h3 ATTR=class:"Sans-17px-black-85%-semibold" EXTRACT=TXT
TAG POS=2 TYPE=span ATTR=class:"pv-position-entity__secondary-title pv-entity__secondary-title Sans-15px-black-55%" EXTRACT=TXT
TAG POS=2 TYPE=h4 ATTR=class:"pv-entity__date-range Sans-**px-black-**%" EXTRACT=TXT
TAG POS=2 TYPE=h4 ATTR=class:"pv-entity__location Sans-**px-black-**% block" EXTRACT=TXT

TAG POS=1 TYPE=h3 ATTR=class:"pv-entity__school-name Sans-17px-black-85%-semibold" EXTRACT=TXT
TAG POS=1 TYPE=p ATTR=class:"pv-education-entity__secondary-title pv-entity__degree-name pv-entity__secondary-title Sans-**px-black-**%" EXTRACT=TXT
TAG POS=1 TYPE=p ATTR=class:"pv-education-entity__secondary-title pv-entity__fos pv-entity__secondary-title Sans-**px-black-**%" EXTRACT=TXT
TAG POS=1 TYPE=p ATTR=class:"pv-education-entity__date pv-entity__dates Sans-**px-black-**%" EXTRACT=TXT

'Re-enable '!ERRORIGNORE':
SET !ERRORIGNORE YES

'Replace '#EANF#' Values with 'EANF_String' defined at beginning of Script:
SET Extracted_Data EVAL("var s='{{!EXTRACT}}'; var eanf='{{EANF_String}}'; var z; z=s.split('#EANF#').join(eanf); z;")
'>
'Debug:
PROMPT Original_EXTRACT:<BR>{{!EXTRACT}}<BR><BR>Cleaned_EXTRACT:<BR>{{Extracted_Data}}

SET !EXTRACT {{Extracted_Data}}
SAVEAS TYPE=EXTRACT FOLDER=* FILE=Leadership.csv

WAIT SECONDS=5
(Tested on iMacros for FF v8.8.2, Pale Moon v26.3.3 (=FF47), Win10-x64.)

The 'split().join()' Syntax I used is similar to using a Global 'replace()' which uses 'REGEX' but I don't like (= don't master!) 'REGEX' so I prefer to use this "Trick" with 'split()' + 'join()' that I find easier to use and even works directly with Special Characters that would require some Escaping in 'REGEX'... :wink:
- (F)CIM = (Full) Config Info Missing: iMacros + Browser + OS with all 3 Versions...
- I usually don't even read the Question if that (required) Info is not mentioned...
- Script & URL usually help a lot for a more "educated" Help...
chivracq
 
Posts: 5962
Joined: Sat Apr 13, 2013 6:07 am
Location: Amsterdam (NL)

Re: Extract blank when tag fails

by szechuansauce on Mon May 15, 2017 11:01 am

Hi chivracq,

Thanks for your reply. I just tested out the script with your additions, and unfortunately I don't think options 2 or 3 will work for me. Since error ignore is set to no, the script now stops when it fails to find what it's looking for. For instance, when I hit an unavailable profile link the script stopped running and gave this error message: "Retry timeout, line: 21 (Error code: -1001). I generally set my macros to run after the work day and so can't monitor stoppages in the script. I will likely have to revert to the old version (Option 1) for now if there isn't another way to do this. Thanks for your time!
szechuansauce
 
Posts: 2
Joined: Mon May 15, 2017 7:04 am

Re: Extract blank when tag fails

by chivracq on Mon May 15, 2017 11:52 am

szechuansauce wrote:Hi chivracq,

Thanks for your reply. I just tested out the script with your additions, and unfortunately I don't think options 2 or 3 will work for me. Since error ignore is set to no, the script now stops when it fails to find what it's looking for. For instance, when I hit an unavailable profile link the script stopped running and gave this error message: "Retry timeout, line: 21 (Error code: -1001). I generally set my macros to run after the work day and so can't monitor stoppages in the script. I will likely have to revert to the old version (Option 1) for now if there isn't another way to do this. Thanks for your time!

Hum, that's strange indeed... Line 21 is the first Extract if I'm correct...?:
Code: Select all
TAG POS=1 TYPE=h1 ATTR=CLASS:* EXTRACT=TXT


I've never seen this 'Retry timeout' RuntimeError, it's not even documented in the Wiki for the List of Error Codes or only as:
FF Error Codes:
-1001 Unknown error

That's maybe new to v9.0.3... I can't even test, I never bothered installing v9.0.3 as it was obvious from the beginning that it was too buggy and very limited compared to v8.9.7.

Stg you could try is setting '!TIMEOUT_STEP' to "1" instead of "0" if that makes a difference, but that will slow down the Execution by 12x1=12 sec for each invalid Profile...

I would still have a Workaround, but pfff..., it will be pretty cumbersome...!:
Depending on what 'EXTRACT' returns in v9.0.3 with '!ERRORIGNORE' activated when the Field is not found, after it was reset to "NULL" before each Extract, it would be possible to check the Content and/or Length of '!EXTRACT' for each Extract specifically. Length is "0" in v8.8.2 for '!EXTRACT' = "NULL" and Length is "6" for '!EXTRACT' = "#EANF#", but I don't know what v9.0.3 returns..., I guess it will be "0".
And it is then possible to "manually" manage the Content of the incremental Extract by using some Temp Var for the incremental Extract... But this is very cumbersome, and it is actually recreating the "Standard' Behaviour of the Extract Mechanism from v8.9.x...! :roll:
I could write that "Mechanism" for 1 Field, but you would need to repeat that Block of Code x12 for the 12 Extracts in total in your Script...

My Advice is Go back to v8.9.7, ah-ah...! This is more reliable and less cumbersome. Many other Bugs and Limitations are still waiting for you in v9.0.3 otherwise, ah-ah...!
- (F)CIM = (Full) Config Info Missing: iMacros + Browser + OS with all 3 Versions...
- I usually don't even read the Question if that (required) Info is not mentioned...
- Script & URL usually help a lot for a more "educated" Help...
chivracq
 
Posts: 5962
Joined: Sat Apr 13, 2013 6:07 am
Location: Amsterdam (NL)


Return to Data Extraction and Web Screen Scraping

Who is online

Users browsing this forum: No registered users and 2 guests

-->