LinkedIn tagging question

Discussions and Tech Support related to website data extraction, screen scraping and data mining using iMacros.

Moderators: Community Moderators, iMacros Moderators

Forum rules
Before asking a question or reporting an issue:
1. Please review the list of FAQ's.
2. Use the Google search box (at the top of each forum page) to see if a similar problem or question has already been addressed. This will search the entire contents of the forums as well as the iMacros Wiki.
3. We can respond much faster to your posts if you include the following information:

CLICK HERE FOR IMPORTANT INFORMATION TO INCLUDE IN YOUR POST

Answering your own posts (e.g. attempting to "bump" your topic) drops your topic from the list of unanswered threads, so it may actually receive less views.

LinkedIn tagging question

by morpdorp on Mon Jul 31, 2017 4:46 pm

Windows 10
iMacros for Firefox. VERSION BUILD=9030808
Firefox 54.0.1 (32-bit)

Hi all,

I have a LinkedIn profile scraper that extracts profile data into a CSV (code at bottom of post). It does the job very well, with one key caveat: it can't account for instances where a piece of information is missing from a section of the profile. As an example, see this profile: https://www.linkedin.com/in/darylpereira/

Look at the second position in Daryl's profile and note that he did not include a date range. My program is trained to find position 2 and print it, then find date range 2 and print it. But it can't tell when there is no date range 2. Instead, the date range associated with the 3rd position will be considered the second date range. The result is that the dates are off for him (I have him joining IBM in this capacity in Nov 2005, he really joined in Sep 2008). Many, many profiles have this issue. I need the date range data to be accurately attributed to the positional and educational data. Otherwise, it's not useful for the analysis.

I believe the answer is that I need to be using a python or other coding interface so that I can integrate if/else functionality. But I'm really hoping someone might have a workaround for iMacros, because I'm not any good at coding. Anyway, all ideas are appreciated. Here's the code I mentioned (feel free to use it - the reference is just the list of profile links):

Code: Select all
VERSION BUILD=9030808 RECORDER=FX
TAB T=1
TAB CLOSEALLOTHERS
SET !ERRORIGNORE YES
SET !DATASOURCE LinkedInURLs.csv
SET !DATASOURCE_COLUMNS 1
SET !LOOP 1
SET !DATASOURCE_LINE {{!loop}}

URL GOTO={{!COL1}}

WAIT SECONDS=10

ADD !EXTRACT {{!COL1}}

TAG POS=1 TYPE=h1 ATTR=CLASS:* EXTRACT=TXT

TAG POS=1 TYPE=h3 ATTR=class:"Sans-**px-black-**%-semibold" EXTRACT=TXT
TAG POS=1 TYPE=span ATTR=class:pv-entity__secondary-title* EXTRACT=TXT
TAG POS=1 TYPE=h4 ATTR=class:pv-entity__date-range* EXTRACT=TXT
TAG POS=1 TYPE=h4 ATTR=class:pv-entity__location* EXTRACT=TXT

TAG POS=2 TYPE=h3 ATTR=class:"Sans-**px-black-**%-semibold" EXTRACT=TXT
TAG POS=2 TYPE=span ATTR=class:pv-entity__secondary-title* EXTRACT=TXT
TAG POS=2 TYPE=h4 ATTR=class:pv-entity__date-range* EXTRACT=TXT
TAG POS=2 TYPE=h4 ATTR=class:pv-entity__location* EXTRACT=TXT

TAG POS=3 TYPE=h3 ATTR=class:"Sans-**px-black-**%-semibold" EXTRACT=TXT
TAG POS=3 TYPE=span ATTR=class:pv-entity__secondary-title* EXTRACT=TXT
TAG POS=3 TYPE=h4 ATTR=class:pv-entity__date-range* EXTRACT=TXT
TAG POS=3 TYPE=h4 ATTR=class:pv-entity__location* EXTRACT=TXT

TAG POS=4 TYPE=h3 ATTR=class:"Sans-**px-black-**%-semibold" EXTRACT=TXT
TAG POS=4 TYPE=span ATTR=class:pv-entity__secondary-title* EXTRACT=TXT
TAG POS=4 TYPE=h4 ATTR=class:pv-entity__date-range* EXTRACT=TXT
TAG POS=4 TYPE=h4 ATTR=class:pv-entity__location* EXTRACT=TXT

TAG POS=1 TYPE=h3 ATTR=class:pv-entity__school-name* EXTRACT=TXT
TAG POS=1 TYPE=p ATTR=class:"pv-entity__secondary-title pv-entity__degree-name*" EXTRACT=TXT
TAG POS=1 TYPE=p ATTR=class:"pv-entity__secondary-title pv-entity__fos*" EXTRACT=TXT
TAG POS=1 TYPE=p ATTR=class:pv-entity__dates* EXTRACT=TXT

SAVEAS TYPE=EXTRACT FOLDER=* FILE=LinkedInResults.csv

WAIT SECONDS=5
morpdorp
 
Posts: 1
Joined: Mon Jul 31, 2017 3:40 pm

Re: LinkedIn tagging question

by chivracq on Mon Jul 31, 2017 7:50 pm

morpdorp wrote:
Code: Select all
Windows 10
iMacros for Firefox. VERSION BUILD=9030808
Firefox 54.0.1 (32-bit)


Hi all,

I have a LinkedIn profile scraper that extracts profile data into a CSV (code at bottom of post). It does the job very well, with one key caveat: it can't account for instances where a piece of information is missing from a section of the profile. As an example, see this profile: https://www.linkedin.com/in/darylpereira/

Look at the second position in Daryl's profile and note that he did not include a date range. My program is trained to find position 2 and print it, then find date range 2 and print it. But it can't tell when there is no date range 2. Instead, the date range associated with the 3rd position will be considered the second date range. The result is that the dates are off for him (I have him joining IBM in this capacity in Nov 2005, he really joined in Sep 2008). Many, many profiles have this issue. I need the date range data to be accurately attributed to the positional and educational data. Otherwise, it's not useful for the analysis.

I believe the answer is that I need to be using a python or other coding interface so that I can integrate if/else functionality. But I'm really hoping someone might have a workaround for iMacros, because I'm not any good at coding. Anyway, all ideas are appreciated. Here's the code I mentioned (feel free to use it - the reference is just the list of profile links):

Code: Select all
VERSION BUILD=9030808 RECORDER=FX
TAB T=1
TAB CLOSEALLOTHERS
SET !ERRORIGNORE YES
SET !DATASOURCE LinkedInURLs.csv
SET !DATASOURCE_COLUMNS 1
SET !LOOP 1
SET !DATASOURCE_LINE {{!loop}}

URL GOTO={{!COL1}}

WAIT SECONDS=10

ADD !EXTRACT {{!COL1}}

TAG POS=1 TYPE=h1 ATTR=CLASS:* EXTRACT=TXT

TAG POS=1 TYPE=h3 ATTR=class:"Sans-**px-black-**%-semibold" EXTRACT=TXT
TAG POS=1 TYPE=span ATTR=class:pv-entity__secondary-title* EXTRACT=TXT
TAG POS=1 TYPE=h4 ATTR=class:pv-entity__date-range* EXTRACT=TXT
TAG POS=1 TYPE=h4 ATTR=class:pv-entity__location* EXTRACT=TXT

TAG POS=2 TYPE=h3 ATTR=class:"Sans-**px-black-**%-semibold" EXTRACT=TXT
TAG POS=2 TYPE=span ATTR=class:pv-entity__secondary-title* EXTRACT=TXT
TAG POS=2 TYPE=h4 ATTR=class:pv-entity__date-range* EXTRACT=TXT
TAG POS=2 TYPE=h4 ATTR=class:pv-entity__location* EXTRACT=TXT

TAG POS=3 TYPE=h3 ATTR=class:"Sans-**px-black-**%-semibold" EXTRACT=TXT
TAG POS=3 TYPE=span ATTR=class:pv-entity__secondary-title* EXTRACT=TXT
TAG POS=3 TYPE=h4 ATTR=class:pv-entity__date-range* EXTRACT=TXT
TAG POS=3 TYPE=h4 ATTR=class:pv-entity__location* EXTRACT=TXT

TAG POS=4 TYPE=h3 ATTR=class:"Sans-**px-black-**%-semibold" EXTRACT=TXT
TAG POS=4 TYPE=span ATTR=class:pv-entity__secondary-title* EXTRACT=TXT
TAG POS=4 TYPE=h4 ATTR=class:pv-entity__date-range* EXTRACT=TXT
TAG POS=4 TYPE=h4 ATTR=class:pv-entity__location* EXTRACT=TXT

TAG POS=1 TYPE=h3 ATTR=class:pv-entity__school-name* EXTRACT=TXT
TAG POS=1 TYPE=p ATTR=class:"pv-entity__secondary-title pv-entity__degree-name*" EXTRACT=TXT
TAG POS=1 TYPE=p ATTR=class:"pv-entity__secondary-title pv-entity__fos*" EXTRACT=TXT
TAG POS=1 TYPE=p ATTR=class:pv-entity__dates* EXTRACT=TXT

SAVEAS TYPE=EXTRACT FOLDER=* FILE=LinkedInResults.csv

WAIT SECONDS=5

Compliments on the "Quality" of your Post...! :D
(This is becoming so rare that I had to mention it, ah-ah...!)

But OK, there are several different Approaches possible to do what you want, and you don't need any IF/ELSE in Python or whatever other Language you were afraid of, ah-ah...!

An easy way I would think is to extract your Data for each Job Description at the 'DIV' Level and to re-split the Data with 'EVAL()' like for example:
Code: Select all
VERSION BUILD=8820413 RECORDER=FX
SET !ERRORIGNORE YES
SET !EXTRACT_TEST_POPUP NO
TAB T=1
'URL GOTO=https://www.linkedin.com/in/darylpereira/

'Testing:
TAG POS=1 TYPE=H2 ATTR=TXT:Experience
'TAG POS=R1 TYPE=DIV ATTR=TXT:*Company<SP>Name* EXTRACT=HTM
'TAG POS=R1 TYPE=DIV ATTR=TXT:*Company<SP>Name* EXTRACT=TXT
'TAG POS=R1 TYPE=DIV ATTR=CLASS:pv-profile* EXTRACT=TXT
'TAG POS=R1 TYPE=DIV ATTR=CLASS:"pv-entity__summary-info" EXTRACT=TXT
'PAUSE

SET !EXTRACT NULL
TAG POS=1 TYPE=h1 ATTR=CLASS:* EXTRACT=TXT
SET Name {{!EXTRACT}}

'Job_1:
TAG POS=1 TYPE=H2 ATTR=TXT:Experience
SET !EXTRACT NULL
TAG POS=R1 TYPE=DIV ATTR=CLASS:"pv-entity__summary-info" EXTRACT=TXT
'>
SET Job_Title_1 EVAL("var s='{{!EXTRACT}}'; var x,y,z; x=s.split('Company Name'); z=x[0].trim(); z;")
SET Company_Name_1 EVAL("var s='{{!EXTRACT}}'; var x,y,z; x=s.split('Company Name'); y=x[1].split('Dates Employed'); z=y[0].trim(); z;")
SET Duration_1 EVAL("var s='{{!EXTRACT}}'; var x,y,z; x=s.split('Dates Employed'); y=x[1].split('Employment Duration'); z=y[0].trim(); z;")
SET Location_1 EVAL("var s='{{!EXTRACT}}'; var x,y,z; x=s.split('Location'); y=x[1].split('/n'); z=y[0].trim(); z;")
'>
SET Job_1 {{Job_Title_1}}[EXTRACT]{{Company_Name_1}}[EXTRACT]{{Duration_1}}[EXTRACT]{{Location_1}}
PROMPT _{{Job_1}}_

'Job_2:
TAG POS=1 TYPE=H2 ATTR=TXT:Experience
SET !EXTRACT NULL
TAG POS=R2 TYPE=DIV ATTR=CLASS:"pv-entity__summary-info" EXTRACT=TXT
'>
SET Job_Title_2 EVAL("var s='{{!EXTRACT}}'; var x,y,z; x=s.split('Company Name'); z=x[0].trim(); z;")
SET Company_Name_2 EVAL("var s='{{!EXTRACT}}'; var x,y,z; x=s.split('Company Name'); y=x[1].split('Dates Employed'); z=y[0].trim(); z;")
SET Duration_2 EVAL("var s='{{!EXTRACT}}'; var x,y,z; x=s.split('Dates Employed'); y=x[1].split('Employment Duration'); z=y[0].trim(); z;")
SET Location_2 EVAL("var s='{{!EXTRACT}}'; var x,y,z; x=s.split('Location'); y=x[1].split('/n'); z=y[0].trim(); z;")
'>
SET Job_2 {{Job_Title_2}}[EXTRACT]{{Company_Name_2}}[EXTRACT]{{Duration_2}}[EXTRACT]{{Location_2}}
PROMPT _{{Job_2}}_
 
'Job_3:
TAG POS=1 TYPE=H2 ATTR=TXT:Experience
SET !EXTRACT NULL
TAG POS=R3 TYPE=DIV ATTR=CLASS:"pv-entity__summary-info" EXTRACT=TXT
'>
SET Job_Title_3 EVAL("var s='{{!EXTRACT}}'; var x,y,z; x=s.split('Company Name'); z=x[0].trim(); z;")
SET Company_Name_3 EVAL("var s='{{!EXTRACT}}'; var x,y,z; x=s.split('Company Name'); y=x[1].split('Dates Employed'); z=y[0].trim(); z;")
SET Duration_3 EVAL("var s='{{!EXTRACT}}'; var x,y,z; x=s.split('Dates Employed'); y=x[1].split('Employment Duration'); z=y[0].trim(); z;")
SET Location_3 EVAL("var s='{{!EXTRACT}}'; var x,y,z; x=s.split('Location'); y=x[1].split('/n'); z=y[0].trim(); z;")
'>
SET Job_3 {{Job_Title_3}}[EXTRACT]{{Company_Name_3}}[EXTRACT]{{Duration_3}}[EXTRACT]{{Location_3}}
PROMPT _{{Job_3}}_

'Job_4:
TAG POS=1 TYPE=H2 ATTR=TXT:Experience
SET !EXTRACT NULL
TAG POS=R4 TYPE=DIV ATTR=CLASS:"pv-entity__summary-info" EXTRACT=TXT
'>
SET Job_Title_4 EVAL("var s='{{!EXTRACT}}'; var x,y,z; x=s.split('Company Name'); z=x[0].trim(); z;")
SET Company_Name_4 EVAL("var s='{{!EXTRACT}}'; var x,y,z; x=s.split('Company Name'); y=x[1].split('Dates Employed'); z=y[0].trim(); z;")
SET Duration_4 EVAL("var s='{{!EXTRACT}}'; var x,y,z; x=s.split('Dates Employed'); y=x[1].split('Employment Duration'); z=y[0].trim(); z;")
SET Location_4 EVAL("var s='{{!EXTRACT}}'; var x,y,z; x=s.split('Location'); y=x[1].split('/n'); z=y[0].trim(); z;")
'>
SET Job_4 {{Job_Title_4}}[EXTRACT]{{Company_Name_4}}[EXTRACT]{{Duration_4}}[EXTRACT]{{Location_4}}
PROMPT _{{Job_4}}_
'PAUSE

'Reconstruct '!EXTRACT':
SET !EXTRACT {{!COL1}}[EXTRACT]{{Name}}
ADD !EXTRACT {{Job_1}}[EXTRACT]{{Job_2}}[EXTRACT]{{Job_3}}[EXTRACT]{{Job_4}}
PROMPT {{!EXTRACT}}
(Tested on iMacros for FF v8.8.2, Pale Moon v26.3.3 (=FF47), Win10-x64.)

This Script only handles the part with the 4 Jobs for this LinkedIn User, without the Looping from your DataSource nor the Education part for which you can reuse your existing Code...
- (F)CIM = (Full) Config Info Missing: iMacros + Browser + OS with all 3 Versions...
- I usually don't even read the Question if that (required) Info is not mentioned...
- Script & URL usually help a lot for a more "educated" Help...
chivracq
 
Posts: 6490
Joined: Sat Apr 13, 2013 6:07 am
Location: Amsterdam (NL)

Re: LinkedIn tagging question

by chivracq on Thu Aug 03, 2017 5:50 pm

myself wrote:Compliments on the "Quality" of your Post...! :D
(This is becoming so rare that I had to mention it, ah-ah...!)

Hum, first Impression was very good, but I'm afraid I will have to lower it:
=> Not "impressed" by the Follow-up, with still no Reaction 3 days later...! :shock:
(This simply comforts me (again) in my "Policy" to (practically) never write Scripts for others Users, I thought this Thread/OP was a nice Exception, ah-ah...! :oops: )

>>>

Hum, Note to myself:
For each 'Job_n', I deliberately gave "_1" / "_2" etc to all 4 Vars within all 4 Jobs, because for example if 'Duration_2' returns "__undefined__", it would otherwise automatically get the Value from 'Duration_1', which then screws the Data... Possible to handle with an extra 'IF' Statement in each 'EVAL()' Statement but it was easier to go for different Names for all Vars.

Some possible Workaround (that would allow to shorten all those "__undefined__" when not found with different Names and) that could allow to repeat the same Block of Code 4 times (for the 4 Jobs) with each time the same Names for all 4 Vars would be to use '!VARDEFAULT', some Deprecated Built-in Var that I think still works, at least in v8.8.2 (but I won't bother to check because of the poor Follow-up, ah-ah...!), but could be completely not supported anymore in v9.0.3..., so that wouldn't be a Solution anyway for @OP...
(Many Commands that were previously deprecated (up to 8 years ago, but were still working until v8.9.7) got "cleaned up" in v9.0.3... I never bothered to install v9.0.3 (too buggy and limited), so I never had a chance to test which ones exactly..., and it's not documented anywhere...)
- (F)CIM = (Full) Config Info Missing: iMacros + Browser + OS with all 3 Versions...
- I usually don't even read the Question if that (required) Info is not mentioned...
- Script & URL usually help a lot for a more "educated" Help...
chivracq
 
Posts: 6490
Joined: Sat Apr 13, 2013 6:07 am
Location: Amsterdam (NL)


Return to Data Extraction and Web Screen Scraping

Who is online

Users browsing this forum: No registered users and 2 guests

-->