Web Scraping and select Content value based on condition from CSV

Discussions and Tech Support related to website data extraction, screen scraping and data mining using iMacros.
Forum rules
Before asking a question or reporting an issue:
1. Please review the list of FAQ's.
2. Use the search box (at the top of each forum page) to see if a similar problem or question has already been addressed.
3. Try searching the iMacros Wiki - it contains the complete iMacros reference as well as plenty of samples and tutorials.
4. We can respond much faster to your posts if you include the following information: CLICK HERE FOR IMPORTANT INFORMATION TO INCLUDE IN YOUR POST
Post Reply
jyotirmaya
Posts: 49
Joined: Wed Jul 27, 2016 6:25 pm

Web Scraping and select Content value based on condition from CSV

Post by jyotirmaya » Mon May 09, 2022 8:28 am

I am using iMacros 8.9.7, Firefox 48.0
Windows 10 64-bit Operating system

I am using the below code

Code: Select all

VERSION BUILD=8970419 RECORDER=FX
TAB T=1
SET !DATASOURCE 111.csv
SET !LOOP 1
SET !DATASOURCE_LINE {{!LOOP}}
TAG POS=3 TYPE=TD ATTR=TXT:* EXTRACT=TXT
wait seconds=1
TAG POS=1 TYPE=A ATTR=TXT:Process
 SET !CLIPBOARD {{!EXTRACT}}
TAG POS=1 TYPE=A ATTR=TXT:Process
wait seconds=0.5
TAG POS=1 TYPE=SELECT FORM=ID:FormProcess ATTR=ID:ddlBLOUpdate CONTENT={{!COL2}}
wait seconds=0.5
TAG POS=1 TYPE=BUTTON FORM=ID:FormProcess ATTR=ID:aERAUD
wait seconds=0.5
TAG POS=4 TYPE=BUTTON ATTR=TXT:×
wait seconds=0.5
In the CSV file I have two columns, Column 1 is the data which I am extracting usig code

Code: Select all

TAG POS=3 TYPE=TD ATTR=TXT:* EXTRACT=TXT
and in Column 2 I have the data which I want to put in the below code

Code: Select all

TAG POS=1 TYPE=SELECT FORM=ID:FormProcess ATTR=ID:ddlBLOUpdate CONTENT={{!COL1}}
For example in the CSV file I have the data like this

Column 1 Column 2

1 10000
2 50000
3 70000
4 80000
5 35000

I want that if below code extracts 1 as value from the Web page

Code: Select all

TAG POS=3 TYPE=TD ATTR=TXT:* EXTRACT=TXT
then in the below code Content it should select 10000

Code: Select all

TAG POS=1 TYPE=SELECT FORM=ID:FormProcess ATTR=ID:ddlBLOUpdate CONTENT={{!COL1}}
Please help me with the code.
chivracq
Posts: 10098
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: Web Scraping and select Content value based on condition from CSV

Post by chivracq » Mon May 09, 2022 9:14 am

jyotirmaya wrote:
Mon May 09, 2022 8:28 am
I am using iMacros 8.9.7, Firefox 48.0
Windows 10 64-bit Operating system

I am using the below code

Code: Select all

VERSION BUILD=8970419 RECORDER=FX
TAB T=1
SET !DATASOURCE 111.csv
SET !LOOP 1
SET !DATASOURCE_LINE {{!LOOP}}
TAG POS=3 TYPE=TD ATTR=TXT:* EXTRACT=TXT
wait seconds=1
TAG POS=1 TYPE=A ATTR=TXT:Process
 SET !CLIPBOARD {{!EXTRACT}}
TAG POS=1 TYPE=A ATTR=TXT:Process
wait seconds=0.5
TAG POS=1 TYPE=SELECT FORM=ID:FormProcess ATTR=ID:ddlBLOUpdate CONTENT={{!COL2}}
wait seconds=0.5
TAG POS=1 TYPE=BUTTON FORM=ID:FormProcess ATTR=ID:aERAUD
wait seconds=0.5
TAG POS=4 TYPE=BUTTON ATTR=TXT:×
wait seconds=0.5
In the CSV file I have two columns, Column 1 is the data which I am extracting usig code

Code: Select all

TAG POS=3 TYPE=TD ATTR=TXT:* EXTRACT=TXT
and in Column 2 I have the data which I want to put in the below code

Code: Select all

TAG POS=1 TYPE=SELECT FORM=ID:FormProcess ATTR=ID:ddlBLOUpdate CONTENT={{!COL1}}
For example in the CSV file I have the data like this

Column 1 Column 2

1 10000
2 50000
3 70000
4 80000
5 35000

I want that if below code extracts 1 as value from the Web page

Code: Select all

TAG POS=3 TYPE=TD ATTR=TXT:* EXTRACT=TXT
then in the below code Content it should select 10000

Code: Select all

TAG POS=1 TYPE=SELECT FORM=ID:FormProcess ATTR=ID:ddlBLOUpdate CONTENT={{!COL1}}
Please help me with the code.

Yep, "easy Sol", ah-ah...!: :idea:

Code: Select all

TAG POS=3 TYPE=TD ATTR=TXT:* EXTRACT=TXT
SET !DATASOURCE_LINE {{!EXTRACT}}
'[...]
TAG POS=1 TYPE=SELECT FORM=ID:FormProcess ATTR=ID:ddlBLOUpdate CONTENT={{!COL2}}
(You have twice a Typo in your Post in "CONTENT={{!COL1}}" btw, you mean "={{!COL2}}"... :? )
- (F)CI(M) = (Full) Config Info (Missing): iMacros + Browser + OS (+ all 3 Versions + 'Free'/'PE'/'Trial').
- I don't even read the Qt if that (required) Info is not mentioned...!
- Script & URL help a lot for more "educated" Help...
jyotirmaya
Posts: 49
Joined: Wed Jul 27, 2016 6:25 pm

Re: Web Scraping and select Content value based on condition from CSV

Post by jyotirmaya » Tue May 10, 2022 10:45 am

Thanks for the help.
Yep, "easy Sol", ah-ah...!: :idea:

Code: Select all

TAG POS=3 TYPE=TD ATTR=TXT:* EXTRACT=TXT
SET !DATASOURCE_LINE {{!EXTRACT}}
'[...]
TAG POS=1 TYPE=SELECT FORM=ID:FormProcess ATTR=ID:ddlBLOUpdate CONTENT={{!COL2}}
After your Code, I have modified the code like this

Code: Select all

VERSION BUILD=8970419 RECORDER=FX
TAB T=1
SET !DATASOURCE AutoSelect.csv
TAG POS=3 TYPE=TD ATTR=TXT:* EXTRACT=TXT
SET !DATASOURCE_LINE {{!EXTRACT}}
wait seconds=1
TAG POS=1 TYPE=A ATTR=TXT:Process
wait seconds=0.5
TAG POS=1 TYPE=SELECT FORM=ID:FormProcess ATTR=ID:ddlBLOUpdate CONTENT={{!COL2}}
wait seconds=0.5
TAG POS=1 TYPE=BUTTON FORM=ID:FormProcess ATTR=ID:aERAUD
wait seconds=0.5
TAG POS=4 TYPE=BUTTON ATTR=TXT:×
wait seconds=0.5
But I am getting Error message that

RuntimeError: Invalid DATASOURCE_LINE value: 71, line 5 (Error code: -951)

The code has extracted the value 71from the Web Page but after that it stops. I have the Data Source file AutoSelect.csv in the Data Source folder and in that in column A the Value 71 is there in Row 3 and in Column B Row 3 the Value 78993 is there, But instead of moving further it stops. I want the CONTENT should select the value 78993.

Please Guide.

As We have used the DATASOURCE_LINE code it is searching for the Line 71 of the CSV file, When I entered data in Row 71 Column A as 71 and Column B as Value 78993, it worked, But I have random values to extract from the Web Page, and the CSV file is also not chronologically arranged. Please help what should be the change in the code.
(You have twice a Typo in your Post in "CONTENT={{!COL1}}" btw, you mean "={{!COL2}}"... :? )
Yes Sorry, actually its "CONTENT={{!COL1}}"
Attachments
Autoselect.png
jyotirmaya
Posts: 49
Joined: Wed Jul 27, 2016 6:25 pm

Re: Web Scraping and select Content value based on condition from CSV

Post by jyotirmaya » Tue May 10, 2022 11:08 am

Or the Solution is to Sort the values in Column1 As 1,2,3,4 etc ?? or any other alternative is there ?
chivracq
Posts: 10098
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: Web Scraping and select Content value based on condition from CSV

Post by chivracq » Tue May 10, 2022 5:42 pm

Alright..., OK, I'll first reply to that part:
jyotirmaya wrote:
Tue May 10, 2022 10:45 am
(You have twice a Typo in your Post in "CONTENT={{!COL1}}" btw, you mean "={{!COL2}}"... :? )
Yes Sorry, actually its "CONTENT={{!COL1}}"

=> Beh..., no...!, it is/must be "CONTENT={{!COL2}}", you use it yourself in your modified Script in:

Code: Select all

TAG POS=1 TYPE=SELECT FORM=ID:FormProcess ATTR=ID:ddlBLOUpdate CONTENT={{!COL2}}
But nvm, it doesn't really matter for the "whole Scenario", you can find out by yourself...

>>>
jyotirmaya wrote:
Tue May 10, 2022 10:45 am
Thanks for the help.
Yep, "easy Sol", ah-ah...!: :idea:

Code: Select all

TAG POS=3 TYPE=TD ATTR=TXT:* EXTRACT=TXT
SET !DATASOURCE_LINE {{!EXTRACT}}
'[...]
TAG POS=1 TYPE=SELECT FORM=ID:FormProcess ATTR=ID:ddlBLOUpdate CONTENT={{!COL2}}
After your Code, I have modified the code like this

Code: Select all

VERSION BUILD=8970419 RECORDER=FX
TAB T=1
SET !DATASOURCE AutoSelect.csv
TAG POS=3 TYPE=TD ATTR=TXT:* EXTRACT=TXT
SET !DATASOURCE_LINE {{!EXTRACT}}
wait seconds=1
TAG POS=1 TYPE=A ATTR=TXT:Process
wait seconds=0.5
TAG POS=1 TYPE=SELECT FORM=ID:FormProcess ATTR=ID:ddlBLOUpdate CONTENT={{!COL2}}
wait seconds=0.5
TAG POS=1 TYPE=BUTTON FORM=ID:FormProcess ATTR=ID:aERAUD
wait seconds=0.5
TAG POS=4 TYPE=BUTTON ATTR=TXT:×
wait seconds=0.5
But I am getting Error message that

RuntimeError: Invalid DATASOURCE_LINE value: 71, line 5 (Error code: -951)

The code has extracted the value 71from the Web Page but after that it stops. I have the Data Source file AutoSelect.csv in the Data Source folder and in that in column A the Value 71 is there in Row 3 and in Column B Row 3 the Value 78993 is there, But instead of moving further it stops. I want the CONTENT should select the value 78993.

Please Guide.

As We have used the DATASOURCE_LINE code it is searching for the Line 71 of the CSV file, When I entered data in Row 71 Column A as 71 and Column B as Value 78993, it worked, But I have random values to extract from the Web Page, and the CSV file is also not chronologically arranged. Please help what should be the change in the code.

Then OK, well..., I called it an "easy Solution", based on "[1,2,3,4,5]" being the "real" Data for your Col1 from our OP, then of course it won't work if your "real" Data turns out to be "[65,91,71,66,91]", ah-ah...! iMacros Solutions/Implementations often don't work very well with "fake" Data/fake Examples, ah-ah...! :wink:

Then OK, => "next Step"...! Well, don't worry, I have another half a dozen other Sols/Implementations, ah-ah...! :twisted:
=> Hum, I actually posted 10 different Methods, ah-ah-oops...! :P :shock: :twisted:

But I would have a few Qt's then...:
1- Why do you have x2 the same Value ("91") in Col1 in your "new/real" Data now...?
=> For Col1: [65,91,71,66,91]

2- About how many Rows/Pairs are we talking...? Is 5 Rows/Pairs the "real" Number, or do you have much more...? Give the Nb...
And will it be "fixed", or does that Nb regularly change, do you add new Pairs to the '.csv' File...?

3- Where do these Values/Pairs come from...? From some other iMacros Script extracting both Values from some "other" Web-Page..., or do you create/maintain the Content of that '.csv' File "manually" from 'Notepad' or 'Excel'...? (=> Well, you seem to be using 'Excel' from your Screenshot, but maybe that's only to "view" its Content...)

>

Then OK, several Sols/Implementations, from "easy" to "a bit more complex", ah-ah...!

1- If 5 is a "real" Nb for the Nb of Pairs/Rows in your '.csv', and depending on where that Content comes from and how often you need to edit it, an easy Sol would be to hard-code your Data in your Script directly... :idea:
=> With 2 Vars/Arrays, for example:

Code: Select all

SET AS_Input [65,91,71,66,93]
SET AS_Output [78987,79012,78993,78988,79013]
(I changed your 2nd "91" Value to "93", I don't understand otherwise why you would have twice the same Value... :? )

Then you would use `EVAL()` to pull out the corresponding Value from 'AS_Output' to use in your Script for the `CONTENT` Param.
(And you don't even need a '.csv' File/DataSource...)

2- Same like '1-', you can declare all the Data in just one Var/Array, if you prefer, using any Format you would like, here is one Example...:

Code: Select all

SET AS_Pairs [65,78987],[91,79012],[71,78993],[66,78988],[93,79013]
And same like '1-', use `EVAL()` for the Correspondence...

3- OK, we use a '.csv' DataSource now, ah-ah...!
If you only have 5 Rows of Data, then simply retrieve those 10 Cells/Values with 10 or 5 Statements to populate the 1 or 2 Var(s)/Array(s) from Sols '1-'/'2-' by hard-coding 5 times the `!DATASOURCE_LINE` to "1"/"2"/.../"5" like in:

Code: Select all

SET !DATASOURCE_LINE 1
SET AS_Input {{!COL1}}
SET AS_Output {{!COL2}}
'>
SET !DATASOURCE_LINE 2
ADD AS_Input ,{{!COL1}}
ADD AS_Output ,{{!COL2}}
'>
'[...]
'>
SET !DATASOURCE_LINE 5
ADD AS_Input ,{{!COL1}}
ADD AS_Output ,{{!COL2}}

4- Same like '3-' and based on '2-', if you prefer to use just 1 Var/Array, for example:

Code: Select all

SET !DATASOURCE_LINE 1
SET AS_Pairs {{!COL1}}/{{!COL2}}
'>
SET !DATASOURCE_LINE 2
ADD AS_Pairs ,{{!COL1}}/{{!COL2}}
'>
'[...]
'>
SET !DATASOURCE_LINE 5
ADD AS_Pairs ,{{!COL1}}/{{!COL2}}
Yeah, well, use any Separator(s) you'd like... :P

Implementations '3-' and '4-' would be fine if you don't want to maintain the Data in the Script itself (Methods '1-' + '2-'), like if the Content changes a bit often, and if the Nb of Rows/Pairs is pretty low/limited, like the "5" you've mentioned..., this would quickly become "a bit cumbersome" if you have more than 10 or 12 Rows, and if that Nb changes... :(

5- Still an "easy Implementation", => what about creating/maintaining your Data "horizontally" in just 2 Rows, instead of "vertically" like you are doing, with 2 Cols...? :idea:
=> That would give for your '.csv':

Code: Select all

65	91	71	66	91
78987	79012	78993	78988	79013
=> Then you can "fall back" onto Methods '1-' or '3-' (with 2 Vars/Arrays) and populate those 2 Vars in just 1 Retrieve Statement for each Row with a simple Trick:

Code: Select all

SET !DATASOURCE_DELIMITER #
'>
SET !DATASOURCE_LINE 1
SET AS_Input {{!COL1}}
'>
SET !DATASOURCE_LINE 2
SET AS_Output {{!COL1}}
By (temporarily) changing the `!DATASOURCE_DELIMITER` Param to some "fake" Separator (=> "#" in my Example instead of the "standard" ","), you "trick" iMacros into seeing a whole Row in the '.csv'/DataSource as just 1 Cell, and you can then keep all Commas separating all Cells, to reuse that Comma to re-separate all Values in the `EVAL()`.

Method '5-' would be fine for up to 100 Pairs, if you don't mind scrolling horizontally in your 'Excel' Sheet... :idea:

6- A bit like Method '5-', but if you really-really-really have to maintain the Data vertically with 2 Cols, => would be to add 1 or 2 (empty) Rows at the Top of your 'Excel' Sheet, => if you add 2 empty Rows, "61" will then shift from Cell 'A1' to 'A3', etc...), then select in 'Excel' the Cell Range for the 'A' Col from [A3-A7] (for 5 Pairs), => then well, until 'An' the last 'A' Value if you have more than 5 Pairs, => + 'Copy' ('Ctrrl^c') manually in 'Excel', + Right-Click on Cell 'A1' in your Sheet, + 'Paste Special' + 'Transpose' => Paste...!
And this will paste the complete Col1 into Row1 to get the same Content in Row1 like in Method '-5', where you will retrieve that complete Row using the same Technique, and finding the Value "71" for example in Cell 'C1' (=> at Position = "3") will "tell you" that its corresponding Col2 Value is located at Row 3+2 ("+2" because of the 2 added empty Rows) to retrieve directly Cell 'B5' (= "SET !DATASOURCE_LINE 5" + "{{!COL2}}") in your Script. Well, you'll need to use a Var for "5" of course...

Now with Methods '7-' + '8-' + '9-', those would be suitable for large Nb's of Pairs + highly "Dynamic" (Content/Values/Nb of Pairs changing often)...

7- From 'Excel', save your Sheet not as a '.csv' File but as '.html', and you can open this '.html' File from your Script (in a 2nd TAB), and if you extracted Value "71" for example, => then simply tag the Cell containing that "71" on this local '.html' File opened in 'TAB_2' with stg like "TAG POS=1 TYPE=* ATTR=TXT:71" (well with "ATTR=TXT:{{!EXTRACT}}" then, of course), and using "TAG POS=R1" + 'EXTRACT' will extract the corresponding Value (=> ="78993") to reuse and input in your Script.

8- Instead of using 'Excel', use some 'GoogleSheet', also in a 2nd TAB...
But hum..., 'GoogleSheet' can sometimes be a "Pain in the Ass" with 'Firefox', it works better in 'Chrome', I guess it's probably "a bit deliberate" from 'Google' to make sure that 'GoogleDocs' and 'GoogleSheets' "work better" in their own Browser, so they keep changing the Functionality of 'GoogleDocs/Sheets' regularly to use the "latest" tricky JS/CSS/HTML Features that "at some point" only the very-latest CR-Version supports, => constantly breaking some Func in FF and other Browsers, and you are actually using FF48, so "Chances are" 'GoogleSheets' will probably not work at all anymore in that Browser, or the Site will keep nagging you about updating to "a supported Browser", and especially 'Chrome', ah-ah...!

9- OK, last one, ah-ah...!
Still from 'Excel', => save your Sheet as '.txt' instead of '.csv'. (Or you can manually change the File Extension from 'Windows Explorer', the Result is the same...)
But you need to modify a bit the Structure of the Content of the Sheet, in 'Excel' (or 'Notepad'), and to add a "separating" Col between your current Col1 and Col2, using some repeated same Separator Char (=> "#" for example), and also another extra Col (either to the Left or to the Right of all existing Cols) with a different Separator (=> "##" for example), that will be used as Row Separator in `EVAL()` to re-separate Cell by Cell all Content/Values, after re-extracting the whole Content of the Sheet from that '.txt' (Local) File (also opened in a 2nd TAB) with a simple and single "TAG POS=1 TYPE=PRE ATTR=* EXTRACT=TXT" Statement. (Well, can use "TYPE=PRE" or "TYPE=BODY", what you prefer, iMacros (and the Browser) sees the whole Content as just one HTML Element...

Then you use the "#" and "##" Separators to re-separate all Cells/Values using `EVAL()` and you fall back onto any of the previous Methods I've already mentioned/explained...

But this last Method ('9-') is maybe a bit of "Overkill" in your Case, I would think, although it's very "powerful", ah-ah...!, I use it myself in a few of my own Scripts, with some Scripts reusing some Data extracted/retrieved by some other Scripts, as it gives me "Full Control" over the '.csv' Content/Format... And I go "even further", as I don't even need to open it (the '.txt' File) in a 2nd TAB, oh...!, but wait...!, then OK, I then have a 10th Method for you then, ah-ah...!

10- OK, this one is based on Method '9-' and requires all the "##" + "#" Formatting to separate Cells and Rows, well Rows maybe not, as the Row_Nb's (for `!DATASOURCE_LINE` need to be hard-coded, so it only "works" if you have a very limited Nb of Rows (like "5" from your Post), or if you maintain your Data "horizontally" (and not "vertically" in Cols).
But you need to create the '.txt' File from 'Notepad' (if you create it manually), and not from 'Excel', and to "make sure" that the whole File doesn't contain any Commas (unless you deliberately want to separate some Data into several Cells), then a single '{{!COL1}}' (like in Method '5-' will get you the whole Content of the File, or of a Row if it contains several Rows...

>

Then alright, you (already) have 10 different Methods to choose from, ah-ah...! :shock: , and hum..., depending on how much Data you have, and how many Rows/Pairs in your '.csv', if "5" is the "real" Nb, I think I would go "personally" for Method '1-' or '5-'... But that's your Choice, ah-ah...! :wink:

I can post the "needed" `EVAL()` Statements for those 2 Methods if you want, if you don't come out by yourself..., or for the Method you would like to choose, but it's not very complicated, I've already posted maybe 100's of Examples on the Forum, ah-ah...!, and after using iMacros for 6 years already now, this should be fairly simple for you also, I would think... :wink:

>>>
jyotirmaya wrote:
Tue May 10, 2022 11:08 am
Or the Solution is to Sort the values in Column1 As 1,2,3,4 etc ?? or any other alternative is there ?

Nope, won't help... See the 10 Methods I listed above... :twisted:
- (F)CI(M) = (Full) Config Info (Missing): iMacros + Browser + OS (+ all 3 Versions + 'Free'/'PE'/'Trial').
- I don't even read the Qt if that (required) Info is not mentioned...!
- Script & URL help a lot for more "educated" Help...
jyotirmaya
Posts: 49
Joined: Wed Jul 27, 2016 6:25 pm

Re: Web Scraping and select Content value based on condition from CSV

Post by jyotirmaya » Wed May 11, 2022 4:10 am

Thank you, you are the ocean of knowledge. :D
=> Beh..., no...!, it is/must be "CONTENT={{!COL2}}", you use it yourself in your modified Script in:
Yes it's {{!COL2}}" sorry again :cry:
1- Why do you have x2 the same Value ("91") in Col1 in your "new/real" Data now...?
=> For Col1: [65,91,71,66,91][/quote

I don't have pairs I wrongly mentioned number 91 twice and it's corresponding value in column B twice for demo purpose. I have unique value in Column A and unique value in column B, no such pairs or duplicate values are there.
2- About how many Rows/Pairs are we talking...? Is 5 Rows/Pairs the "real" Number, or do you have much more...? Give the Nb...
And will it be "fixed", or does that Nb regularly change, do you add new Pairs to the '.csv' File...?
There are nearly 300 rows in the CSV and numbers & the numbers don't change regularly.
1- If 5 is a "real" Nb for the Nb of Pairs/Rows in your '.csv', and depending on where that Content comes from and how often you need to edit it,
That content came from a different webpage and I don't need to edit those numbers, numbers are fix and unique.

Code: Select all

AS_Input [65,91,71,66,93]
SET AS_Output [78987,79012,78993,78988,79013]
Can I use this method for the 300 values? I can edit the script and enter the numbers, or it can work for only limited numbers??


All the mentioned 10 type of solutions are awesome and I learnt a lot of new things because of that, Thank you once again for your explanation. I haven't tried the methods yet. I will try them.

Thanks
Post Reply