Scraping text/links from specific web page <div> section to clipboard?

Discussions and Tech Support related to website data extraction, screen scraping and data mining using iMacros.
Forum rules
iMacros EOL - Attention!

The renewal maintenance has officially ended for Progress iMacros effective November 20, 2023 and all versions of iMacros are now considered EOL (End-of-Life). The iMacros products will no longer be supported by Progress (aside from customer license issues), and these forums will also no longer be moderated from the Progress side.

Thank you again for your business and support.

Sincerely,
The Progress Team

Before asking a question or reporting an issue:
1. Please review the list of FAQ's.
2. Use the search box (at the top of each forum page) to see if a similar problem or question has already been addressed.
3. Try searching the iMacros Wiki - it contains the complete iMacros reference as well as plenty of samples and tutorials.
4. We can respond much faster to your posts if you include the following information: CLICK HERE FOR IMPORTANT INFORMATION TO INCLUDE IN YOUR POST
Post Reply
DavidRTurner
Posts: 3
Joined: Mon May 16, 2022 6:27 pm

Scraping text/links from specific web page <div> section to clipboard?

Post by DavidRTurner » Mon May 16, 2022 7:30 pm

[EDITED after feedback - sorry for the long post, trying to input all relevant information...]

I've been working on a project for a couple of years (it's a continual thing, so it's never-ending).
I've progressed to a point of using mouse/keyboard macros to scrape a list of text/links from a set of variable-length pages, to paste into Excel; then run an Excel macro to manipulate that data; then return to the webpage, close it & repeat on the next one (I have some error-checking in place in case of a failure).
I do this every 6 months or so.

I am scraping about *75,000 cemetery index pages on http://www.billiongraves.com, copying the names/dates/links of the people interred there, then sorting, filtering and eventually editing errors & merging duplicate records.
*FYI - there are about 600,000 cemetery pages, but I do some data preparation first, extracting only the 75,000 pages with data on them.

Recently, because of some minor site changes and Firefox add-in customizations, the macros that I painstakingly created over time (to pixel-perfect page coordinates, with JitBit Macro Recorder) need to be shifted & changed, which will take me a week or more to do. It's painful...

I'm thinking of iMacros as an alternative (or as an additional part of the process), as I would LIKE TO do the following, but am not sure it's capable of this.
A typical page has a particular <div><id> section which shows the data I want - it would be ideal if I could select JUST that <div> section and copy all of its contents at once, to the clipboard, which I can then paste into Excel.
*right now, my macro is scrolling & selecting specifically-positioned lines depending on the length of the list...

So I'm looking for this very basic need first, as I can build up more functionality around it later as I learn more about iMacros.

EXAMPLE PAGE: https://billiongraves.com/site-map?ceme ... 295&page=0 - in the Page Source is the section:

Code: Select all

	<div id="content">
        <h1 style="margin: 10px 0 25px 10px;">BillionGraves Site Map</h1>
        <div class="card">
            <h1 style="float:left; margin: 10px 0 10px 10px;">Burial records in <a href='/cemetery/Bethesda-Cemetery/100295' >Bethesda Cemetery</a></h1>
            <br class="clearfloat" />
            <div style="border-bottom:#CCC thin solid; width:916px;"> </div>

            <div class="center">
*******HERE IS THE DIV ID SECTION 'MULTIPLE' WHICH CONTAINS THE DATA I WANT TO COPY*******                <div id="multiple">
                    <div class='backlinks'><a href='/site-map'>Sitemap</a> > <a href='/site-map?country=United+States'>United States</a> > <a href='/site-map?country=United+States&state=Tennessee'>Tennessee</a> > <a href='/cemetery/Bethesda-Cemetery/100295'>Bethesda Cemetery</a></div><div><div class='record'><a href='/grave/William-R-Brooks/31780628' alt='Brooks, William R. (1833 - 1864)' title='Brooks, William R. (1833 - 1864)'>Brooks, William R. (1833 - 1864)</a></div><div class='record'><a href='/grave/Nathan-Andrew-Jackson/31709567' alt='Jackson, Nathan Andrew (1838 - 1864)' title='Jackson, Nathan Andrew (1838 - 1864)'>Jackson, Nathan Andrew (1838 - 1864)</a></div><div class='record'><a href='/grave/Josiah-S-Price/31694361' alt='Price, Josiah S (1838 - 1862)' title='Price, Josiah S (1838 - 1862)'>Price, Josiah S (1838 - 1862)</a></div><div class='record'><a href='/grave/Charles-J-Shropshire/31780629' alt='Shropshire, Charles J. (1841 - 1863)' title='Shropshire, Charles J. (1841 - 1863)'>Shropshire, Charles J. (1841 - 1863)</a></div><div class='record'><a href='/grave/William-A-Wingard/31709460' alt='Wingard, William  A. (1839 - 1864)' title='Wingard, William  A. (1839 - 1864)'>Wingard, William  A. (1839 - 1864)</a></div></div><br/><br/>Pages: <span>1</span>&nbsp;                </div>
            </div>
        </div>
My hope is that in selecting 'the entire block' - i.e. the whole <div> section - I can copy all the contents in one shot, rather than having macros scroll to the bottom to capture all the lines, which is not yet working 100% perfectly.

Is iMacros able to select a specific page section (edit: I have found that it can) and copy the hyperlink contents to clipboard or a file? I see that it can extract content to a CSV file, for example (edit: in the paid version, not the freeware one). This could work for me (if it creates a 2-column file, with the TEXT and also the LINK - really NEED both!), as I could later combine the CSVs and import to Excel in bulk.
*If the functionality is there to quickly/easily copy a defined <div> section, I'm happy to pay the $99 for the basic version to allow me to SAVEAS a file...

Any help or direction appreciated.


I just installed the (free) Firefox add-in "iMacros for Firefox" - v. 10.1.0.1485, on Windows10 Pro-64 (v.19043.1706) with Firefox v100 (64bit).
Last edited by DavidRTurner on Mon May 16, 2022 11:09 pm, edited 3 times in total.
chivracq
Posts: 10301
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: Scraping text/links from specific web page <div> section to clipboard?

Post by chivracq » Mon May 16, 2022 8:11 pm

DavidRTurner wrote:
Mon May 16, 2022 7:30 pm
I've been working on a project for a couple of years (it's a continual thing, so it's never-ending).
I've progressed to a point of using mouse/keyboard macros to scrape a list of text/links from a set of variable-length pages, to paste into Excel; then run an Excel macro to manipulate that data; then return to the webpage, close it & repeat on the next one (I have some error-checking in place in case of a failure).
I do this every 6 months or so.

I am scraping about *75,000 cemetery index pages on billiongraves.com, copying the names/dates/links of the people interred there, then sorting, filtering and eventually editing errors & merging duplicate records.
*FYI - there are about 600,000 cemetery pages, but I do some data preparation first, extracting only the 75,000 pages with data on them.

Recently, because of some minor site changes and Firefox customizations, the macros that I painstakingly created over time (to pixel-perfect page coordinates, with JitBit Macro Recorder) need to be shifted & changed, which will take me a week or more to do. It's painful...

I'm thinking of iMacros as an alternative (or as an additional part of the process), as I would LIKE TO do the following, but am not sure it's capable of this:
- a typical page has a particular <div> section - it would be IDEAL if I could select JUST that section and COPY ALL ITS CONTENTS (the list of names, with their links) at once, to the clipboard, which I can then paste into Excel.
*right now, I'm scrolling & selecting specifically-positioned lines depending on the length of the list...

So I'm looking for this very basic need first, as I can build up more functionality around it later as I learn more about iMacros.

EXAMPLE PAGE: https://billiongraves.com/site-map?ceme ... 295&page=0 - in the Page Source is the section:

Code: Select all

	<div id="content">
        <h1 style="margin: 10px 0 25px 10px;">BillionGraves Site Map</h1>
        <div class="card">
            <h1 style="float:left; margin: 10px 0 10px 10px;">Burial records in <a href='/cemetery/Bethesda-Cemetery/100295' >Bethesda Cemetery</a></h1>
            <br class="clearfloat" />
            <div style="border-bottom:#CCC thin solid; width:916px;"> </div>

            <div class="center">
*******HERE IS THE DIV ID SECTION 'MULTIPLE' WHICH CONTAINS THE DATA I WANT TO COPY*******                <div id="multiple">
                    <div class='backlinks'><a href='/site-map'>Sitemap</a> > <a href='/site-map?country=United+States'>United States</a> > <a href='/site-map?country=United+States&state=Tennessee'>Tennessee</a> > <a href='/cemetery/Bethesda-Cemetery/100295'>Bethesda Cemetery</a></div><div><div class='record'><a href='/grave/William-R-Brooks/31780628' alt='Brooks, William R. (1833 - 1864)' title='Brooks, William R. (1833 - 1864)'>Brooks, William R. (1833 - 1864)</a></div><div class='record'><a href='/grave/Nathan-Andrew-Jackson/31709567' alt='Jackson, Nathan Andrew (1838 - 1864)' title='Jackson, Nathan Andrew (1838 - 1864)'>Jackson, Nathan Andrew (1838 - 1864)</a></div><div class='record'><a href='/grave/Josiah-S-Price/31694361' alt='Price, Josiah S (1838 - 1862)' title='Price, Josiah S (1838 - 1862)'>Price, Josiah S (1838 - 1862)</a></div><div class='record'><a href='/grave/Charles-J-Shropshire/31780629' alt='Shropshire, Charles J. (1841 - 1863)' title='Shropshire, Charles J. (1841 - 1863)'>Shropshire, Charles J. (1841 - 1863)</a></div><div class='record'><a href='/grave/William-A-Wingard/31709460' alt='Wingard, William  A. (1839 - 1864)' title='Wingard, William  A. (1839 - 1864)'>Wingard, William  A. (1839 - 1864)</a></div></div><br/><br/>Pages: <span>1</span>&nbsp;                </div>
            </div>
        </div>
My hope is that in selecting 'the entire block' - i.e. the whole <div> section - I can copy all the contents in one shot, rather than having macros scroll to the bottom to capture all the lines, which is not yet working 100% perfectly.

Is iMacros able to SELECT this specific section (in testing, I know it CAN select the section) and copy the contents to clipboard or a file? I see that it can extract content to a CSV file, for example. This could work for me (if it creates a 2-column file, with the TEXT and also the LINK - really NEED both!), as I could later combine the CSVs and import to Excel in bulk.

Any help or direction appreciated.


I just installed

Code: Select all

iMacros for Firefox - v. 10.1.0.1485, on Windows10.  Current version of Firefox.

Alright, you "found" your Thread, ... that I moved directly to the "correct" Sub-Forum (=> 'Data Extraction'), you were first reading some Threads in the 'Data Extraction' Sub-Forum, + 'How-To' Threads about "Data Extraction", but you opened your Thread in the 'iMacros for FF' Sub-Forum, while it has nothing specific to iMacros for FF... :o

Then hum, we don't scream in huge big Red Bold Letters on this Forum...!, => can you edit that part and use some "normal" Formatting/Layout...? (I refused to read that part..., and I kind of "obfuscated" it in my Quote in a ]code[ Block..., I'll edit back later... The whole Post is actually annoying to read, because of that "screaming" part, screaming for attention..., grrr...!) :shock:

(And/or I'll edit your Post myself "in a few hours" when I check the Thread again, if you haven't reacted/complied yet then... :| )
A bit of Bold is perfectly fine, and Size, up to Max=120 for a short part/sentence can be OK, but Bold + Size=150 + Red is really annoying... :roll:
And I would say/think, => Bold only is more than "good enough", I'll be "the one" (and probably the only one) reading your Post and helping you in this Thread, I'm "clever" enough to read and understand your whole Post and "extirpate" myself what Info is important or not, and to ask the "right" Qt's if anything is not clear to me to understand your Scenario precisely... :P 8) :shock: :twisted:

FCI mentioned, good, (even if you read the Forum Rules only after posting your Thread, ah-ah...! :? ), but hum, "I just installed" => 'Free'/'Trial' is still missing from your iMacros Version...?, + "Current" in "Current version of Firefox." doesn't mean anything on a Tech Forum, => simply mention the Version..., => FF100 I think...?

And from a quick Look at the Page is this the Data you want to extract...?:

Code: Select all

Sitemap > United States > Tennessee > Bethesda Cemetery
Brooks, William R. (1833 - 1864)
Jackson, Nathan Andrew (1838 - 1864)
Price, Josiah S (1838 - 1862)
Shropshire, Charles J. (1841 - 1863)
Wingard, William A. (1839 - 1864)
+ Do you want the Links on the 5 Names...?
- (F)CI(M) = (Full) Config Info (Missing): iMacros + Browser + OS (+ all 3 Versions + 'Free'/'PE'/'Trial').
- FCI not mentioned: I don't even read the Qt...! (or only to catch Spam!)
- Script & URL help a lot for more "educated" Help...
DavidRTurner
Posts: 3
Joined: Mon May 16, 2022 6:27 pm

Re: Scraping text/links from specific web page <div> section to clipboard?

Post by DavidRTurner » Mon May 16, 2022 9:57 pm

chivracq wrote:
Mon May 16, 2022 8:11 pm
Alright, you "found" your Thread, ... that I moved directly to the "correct" Sub-Forum (=> 'Data Extraction'), you were first reading some Threads in the 'Data Extraction' Sub-Forum, + 'How-To' Threads about "Data Extraction", but you opened your Thread in the 'iMacros for FF' Sub-Forum, while it has nothing specific to iMacros for FF... :o
OK, yes, I am using the Firefox add-in and I supposed that was the best forum for it, as I'm not interested in using an alternative browser, which may be someone's response. I'll take your lead and keep it in Data Extraction.
Then hum, we don't scream in huge big Red Bold Letters on this Forum...!, => can you edit that part and use some "normal" Formatting/Layout...? (I refused to read that part..., and I kind of "obfuscated" it in my Quote in a ]code[ Block..., I'll edit back later... The whole Post is actually annoying to read, because of that "screaming" part, screaming for attention..., grrr...!) :shock:
I've learned on essentially every other forum I've used, to do this, because people don't actually read anything more than a few sentences unless their attention can be grabbed. So the simplest, most relevant line, is something I tend to highlight like this. Again, I'll take your lead on this here.
(And/or I'll edit your Post myself "in a few hours" when I check the Thread again, if you haven't reacted/complied yet then... :| )
A bit of Bold is perfectly fine, and Size, up to Max=120 for a short part/sentence can be OK, but Bold + Size=150 + Red is really annoying... :roll:
And I would say/think, => Bold only is more than "good enough", I'll be "the one" (and probably the only one) reading your Post and helping you in this Thread, I'm "clever" enough to read and understand your whole Post and "extirpate" myself what Info is important or not, and to ask the "right" Qt's if anything is not clear to me to understand your Scenario precisely... :P 8) :shock: :twisted:
I've edited the post and changed a few bits of it to be more clear.
FCI mentioned, good, (even if you read the Forum Rules only after posting your Thread, ah-ah...! :? ), but hum, "I just installed" => 'Free'/'Trial' is still missing from your iMacros Version...?, + "Current" in "Current version of Firefox." doesn't mean anything on a Tech Forum, => simply mention the Version..., => FF100 I think...?
Updated.
And from a quick Look at the Page is this the Data you want to extract...?:

Code: Select all

Sitemap > United States > Tennessee > Bethesda Cemetery
Brooks, William R. (1833 - 1864)
Jackson, Nathan Andrew (1838 - 1864)
Price, Josiah S (1838 - 1862)
Shropshire, Charles J. (1841 - 1863)
Wingard, William A. (1839 - 1864)
+ Do you want the Links on the 5 Names...?
Yes, I need to extract the text rows with their links.
When I copy/paste into Excel (say, column A), I run a macro that extracts the URLs and puts them in column B, then pastes the text "as text" back to column A.
Both columns end up as text-only (Excel can only handle 65k URLs per sheet). But I fill a million rows of data before moving to the next sheet.
chivracq
Posts: 10301
Joined: Sat Apr 13, 2013 1:07 pm
Location: Amsterdam (NL)

Re: Scraping text/links from specific web page <div> section to clipboard?

Post by chivracq » Mon May 16, 2022 10:45 pm

OK, Thanks for the Edit and (finishing) mentioning your FCI... :D

And, ah-ah...!, as you can see, the Qt's I asked about what Data you wanted to extract were actually contained/explained exactly in that ugly red-bold-shouting Section of yours that I had refused to read... :wink:

Then well, if what you really want is to extract the HTML Source Code of the '<div id="multiple">', then that's pretty straightforward...:

Code: Select all

SET !EXTRACT_TEST_POPUP NO
TAG POS=1 TYPE=DIV ATTR=ID:multiple EXTRACT=HTM
SET !CLIPBOARD {{!EXTRACT}}
PROMPT {{!CLIPBOARD}}
... which displays in the 'PROMPT', and will already be copied to your OS Clipboard:

Code: Select all

<div style="outline: 1px solid blue;" id="multiple">                     <div class="backlinks"><a href="/site-map">Sitemap</a> &gt; <a href="/site-map?country=United+States">United States</a> &gt; <a href="/site-map?country=United+States&amp;state=Tennessee">Tennessee</a> &gt; <a href="/cemetery/Bethesda-Cemetery/100295">Bethesda Cemetery</a></div><div><div class="record"><a href="/grave/William-R-Brooks/31780628" alt="Brooks, William R. (1833 - 1864)" title="Brooks, William R. (1833 - 1864)">Brooks, William R. (1833 - 1864)</a></div><div class="record"><a href="/grave/Nathan-Andrew-Jackson/31709567" alt="Jackson, Nathan Andrew (1838 - 1864)" title="Jackson, Nathan Andrew (1838 - 1864)">Jackson, Nathan Andrew (1838 - 1864)</a></div><div class="record"><a href="/grave/Josiah-S-Price/31694361" alt="Price, Josiah S (1838 - 1862)" title="Price, Josiah S (1838 - 1862)">Price, Josiah S (1838 - 1862)</a></div><div class="record"><a href="/grave/Charles-J-Shropshire/31780629" alt="Shropshire, Charles J. (1841 - 1863)" title="Shropshire, Charles J. (1841 - 1863)">Shropshire, Charles J. (1841 - 1863)</a></div><div class="record"><a href="/grave/William-A-Wingard/31709460" alt="Wingard, William  A. (1839 - 1864)" title="Wingard, William  A. (1839 - 1864)">Wingard, William  A. (1839 - 1864)</a></div></div><br><br>Pages: <span>1</span>&nbsp;                </div>
(Tested on iMacros for FF v8.8.2, PM v26.3.3, Win10_x64_21H2_Pro.)

Then OK, if you've already put all the "Data Cleaning" Logic in your Macro(s) in 'Excel', and you are "happy" with the raw HTML Source from that 'DIV', fair enough..., but iMacros could do much more, and could do the "Data Cleaning" also..., and even save the Data directly as '.csv', I'm not sure about the Speed, as it would depend on how many Records (on Average) get displayed per Page, and loading each new Page is what takes the most time, but I think it would be at least 100 Records per Minute... (But that Func (Looping >100 + Saving as '.csv') is not included in the 'Free' Version, only in the 'PE' Version, => about 100US$ for the License...)
But I guess "we" are all more "confident" with the Tools and Technology that we already master best, ah-ah...! :wink:
- (F)CI(M) = (Full) Config Info (Missing): iMacros + Browser + OS (+ all 3 Versions + 'Free'/'PE'/'Trial').
- FCI not mentioned: I don't even read the Qt...! (or only to catch Spam!)
- Script & URL help a lot for more "educated" Help...
DavidRTurner
Posts: 3
Joined: Mon May 16, 2022 6:27 pm

Re: Scraping text/links from specific web page <div> section to clipboard?

Post by DavidRTurner » Tue May 17, 2022 12:40 am

I appreciate your reply, but I want the text of the lines of data, as well as their URLs (from the 'Multiple' DIV ID).
I do not want the HTML code. I'll try re-wording some of this...

I want to copy the list of Names (and their hyperlinks). I want the end result to be a column of names/dates and a column of their URLs, in Excel.
What I do now: I simply mouse-select the range of data and CTRL-C; then CTRL-V into Excel. Very basic.
Windows' clipboard is copying the whole element structure (the text and its hyperlink), but pasting to a CSV or text editor, simply pastes the name/date, as it can't handle a URL.
But Excel receives the name/date lines with their links. I then use an Excel function in a macro to copy the URL info into a second column.

Example of the result of the sample page https://billiongraves.com/site-map?ceme ... 295&page=0, after my current copy/paste/Excelmacro:
2022-05-16 6-25-38 PM.png
Hopefully this helps visualize, better than my descriptions...


I have been digging into iMacros, and can see it's a bit more complex than I first thought.

Since iMacros isn't working on the "surface data of the page", and is looking at the HTML code itself:
It needs to pull the text and its matching URL, for each name line.
The more I think about it, I suppose if it is reading the HTML code directly, it would need to extract the text field (title) and then the following URL field (<div class='record'>), and repeat until end-of-line... since the whole page's list is just one long HTML line, it can't do the whole-page-copy-paste that I'm currently doing.


You've helped me get started on the ATTR=ID piece, to select the right section; I'll dig into it some more, but if anyone has a code snippet or direction to a better topic, much appreciated!
Post Reply