Site now uses infinite scrolling - what to do?

Discussions and Tech Support related to website data extraction, screen scraping and data mining using iMacros.

Moderators: Community Moderators, iMacros Moderators

Forum rules
Before asking a question or reporting an issue:
1. Please review the list of FAQ's.
2. Use the Google search box (at the top of each forum page) to see if a similar problem or question has already been addressed. This will search the entire contents of the forums as well as the iMacros Wiki.
3. We can respond much faster to your posts if you include the following information:

CLICK HERE FOR IMPORTANT INFORMATION TO INCLUDE IN YOUR POST

Answering your own posts (e.g. attempting to "bump" your topic) drops your topic from the list of unanswered threads, so it may actually receive less views.

Site now uses infinite scrolling - what to do?

by mfletcher on Wed Sep 20, 2017 8:40 am

I'm sorry, I'm very new to this. A friend wrote this a long time ago. I reckon I have to tell `extractFromLibrary` to use some scroll mechanism instead of parsing through pages? Thanks!

Code: Select all
// Instead of just extracting all the books from one library, this script imports a list of library's from CSV
// and for each, saves all of their books to the output CSV

// To speed up the running of this script, the following iMacros preferences are recommended...
// Go to iMacros options, then the "general" tab
// Set "Replay Speed" to fast
// Under "Visual Effects", untick "scroll to object when found" as well as "Highlight object when found"
// Under "Javascript scripting settings", untick "Show Javascript during replay"

// File is read from the "datasources" path set in iMacros prefs, not "downloads" path pref
const inputFileName= "Libraries-to-extract-from.csv";
// The starting row number (to enable importing just part of a large CSV)
const startRowID = 1;
// Name of the file where the results are output (is saved into the "Downloads" folder set in iMacros prefs)
// NB: Every time this script is run, the results are just added to the end of this file.
// So delete/rename the output file if needed - to avoid duplicate entries.
const outputFileName= "Books-url-list.csv";

// ###################

// Global variable for status message, since using iimDisplay() clears previous messages
var statusMessage;

addStatusMessage("Importing " + inputFileName + ", starting at line " + startRowID);

// For each library in the CSV file, import all of their books
var rowID = 1;
while (true) {
    // Not using addStatusMessage() directly, since want to throw away this last message afterwards
    iimDisplay(statusMessage + "\n-Processing row " + rowID);
    var currentRowContents = getCSVRow(rowID);
    if (!currentRowContents) {
        // Break if end of file reached, or if there was an error reading the file (eg file not found)
        addStatusMessage("Exiting on row " + rowID + ". Either an error has occurred, or the end of file was reached.");
        break;
    }
    extractFromLibrary(currentRowContents);
    rowID++;
}

function extractFromLibrary(targetLibrary) {
    // URL of the books page to process
    var targetBooksPage = targetLibrary + "/books";
    goToPage(targetBooksPage);

    var lastPageID = getLastPageID();
    addStatusMessage("Saving pages 1->" + lastPageID + " for " + targetBooksPage);

    for (var i = startFromPageID; i <= lastPageID; i++) {
        // Not using addStatusMessage() directly, since want to throw away this last message afterwards
        iimDisplay(statusMessage + "\n-Processing page " + i);
        // Start of script navigated to page 1 already, so only need to change if i is not 1
        if (i != 1) goToPage(targetBooksPage + "?page=" + i);
        processCurrentPage();
    }
}


/* Helper Functions */

function runMacro(macro) {
    // Runs the specified macro with a reduced tag timeout of 3 seconds (default is 60)
    return iimPlay("CODE:" + "SET !TIMEOUT_TAG 3\n" + macro);
}
function addStatusMessage(newMessage) {
    // Using iimDisplay() clears previous messages, so global statusMessage variable used to save them
    if (!statusMessage) {
        statusMessage = "Starting script...";
    }
    statusMessage += "\n-" + newMessage;
    iimDisplay(statusMessage);
}
function getCSVRow(rowID) {
    var result = runMacro("SET !DATASOURCE " + inputFileName +
    "\nSET !DATASOURCE_COLUMNS 1" +
    "\nSET !DATASOURCE_LINE " + rowID +
    "\nSET !EXTRACT {{!COL1}}");
    if (result < 0) {
        // Fetching the row failed. Could be due to end of file or else file not found.
        return null;
    } else {
        return iimGetLastExtract(1);
    }
}
function goToPage(url) {
    // Navigates to the desired URL with images turned off, to decrease pageload time
    runMacro("FILTER TYPE=IMAGES STATUS=ON" +
            "\nURL GOTO=" + url);
}
function getLastPageID() {
    // Extract the page ID of the last page of books, using relative positioning numbering
    // The site uses "Page 1", "Page 2", "...", "Page N", "Next" type site navigation
    // First finds the "Next" link, than extracts the link text immediately prior to it, to get last page ID
    runMacro("TAG POS=1 TYPE=A ATTR=TXT:Next EXTRACT=TXT" +
            "\nTAG POS=R-1 TYPE=A ATTR=TXT:* EXTRACT=TXT");
    if (iimGetLastExtract(2) == "#EANF#") {
        // Tags not found, or timeout reached
        addStatusMessage("No next page button found, so there must only be one page total" +
                " (or else the page didn't finish loading in 60s).");
        lastPageID = 1;
    } else {
        // Tags found, so use the link text value
        lastPageID = iimGetLastExtract(2);
    }
    return lastPageID;
}
function processCurrentPage() {
    var i = 0;
    while (true) {
        i++;
        // Attempt extraction of next library book link
        // Note: extraction and saving to CSV were not combined, since hard/impossible to know when to stop,
        // since logic not possible inside macros - and whenever SAVEAS TYPE=EXTRACT is used, the
        // EXTRACT variable is cleared. So iimGetLastExtract(1) always returns null, regardless of success or
        // failure. Even if the iimPlay return code was checked instead, #EANF# junk would still have been added
        // to the last row of the CSV, which isn't desired.
        // To reduce the slowdown caused by splitting the steps, the EXTRACT variable is manually set before
        // using SAVEAS, rather than wasting time using TAG again.
        runMacro("TAG POS=" + i + " TYPE=A ATTR=CLASS:library-link&&TITLE: EXTRACT=HREF");
        var currentLibraryURL = iimGetLastExtract(1);
        // If that link was found, save to the next line of the CSV, otherwise break out of loop
        if (currentLibraryURL == "#EANF#") {
            break;
        } else {
            runMacro("SET !EXTRACT " + currentLibraryURL +
                    "\nSAVEAS TYPE=EXTRACT FOLDER=* FILE=" + outputFileName);
        }
    }
}
mfletcher
 
Posts: 3
Joined: Wed Sep 20, 2017 8:22 am

Re: Site now uses infinite scrolling - what to do?

by chivracq on Wed Sep 20, 2017 8:56 am

mfletcher wrote:I'm sorry, I'm very new to this. A friend wrote this a long time ago. I reckon I have to tell `extractFromLibrary` to use some scroll mechanism instead of parsing through pages? Thanks!

Code: Select all
// Instead of just extracting all the books from one library, this script imports a list of library's from CSV
// and for each, saves all of their books to the output CSV

// To speed up the running of this script, the following iMacros preferences are recommended...
// Go to iMacros options, then the "general" tab
// Set "Replay Speed" to fast
// Under "Visual Effects", untick "scroll to object when found" as well as "Highlight object when found"
// Under "Javascript scripting settings", untick "Show Javascript during replay"

// File is read from the "datasources" path set in iMacros prefs, not "downloads" path pref
const inputFileName= "Libraries-to-extract-from.csv";
// The starting row number (to enable importing just part of a large CSV)
const startRowID = 1;
// Name of the file where the results are output (is saved into the "Downloads" folder set in iMacros prefs)
// NB: Every time this script is run, the results are just added to the end of this file.
// So delete/rename the output file if needed - to avoid duplicate entries.
const outputFileName= "Books-url-list.csv";

// ###################

// Global variable for status message, since using iimDisplay() clears previous messages
var statusMessage;

addStatusMessage("Importing " + inputFileName + ", starting at line " + startRowID);

// For each library in the CSV file, import all of their books
var rowID = 1;
while (true) {
    // Not using addStatusMessage() directly, since want to throw away this last message afterwards
    iimDisplay(statusMessage + "\n-Processing row " + rowID);
    var currentRowContents = getCSVRow(rowID);
    if (!currentRowContents) {
        // Break if end of file reached, or if there was an error reading the file (eg file not found)
        addStatusMessage("Exiting on row " + rowID + ". Either an error has occurred, or the end of file was reached.");
        break;
    }
    extractFromLibrary(currentRowContents);
    rowID++;
}

function extractFromLibrary(targetLibrary) {
    // URL of the books page to process
    var targetBooksPage = targetLibrary + "/books";
    goToPage(targetBooksPage);

    var lastPageID = getLastPageID();
    addStatusMessage("Saving pages 1->" + lastPageID + " for " + targetBooksPage);

    for (var i = startFromPageID; i <= lastPageID; i++) {
        // Not using addStatusMessage() directly, since want to throw away this last message afterwards
        iimDisplay(statusMessage + "\n-Processing page " + i);
        // Start of script navigated to page 1 already, so only need to change if i is not 1
        if (i != 1) goToPage(targetBooksPage + "?page=" + i);
        processCurrentPage();
    }
}


/* Helper Functions */

function runMacro(macro) {
    // Runs the specified macro with a reduced tag timeout of 3 seconds (default is 60)
    return iimPlay("CODE:" + "SET !TIMEOUT_TAG 3\n" + macro);
}
function addStatusMessage(newMessage) {
    // Using iimDisplay() clears previous messages, so global statusMessage variable used to save them
    if (!statusMessage) {
        statusMessage = "Starting script...";
    }
    statusMessage += "\n-" + newMessage;
    iimDisplay(statusMessage);
}
function getCSVRow(rowID) {
    var result = runMacro("SET !DATASOURCE " + inputFileName +
    "\nSET !DATASOURCE_COLUMNS 1" +
    "\nSET !DATASOURCE_LINE " + rowID +
    "\nSET !EXTRACT {{!COL1}}");
    if (result < 0) {
        // Fetching the row failed. Could be due to end of file or else file not found.
        return null;
    } else {
        return iimGetLastExtract(1);
    }
}
function goToPage(url) {
    // Navigates to the desired URL with images turned off, to decrease pageload time
    runMacro("FILTER TYPE=IMAGES STATUS=ON" +
            "\nURL GOTO=" + url);
}
function getLastPageID() {
    // Extract the page ID of the last page of books, using relative positioning numbering
    // The site uses "Page 1", "Page 2", "...", "Page N", "Next" type site navigation
    // First finds the "Next" link, than extracts the link text immediately prior to it, to get last page ID
    runMacro("TAG POS=1 TYPE=A ATTR=TXT:Next EXTRACT=TXT" +
            "\nTAG POS=R-1 TYPE=A ATTR=TXT:* EXTRACT=TXT");
    if (iimGetLastExtract(2) == "#EANF#") {
        // Tags not found, or timeout reached
        addStatusMessage("No next page button found, so there must only be one page total" +
                " (or else the page didn't finish loading in 60s).");
        lastPageID = 1;
    } else {
        // Tags found, so use the link text value
        lastPageID = iimGetLastExtract(2);
    }
    return lastPageID;
}
function processCurrentPage() {
    var i = 0;
    while (true) {
        i++;
        // Attempt extraction of next library book link
        // Note: extraction and saving to CSV were not combined, since hard/impossible to know when to stop,
        // since logic not possible inside macros - and whenever SAVEAS TYPE=EXTRACT is used, the
        // EXTRACT variable is cleared. So iimGetLastExtract(1) always returns null, regardless of success or
        // failure. Even if the iimPlay return code was checked instead, #EANF# junk would still have been added
        // to the last row of the CSV, which isn't desired.
        // To reduce the slowdown caused by splitting the steps, the EXTRACT variable is manually set before
        // using SAVEAS, rather than wasting time using TAG again.
        runMacro("TAG POS=" + i + " TYPE=A ATTR=CLASS:library-link&&TITLE: EXTRACT=HREF");
        var currentLibraryURL = iimGetLastExtract(1);
        // If that link was found, save to the next line of the CSV, otherwise break out of loop
        if (currentLibraryURL == "#EANF#") {
            break;
        } else {
            runMacro("SET !EXTRACT " + currentLibraryURL +
                    "\nSAVEAS TYPE=EXTRACT FOLDER=* FILE=" + outputFileName);
        }
    }
}

FCIM...! :mrgreen: (Read my Sig...)

But Compliments already to your Friend, the Script is nicely written, by probably a Professional Programmer used to work in a Team with other Programmers working on the same Project/Code... We don't see that "Quality" very often on the Forum, ah-ah...! :oops:
Script is a bit old indeed, maybe from 5 or 6 years ago as I see some deprecated Command(s), ah-ah...!
But OK, mention your FCI for me to "elaborate"... :idea:
- (F)CIM = (Full) Config Info Missing: iMacros + Browser + OS with all 3 Versions...
- I usually don't even read the Question if that (required) Info is not mentioned...
- Script & URL usually help a lot for a more "educated" Help...
chivracq
 
Posts: 6477
Joined: Sat Apr 13, 2013 6:07 am
Location: Amsterdam (NL)

Re: Site now uses infinite scrolling - what to do?

by mfletcher on Wed Sep 20, 2017 9:44 am

Yeah, he's very good isn't he :)

I'm on Firefox 55.0.3 with iMacros for Firefox 9.0.3 on Windows 10.
mfletcher
 
Posts: 3
Joined: Wed Sep 20, 2017 8:22 am

Re: Site now uses infinite scrolling - what to do?

by chivracq on Wed Sep 20, 2017 11:29 am

mfletcher wrote:Yeah, he's very good isn't he :)

I'm on
Code: Select all
Firefox 55.0.3 with iMacros for Firefox 9.0.3 on Windows 10.

OK, FCI mentioned, perfect, now we can talk...! :D
(Always mention your FCI when you open a Thread (or post for the first time in some existing Thread), I don't react otherwise... :idea: , and many Commands are not implemented for all Browsers/Versions...)

Answering first your original Qt, the Scrolling is usually achieved using the following Statement/Syntax (from an '.iim' (on-the-fly or native) Macro):
Code: Select all
URL GOTO=javascript:window.scrollBy(0,500)

But you are on iMacros for FF v9.0.3 and I'm not sure if this Syntax still works in v9.0.3, there were directly several Threads on the Forum when v9.0.3 for FF got released (Aug. 2016) about this Version breaking this 'scrollBy()' Syntax...
I don't know myself actually because I've never installed v9.0.3 which was a bit too buggy and limited from Day_1, so I've never had a chance to test it myself... The previous Version (v8.9.7 for FF) is still the advised/stable Version to use, and it still works on FF v55.0.3. (I run it myself in this exact same FCI, + Win10-x64 as well.)

If you don't want to revert to v8.9.7, Scrolling (down) can also be achieved using the Keyboard 'Spacebar' which can be input from a Macro using the 'EVENT' Mode. That could be a Workaround. :idea:

I mentioned that your Script was using a few deprecated Commands, '!TIMEOUT_TAG' is one of them (replaced by '!TIMEOUT_STEP'), even if it still works, at least until v8.9.7, and I would actually be curious to know if it still works in v9.0.3...?, as several deprecated Commands were actually completely removed from the Code for v9.0.3...?

Other deprecated Command would be 'iimGetLastExtract()' (replaced by 'iimGetExtract()'), but I know that this one still works in v9.0.3.

Some other Command that your Script uses and that might cause a Pb is 'FILTER'. :oops:
It is not deprecated at all in v9.0.3, but I have reported that it got broken (using v8.9.7 for FF) from FF53 (and possibly FF52, but I went straight from FF51 to FF53 and again directly to FF54, but it was still working fine in FF v51.0.1), but absolutely no other User(s) has/have confirmed my Report.
The ('.iim' (I only use '.iim', I don't use any '.js')) Script just hangs and the Browser (FF53/54/55) hangs for ever and needs to be killed from Task Manager). I'm talking about a "large" Page with 1000 Images. A "small" Page with a few Images will eventually still manage to load (after 40 sec / 1 min / 2 min) and the Browser will not crash, but the Purpose of using 'FILTER' is a bit gone anyway as a Page is supposed to load instantly when using this Command... :shock:
- 'FILTER TYPE=IMAGES' crashes FF53/54...!
=> I don't know if v9.0.3 for FF is impacted as well but I would be interested to know if it gets broken as well in that Version (+ your current FF v55.0.3).
- (F)CIM = (Full) Config Info Missing: iMacros + Browser + OS with all 3 Versions...
- I usually don't even read the Question if that (required) Info is not mentioned...
- Script & URL usually help a lot for a more "educated" Help...
chivracq
 
Posts: 6477
Joined: Sat Apr 13, 2013 6:07 am
Location: Amsterdam (NL)

Re: Site now uses infinite scrolling - what to do?

by mfletcher on Fri Sep 22, 2017 6:12 am

Thank you so much for your answer :)

Please allow for some time to pass while I try to update this script.

Have a good weekend!
mfletcher
 
Posts: 3
Joined: Wed Sep 20, 2017 8:22 am

Re: Site now uses infinite scrolling - what to do?

by chivracq on Fri Sep 22, 2017 9:01 am

mfletcher wrote:Thank you so much for your answer :)

Please allow for some time to pass while I try to update this script.

Have a good weekend!

Yeah-yeah, no Pb, don't worry, I'll notice your Reply once you'll have had (hum, funny grammatical Construction...!) the time to adjust your Script, and if you revert to v8.9.7 to first test/confirm about '!TIMEOUT_TAG' + 'FILTER' in v9.0.3 before reverting to v8.9.7. :wink:
- (F)CIM = (Full) Config Info Missing: iMacros + Browser + OS with all 3 Versions...
- I usually don't even read the Question if that (required) Info is not mentioned...
- Script & URL usually help a lot for a more "educated" Help...
chivracq
 
Posts: 6477
Joined: Sat Apr 13, 2013 6:07 am
Location: Amsterdam (NL)


Return to Data Extraction and Web Screen Scraping

Who is online

Users browsing this forum: No registered users and 2 guests

-->