Search list > Extract List > Download page

Discussions and Tech Support related to website data extraction, screen scraping and data mining using iMacros.

Moderators: Community Moderators, iMacros Moderators

Forum rules
Before asking a question or reporting an issue:
1. Please review the list of FAQ's.
2. Use the Google search box (at the top of each forum page) to see if a similar problem or question has already been addressed. This will search the entire contents of the forums as well as the iMacros Wiki.
3. We can respond much faster to your posts if you include the following information:

CLICK HERE FOR IMPORTANT INFORMATION TO INCLUDE IN YOUR POST

Answering your own posts (e.g. attempting to "bump" your topic) drops your topic from the list of unanswered threads, so it may actually receive less views.

Re: Search list > Extract List > Download page

by chivracq on Sun Feb 04, 2018 11:42 am

ccvle wrote:My current script is posted below. But my question is not really about my script. I'm able to auto search, click on the result and get to the final page with the company information. However, the company profile information looks like it is displayed as in gif format. See example -> http://apps.dor.wa.gov/BRD/Utilities/Br ... W1D9J&rsp=

Code: Select all
SET !ERRORIGNORE YES
TAB T=1
URL GOTO=http://apps.dor.wa.gov/BRD/default.aspx
TAG POS=1 TYPE=INPUT:TEXT FORM=ID:form1 ATTR=ID:MainContent_SearchControl_txtCriteria CONTENT="123 1ST Call Bail Bonds INC"
TAG POS=1 TYPE=A ATTR=ID:MainContent_SearchControl_lnkSearch
TAG POS=1 TYPE=INPUT:IMAGE ATTR=NAME:ctl00$MainContent$SearchControl$dgListResults$ctl03$TextImage1
SAVEAS TYPE=PNG FOLDER=C:\Users\Desktop\Screenshots\ FILE="123 1ST Call Bail Bonds INC"

Euh..., "I" see your Screenshot as a '.JPEG', both from 'Save Image as' and from 'View Image Info' in Pale Moon (PM v26.3.3).
... Which is indeed a little bit "strange" anyway, even if it's not a '.GIF' Image, but you would expect a '.PNG' Screenshot I would think from your 'SAVE TYPE=PNG' Statement...

But hum..., are you sure you want to save all that Data as Screenshots...? Maybe easy to print, but you won't be able to search the Content of those Records or to do any further Data Processing... You could better extract all Fields one by one and store them in a '.CSV' File I would think...
ccvle wrote:Here is the full source code of that last page with the company information .

Code: Select all
<!DOCTYPE html>

<html xmlns="http://www.w3.org/1999/xhtml">
<head><meta name="viewport" content="width=device-width, initial-scale=1" /><title>
   Lookup Business Information
</title>
    <link rel="stylesheet" href="css/bootstrap.min.css" />
    <link rel="stylesheet" href="css/master.css" />
    <link rel="stylesheet" href="css/brd.css" />
    </head>
<body>
    <form method="post" action="./default.aspx" id="form1">
<div class="aspNetHidden">
<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" />
<input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT" value="" />
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="A+mFbjXZhb7rejjTPhoW+bnTihto4owFaLG+9LhHYQmXHuXO6F99vMKgpUd6SRYDSMYjHOFRJT/0Fq41aK4+CtfQISHa7B4NNa7a6a/AokHGH1im78mYZBWgauLqkH0gVLTtvB/ou4FEElSlaP2BXbGdAy+YoRbm4Uj5cRi1zWbSkSHsm+AZ22+S5L9DT9R5pS5TFZ/jn6qglt7/yz/uAVnFnis1wAIvR5cJPSt8q+nr9h5IeWcMtzJ1Y6JttY/h2vhO3HEHRnOZD1rHGc5q5XLuIfDaxSaMo/fM5RgI5rU/W+d231MAJ2BfBiP181rJSzUhYCwONUSRWSHc2oUd2gYrBk8oSdX+Vg9x4ygqNTxzTXxQcJOa3UVBF6tYcjzpcZnS5d59jA80fjooQ9t3Cg9FzMrZd08W+gIqkuDBd1PmPdDMe/hQXucXwhNvGylSHNi3VZlbmTwGsj3oDZnkCzvX4xLogHDMSTrguEv70ZCDt7Xn9BoI/DCnWb/PeC+adkuQU8OUfDYK2IQEvrJ5iIJfnVKPlQMSjxv7ym1geFb2qqjbt9SBuoE45aP4miNrCl2TGdGFVEfTzTkVDNwKYqcPpodzehxluZArCFCvyzJ/rfwzpHVwBBdb7WSKOBMYKamH1Zknc9nmrS/U3/wXL6vsBbkiaXkZ95xOB8/FhWR9i5LU9Xya88u2x+/zWJgi8t4timIDQwECIn89Nmpph50vCnmvofn8kFlJOhd5AcgZIqQQykkKxgpJtNwh0D4wow30uSfDOeFz8sXYVJixK6eBKpCLqKIQE5Xcaugww9hqWQU4zgfGiTiMkErd8Htu3E1UWRGdWYDc0XQmC5+7WxzzJjpzVs7LbEdbhapqW50hCKHm51mzJLxLVWV++15aJtqu9xmMngxrOdvIuI2CbPU5Fom4juI6im+cJuvFqIHecrMixs3i0ydVcweEDcTIeNHtmV2CxcTXeQmV2eWHrBuqAJfboII9rg+P6WdVszBzdgrUWpzpn4FfeeEyMqhClvqCfNkwWcGkkUbmpDvkxgmiVHuGkSTF3Tocjjjf0o5MGSxoJeHLrqkCw2am9QvbqnPuxQQf2/y7LxhYQ1nF13fHjW5JArsz7gq3xkZKgge4ufTrEesev9zMdjl6kwb5Ts+/08f1fRDykAyaEDe1WhDaQIq0TnWwQ2OQzvm+YjVBXkb8L9/hr9dYfO0YA/rk31c9MZKpcjDN/bq8lu8x8zb4JMFPLsMNXbIDoYn71dhoj4mjUjH15V6FpZ1wUI6ab7fibLZO0OZFgCT6KLlkcC14kBnMWLbyYj4e427xfCe1o6BVO8aR0i6xZ93ZmsAi731frboiA1eSQ6o1B/hnGERjEg5gPYjbtEsRwuPeXDxR1qvdyEtIzGcHiXn+EZkGBtBypj+6YKSiu9tpLkSGC6j+WdvLg+frn21vTgVGz0weX6q5ogeokzh3ctsjV+NuQ31cOzICulfqWQCymHcyx+vdpti1+NguIGr33UvT0l/fhRIMNUIzAz55rLwdLzTBrkaSFuYNTcweijpVFhHNXWUalNaz7QIKV2dX8m9UjbM95+x8yZsv/Qn06vqI39U6y5jd4rZPHkUpZOr7HZiIXtT0K/8cRNDuH3P1aSWHvOcp0Nx+1fXb2bYG2WDIz29Z89GuwQ==" />
</div>

<script type="text/javascript">
//<![CDATA[
var theForm = document.forms['form1'];
if (!theForm) {
    theForm = document.form1;
}
function __doPostBack(eventTarget, eventArgument) {
    if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
        theForm.__EVENTTARGET.value = eventTarget;
        theForm.__EVENTARGUMENT.value = eventArgument;
        theForm.submit();
    }
}
//]]>
</script>


<script src="/BRD/WebResource.axd?d=pynGkmcFUV13He1Qd6_TZO2qnHC0dZbS7lt4UbjWKfsH9VhZ06wn2tWzASfS6QQxfyM8gQ2&amp;t=636414544201228773" type="text/javascript"></script>

<script language=javascript>function Name() {if (document.getElementById('MainContent_SearchControl_rdoUBITRA').checked|| document.getElementById('MainContent_SearchControl_rdoRSPNumber').checked){document.getElementById('MainContent_SearchControl_txtUBITRA').value='';document.getElementById('MainContent_SearchControl_txtRSPNumber').value='';document.getElementById('MainContent_SearchControl_rdoBusinessOrOwnerName').checked=true;document.getElementById('MainContent_SearchControl_rdoRSPNumber').checked=false;document.getElementById('MainContent_SearchControl_rdoUBITRA').checked=false;}}</script><script language=javascript>function UBI() {if (document.getElementById('MainContent_SearchControl_rdoBusinessOrOwnerName').checked || document.getElementById('MainContent_SearchControl_rdoRSPNumber').checked){document.getElementById('MainContent_SearchControl_txtCriteria').value='';document.getElementById('MainContent_SearchControl_txtCity').value='';document.getElementById('MainContent_SearchControl_txtRSPNumber').value='';document.getElementById('MainContent_SearchControl_rdoUBITRA').checked=true;document.getElementById('MainContent_SearchControl_rdoRSPNumber').checked=false;document.getElementById('MainContent_SearchControl_rdoBusinessOrOwnerName').checked=false;}}</script><script language=javascript>function RSP() {if (document.getElementById('MainContent_SearchControl_rdoUBITRA').checked|| document.getElementById('MainContent_SearchControl_rdoBusinessOrOwnerName').checked){document.getElementById('MainContent_SearchControl_txtCriteria').value='';document.getElementById('MainContent_SearchControl_txtCity').value='';document.getElementById('MainContent_SearchControl_txtUBITRA').value='';document.getElementById('MainContent_SearchControl_rdoRSPNumber').checked=true;document.getElementById('MainContent_SearchControl_rdoBusinessOrOwnerName').checked=false;document.getElementById('MainContent_SearchControl_rdoUBITRA').checked=false;}}</script>
<div class="aspNetHidden">

   <input type="hidden" name="__VIEWSTATEGENERATOR" id="__VIEWSTATEGENERATOR" value="9CD75CFB" />
   <input type="hidden" name="__PREVIOUSPAGE" id="__PREVIOUSPAGE" value="QUTbpk4IajPLmbA5OX51k72AdQNIY8ohBvx8SwVaI-jjOotySRGSvlAOYDbQUmXIRJn85YuOEuOwpdpNjz0IuPBxfxE1" />
   <input type="hidden" name="__EVENTVALIDATION" id="__EVENTVALIDATION" value="wmQAg3Pn1rty2Njfo3myeM5Iaapd564s3bB7UFh8miKaDj17AllvaP2+N1Y7u2DU7SHg0J5iqAXQYqCCr2t7nl5PCeV7/LgxwE+OWZd7Cp88ZiVy" />
</div>
        <div id="pageWrap" class="container-fluid">

            <div class="page-header">
                <a href="http://dor.wa.gov/">
                    <img src="Images/DOR_Logo_2.png" /></a>
            </div>

            <div class="navbar navbar-default">
                <div class="navbar-header">
                    <button type="button" class="pull-left navbar-toggle" data-toggle="collapse" data-target=".navbar-collapse">
                        <span class="icon-bar"></span>
                        <span class="icon-bar"></span>
                        <span class="icon-bar"></span>
                    </button>
                </div>
                <div class="navbar-collapse collapse">
                    <ul class="nav navbar-nav">
                        <li><a href="http://dor.wa.gov/">DOR Home</a></li>
                        <li><a href="Default.aspx">Business Lookup</a></li>
                        <li><a href="SearchTips.aspx">Search Tips</a></li>
                    </ul>
                </div>
            </div>

            <div class="background">
                <div id="backgroundImage"></div>
                <div id="contentWrap" class="row body-content">
                    <div id="content" class="col-xs-12">
                       
   
   


<div id="MainContent_SearchControl_pnlInfo">
   
   
    <div class="row">
        <div class="col-xs-12 text-center">
           
            <p>
                <a id="MainContent_SearchControl_lnkListResults" class="btn btn-primary btn-main" href="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(&quot;ctl00$MainContent$SearchControl$lnkListResults&quot;, &quot;&quot;, false, &quot;&quot;, &quot;/BRD/default.aspx#brdResults&quot;, false, true))">Back to search results</a>
            </p>
        </div>
    </div>
   
    <div class="row">
        <div class="col-xs-12 text-center">
            <img style="border:black 1px solid;" alt="Business registration information" src="Utilities/BrdImagePage.aspx?tra=lXxIVwXs13m4fgI9kywUnc9DYN6W1D9J&rsp=" />
            <p>
                <span id="MainContent_SearchControl_lblNameCityResellerMsg" style="display:inline-block;width:60%;">If you are unable to find the reseller permit you are looking for, try searching by tax registration/UBI number.</span>
            </p>
        </div>
    </div>

</div>



                    </div>
                </div>
            </div>

            <footer class="footer text-center">
                <span class="text-muted">Working together to fund Washington's future</span>
            </footer>
        </div>

    </form>
    <script src="Scripts/jquery-1.12.3.min.js"></script>
    <script src="Scripts/bootstrap.min.js"></script>
    <script src="Scripts/master.js"></script>
   
    <script src="Scripts/brd.js"></script>

</body>
</html>

Ouf-ouf, try to use the Forum ]CODE[ Meta-Tags when posting a Script or a Page Source (like I do in my Quotes), that makes the Thread easier to read...

And hum you didn't use any Relative Positioning on the 'Results' Field in your Script, like I had suggested with the Thread(s) I had referred you to...
The '!ERRORIGNORE' handles the Case if there will be no Results, and you've manage to identify the Link without the Need for the URL, very good..., but you don't handle the Case if there will be more than 1 Result..., unless you are "happy" with following the first Link in that Case...

ccvle wrote:I guess my question is
1) is the company profile information truly in gif format ? If so, is the gif picture generated from a master copy that is saved on https://ccfs.sos.wa.gov website (not sure if this question makes sense).
Or is the gif generated by first "pulling the company information" (e.g., account opened, UIB ID, Entity Type, Mailing address,etc) then saving the completed profile as a gif.

Hum, I don't really know, I don't use that Functionality myself in any of my Scripts, so I never really experimented with it, what I understand is that iMacros takes a "Screenshot" itself of the Page, like many other Add-ons are able to do as well...
- (F)CIM = (Full) Config Info Missing: iMacros + Browser + OS with all 3 Versions...
- I usually don't even read the Question if that (required) Info is not mentioned...
- Script & URL usually help a lot for a more "educated" Help...
chivracq
 
Posts: 6957
Joined: Sat Apr 13, 2013 6:07 am
Location: Amsterdam (NL)

Re: Search list > Extract List > Download page

by chivracq on Sun Feb 04, 2018 1:15 pm

Oh, but OK, I see what you mean... The "Data" on the Site itself is displayed as '.JPEG' already, probably from a Scan of their Archives, and not as HTML with Text... Then forget what I mentioned about extracting the Data Field by Field, that cannot be done on Images..., at least directly from iMacros, I guess you would need some extra Tool involving OCR Functionality...

Then I guess the 'SAVEAS TYPE=PNG' from iMacros will take a Screenshot of the Page in the Image Format you'll have selected, and that Process takes place locally on your own PC, I would think...

Alternatively to the 'SAVEAS' Command you are using, you could then use the 'SAVEITEM' Command/Mechanism which will save directly only the "original" '.JPEG' Image from the Site, but from your Browser Cache which will have already downloaded it. But it will be in '.JPG' Format and not '.PNG' like you apparently want...
You'll need to test both Methods to check which one is "quicker" if Performance" plays a role for you, and maybe as well the Size (and Resolution) of the Image, between the 2 Image Formats and the Content between Screenshot of the Full Page or only the Image from the Site... I never used that Functionality, so I don't know... :oops:
- (F)CIM = (Full) Config Info Missing: iMacros + Browser + OS with all 3 Versions...
- I usually don't even read the Question if that (required) Info is not mentioned...
- Script & URL usually help a lot for a more "educated" Help...
chivracq
 
Posts: 6957
Joined: Sat Apr 13, 2013 6:07 am
Location: Amsterdam (NL)

Re: Search list > Extract List > Download page

by ccvle on Sun Feb 04, 2018 1:15 pm

double post.
ccvle
 
Posts: 13
Joined: Fri Feb 02, 2018 9:45 am

Re: Search list > Extract List > Download page

by chivracq on Sun Feb 04, 2018 1:25 pm

ccvle wrote:
chivracq wrote:But hum..., are you sure you want to save all that Data as Screenshots...? Maybe easy to print, but you won't be able to search the Content of those Records or to do any further Data Processing... You could better extract all Fields one by one and store them in a '.CSV' File I would think...


I don't want to save all that data as screenshots. What I'm saying is the government website is GIVING me a GIF picture of the company profile. I don't see any data to extract. All the data that I see is in a GIF picture.

For example, This is the company profile page that returned from the government website. I didn't take a screenshot of this. This is what the website returned when I click on the result link. I could be wrong but I just dont see any extractable data in there. It looks just like a picture.

http://apps.dor.wa.gov/BRD/Utilities/Br ... W1D9J&rsp=

Yep, I realized that..., see my previous Reply, I posted just a few seconds before you, ah-ah...! :wink:
- (F)CIM = (Full) Config Info Missing: iMacros + Browser + OS with all 3 Versions...
- I usually don't even read the Question if that (required) Info is not mentioned...
- Script & URL usually help a lot for a more "educated" Help...
chivracq
 
Posts: 6957
Joined: Sat Apr 13, 2013 6:07 am
Location: Amsterdam (NL)

Re: Search list > Extract List > Download page

by ccvle on Sun Feb 04, 2018 7:21 pm

I think I figured out the relative position tag. I'm able to get it to work to do the following: 1) If the Text: 1 to 1 results appears then click on the result link and take a screenshot or 2) If no results or more than one result than obviously the text "1 to 1 results" will not appear, the errorignore will then ignore the error and the next line will take a screenshot.




SET !ERRORIGNORE YES
TAB T=1
URL GOTO=http://apps.dor.wa.gov/BRD/default.aspx
TAG POS=1 TYPE=INPUT:TEXT FORM=ID:form1 ATTR=ID:MainContent_SearchControl_txtCriteria CONTENT="XXXX LLC"
TAG POS=1 TYPE=A ATTR=ID:MainContent_SearchControl_lnkSearch
TAG POS=1 TYPE=H2 ATTR=TXT:1<SP>to<SP>1<SP>of<SP>1<SP>Results
TAG POS=R1 TYPE=INPUT:IMAGE ATTR=NAME:ctl00$MainContent$SearchControl$dgListResults$ctl03$TextImage1
SAVEAS TYPE=PNG FOLDER=C:\Users\Desktop\Screenshots\ FILE="XXX LLC"
URL GOTO=http://apps.dor.wa.gov/BRD/default.aspx
TAG POS=1 TYPE=INPUT:TEXT FORM=ID:form1 ATTR=ID:MainContent_SearchControl_txtCriteria CONTENT="XXX LLC"
TAG POS=1 TYPE=A ATTR=ID:MainContent_SearchControl_lnkSearch
TAG POS=1 TYPE=H2 ATTR=TXT:1<SP>to<SP>1<SP>of<SP>1<SP>Results
TAG POS=R1 TYPE=INPUT:IMAGE ATTR=NAME:ctl00$MainContent$SearchControl$dgListResults$ctl03$TextImage1
SAVEAS TYPE=PNG FOLDER=C:\Users\Desktop\Screenshots\ FILE="XXX LLC"
ccvle
 
Posts: 13
Joined: Fri Feb 02, 2018 9:45 am

Re: Search list > Extract List > Download page

by chivracq on Mon Feb 05, 2018 10:13 am

ccvle wrote:I think I figured out the relative position tag. I'm able to get it to work to do the following:
-1) If the Text: "1 to 1 results" appears then click on the result link and take a screenshot or
-2) If no results or more than one result than obviously the text "1 to 1 results" will not appear, the errorignore will then ignore the error and the next line will take a screenshot.

Code: Select all
SET !ERRORIGNORE YES
TAB T=1

URL GOTO=http://apps.dor.wa.gov/BRD/default.aspx
TAG POS=1 TYPE=INPUT:TEXT FORM=ID:form1 ATTR=ID:MainContent_SearchControl_txtCriteria CONTENT="XXXX LLC"
TAG POS=1 TYPE=A ATTR=ID:MainContent_SearchControl_lnkSearch
TAG POS=1 TYPE=H2 ATTR=TXT:1<SP>to<SP>1<SP>of<SP>1<SP>Results
TAG POS=R1 TYPE=INPUT:IMAGE ATTR=NAME:ctl00$MainContent$SearchControl$dgListResults$ctl03$TextImage1
SAVEAS TYPE=PNG FOLDER=C:\Users\Desktop\Screenshots\ FILE="XXX LLC"

URL GOTO=http://apps.dor.wa.gov/BRD/default.aspx
TAG POS=1 TYPE=INPUT:TEXT FORM=ID:form1 ATTR=ID:MainContent_SearchControl_txtCriteria CONTENT="XXX LLC"
TAG POS=1 TYPE=A ATTR=ID:MainContent_SearchControl_lnkSearch
TAG POS=1 TYPE=H2 ATTR=TXT:1<SP>to<SP>1<SP>of<SP>1<SP>Results
TAG POS=R1 TYPE=INPUT:IMAGE ATTR=NAME:ctl00$MainContent$SearchControl$dgListResults$ctl03$TextImage1
SAVEAS TYPE=PNG FOLDER=C:\Users\Desktop\Screenshots\ FILE="XXX LLC"

Yep indeed, that's what I meant...

Not sure btw why you kind of "double" your Code, first with "XXXX LLC" then with "XXX LLC" but save both Screenshots with the same "XXX LLC" File Name... :?

I could understand you might want to make the Screenshot "Conditional" to take place only if there is only 1 Result, instead of "collecting" numerous similar Screenshots of the Main Page with the Search Results..., unless you intend to check them afterwards manually to verify and delete them one by one...

But if you want to get rid of them, that can be done like explained in the "Workarounds for Conditional Logic..." Thread I already referred you to..., by first extracting that "1 to 1 results" Field to check if it's there, and using 'EVAL()' to compute dynamically the Name of your File for the 'SAVEAS', with the "real" File Name if you want the Screenshot or to some "Dummy" Name or an empty String, I would think the 'SAVEAS' with an empty File Name will do nothing. I'm not sure about the Behaviour of 'SAVEAS' when reusing a same File Name, if the File is automatically replaced each time or if you get some "Dummy (1).png" / "Dummy (2).png" etc..., there is some mention about Changes for this Functionality for iMB between different iMB Versions in the Releases Notes for iMB..., [checking]... => Yep, for iMB v12.0 indeed...
Or instead of the 'FILE' Name, you can apply the same Method to the 'FOLDER' Path/Name by saving all "faulty" Screenshots to a separate Folder... that you then later can empty manually from time to time...

The 'SAVEITEM' Method I mentioned earlier might be a bit more "straightforward" as it might be easier to compute a "1"/"0" on the 'TAG POS=n' for the 'SAVEITEM', as 'TAG POS=0' won't do anything...
- (F)CIM = (Full) Config Info Missing: iMacros + Browser + OS with all 3 Versions...
- I usually don't even read the Question if that (required) Info is not mentioned...
- Script & URL usually help a lot for a more "educated" Help...
chivracq
 
Posts: 6957
Joined: Sat Apr 13, 2013 6:07 am
Location: Amsterdam (NL)

Previous

Return to Data Extraction and Web Screen Scraping

Who is online

Users browsing this forum: No registered users and 5 guests

-->