How to limit Apify web crawler scope to first three list pages?












0















I have written the following web scraper in Apify (jQuery), but I am struggling to limit it to only look at certain list pages.



The crawler scrapes articles I have published at https://www.beet.tv/author/randrews, a page which contains 102 paginated index pages, each containing 20 article links. The crawler works fine when executed manually and in full; it gets everything, 2,000+ articles.



However, I wish to use Apify's scheduler to trigger an occasional crawl that only scrapes articles from the first three of those index (LIST) pages (ie. 60 articles).



The scheduler uses cron and allows the passing of settings via input Json. As advised, I am using "customData"...



{
"customData": 3
}


... and then the below to take that value and use it to limit...



var maxListDepth = parseInt(context.customData); // Jakub's suggestion, Nov 20 2018
if(!maxListDepth || (maxListDepth && pageNumber <= maxListDepth)) {
context.enqueuePage({


This should allow the script to limit the scope when executed via the scheduler, but to carry on as normal and get everything in full when executed manually.



However, whilst the scheduler successfully fires the crawler - the crawler still runs right through the whole set again; it doesn't cap out at /page/3.



How can I ensure I only get the first three pages up to /page/3?



Have I malformed something?



In the code, you can see, now commented-out, my previous version of the above addition.





Those LIST pages should only be...




  1. The STARTing one, with an implied "/page/1" URL (https://www.beet.tv/author/randrews)

  2. https://www.beet.tv/author/randrews/page/2

  3. https://www.beet.tv/author/randrews/page/3


... and not the likes of /page/101 or /page/102, which may surface.





Here are the key terms...



START https://www.beet.tv/author/randrews
LIST https://www.beet.tv/author/randrews/page/[d+]
DETAIL https://www.beet.tv/*
Clickable elements a.page-numbers


And here is the crawler script...



function pageFunction(context) {

// Called on every page the crawler visits, use it to extract data from it
var $ = context.jQuery;

// If page is START or a LIST,
if (context.request.label === 'START' || context.request.label === 'LIST') {

context.skipOutput();

// First, gather LIST page
$('a.page-numbers').each(function() {
// lines added to accept number of pages via customData in Scheduler...
var pageNumber = parseInt($(this).text());
// var maxListDepth = context.customData;
var maxListDepth = parseInt(context.customData); // Jakub's suggestion, Nov 20 2018
if(!maxListDepth || (maxListDepth && pageNumber <= maxListDepth)) {
context.enqueuePage({
url: /*window.location.origin +*/ $(this).attr('href'),
label: 'LIST'
});
}
});

// Then, gather every DETAIL page
$('h3>a').each(function(){
context.enqueuePage({
url: /*window.location.origin +*/ $(this).attr('href'),
label: 'DETAIL'
});
});

// If page is actually a DETAIL target page
} else if (context.request.label === 'DETAIL') {

/* context.skipLinks(); */

var categories = ;
$('span.cat-links a').each( function() {
categories.push($(this).text());
});
var tags = ;
$('span.tags-links a').each( function() {
tags.push($(this).text());
});

result = {
"title": $('h1').text(),
"entry": $('div.entry-content').html().trim(),
"datestamp": $('time').attr('datetime'),
"photo": $('meta[name="twitter:image"]').attr("content"),
categories: categories,
tags: tags
};

}
return result;
}









share|improve this question



























    0















    I have written the following web scraper in Apify (jQuery), but I am struggling to limit it to only look at certain list pages.



    The crawler scrapes articles I have published at https://www.beet.tv/author/randrews, a page which contains 102 paginated index pages, each containing 20 article links. The crawler works fine when executed manually and in full; it gets everything, 2,000+ articles.



    However, I wish to use Apify's scheduler to trigger an occasional crawl that only scrapes articles from the first three of those index (LIST) pages (ie. 60 articles).



    The scheduler uses cron and allows the passing of settings via input Json. As advised, I am using "customData"...



    {
    "customData": 3
    }


    ... and then the below to take that value and use it to limit...



    var maxListDepth = parseInt(context.customData); // Jakub's suggestion, Nov 20 2018
    if(!maxListDepth || (maxListDepth && pageNumber <= maxListDepth)) {
    context.enqueuePage({


    This should allow the script to limit the scope when executed via the scheduler, but to carry on as normal and get everything in full when executed manually.



    However, whilst the scheduler successfully fires the crawler - the crawler still runs right through the whole set again; it doesn't cap out at /page/3.



    How can I ensure I only get the first three pages up to /page/3?



    Have I malformed something?



    In the code, you can see, now commented-out, my previous version of the above addition.





    Those LIST pages should only be...




    1. The STARTing one, with an implied "/page/1" URL (https://www.beet.tv/author/randrews)

    2. https://www.beet.tv/author/randrews/page/2

    3. https://www.beet.tv/author/randrews/page/3


    ... and not the likes of /page/101 or /page/102, which may surface.





    Here are the key terms...



    START https://www.beet.tv/author/randrews
    LIST https://www.beet.tv/author/randrews/page/[d+]
    DETAIL https://www.beet.tv/*
    Clickable elements a.page-numbers


    And here is the crawler script...



    function pageFunction(context) {

    // Called on every page the crawler visits, use it to extract data from it
    var $ = context.jQuery;

    // If page is START or a LIST,
    if (context.request.label === 'START' || context.request.label === 'LIST') {

    context.skipOutput();

    // First, gather LIST page
    $('a.page-numbers').each(function() {
    // lines added to accept number of pages via customData in Scheduler...
    var pageNumber = parseInt($(this).text());
    // var maxListDepth = context.customData;
    var maxListDepth = parseInt(context.customData); // Jakub's suggestion, Nov 20 2018
    if(!maxListDepth || (maxListDepth && pageNumber <= maxListDepth)) {
    context.enqueuePage({
    url: /*window.location.origin +*/ $(this).attr('href'),
    label: 'LIST'
    });
    }
    });

    // Then, gather every DETAIL page
    $('h3>a').each(function(){
    context.enqueuePage({
    url: /*window.location.origin +*/ $(this).attr('href'),
    label: 'DETAIL'
    });
    });

    // If page is actually a DETAIL target page
    } else if (context.request.label === 'DETAIL') {

    /* context.skipLinks(); */

    var categories = ;
    $('span.cat-links a').each( function() {
    categories.push($(this).text());
    });
    var tags = ;
    $('span.tags-links a').each( function() {
    tags.push($(this).text());
    });

    result = {
    "title": $('h1').text(),
    "entry": $('div.entry-content').html().trim(),
    "datestamp": $('time').attr('datetime'),
    "photo": $('meta[name="twitter:image"]').attr("content"),
    categories: categories,
    tags: tags
    };

    }
    return result;
    }









    share|improve this question

























      0












      0








      0








      I have written the following web scraper in Apify (jQuery), but I am struggling to limit it to only look at certain list pages.



      The crawler scrapes articles I have published at https://www.beet.tv/author/randrews, a page which contains 102 paginated index pages, each containing 20 article links. The crawler works fine when executed manually and in full; it gets everything, 2,000+ articles.



      However, I wish to use Apify's scheduler to trigger an occasional crawl that only scrapes articles from the first three of those index (LIST) pages (ie. 60 articles).



      The scheduler uses cron and allows the passing of settings via input Json. As advised, I am using "customData"...



      {
      "customData": 3
      }


      ... and then the below to take that value and use it to limit...



      var maxListDepth = parseInt(context.customData); // Jakub's suggestion, Nov 20 2018
      if(!maxListDepth || (maxListDepth && pageNumber <= maxListDepth)) {
      context.enqueuePage({


      This should allow the script to limit the scope when executed via the scheduler, but to carry on as normal and get everything in full when executed manually.



      However, whilst the scheduler successfully fires the crawler - the crawler still runs right through the whole set again; it doesn't cap out at /page/3.



      How can I ensure I only get the first three pages up to /page/3?



      Have I malformed something?



      In the code, you can see, now commented-out, my previous version of the above addition.





      Those LIST pages should only be...




      1. The STARTing one, with an implied "/page/1" URL (https://www.beet.tv/author/randrews)

      2. https://www.beet.tv/author/randrews/page/2

      3. https://www.beet.tv/author/randrews/page/3


      ... and not the likes of /page/101 or /page/102, which may surface.





      Here are the key terms...



      START https://www.beet.tv/author/randrews
      LIST https://www.beet.tv/author/randrews/page/[d+]
      DETAIL https://www.beet.tv/*
      Clickable elements a.page-numbers


      And here is the crawler script...



      function pageFunction(context) {

      // Called on every page the crawler visits, use it to extract data from it
      var $ = context.jQuery;

      // If page is START or a LIST,
      if (context.request.label === 'START' || context.request.label === 'LIST') {

      context.skipOutput();

      // First, gather LIST page
      $('a.page-numbers').each(function() {
      // lines added to accept number of pages via customData in Scheduler...
      var pageNumber = parseInt($(this).text());
      // var maxListDepth = context.customData;
      var maxListDepth = parseInt(context.customData); // Jakub's suggestion, Nov 20 2018
      if(!maxListDepth || (maxListDepth && pageNumber <= maxListDepth)) {
      context.enqueuePage({
      url: /*window.location.origin +*/ $(this).attr('href'),
      label: 'LIST'
      });
      }
      });

      // Then, gather every DETAIL page
      $('h3>a').each(function(){
      context.enqueuePage({
      url: /*window.location.origin +*/ $(this).attr('href'),
      label: 'DETAIL'
      });
      });

      // If page is actually a DETAIL target page
      } else if (context.request.label === 'DETAIL') {

      /* context.skipLinks(); */

      var categories = ;
      $('span.cat-links a').each( function() {
      categories.push($(this).text());
      });
      var tags = ;
      $('span.tags-links a').each( function() {
      tags.push($(this).text());
      });

      result = {
      "title": $('h1').text(),
      "entry": $('div.entry-content').html().trim(),
      "datestamp": $('time').attr('datetime'),
      "photo": $('meta[name="twitter:image"]').attr("content"),
      categories: categories,
      tags: tags
      };

      }
      return result;
      }









      share|improve this question














      I have written the following web scraper in Apify (jQuery), but I am struggling to limit it to only look at certain list pages.



      The crawler scrapes articles I have published at https://www.beet.tv/author/randrews, a page which contains 102 paginated index pages, each containing 20 article links. The crawler works fine when executed manually and in full; it gets everything, 2,000+ articles.



      However, I wish to use Apify's scheduler to trigger an occasional crawl that only scrapes articles from the first three of those index (LIST) pages (ie. 60 articles).



      The scheduler uses cron and allows the passing of settings via input Json. As advised, I am using "customData"...



      {
      "customData": 3
      }


      ... and then the below to take that value and use it to limit...



      var maxListDepth = parseInt(context.customData); // Jakub's suggestion, Nov 20 2018
      if(!maxListDepth || (maxListDepth && pageNumber <= maxListDepth)) {
      context.enqueuePage({


      This should allow the script to limit the scope when executed via the scheduler, but to carry on as normal and get everything in full when executed manually.



      However, whilst the scheduler successfully fires the crawler - the crawler still runs right through the whole set again; it doesn't cap out at /page/3.



      How can I ensure I only get the first three pages up to /page/3?



      Have I malformed something?



      In the code, you can see, now commented-out, my previous version of the above addition.





      Those LIST pages should only be...




      1. The STARTing one, with an implied "/page/1" URL (https://www.beet.tv/author/randrews)

      2. https://www.beet.tv/author/randrews/page/2

      3. https://www.beet.tv/author/randrews/page/3


      ... and not the likes of /page/101 or /page/102, which may surface.





      Here are the key terms...



      START https://www.beet.tv/author/randrews
      LIST https://www.beet.tv/author/randrews/page/[d+]
      DETAIL https://www.beet.tv/*
      Clickable elements a.page-numbers


      And here is the crawler script...



      function pageFunction(context) {

      // Called on every page the crawler visits, use it to extract data from it
      var $ = context.jQuery;

      // If page is START or a LIST,
      if (context.request.label === 'START' || context.request.label === 'LIST') {

      context.skipOutput();

      // First, gather LIST page
      $('a.page-numbers').each(function() {
      // lines added to accept number of pages via customData in Scheduler...
      var pageNumber = parseInt($(this).text());
      // var maxListDepth = context.customData;
      var maxListDepth = parseInt(context.customData); // Jakub's suggestion, Nov 20 2018
      if(!maxListDepth || (maxListDepth && pageNumber <= maxListDepth)) {
      context.enqueuePage({
      url: /*window.location.origin +*/ $(this).attr('href'),
      label: 'LIST'
      });
      }
      });

      // Then, gather every DETAIL page
      $('h3>a').each(function(){
      context.enqueuePage({
      url: /*window.location.origin +*/ $(this).attr('href'),
      label: 'DETAIL'
      });
      });

      // If page is actually a DETAIL target page
      } else if (context.request.label === 'DETAIL') {

      /* context.skipLinks(); */

      var categories = ;
      $('span.cat-links a').each( function() {
      categories.push($(this).text());
      });
      var tags = ;
      $('span.tags-links a').each( function() {
      tags.push($(this).text());
      });

      result = {
      "title": $('h1').text(),
      "entry": $('div.entry-content').html().trim(),
      "datestamp": $('time').attr('datetime'),
      "photo": $('meta[name="twitter:image"]').attr("content"),
      categories: categories,
      tags: tags
      };

      }
      return result;
      }






      javascript jquery web-crawler apify






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 23 '18 at 17:05









      Robert AndrewsRobert Andrews

      3861521




      3861521
























          1 Answer
          1






          active

          oldest

          votes


















          0














          There are two options in advanced settings which can help: Max pages per crawl and Max result records. In your case, I would set Max result records to 60 and then crawler stops after outputting 60 pages (from the first 3 lists)






          share|improve this answer
























          • Hi. Can I leave this switched off as standard but pass something only for the scheduled crawl (ie. manual execution gets everything, but scheduled execution only >60)? If (can) I pass {maxCrawledPages: Number} or {maxOutputPages: Number} via Input JSON in the Schduler, a) can I delete my current maxListDepth code, and b) do I need some code to handle that also in the crawler code?

            – Robert Andrews
            Nov 23 '18 at 21:58






          • 1





            Yes, when starting the crawler via scheduler (or API) you can override any crawler setting. So you can use something like { "maxCrawledPages": 60, "maxOutputPages": 60 } And yes, you can delete your maxListDepth code and you don't need to handle it in a PageFunction

            – Jakub Balada
            Nov 24 '18 at 0:19













          • I think the solution in my case was to remove "Clickable element" from the GUI. You have previously advised this, but it seems I let it creep back in. However, the Max-results-records route also seems like a better solution than the LIST pages route. That works, too, and allows me to remove some code. Thanks!

            – Robert Andrews
            Nov 24 '18 at 8:09











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53450586%2fhow-to-limit-apify-web-crawler-scope-to-first-three-list-pages%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0














          There are two options in advanced settings which can help: Max pages per crawl and Max result records. In your case, I would set Max result records to 60 and then crawler stops after outputting 60 pages (from the first 3 lists)






          share|improve this answer
























          • Hi. Can I leave this switched off as standard but pass something only for the scheduled crawl (ie. manual execution gets everything, but scheduled execution only >60)? If (can) I pass {maxCrawledPages: Number} or {maxOutputPages: Number} via Input JSON in the Schduler, a) can I delete my current maxListDepth code, and b) do I need some code to handle that also in the crawler code?

            – Robert Andrews
            Nov 23 '18 at 21:58






          • 1





            Yes, when starting the crawler via scheduler (or API) you can override any crawler setting. So you can use something like { "maxCrawledPages": 60, "maxOutputPages": 60 } And yes, you can delete your maxListDepth code and you don't need to handle it in a PageFunction

            – Jakub Balada
            Nov 24 '18 at 0:19













          • I think the solution in my case was to remove "Clickable element" from the GUI. You have previously advised this, but it seems I let it creep back in. However, the Max-results-records route also seems like a better solution than the LIST pages route. That works, too, and allows me to remove some code. Thanks!

            – Robert Andrews
            Nov 24 '18 at 8:09
















          0














          There are two options in advanced settings which can help: Max pages per crawl and Max result records. In your case, I would set Max result records to 60 and then crawler stops after outputting 60 pages (from the first 3 lists)






          share|improve this answer
























          • Hi. Can I leave this switched off as standard but pass something only for the scheduled crawl (ie. manual execution gets everything, but scheduled execution only >60)? If (can) I pass {maxCrawledPages: Number} or {maxOutputPages: Number} via Input JSON in the Schduler, a) can I delete my current maxListDepth code, and b) do I need some code to handle that also in the crawler code?

            – Robert Andrews
            Nov 23 '18 at 21:58






          • 1





            Yes, when starting the crawler via scheduler (or API) you can override any crawler setting. So you can use something like { "maxCrawledPages": 60, "maxOutputPages": 60 } And yes, you can delete your maxListDepth code and you don't need to handle it in a PageFunction

            – Jakub Balada
            Nov 24 '18 at 0:19













          • I think the solution in my case was to remove "Clickable element" from the GUI. You have previously advised this, but it seems I let it creep back in. However, the Max-results-records route also seems like a better solution than the LIST pages route. That works, too, and allows me to remove some code. Thanks!

            – Robert Andrews
            Nov 24 '18 at 8:09














          0












          0








          0







          There are two options in advanced settings which can help: Max pages per crawl and Max result records. In your case, I would set Max result records to 60 and then crawler stops after outputting 60 pages (from the first 3 lists)






          share|improve this answer













          There are two options in advanced settings which can help: Max pages per crawl and Max result records. In your case, I would set Max result records to 60 and then crawler stops after outputting 60 pages (from the first 3 lists)







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 23 '18 at 21:14









          Jakub BaladaJakub Balada

          311




          311













          • Hi. Can I leave this switched off as standard but pass something only for the scheduled crawl (ie. manual execution gets everything, but scheduled execution only >60)? If (can) I pass {maxCrawledPages: Number} or {maxOutputPages: Number} via Input JSON in the Schduler, a) can I delete my current maxListDepth code, and b) do I need some code to handle that also in the crawler code?

            – Robert Andrews
            Nov 23 '18 at 21:58






          • 1





            Yes, when starting the crawler via scheduler (or API) you can override any crawler setting. So you can use something like { "maxCrawledPages": 60, "maxOutputPages": 60 } And yes, you can delete your maxListDepth code and you don't need to handle it in a PageFunction

            – Jakub Balada
            Nov 24 '18 at 0:19













          • I think the solution in my case was to remove "Clickable element" from the GUI. You have previously advised this, but it seems I let it creep back in. However, the Max-results-records route also seems like a better solution than the LIST pages route. That works, too, and allows me to remove some code. Thanks!

            – Robert Andrews
            Nov 24 '18 at 8:09



















          • Hi. Can I leave this switched off as standard but pass something only for the scheduled crawl (ie. manual execution gets everything, but scheduled execution only >60)? If (can) I pass {maxCrawledPages: Number} or {maxOutputPages: Number} via Input JSON in the Schduler, a) can I delete my current maxListDepth code, and b) do I need some code to handle that also in the crawler code?

            – Robert Andrews
            Nov 23 '18 at 21:58






          • 1





            Yes, when starting the crawler via scheduler (or API) you can override any crawler setting. So you can use something like { "maxCrawledPages": 60, "maxOutputPages": 60 } And yes, you can delete your maxListDepth code and you don't need to handle it in a PageFunction

            – Jakub Balada
            Nov 24 '18 at 0:19













          • I think the solution in my case was to remove "Clickable element" from the GUI. You have previously advised this, but it seems I let it creep back in. However, the Max-results-records route also seems like a better solution than the LIST pages route. That works, too, and allows me to remove some code. Thanks!

            – Robert Andrews
            Nov 24 '18 at 8:09

















          Hi. Can I leave this switched off as standard but pass something only for the scheduled crawl (ie. manual execution gets everything, but scheduled execution only >60)? If (can) I pass {maxCrawledPages: Number} or {maxOutputPages: Number} via Input JSON in the Schduler, a) can I delete my current maxListDepth code, and b) do I need some code to handle that also in the crawler code?

          – Robert Andrews
          Nov 23 '18 at 21:58





          Hi. Can I leave this switched off as standard but pass something only for the scheduled crawl (ie. manual execution gets everything, but scheduled execution only >60)? If (can) I pass {maxCrawledPages: Number} or {maxOutputPages: Number} via Input JSON in the Schduler, a) can I delete my current maxListDepth code, and b) do I need some code to handle that also in the crawler code?

          – Robert Andrews
          Nov 23 '18 at 21:58




          1




          1





          Yes, when starting the crawler via scheduler (or API) you can override any crawler setting. So you can use something like { "maxCrawledPages": 60, "maxOutputPages": 60 } And yes, you can delete your maxListDepth code and you don't need to handle it in a PageFunction

          – Jakub Balada
          Nov 24 '18 at 0:19







          Yes, when starting the crawler via scheduler (or API) you can override any crawler setting. So you can use something like { "maxCrawledPages": 60, "maxOutputPages": 60 } And yes, you can delete your maxListDepth code and you don't need to handle it in a PageFunction

          – Jakub Balada
          Nov 24 '18 at 0:19















          I think the solution in my case was to remove "Clickable element" from the GUI. You have previously advised this, but it seems I let it creep back in. However, the Max-results-records route also seems like a better solution than the LIST pages route. That works, too, and allows me to remove some code. Thanks!

          – Robert Andrews
          Nov 24 '18 at 8:09





          I think the solution in my case was to remove "Clickable element" from the GUI. You have previously advised this, but it seems I let it creep back in. However, the Max-results-records route also seems like a better solution than the LIST pages route. That works, too, and allows me to remove some code. Thanks!

          – Robert Andrews
          Nov 24 '18 at 8:09




















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53450586%2fhow-to-limit-apify-web-crawler-scope-to-first-three-list-pages%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Costa Masnaga

          Fotorealismo

          Sidney Franklin