How to limit Apify web crawler scope to first three list pages?
I have written the following web scraper in Apify (jQuery), but I am struggling to limit it to only look at certain list pages.
The crawler scrapes articles I have published at https://www.beet.tv/author/randrews, a page which contains 102 paginated index pages, each containing 20 article links. The crawler works fine when executed manually and in full; it gets everything, 2,000+ articles.
However, I wish to use Apify's scheduler to trigger an occasional crawl that only scrapes articles from the first three of those index (LIST) pages (ie. 60 articles).
The scheduler uses cron and allows the passing of settings via input Json. As advised, I am using "customData"...
{
"customData": 3
}
... and then the below to take that value and use it to limit...
var maxListDepth = parseInt(context.customData); // Jakub's suggestion, Nov 20 2018
if(!maxListDepth || (maxListDepth && pageNumber <= maxListDepth)) {
context.enqueuePage({
This should allow the script to limit the scope when executed via the scheduler, but to carry on as normal and get everything in full when executed manually.
However, whilst the scheduler successfully fires the crawler - the crawler still runs right through the whole set again; it doesn't cap out at /page/3.
How can I ensure I only get the first three pages up to /page/3?
Have I malformed something?
In the code, you can see, now commented-out, my previous version of the above addition.
Those LIST pages should only be...
- The STARTing one, with an implied "/page/1" URL (https://www.beet.tv/author/randrews)
- https://www.beet.tv/author/randrews/page/2
- https://www.beet.tv/author/randrews/page/3
... and not the likes of /page/101 or /page/102, which may surface.
Here are the key terms...
START https://www.beet.tv/author/randrews
LIST https://www.beet.tv/author/randrews/page/[d+]
DETAIL https://www.beet.tv/*
Clickable elements a.page-numbers
And here is the crawler script...
function pageFunction(context) {
// Called on every page the crawler visits, use it to extract data from it
var $ = context.jQuery;
// If page is START or a LIST,
if (context.request.label === 'START' || context.request.label === 'LIST') {
context.skipOutput();
// First, gather LIST page
$('a.page-numbers').each(function() {
// lines added to accept number of pages via customData in Scheduler...
var pageNumber = parseInt($(this).text());
// var maxListDepth = context.customData;
var maxListDepth = parseInt(context.customData); // Jakub's suggestion, Nov 20 2018
if(!maxListDepth || (maxListDepth && pageNumber <= maxListDepth)) {
context.enqueuePage({
url: /*window.location.origin +*/ $(this).attr('href'),
label: 'LIST'
});
}
});
// Then, gather every DETAIL page
$('h3>a').each(function(){
context.enqueuePage({
url: /*window.location.origin +*/ $(this).attr('href'),
label: 'DETAIL'
});
});
// If page is actually a DETAIL target page
} else if (context.request.label === 'DETAIL') {
/* context.skipLinks(); */
var categories = ;
$('span.cat-links a').each( function() {
categories.push($(this).text());
});
var tags = ;
$('span.tags-links a').each( function() {
tags.push($(this).text());
});
result = {
"title": $('h1').text(),
"entry": $('div.entry-content').html().trim(),
"datestamp": $('time').attr('datetime'),
"photo": $('meta[name="twitter:image"]').attr("content"),
categories: categories,
tags: tags
};
}
return result;
}
javascript jquery web-crawler apify
add a comment |
I have written the following web scraper in Apify (jQuery), but I am struggling to limit it to only look at certain list pages.
The crawler scrapes articles I have published at https://www.beet.tv/author/randrews, a page which contains 102 paginated index pages, each containing 20 article links. The crawler works fine when executed manually and in full; it gets everything, 2,000+ articles.
However, I wish to use Apify's scheduler to trigger an occasional crawl that only scrapes articles from the first three of those index (LIST) pages (ie. 60 articles).
The scheduler uses cron and allows the passing of settings via input Json. As advised, I am using "customData"...
{
"customData": 3
}
... and then the below to take that value and use it to limit...
var maxListDepth = parseInt(context.customData); // Jakub's suggestion, Nov 20 2018
if(!maxListDepth || (maxListDepth && pageNumber <= maxListDepth)) {
context.enqueuePage({
This should allow the script to limit the scope when executed via the scheduler, but to carry on as normal and get everything in full when executed manually.
However, whilst the scheduler successfully fires the crawler - the crawler still runs right through the whole set again; it doesn't cap out at /page/3.
How can I ensure I only get the first three pages up to /page/3?
Have I malformed something?
In the code, you can see, now commented-out, my previous version of the above addition.
Those LIST pages should only be...
- The STARTing one, with an implied "/page/1" URL (https://www.beet.tv/author/randrews)
- https://www.beet.tv/author/randrews/page/2
- https://www.beet.tv/author/randrews/page/3
... and not the likes of /page/101 or /page/102, which may surface.
Here are the key terms...
START https://www.beet.tv/author/randrews
LIST https://www.beet.tv/author/randrews/page/[d+]
DETAIL https://www.beet.tv/*
Clickable elements a.page-numbers
And here is the crawler script...
function pageFunction(context) {
// Called on every page the crawler visits, use it to extract data from it
var $ = context.jQuery;
// If page is START or a LIST,
if (context.request.label === 'START' || context.request.label === 'LIST') {
context.skipOutput();
// First, gather LIST page
$('a.page-numbers').each(function() {
// lines added to accept number of pages via customData in Scheduler...
var pageNumber = parseInt($(this).text());
// var maxListDepth = context.customData;
var maxListDepth = parseInt(context.customData); // Jakub's suggestion, Nov 20 2018
if(!maxListDepth || (maxListDepth && pageNumber <= maxListDepth)) {
context.enqueuePage({
url: /*window.location.origin +*/ $(this).attr('href'),
label: 'LIST'
});
}
});
// Then, gather every DETAIL page
$('h3>a').each(function(){
context.enqueuePage({
url: /*window.location.origin +*/ $(this).attr('href'),
label: 'DETAIL'
});
});
// If page is actually a DETAIL target page
} else if (context.request.label === 'DETAIL') {
/* context.skipLinks(); */
var categories = ;
$('span.cat-links a').each( function() {
categories.push($(this).text());
});
var tags = ;
$('span.tags-links a').each( function() {
tags.push($(this).text());
});
result = {
"title": $('h1').text(),
"entry": $('div.entry-content').html().trim(),
"datestamp": $('time').attr('datetime'),
"photo": $('meta[name="twitter:image"]').attr("content"),
categories: categories,
tags: tags
};
}
return result;
}
javascript jquery web-crawler apify
add a comment |
I have written the following web scraper in Apify (jQuery), but I am struggling to limit it to only look at certain list pages.
The crawler scrapes articles I have published at https://www.beet.tv/author/randrews, a page which contains 102 paginated index pages, each containing 20 article links. The crawler works fine when executed manually and in full; it gets everything, 2,000+ articles.
However, I wish to use Apify's scheduler to trigger an occasional crawl that only scrapes articles from the first three of those index (LIST) pages (ie. 60 articles).
The scheduler uses cron and allows the passing of settings via input Json. As advised, I am using "customData"...
{
"customData": 3
}
... and then the below to take that value and use it to limit...
var maxListDepth = parseInt(context.customData); // Jakub's suggestion, Nov 20 2018
if(!maxListDepth || (maxListDepth && pageNumber <= maxListDepth)) {
context.enqueuePage({
This should allow the script to limit the scope when executed via the scheduler, but to carry on as normal and get everything in full when executed manually.
However, whilst the scheduler successfully fires the crawler - the crawler still runs right through the whole set again; it doesn't cap out at /page/3.
How can I ensure I only get the first three pages up to /page/3?
Have I malformed something?
In the code, you can see, now commented-out, my previous version of the above addition.
Those LIST pages should only be...
- The STARTing one, with an implied "/page/1" URL (https://www.beet.tv/author/randrews)
- https://www.beet.tv/author/randrews/page/2
- https://www.beet.tv/author/randrews/page/3
... and not the likes of /page/101 or /page/102, which may surface.
Here are the key terms...
START https://www.beet.tv/author/randrews
LIST https://www.beet.tv/author/randrews/page/[d+]
DETAIL https://www.beet.tv/*
Clickable elements a.page-numbers
And here is the crawler script...
function pageFunction(context) {
// Called on every page the crawler visits, use it to extract data from it
var $ = context.jQuery;
// If page is START or a LIST,
if (context.request.label === 'START' || context.request.label === 'LIST') {
context.skipOutput();
// First, gather LIST page
$('a.page-numbers').each(function() {
// lines added to accept number of pages via customData in Scheduler...
var pageNumber = parseInt($(this).text());
// var maxListDepth = context.customData;
var maxListDepth = parseInt(context.customData); // Jakub's suggestion, Nov 20 2018
if(!maxListDepth || (maxListDepth && pageNumber <= maxListDepth)) {
context.enqueuePage({
url: /*window.location.origin +*/ $(this).attr('href'),
label: 'LIST'
});
}
});
// Then, gather every DETAIL page
$('h3>a').each(function(){
context.enqueuePage({
url: /*window.location.origin +*/ $(this).attr('href'),
label: 'DETAIL'
});
});
// If page is actually a DETAIL target page
} else if (context.request.label === 'DETAIL') {
/* context.skipLinks(); */
var categories = ;
$('span.cat-links a').each( function() {
categories.push($(this).text());
});
var tags = ;
$('span.tags-links a').each( function() {
tags.push($(this).text());
});
result = {
"title": $('h1').text(),
"entry": $('div.entry-content').html().trim(),
"datestamp": $('time').attr('datetime'),
"photo": $('meta[name="twitter:image"]').attr("content"),
categories: categories,
tags: tags
};
}
return result;
}
javascript jquery web-crawler apify
I have written the following web scraper in Apify (jQuery), but I am struggling to limit it to only look at certain list pages.
The crawler scrapes articles I have published at https://www.beet.tv/author/randrews, a page which contains 102 paginated index pages, each containing 20 article links. The crawler works fine when executed manually and in full; it gets everything, 2,000+ articles.
However, I wish to use Apify's scheduler to trigger an occasional crawl that only scrapes articles from the first three of those index (LIST) pages (ie. 60 articles).
The scheduler uses cron and allows the passing of settings via input Json. As advised, I am using "customData"...
{
"customData": 3
}
... and then the below to take that value and use it to limit...
var maxListDepth = parseInt(context.customData); // Jakub's suggestion, Nov 20 2018
if(!maxListDepth || (maxListDepth && pageNumber <= maxListDepth)) {
context.enqueuePage({
This should allow the script to limit the scope when executed via the scheduler, but to carry on as normal and get everything in full when executed manually.
However, whilst the scheduler successfully fires the crawler - the crawler still runs right through the whole set again; it doesn't cap out at /page/3.
How can I ensure I only get the first three pages up to /page/3?
Have I malformed something?
In the code, you can see, now commented-out, my previous version of the above addition.
Those LIST pages should only be...
- The STARTing one, with an implied "/page/1" URL (https://www.beet.tv/author/randrews)
- https://www.beet.tv/author/randrews/page/2
- https://www.beet.tv/author/randrews/page/3
... and not the likes of /page/101 or /page/102, which may surface.
Here are the key terms...
START https://www.beet.tv/author/randrews
LIST https://www.beet.tv/author/randrews/page/[d+]
DETAIL https://www.beet.tv/*
Clickable elements a.page-numbers
And here is the crawler script...
function pageFunction(context) {
// Called on every page the crawler visits, use it to extract data from it
var $ = context.jQuery;
// If page is START or a LIST,
if (context.request.label === 'START' || context.request.label === 'LIST') {
context.skipOutput();
// First, gather LIST page
$('a.page-numbers').each(function() {
// lines added to accept number of pages via customData in Scheduler...
var pageNumber = parseInt($(this).text());
// var maxListDepth = context.customData;
var maxListDepth = parseInt(context.customData); // Jakub's suggestion, Nov 20 2018
if(!maxListDepth || (maxListDepth && pageNumber <= maxListDepth)) {
context.enqueuePage({
url: /*window.location.origin +*/ $(this).attr('href'),
label: 'LIST'
});
}
});
// Then, gather every DETAIL page
$('h3>a').each(function(){
context.enqueuePage({
url: /*window.location.origin +*/ $(this).attr('href'),
label: 'DETAIL'
});
});
// If page is actually a DETAIL target page
} else if (context.request.label === 'DETAIL') {
/* context.skipLinks(); */
var categories = ;
$('span.cat-links a').each( function() {
categories.push($(this).text());
});
var tags = ;
$('span.tags-links a').each( function() {
tags.push($(this).text());
});
result = {
"title": $('h1').text(),
"entry": $('div.entry-content').html().trim(),
"datestamp": $('time').attr('datetime'),
"photo": $('meta[name="twitter:image"]').attr("content"),
categories: categories,
tags: tags
};
}
return result;
}
javascript jquery web-crawler apify
javascript jquery web-crawler apify
asked Nov 23 '18 at 17:05
Robert AndrewsRobert Andrews
3861521
3861521
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
There are two options in advanced settings which can help: Max pages per crawl and Max result records. In your case, I would set Max result records to 60 and then crawler stops after outputting 60 pages (from the first 3 lists)
Hi. Can I leave this switched off as standard but pass something only for the scheduled crawl (ie. manual execution gets everything, but scheduled execution only >60)? If (can) I pass {maxCrawledPages: Number} or {maxOutputPages: Number} via Input JSON in the Schduler, a) can I delete my current maxListDepth code, and b) do I need some code to handle that also in the crawler code?
– Robert Andrews
Nov 23 '18 at 21:58
1
Yes, when starting the crawler via scheduler (or API) you can override any crawler setting. So you can use something like{ "maxCrawledPages": 60, "maxOutputPages": 60 }
And yes, you can delete your maxListDepth code and you don't need to handle it in a PageFunction
– Jakub Balada
Nov 24 '18 at 0:19
I think the solution in my case was to remove "Clickable element" from the GUI. You have previously advised this, but it seems I let it creep back in. However, the Max-results-records route also seems like a better solution than the LIST pages route. That works, too, and allows me to remove some code. Thanks!
– Robert Andrews
Nov 24 '18 at 8:09
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53450586%2fhow-to-limit-apify-web-crawler-scope-to-first-three-list-pages%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
There are two options in advanced settings which can help: Max pages per crawl and Max result records. In your case, I would set Max result records to 60 and then crawler stops after outputting 60 pages (from the first 3 lists)
Hi. Can I leave this switched off as standard but pass something only for the scheduled crawl (ie. manual execution gets everything, but scheduled execution only >60)? If (can) I pass {maxCrawledPages: Number} or {maxOutputPages: Number} via Input JSON in the Schduler, a) can I delete my current maxListDepth code, and b) do I need some code to handle that also in the crawler code?
– Robert Andrews
Nov 23 '18 at 21:58
1
Yes, when starting the crawler via scheduler (or API) you can override any crawler setting. So you can use something like{ "maxCrawledPages": 60, "maxOutputPages": 60 }
And yes, you can delete your maxListDepth code and you don't need to handle it in a PageFunction
– Jakub Balada
Nov 24 '18 at 0:19
I think the solution in my case was to remove "Clickable element" from the GUI. You have previously advised this, but it seems I let it creep back in. However, the Max-results-records route also seems like a better solution than the LIST pages route. That works, too, and allows me to remove some code. Thanks!
– Robert Andrews
Nov 24 '18 at 8:09
add a comment |
There are two options in advanced settings which can help: Max pages per crawl and Max result records. In your case, I would set Max result records to 60 and then crawler stops after outputting 60 pages (from the first 3 lists)
Hi. Can I leave this switched off as standard but pass something only for the scheduled crawl (ie. manual execution gets everything, but scheduled execution only >60)? If (can) I pass {maxCrawledPages: Number} or {maxOutputPages: Number} via Input JSON in the Schduler, a) can I delete my current maxListDepth code, and b) do I need some code to handle that also in the crawler code?
– Robert Andrews
Nov 23 '18 at 21:58
1
Yes, when starting the crawler via scheduler (or API) you can override any crawler setting. So you can use something like{ "maxCrawledPages": 60, "maxOutputPages": 60 }
And yes, you can delete your maxListDepth code and you don't need to handle it in a PageFunction
– Jakub Balada
Nov 24 '18 at 0:19
I think the solution in my case was to remove "Clickable element" from the GUI. You have previously advised this, but it seems I let it creep back in. However, the Max-results-records route also seems like a better solution than the LIST pages route. That works, too, and allows me to remove some code. Thanks!
– Robert Andrews
Nov 24 '18 at 8:09
add a comment |
There are two options in advanced settings which can help: Max pages per crawl and Max result records. In your case, I would set Max result records to 60 and then crawler stops after outputting 60 pages (from the first 3 lists)
There are two options in advanced settings which can help: Max pages per crawl and Max result records. In your case, I would set Max result records to 60 and then crawler stops after outputting 60 pages (from the first 3 lists)
answered Nov 23 '18 at 21:14
Jakub BaladaJakub Balada
311
311
Hi. Can I leave this switched off as standard but pass something only for the scheduled crawl (ie. manual execution gets everything, but scheduled execution only >60)? If (can) I pass {maxCrawledPages: Number} or {maxOutputPages: Number} via Input JSON in the Schduler, a) can I delete my current maxListDepth code, and b) do I need some code to handle that also in the crawler code?
– Robert Andrews
Nov 23 '18 at 21:58
1
Yes, when starting the crawler via scheduler (or API) you can override any crawler setting. So you can use something like{ "maxCrawledPages": 60, "maxOutputPages": 60 }
And yes, you can delete your maxListDepth code and you don't need to handle it in a PageFunction
– Jakub Balada
Nov 24 '18 at 0:19
I think the solution in my case was to remove "Clickable element" from the GUI. You have previously advised this, but it seems I let it creep back in. However, the Max-results-records route also seems like a better solution than the LIST pages route. That works, too, and allows me to remove some code. Thanks!
– Robert Andrews
Nov 24 '18 at 8:09
add a comment |
Hi. Can I leave this switched off as standard but pass something only for the scheduled crawl (ie. manual execution gets everything, but scheduled execution only >60)? If (can) I pass {maxCrawledPages: Number} or {maxOutputPages: Number} via Input JSON in the Schduler, a) can I delete my current maxListDepth code, and b) do I need some code to handle that also in the crawler code?
– Robert Andrews
Nov 23 '18 at 21:58
1
Yes, when starting the crawler via scheduler (or API) you can override any crawler setting. So you can use something like{ "maxCrawledPages": 60, "maxOutputPages": 60 }
And yes, you can delete your maxListDepth code and you don't need to handle it in a PageFunction
– Jakub Balada
Nov 24 '18 at 0:19
I think the solution in my case was to remove "Clickable element" from the GUI. You have previously advised this, but it seems I let it creep back in. However, the Max-results-records route also seems like a better solution than the LIST pages route. That works, too, and allows me to remove some code. Thanks!
– Robert Andrews
Nov 24 '18 at 8:09
Hi. Can I leave this switched off as standard but pass something only for the scheduled crawl (ie. manual execution gets everything, but scheduled execution only >60)? If (can) I pass {maxCrawledPages: Number} or {maxOutputPages: Number} via Input JSON in the Schduler, a) can I delete my current maxListDepth code, and b) do I need some code to handle that also in the crawler code?
– Robert Andrews
Nov 23 '18 at 21:58
Hi. Can I leave this switched off as standard but pass something only for the scheduled crawl (ie. manual execution gets everything, but scheduled execution only >60)? If (can) I pass {maxCrawledPages: Number} or {maxOutputPages: Number} via Input JSON in the Schduler, a) can I delete my current maxListDepth code, and b) do I need some code to handle that also in the crawler code?
– Robert Andrews
Nov 23 '18 at 21:58
1
1
Yes, when starting the crawler via scheduler (or API) you can override any crawler setting. So you can use something like
{ "maxCrawledPages": 60, "maxOutputPages": 60 }
And yes, you can delete your maxListDepth code and you don't need to handle it in a PageFunction– Jakub Balada
Nov 24 '18 at 0:19
Yes, when starting the crawler via scheduler (or API) you can override any crawler setting. So you can use something like
{ "maxCrawledPages": 60, "maxOutputPages": 60 }
And yes, you can delete your maxListDepth code and you don't need to handle it in a PageFunction– Jakub Balada
Nov 24 '18 at 0:19
I think the solution in my case was to remove "Clickable element" from the GUI. You have previously advised this, but it seems I let it creep back in. However, the Max-results-records route also seems like a better solution than the LIST pages route. That works, too, and allows me to remove some code. Thanks!
– Robert Andrews
Nov 24 '18 at 8:09
I think the solution in my case was to remove "Clickable element" from the GUI. You have previously advised this, but it seems I let it creep back in. However, the Max-results-records route also seems like a better solution than the LIST pages route. That works, too, and allows me to remove some code. Thanks!
– Robert Andrews
Nov 24 '18 at 8:09
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53450586%2fhow-to-limit-apify-web-crawler-scope-to-first-three-list-pages%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown