How to yield several requests in order in Scrapy?

up vote
0
down vote

favorite

I need to send my requests in order with Scrapy.

def n1(self, response) :

    #self.input = [elem1,elem2,elem3,elem4,elem5, .... ,elem100000]

    for (elem,) in self.input :



        link =  urljoin(path,elem)



        yield Request(link)

My problem is that the requests are not in the order.
I read this question but it has no correct answer.

How should I change my code for sending the requests in order?

UPDATE 1

I used priority and changed my code to

def n1(self, response) :



    #self.input = [elem1,elem2,elem3,elem4,elem5, .... ,elem100000]

    self.prio = len(self.input)

    for (elem,) in self.input :

        self.prio -= 1

        link =  urljoin(path,elem)



        yield Request(link, priority=self.prio)

And my setting for this spider is

custom_settings = {

    'DOWNLOAD_DELAY' : 0,

    'COOKIES_ENABLED' : True,

    'CONCURRENT_REQUESTS' : 1 ,

    'AUTOTHROTTLE_ENABLED' : False,

}

Now the order is changed, but it's not in the order of elements in the array

edited Nov 19 at 20:56

asked Nov 16 at 10:50

parik

32162048

Try to check priority in Request: doc.scrapy.org/en/latest/topics/request-response.html
– vezunchik
Nov 16 at 11:01

@ElenaSh. Thanks, I tried your suggestion, and updated my question
– parik
Nov 16 at 11:27

add a comment |

up vote
0
down vote

favorite

I need to send my requests in order with Scrapy.

def n1(self, response) :

    #self.input = [elem1,elem2,elem3,elem4,elem5, .... ,elem100000]

    for (elem,) in self.input :



        link =  urljoin(path,elem)



        yield Request(link)

My problem is that the requests are not in the order.
I read this question but it has no correct answer.

How should I change my code for sending the requests in order?

UPDATE 1

I used priority and changed my code to

def n1(self, response) :



    #self.input = [elem1,elem2,elem3,elem4,elem5, .... ,elem100000]

    self.prio = len(self.input)

    for (elem,) in self.input :

        self.prio -= 1

        link =  urljoin(path,elem)



        yield Request(link, priority=self.prio)

And my setting for this spider is

custom_settings = {

    'DOWNLOAD_DELAY' : 0,

    'COOKIES_ENABLED' : True,

    'CONCURRENT_REQUESTS' : 1 ,

    'AUTOTHROTTLE_ENABLED' : False,

}

Now the order is changed, but it's not in the order of elements in the array

edited Nov 19 at 20:56

asked Nov 16 at 10:50

parik

32162048

Try to check priority in Request: doc.scrapy.org/en/latest/topics/request-response.html
– vezunchik
Nov 16 at 11:01

@ElenaSh. Thanks, I tried your suggestion, and updated my question
– parik
Nov 16 at 11:27

add a comment |

up vote
0
down vote

favorite

I need to send my requests in order with Scrapy.

def n1(self, response) :

    #self.input = [elem1,elem2,elem3,elem4,elem5, .... ,elem100000]

    for (elem,) in self.input :



        link =  urljoin(path,elem)



        yield Request(link)

My problem is that the requests are not in the order.
I read this question but it has no correct answer.

How should I change my code for sending the requests in order?

UPDATE 1

I used priority and changed my code to

def n1(self, response) :



    #self.input = [elem1,elem2,elem3,elem4,elem5, .... ,elem100000]

    self.prio = len(self.input)

    for (elem,) in self.input :

        self.prio -= 1

        link =  urljoin(path,elem)



        yield Request(link, priority=self.prio)

And my setting for this spider is

custom_settings = {

    'DOWNLOAD_DELAY' : 0,

    'COOKIES_ENABLED' : True,

    'CONCURRENT_REQUESTS' : 1 ,

    'AUTOTHROTTLE_ENABLED' : False,

}

Now the order is changed, but it's not in the order of elements in the array

edited Nov 19 at 20:56

asked Nov 16 at 10:50

parik

32162048

I need to send my requests in order with Scrapy.

def n1(self, response) :

    #self.input = [elem1,elem2,elem3,elem4,elem5, .... ,elem100000]

    for (elem,) in self.input :



        link =  urljoin(path,elem)



        yield Request(link)

My problem is that the requests are not in the order.
I read this question but it has no correct answer.

How should I change my code for sending the requests in order?

UPDATE 1

I used priority and changed my code to

def n1(self, response) :



    #self.input = [elem1,elem2,elem3,elem4,elem5, .... ,elem100000]

    self.prio = len(self.input)

    for (elem,) in self.input :

        self.prio -= 1

        link =  urljoin(path,elem)



        yield Request(link, priority=self.prio)

And my setting for this spider is

custom_settings = {

    'DOWNLOAD_DELAY' : 0,

    'COOKIES_ENABLED' : True,

    'CONCURRENT_REQUESTS' : 1 ,

    'AUTOTHROTTLE_ENABLED' : False,

}

Now the order is changed, but it's not in the order of elements in the array

scrapy python-requests yield

edited Nov 19 at 20:56

asked Nov 16 at 10:50

parik

32162048

edited Nov 19 at 20:56

asked Nov 16 at 10:50

parik

32162048

edited Nov 19 at 20:56

asked Nov 16 at 10:50

parik

32162048

asked Nov 16 at 10:50

parik

32162048

asked Nov 16 at 10:50

parik

32162048

Try to check priority in Request: doc.scrapy.org/en/latest/topics/request-response.html
– vezunchik
Nov 16 at 11:01

@ElenaSh. Thanks, I tried your suggestion, and updated my question
– parik
Nov 16 at 11:27

add a comment |

Try to check priority in Request: doc.scrapy.org/en/latest/topics/request-response.html
– vezunchik
Nov 16 at 11:01

@ElenaSh. Thanks, I tried your suggestion, and updated my question
– parik
Nov 16 at 11:27

Try to check priority in Request: doc.scrapy.org/en/latest/topics/request-response.html
– vezunchik
Nov 16 at 11:01

@ElenaSh. Thanks, I tried your suggestion, and updated my question
– parik
Nov 16 at 11:27

add a comment |

3 Answers
3

active

oldest

votes

up vote
1
down vote

accepted

Use a return statement instead of yield.

You don't even need to touch any setting:

from scrapy.spiders import Spider, Request



class MySpider(Spider):



    name = 'toscrape.com'

    start_urls = ['http://books.toscrape.com/catalogue/page-1.html']



    urls = (

        'http://books.toscrape.com/catalogue/page-{}.html'.format(i + 1) for i in range(50)

    )



    def parse(self, response):

        for url in self.urls:

            return Request(url)

Output:

2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-1.html> (referer: http://books.toscrape.com/catalogue/page-1.html)

2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-2.html> (referer: http://books.toscrape.com/catalogue/page-1.html)

2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-3.html> (referer: http://books.toscrape.com/catalogue/page-2.html)

2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-4.html> (referer: http://books.toscrape.com/catalogue/page-3.html)

2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-5.html> (referer: http://books.toscrape.com/catalogue/page-4.html)

2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-6.html> (referer: http://books.toscrape.com/catalogue/page-5.html)

2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-7.html> (referer: http://books.toscrape.com/catalogue/page-6.html)

2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-8.html> (referer: http://books.toscrape.com/catalogue/page-7.html)

2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-9.html> (referer: http://books.toscrape.com/catalogue/page-8.html)

2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-10.html> (referer: http://books.toscrape.com/catalogue/page-9.html)

2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-11.html> (referer: http://books.toscrape.com/catalogue/page-10.html)

2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-12.html> (referer: http://books.toscrape.com/catalogue/page-11.html)

2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-13.html> (referer: http://books.toscrape.com/catalogue/page-12.html)

2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-14.html> (referer: http://books.toscrape.com/catalogue/page-13.html)

2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-15.html> (referer: http://books.toscrape.com/catalogue/page-14.html)

2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-16.html> (referer: http://books.toscrape.com/catalogue/page-15.html)

2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-17.html> (referer: http://books.toscrape.com/catalogue/page-16.html)

2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-18.html> (referer: http://books.toscrape.com/catalogue/page-17.html)

2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-19.html> (referer: http://books.toscrape.com/catalogue/page-18.html)

2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-20.html> (referer: http://books.toscrape.com/catalogue/page-19.html)

2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-21.html> (referer: http://books.toscrape.com/catalogue/page-20.html)

2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-22.html> (referer: http://books.toscrape.com/catalogue/page-21.html)

2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-23.html> (referer: http://books.toscrape.com/catalogue/page-22.html)

2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-24.html> (referer: http://books.toscrape.com/catalogue/page-23.html)

2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-25.html> (referer: http://books.toscrape.com/catalogue/page-24.html)

With a yield statement, the engine gets all the responses from the generator and executes them in an arbitrary order (I suspect they might be stored in some sort of set to remove duplicates).

answered Nov 20 at 16:56

Guillaume

1,0951724

add a comment |

up vote
0
down vote

I think concurrent request is play at here. You can try setting

custom_settings = {

    'CONCURRENT_REQUESTS': 1

}

Default setting is 8. It will kind of explain why priority will not honoured when you have other workers free for work.

answered Nov 17 at 7:18

Biswanath

5,136103756

I have 'CONCURRENT_REQUESTS' : 1 ,'AUTOTHROTTLE_ENABLED' : False, in my settings
– parik
Nov 17 at 17:54

add a comment |

up vote
0
down vote

You can send the next request only after receiving the previous one:

class MainSpider(Spider):

    urls = [

        'https://www.url1...',

        'https://www.url2...',

        'https://www.url3...',

    ]



    def start_requests(self):

        yield Request(

            url=self.urls[0],

            callback=self.parse,

            meta={'next_index': 1},

        )



    def parse(self, response):

        next_index = response.meta['next_index']



        # do something with response...



        # Process next url

        if next_index < len(self.urls):

            yield Request(

                url=self.urls[next_index],

                callback=self.parse,

                meta={'next_index': next_index+1},

            )

answered Nov 20 at 1:07

nicolas

1,436813

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53336352%2fhow-to-yield-several-requests-in-order-in-scrapy%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
1
down vote

accepted

Use a return statement instead of yield.

You don't even need to touch any setting:

from scrapy.spiders import Spider, Request



class MySpider(Spider):



    name = 'toscrape.com'

    start_urls = ['http://books.toscrape.com/catalogue/page-1.html']



    urls = (

        'http://books.toscrape.com/catalogue/page-{}.html'.format(i + 1) for i in range(50)

    )



    def parse(self, response):

        for url in self.urls:

            return Request(url)

Output:

2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-1.html> (referer: http://books.toscrape.com/catalogue/page-1.html)

2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-2.html> (referer: http://books.toscrape.com/catalogue/page-1.html)

2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-3.html> (referer: http://books.toscrape.com/catalogue/page-2.html)

2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-4.html> (referer: http://books.toscrape.com/catalogue/page-3.html)

2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-5.html> (referer: http://books.toscrape.com/catalogue/page-4.html)

2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-6.html> (referer: http://books.toscrape.com/catalogue/page-5.html)

2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-7.html> (referer: http://books.toscrape.com/catalogue/page-6.html)

2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-8.html> (referer: http://books.toscrape.com/catalogue/page-7.html)

2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-9.html> (referer: http://books.toscrape.com/catalogue/page-8.html)

2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-10.html> (referer: http://books.toscrape.com/catalogue/page-9.html)

2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-11.html> (referer: http://books.toscrape.com/catalogue/page-10.html)

2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-12.html> (referer: http://books.toscrape.com/catalogue/page-11.html)

2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-13.html> (referer: http://books.toscrape.com/catalogue/page-12.html)

2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-14.html> (referer: http://books.toscrape.com/catalogue/page-13.html)

2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-15.html> (referer: http://books.toscrape.com/catalogue/page-14.html)

2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-16.html> (referer: http://books.toscrape.com/catalogue/page-15.html)

2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-17.html> (referer: http://books.toscrape.com/catalogue/page-16.html)

2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-18.html> (referer: http://books.toscrape.com/catalogue/page-17.html)

2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-19.html> (referer: http://books.toscrape.com/catalogue/page-18.html)

2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-20.html> (referer: http://books.toscrape.com/catalogue/page-19.html)

2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-21.html> (referer: http://books.toscrape.com/catalogue/page-20.html)

2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-22.html> (referer: http://books.toscrape.com/catalogue/page-21.html)

2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-23.html> (referer: http://books.toscrape.com/catalogue/page-22.html)

2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-24.html> (referer: http://books.toscrape.com/catalogue/page-23.html)

2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-25.html> (referer: http://books.toscrape.com/catalogue/page-24.html)

With a yield statement, the engine gets all the responses from the generator and executes them in an arbitrary order (I suspect they might be stored in some sort of set to remove duplicates).

answered Nov 20 at 16:56

Guillaume

1,0951724

add a comment |

up vote
1
down vote

accepted

Use a return statement instead of yield.

You don't even need to touch any setting:

from scrapy.spiders import Spider, Request



class MySpider(Spider):



    name = 'toscrape.com'

    start_urls = ['http://books.toscrape.com/catalogue/page-1.html']



    urls = (

        'http://books.toscrape.com/catalogue/page-{}.html'.format(i + 1) for i in range(50)

    )



    def parse(self, response):

        for url in self.urls:

            return Request(url)

Output:

2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-1.html> (referer: http://books.toscrape.com/catalogue/page-1.html)

2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-2.html> (referer: http://books.toscrape.com/catalogue/page-1.html)

2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-3.html> (referer: http://books.toscrape.com/catalogue/page-2.html)

2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-4.html> (referer: http://books.toscrape.com/catalogue/page-3.html)

2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-5.html> (referer: http://books.toscrape.com/catalogue/page-4.html)

2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-6.html> (referer: http://books.toscrape.com/catalogue/page-5.html)

2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-7.html> (referer: http://books.toscrape.com/catalogue/page-6.html)

2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-8.html> (referer: http://books.toscrape.com/catalogue/page-7.html)

2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-9.html> (referer: http://books.toscrape.com/catalogue/page-8.html)

2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-10.html> (referer: http://books.toscrape.com/catalogue/page-9.html)

2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-11.html> (referer: http://books.toscrape.com/catalogue/page-10.html)

2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-12.html> (referer: http://books.toscrape.com/catalogue/page-11.html)

2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-13.html> (referer: http://books.toscrape.com/catalogue/page-12.html)

2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-14.html> (referer: http://books.toscrape.com/catalogue/page-13.html)

2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-15.html> (referer: http://books.toscrape.com/catalogue/page-14.html)

2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-16.html> (referer: http://books.toscrape.com/catalogue/page-15.html)

2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-17.html> (referer: http://books.toscrape.com/catalogue/page-16.html)

2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-18.html> (referer: http://books.toscrape.com/catalogue/page-17.html)

2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-19.html> (referer: http://books.toscrape.com/catalogue/page-18.html)

2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-20.html> (referer: http://books.toscrape.com/catalogue/page-19.html)

2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-21.html> (referer: http://books.toscrape.com/catalogue/page-20.html)

2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-22.html> (referer: http://books.toscrape.com/catalogue/page-21.html)

2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-23.html> (referer: http://books.toscrape.com/catalogue/page-22.html)

2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-24.html> (referer: http://books.toscrape.com/catalogue/page-23.html)

2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-25.html> (referer: http://books.toscrape.com/catalogue/page-24.html)

With a yield statement, the engine gets all the responses from the generator and executes them in an arbitrary order (I suspect they might be stored in some sort of set to remove duplicates).

answered Nov 20 at 16:56

Guillaume

1,0951724

add a comment |

up vote
1
down vote

accepted

Use a return statement instead of yield.

You don't even need to touch any setting:

from scrapy.spiders import Spider, Request



class MySpider(Spider):



    name = 'toscrape.com'

    start_urls = ['http://books.toscrape.com/catalogue/page-1.html']



    urls = (

        'http://books.toscrape.com/catalogue/page-{}.html'.format(i + 1) for i in range(50)

    )



    def parse(self, response):

        for url in self.urls:

            return Request(url)

Output:

2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-1.html> (referer: http://books.toscrape.com/catalogue/page-1.html)

2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-2.html> (referer: http://books.toscrape.com/catalogue/page-1.html)

2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-3.html> (referer: http://books.toscrape.com/catalogue/page-2.html)

2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-4.html> (referer: http://books.toscrape.com/catalogue/page-3.html)

2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-5.html> (referer: http://books.toscrape.com/catalogue/page-4.html)

2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-6.html> (referer: http://books.toscrape.com/catalogue/page-5.html)

2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-7.html> (referer: http://books.toscrape.com/catalogue/page-6.html)

2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-8.html> (referer: http://books.toscrape.com/catalogue/page-7.html)

2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-9.html> (referer: http://books.toscrape.com/catalogue/page-8.html)

2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-10.html> (referer: http://books.toscrape.com/catalogue/page-9.html)

2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-11.html> (referer: http://books.toscrape.com/catalogue/page-10.html)

2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-12.html> (referer: http://books.toscrape.com/catalogue/page-11.html)

2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-13.html> (referer: http://books.toscrape.com/catalogue/page-12.html)

2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-14.html> (referer: http://books.toscrape.com/catalogue/page-13.html)

2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-15.html> (referer: http://books.toscrape.com/catalogue/page-14.html)

2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-16.html> (referer: http://books.toscrape.com/catalogue/page-15.html)

2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-17.html> (referer: http://books.toscrape.com/catalogue/page-16.html)

2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-18.html> (referer: http://books.toscrape.com/catalogue/page-17.html)

2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-19.html> (referer: http://books.toscrape.com/catalogue/page-18.html)

2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-20.html> (referer: http://books.toscrape.com/catalogue/page-19.html)

2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-21.html> (referer: http://books.toscrape.com/catalogue/page-20.html)

2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-22.html> (referer: http://books.toscrape.com/catalogue/page-21.html)

2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-23.html> (referer: http://books.toscrape.com/catalogue/page-22.html)

2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-24.html> (referer: http://books.toscrape.com/catalogue/page-23.html)

2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-25.html> (referer: http://books.toscrape.com/catalogue/page-24.html)

With a yield statement, the engine gets all the responses from the generator and executes them in an arbitrary order (I suspect they might be stored in some sort of set to remove duplicates).

answered Nov 20 at 16:56

Guillaume

1,0951724

Use a return statement instead of yield.

You don't even need to touch any setting:

from scrapy.spiders import Spider, Request



class MySpider(Spider):



    name = 'toscrape.com'

    start_urls = ['http://books.toscrape.com/catalogue/page-1.html']



    urls = (

        'http://books.toscrape.com/catalogue/page-{}.html'.format(i + 1) for i in range(50)

    )



    def parse(self, response):

        for url in self.urls:

            return Request(url)

Output:

2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-1.html> (referer: http://books.toscrape.com/catalogue/page-1.html)

2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-2.html> (referer: http://books.toscrape.com/catalogue/page-1.html)

2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-3.html> (referer: http://books.toscrape.com/catalogue/page-2.html)

2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-4.html> (referer: http://books.toscrape.com/catalogue/page-3.html)

2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-5.html> (referer: http://books.toscrape.com/catalogue/page-4.html)

2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-6.html> (referer: http://books.toscrape.com/catalogue/page-5.html)

2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-7.html> (referer: http://books.toscrape.com/catalogue/page-6.html)

2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-8.html> (referer: http://books.toscrape.com/catalogue/page-7.html)

2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-9.html> (referer: http://books.toscrape.com/catalogue/page-8.html)

2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-10.html> (referer: http://books.toscrape.com/catalogue/page-9.html)

2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-11.html> (referer: http://books.toscrape.com/catalogue/page-10.html)

2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-12.html> (referer: http://books.toscrape.com/catalogue/page-11.html)

2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-13.html> (referer: http://books.toscrape.com/catalogue/page-12.html)

2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-14.html> (referer: http://books.toscrape.com/catalogue/page-13.html)

2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-15.html> (referer: http://books.toscrape.com/catalogue/page-14.html)

2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-16.html> (referer: http://books.toscrape.com/catalogue/page-15.html)

2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-17.html> (referer: http://books.toscrape.com/catalogue/page-16.html)

2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-18.html> (referer: http://books.toscrape.com/catalogue/page-17.html)

2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-19.html> (referer: http://books.toscrape.com/catalogue/page-18.html)

2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-20.html> (referer: http://books.toscrape.com/catalogue/page-19.html)

2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-21.html> (referer: http://books.toscrape.com/catalogue/page-20.html)

2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-22.html> (referer: http://books.toscrape.com/catalogue/page-21.html)

2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-23.html> (referer: http://books.toscrape.com/catalogue/page-22.html)

2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-24.html> (referer: http://books.toscrape.com/catalogue/page-23.html)

2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-25.html> (referer: http://books.toscrape.com/catalogue/page-24.html)

With a yield statement, the engine gets all the responses from the generator and executes them in an arbitrary order (I suspect they might be stored in some sort of set to remove duplicates).

answered Nov 20 at 16:56

Guillaume

1,0951724

answered Nov 20 at 16:56

Guillaume

1,0951724

answered Nov 20 at 16:56

Guillaume

1,0951724

answered Nov 20 at 16:56

Guillaume

1,0951724

add a comment |

up vote
0
down vote

I think concurrent request is play at here. You can try setting

custom_settings = {

    'CONCURRENT_REQUESTS': 1

}

Default setting is 8. It will kind of explain why priority will not honoured when you have other workers free for work.

answered Nov 17 at 7:18

Biswanath

5,136103756

I have 'CONCURRENT_REQUESTS' : 1 ,'AUTOTHROTTLE_ENABLED' : False, in my settings
– parik
Nov 17 at 17:54

add a comment |

up vote
0
down vote

I think concurrent request is play at here. You can try setting

custom_settings = {

    'CONCURRENT_REQUESTS': 1

}

Default setting is 8. It will kind of explain why priority will not honoured when you have other workers free for work.

answered Nov 17 at 7:18

Biswanath

5,136103756

I have 'CONCURRENT_REQUESTS' : 1 ,'AUTOTHROTTLE_ENABLED' : False, in my settings
– parik
Nov 17 at 17:54

add a comment |

up vote
0
down vote

I think concurrent request is play at here. You can try setting

custom_settings = {

    'CONCURRENT_REQUESTS': 1

}

Default setting is 8. It will kind of explain why priority will not honoured when you have other workers free for work.

answered Nov 17 at 7:18

Biswanath

5,136103756

I think concurrent request is play at here. You can try setting

custom_settings = {

    'CONCURRENT_REQUESTS': 1

}

Default setting is 8. It will kind of explain why priority will not honoured when you have other workers free for work.

answered Nov 17 at 7:18

Biswanath

5,136103756

answered Nov 17 at 7:18

Biswanath

5,136103756

answered Nov 17 at 7:18

Biswanath

5,136103756

answered Nov 17 at 7:18

Biswanath

5,136103756

I have 'CONCURRENT_REQUESTS' : 1 ,'AUTOTHROTTLE_ENABLED' : False, in my settings
– parik
Nov 17 at 17:54

add a comment |

I have 'CONCURRENT_REQUESTS' : 1 ,'AUTOTHROTTLE_ENABLED' : False, in my settings
– parik
Nov 17 at 17:54

I have 'CONCURRENT_REQUESTS' : 1 ,'AUTOTHROTTLE_ENABLED' : False, in my settings
– parik
Nov 17 at 17:54

add a comment |

up vote
0
down vote

You can send the next request only after receiving the previous one:

class MainSpider(Spider):

    urls = [

        'https://www.url1...',

        'https://www.url2...',

        'https://www.url3...',

    ]



    def start_requests(self):

        yield Request(

            url=self.urls[0],

            callback=self.parse,

            meta={'next_index': 1},

        )



    def parse(self, response):

        next_index = response.meta['next_index']



        # do something with response...



        # Process next url

        if next_index < len(self.urls):

            yield Request(

                url=self.urls[next_index],

                callback=self.parse,

                meta={'next_index': next_index+1},

            )

answered Nov 20 at 1:07

nicolas

1,436813

add a comment |

up vote
0
down vote

You can send the next request only after receiving the previous one:

class MainSpider(Spider):

    urls = [

        'https://www.url1...',

        'https://www.url2...',

        'https://www.url3...',

    ]



    def start_requests(self):

        yield Request(

            url=self.urls[0],

            callback=self.parse,

            meta={'next_index': 1},

        )



    def parse(self, response):

        next_index = response.meta['next_index']



        # do something with response...



        # Process next url

        if next_index < len(self.urls):

            yield Request(

                url=self.urls[next_index],

                callback=self.parse,

                meta={'next_index': next_index+1},

            )

answered Nov 20 at 1:07

nicolas

1,436813

add a comment |

up vote
0
down vote

You can send the next request only after receiving the previous one:

class MainSpider(Spider):

    urls = [

        'https://www.url1...',

        'https://www.url2...',

        'https://www.url3...',

    ]



    def start_requests(self):

        yield Request(

            url=self.urls[0],

            callback=self.parse,

            meta={'next_index': 1},

        )



    def parse(self, response):

        next_index = response.meta['next_index']



        # do something with response...



        # Process next url

        if next_index < len(self.urls):

            yield Request(

                url=self.urls[next_index],

                callback=self.parse,

                meta={'next_index': next_index+1},

            )

answered Nov 20 at 1:07

nicolas

1,436813

You can send the next request only after receiving the previous one:

class MainSpider(Spider):

    urls = [

        'https://www.url1...',

        'https://www.url2...',

        'https://www.url3...',

    ]



    def start_requests(self):

        yield Request(

            url=self.urls[0],

            callback=self.parse,

            meta={'next_index': 1},

        )



    def parse(self, response):

        next_index = response.meta['next_index']



        # do something with response...



        # Process next url

        if next_index < len(self.urls):

            yield Request(

                url=self.urls[next_index],

                callback=self.parse,

                meta={'next_index': next_index+1},

            )

answered Nov 20 at 1:07

nicolas

1,436813

answered Nov 20 at 1:07

nicolas

1,436813

answered Nov 20 at 1:07

nicolas

1,436813

answered Nov 20 at 1:07

nicolas

1,436813

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Nsryjdtyk