Extract name, address and phone number from some web pages using multiprocessing

up vote
2
down vote

favorite

I've written a script in Python using the multiprocessing module to scrape values from web pages (one page per subprocess). As I'm very new to multiprocessing, I'm not sure whether I did everything in the right way. It works without error, though.

Here goes the full script:

import requests 

from lxml.html import fromstring

from multiprocessing import Pool



link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"



def create_links(url):

    response = requests.get(url).text

    tree = fromstring(response)

    for title in tree.cssselect("div.info"):

        name = title.cssselect("a.business-name span")[0].text

        try:

            street = title.cssselect("span.street-address")[0].text

        except IndexError: street = ""

        try:

            phone = title.cssselect("div[class^=phones]")[0].text

        except IndexError: phone = ""

        print(name, street, phone)



if __name__ == '__main__':

    links = [link.format(page) for page in range(1,4)]

    with Pool(4) as p:

        p.map(create_links, links)

Any idea to make it more robust will be highly appreciated.

edited 21 hours ago

Toby Speight

21.9k536108

asked 22 hours ago

robots.txt

112

New contributor

Were you intending to include page 4? If so, you need to change the range's (exclusive) stop value to 5.
– Solomon Ucko
19 hours ago

add a comment |

up vote
2
down vote

favorite

Here goes the full script:

import requests 

from lxml.html import fromstring

from multiprocessing import Pool



link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"



def create_links(url):

    response = requests.get(url).text

    tree = fromstring(response)

    for title in tree.cssselect("div.info"):

        name = title.cssselect("a.business-name span")[0].text

        try:

            street = title.cssselect("span.street-address")[0].text

        except IndexError: street = ""

        try:

            phone = title.cssselect("div[class^=phones]")[0].text

        except IndexError: phone = ""

        print(name, street, phone)



if __name__ == '__main__':

    links = [link.format(page) for page in range(1,4)]

    with Pool(4) as p:

        p.map(create_links, links)

Any idea to make it more robust will be highly appreciated.

edited 21 hours ago

Toby Speight

21.9k536108

asked 22 hours ago

robots.txt

112

New contributor

Were you intending to include page 4? If so, you need to change the range's (exclusive) stop value to 5.
– Solomon Ucko
19 hours ago

add a comment |

up vote
2
down vote

favorite

Here goes the full script:

import requests 

from lxml.html import fromstring

from multiprocessing import Pool



link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"



def create_links(url):

    response = requests.get(url).text

    tree = fromstring(response)

    for title in tree.cssselect("div.info"):

        name = title.cssselect("a.business-name span")[0].text

        try:

            street = title.cssselect("span.street-address")[0].text

        except IndexError: street = ""

        try:

            phone = title.cssselect("div[class^=phones]")[0].text

        except IndexError: phone = ""

        print(name, street, phone)



if __name__ == '__main__':

    links = [link.format(page) for page in range(1,4)]

    with Pool(4) as p:

        p.map(create_links, links)

Any idea to make it more robust will be highly appreciated.

edited 21 hours ago

Toby Speight

21.9k536108

asked 22 hours ago

robots.txt

112

New contributor

Here goes the full script:

import requests 

from lxml.html import fromstring

from multiprocessing import Pool



link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"



def create_links(url):

    response = requests.get(url).text

    tree = fromstring(response)

    for title in tree.cssselect("div.info"):

        name = title.cssselect("a.business-name span")[0].text

        try:

            street = title.cssselect("span.street-address")[0].text

        except IndexError: street = ""

        try:

            phone = title.cssselect("div[class^=phones]")[0].text

        except IndexError: phone = ""

        print(name, street, phone)



if __name__ == '__main__':

    links = [link.format(page) for page in range(1,4)]

    with Pool(4) as p:

        p.map(create_links, links)

Any idea to make it more robust will be highly appreciated.

python python-3.x web-scraping multiprocessing

edited 21 hours ago

Toby Speight

21.9k536108

asked 22 hours ago

robots.txt

112

New contributor

edited 21 hours ago

Toby Speight

21.9k536108

asked 22 hours ago

robots.txt

112

New contributor

edited 21 hours ago

Toby Speight

21.9k536108

edited 21 hours ago

Toby Speight

21.9k536108

edited 21 hours ago

Toby Speight

21.9k536108

asked 22 hours ago

robots.txt

112

New contributor

asked 22 hours ago

robots.txt

112

asked 22 hours ago

robots.txt

112

New contributor

robots.txt is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

Were you intending to include page 4? If so, you need to change the range's (exclusive) stop value to 5.
– Solomon Ucko
19 hours ago

add a comment |

Were you intending to include page 4? If so, you need to change the range's (exclusive) stop value to 5.
– Solomon Ucko
19 hours ago

Were you intending to include page 4? If so, you need to change the range's (exclusive) stop value to 5.
– Solomon Ucko
19 hours ago

add a comment |

2 Answers
2

active

oldest

votes

up vote
0
down vote

Proper use of a Pool p should include p.close() and p.join() (in this order).

Cases of the websites not responding should be handled. Requests should have a timeout, the timeout exception is caught, and non-200 should be handled as well.

Other than that, the script is correct for a one time extract of a few pages, but will need to be extended if you intend to produce a high volume daemon. If that is the case, usage of Pool can be questionnable and has multiple alternatives, such as using Scrapy framework scheduler or using a celery broker for high-level handling of workers (this will avoid your workers to crash entirely on exceptions, among a few other benefits).

edited 13 hours ago

answered 14 hours ago

Arthur Havlicek

2713

New contributor

add a comment |

up vote
0
down vote

You've (manually?) escaped your query params in your URL string. This is OK, and technically a few nanoseconds faster, but less legible than the alternative:

requests.get('https://www.yellowpages.com/search',

             params={'search_terms': 'coffee',

                     'geo_location_terms': 'Los Angeles, CA',

                     'page': page})

Then, rather than calling format, you simply pass in the page parameter.

answered 31 mins ago

Reinderien

932415

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

robots.txt is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f207790%2fextract-name-address-and-phone-number-from-some-web-pages-using-multiprocessing%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
0
down vote

Proper use of a Pool p should include p.close() and p.join() (in this order).

Cases of the websites not responding should be handled. Requests should have a timeout, the timeout exception is caught, and non-200 should be handled as well.

edited 13 hours ago

answered 14 hours ago

Arthur Havlicek

2713

New contributor

add a comment |

up vote
0
down vote

Proper use of a Pool p should include p.close() and p.join() (in this order).

Cases of the websites not responding should be handled. Requests should have a timeout, the timeout exception is caught, and non-200 should be handled as well.

edited 13 hours ago

answered 14 hours ago

Arthur Havlicek

2713

New contributor

add a comment |

up vote
0
down vote

Proper use of a Pool p should include p.close() and p.join() (in this order).

Cases of the websites not responding should be handled. Requests should have a timeout, the timeout exception is caught, and non-200 should be handled as well.

edited 13 hours ago

answered 14 hours ago

Arthur Havlicek

2713

New contributor

Proper use of a Pool p should include p.close() and p.join() (in this order).

Cases of the websites not responding should be handled. Requests should have a timeout, the timeout exception is caught, and non-200 should be handled as well.

edited 13 hours ago

answered 14 hours ago

Arthur Havlicek

2713

New contributor

edited 13 hours ago

answered 14 hours ago

Arthur Havlicek

2713

New contributor

answered 14 hours ago

Arthur Havlicek

2713

answered 14 hours ago

Arthur Havlicek

2713

New contributor

Arthur Havlicek is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

up vote
0
down vote

You've (manually?) escaped your query params in your URL string. This is OK, and technically a few nanoseconds faster, but less legible than the alternative:

requests.get('https://www.yellowpages.com/search',

             params={'search_terms': 'coffee',

                     'geo_location_terms': 'Los Angeles, CA',

                     'page': page})

Then, rather than calling format, you simply pass in the page parameter.

answered 31 mins ago

Reinderien

932415

add a comment |

up vote
0
down vote

You've (manually?) escaped your query params in your URL string. This is OK, and technically a few nanoseconds faster, but less legible than the alternative:

requests.get('https://www.yellowpages.com/search',

             params={'search_terms': 'coffee',

                     'geo_location_terms': 'Los Angeles, CA',

                     'page': page})

Then, rather than calling format, you simply pass in the page parameter.

answered 31 mins ago

Reinderien

932415

add a comment |

up vote
0
down vote

You've (manually?) escaped your query params in your URL string. This is OK, and technically a few nanoseconds faster, but less legible than the alternative:

requests.get('https://www.yellowpages.com/search',

             params={'search_terms': 'coffee',

                     'geo_location_terms': 'Los Angeles, CA',

                     'page': page})

Then, rather than calling format, you simply pass in the page parameter.

answered 31 mins ago

Reinderien

932415

You've (manually?) escaped your query params in your URL string. This is OK, and technically a few nanoseconds faster, but less legible than the alternative:

requests.get('https://www.yellowpages.com/search',

             params={'search_terms': 'coffee',

                     'geo_location_terms': 'Los Angeles, CA',

                     'page': page})

Then, rather than calling format, you simply pass in the page parameter.

answered 31 mins ago

Reinderien

932415

answered 31 mins ago

Reinderien

932415

answered 31 mins ago

Reinderien

932415

answered 31 mins ago

Reinderien

932415

add a comment |

robots.txt is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

robots.txt is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Nsryjdtyk