Extract name, address and phone number from some web pages using multiprocessing
up vote
2
down vote
favorite
I've written a script in Python using the multiprocessing module to scrape values from web pages (one page per subprocess). As I'm very new to multiprocessing, I'm not sure whether I did everything in the right way. It works without error, though.
Here goes the full script:
import requests
from lxml.html import fromstring
from multiprocessing import Pool
link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"
def create_links(url):
response = requests.get(url).text
tree = fromstring(response)
for title in tree.cssselect("div.info"):
name = title.cssselect("a.business-name span")[0].text
try:
street = title.cssselect("span.street-address")[0].text
except IndexError: street = ""
try:
phone = title.cssselect("div[class^=phones]")[0].text
except IndexError: phone = ""
print(name, street, phone)
if __name__ == '__main__':
links = [link.format(page) for page in range(1,4)]
with Pool(4) as p:
p.map(create_links, links)
Any idea to make it more robust will be highly appreciated.
python python-3.x web-scraping multiprocessing
New contributor
robots.txt is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |
up vote
2
down vote
favorite
I've written a script in Python using the multiprocessing module to scrape values from web pages (one page per subprocess). As I'm very new to multiprocessing, I'm not sure whether I did everything in the right way. It works without error, though.
Here goes the full script:
import requests
from lxml.html import fromstring
from multiprocessing import Pool
link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"
def create_links(url):
response = requests.get(url).text
tree = fromstring(response)
for title in tree.cssselect("div.info"):
name = title.cssselect("a.business-name span")[0].text
try:
street = title.cssselect("span.street-address")[0].text
except IndexError: street = ""
try:
phone = title.cssselect("div[class^=phones]")[0].text
except IndexError: phone = ""
print(name, street, phone)
if __name__ == '__main__':
links = [link.format(page) for page in range(1,4)]
with Pool(4) as p:
p.map(create_links, links)
Any idea to make it more robust will be highly appreciated.
python python-3.x web-scraping multiprocessing
New contributor
robots.txt is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
Were you intending to include page 4? If so, you need to change the range's (exclusive) stop value to 5.
– Solomon Ucko
19 hours ago
add a comment |
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I've written a script in Python using the multiprocessing module to scrape values from web pages (one page per subprocess). As I'm very new to multiprocessing, I'm not sure whether I did everything in the right way. It works without error, though.
Here goes the full script:
import requests
from lxml.html import fromstring
from multiprocessing import Pool
link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"
def create_links(url):
response = requests.get(url).text
tree = fromstring(response)
for title in tree.cssselect("div.info"):
name = title.cssselect("a.business-name span")[0].text
try:
street = title.cssselect("span.street-address")[0].text
except IndexError: street = ""
try:
phone = title.cssselect("div[class^=phones]")[0].text
except IndexError: phone = ""
print(name, street, phone)
if __name__ == '__main__':
links = [link.format(page) for page in range(1,4)]
with Pool(4) as p:
p.map(create_links, links)
Any idea to make it more robust will be highly appreciated.
python python-3.x web-scraping multiprocessing
New contributor
robots.txt is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
I've written a script in Python using the multiprocessing module to scrape values from web pages (one page per subprocess). As I'm very new to multiprocessing, I'm not sure whether I did everything in the right way. It works without error, though.
Here goes the full script:
import requests
from lxml.html import fromstring
from multiprocessing import Pool
link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"
def create_links(url):
response = requests.get(url).text
tree = fromstring(response)
for title in tree.cssselect("div.info"):
name = title.cssselect("a.business-name span")[0].text
try:
street = title.cssselect("span.street-address")[0].text
except IndexError: street = ""
try:
phone = title.cssselect("div[class^=phones]")[0].text
except IndexError: phone = ""
print(name, street, phone)
if __name__ == '__main__':
links = [link.format(page) for page in range(1,4)]
with Pool(4) as p:
p.map(create_links, links)
Any idea to make it more robust will be highly appreciated.
python python-3.x web-scraping multiprocessing
python python-3.x web-scraping multiprocessing
New contributor
robots.txt is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
robots.txt is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
edited 21 hours ago
Toby Speight
21.9k536108
21.9k536108
New contributor
robots.txt is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
asked 22 hours ago
robots.txt
112
112
New contributor
robots.txt is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
robots.txt is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
robots.txt is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
Were you intending to include page 4? If so, you need to change the range's (exclusive) stop value to 5.
– Solomon Ucko
19 hours ago
add a comment |
Were you intending to include page 4? If so, you need to change the range's (exclusive) stop value to 5.
– Solomon Ucko
19 hours ago
Were you intending to include page 4? If so, you need to change the range's (exclusive) stop value to 5.
– Solomon Ucko
19 hours ago
Were you intending to include page 4? If so, you need to change the range's (exclusive) stop value to 5.
– Solomon Ucko
19 hours ago
add a comment |
2 Answers
2
active
oldest
votes
up vote
0
down vote
Proper use of a Pool p should include p.close() and p.join() (in this order).
Cases of the websites not responding should be handled. Requests should have a timeout, the timeout exception is caught, and non-200 should be handled as well.
Other than that, the script is correct for a one time extract of a few pages, but will need to be extended if you intend to produce a high volume daemon. If that is the case, usage of Pool can be questionnable and has multiple alternatives, such as using Scrapy framework scheduler or using a celery broker for high-level handling of workers (this will avoid your workers to crash entirely on exceptions, among a few other benefits).
New contributor
Arthur Havlicek is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |
up vote
0
down vote
You've (manually?) escaped your query params in your URL string. This is OK, and technically a few nanoseconds faster, but less legible than the alternative:
requests.get('https://www.yellowpages.com/search',
params={'search_terms': 'coffee',
'geo_location_terms': 'Los Angeles, CA',
'page': page})
Then, rather than calling format, you simply pass in the page parameter.
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
Proper use of a Pool p should include p.close() and p.join() (in this order).
Cases of the websites not responding should be handled. Requests should have a timeout, the timeout exception is caught, and non-200 should be handled as well.
Other than that, the script is correct for a one time extract of a few pages, but will need to be extended if you intend to produce a high volume daemon. If that is the case, usage of Pool can be questionnable and has multiple alternatives, such as using Scrapy framework scheduler or using a celery broker for high-level handling of workers (this will avoid your workers to crash entirely on exceptions, among a few other benefits).
New contributor
Arthur Havlicek is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |
up vote
0
down vote
Proper use of a Pool p should include p.close() and p.join() (in this order).
Cases of the websites not responding should be handled. Requests should have a timeout, the timeout exception is caught, and non-200 should be handled as well.
Other than that, the script is correct for a one time extract of a few pages, but will need to be extended if you intend to produce a high volume daemon. If that is the case, usage of Pool can be questionnable and has multiple alternatives, such as using Scrapy framework scheduler or using a celery broker for high-level handling of workers (this will avoid your workers to crash entirely on exceptions, among a few other benefits).
New contributor
Arthur Havlicek is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |
up vote
0
down vote
up vote
0
down vote
Proper use of a Pool p should include p.close() and p.join() (in this order).
Cases of the websites not responding should be handled. Requests should have a timeout, the timeout exception is caught, and non-200 should be handled as well.
Other than that, the script is correct for a one time extract of a few pages, but will need to be extended if you intend to produce a high volume daemon. If that is the case, usage of Pool can be questionnable and has multiple alternatives, such as using Scrapy framework scheduler or using a celery broker for high-level handling of workers (this will avoid your workers to crash entirely on exceptions, among a few other benefits).
New contributor
Arthur Havlicek is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
Proper use of a Pool p should include p.close() and p.join() (in this order).
Cases of the websites not responding should be handled. Requests should have a timeout, the timeout exception is caught, and non-200 should be handled as well.
Other than that, the script is correct for a one time extract of a few pages, but will need to be extended if you intend to produce a high volume daemon. If that is the case, usage of Pool can be questionnable and has multiple alternatives, such as using Scrapy framework scheduler or using a celery broker for high-level handling of workers (this will avoid your workers to crash entirely on exceptions, among a few other benefits).
New contributor
Arthur Havlicek is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
edited 13 hours ago
New contributor
Arthur Havlicek is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
answered 14 hours ago
Arthur Havlicek
2713
2713
New contributor
Arthur Havlicek is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
Arthur Havlicek is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
Arthur Havlicek is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |
add a comment |
up vote
0
down vote
You've (manually?) escaped your query params in your URL string. This is OK, and technically a few nanoseconds faster, but less legible than the alternative:
requests.get('https://www.yellowpages.com/search',
params={'search_terms': 'coffee',
'geo_location_terms': 'Los Angeles, CA',
'page': page})
Then, rather than calling format, you simply pass in the page parameter.
add a comment |
up vote
0
down vote
You've (manually?) escaped your query params in your URL string. This is OK, and technically a few nanoseconds faster, but less legible than the alternative:
requests.get('https://www.yellowpages.com/search',
params={'search_terms': 'coffee',
'geo_location_terms': 'Los Angeles, CA',
'page': page})
Then, rather than calling format, you simply pass in the page parameter.
add a comment |
up vote
0
down vote
up vote
0
down vote
You've (manually?) escaped your query params in your URL string. This is OK, and technically a few nanoseconds faster, but less legible than the alternative:
requests.get('https://www.yellowpages.com/search',
params={'search_terms': 'coffee',
'geo_location_terms': 'Los Angeles, CA',
'page': page})
Then, rather than calling format, you simply pass in the page parameter.
You've (manually?) escaped your query params in your URL string. This is OK, and technically a few nanoseconds faster, but less legible than the alternative:
requests.get('https://www.yellowpages.com/search',
params={'search_terms': 'coffee',
'geo_location_terms': 'Los Angeles, CA',
'page': page})
Then, rather than calling format, you simply pass in the page parameter.
answered 31 mins ago
Reinderien
932415
932415
add a comment |
add a comment |
robots.txt is a new contributor. Be nice, and check out our Code of Conduct.
robots.txt is a new contributor. Be nice, and check out our Code of Conduct.
robots.txt is a new contributor. Be nice, and check out our Code of Conduct.
robots.txt is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f207790%2fextract-name-address-and-phone-number-from-some-web-pages-using-multiprocessing%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Were you intending to include page 4? If so, you need to change the range's (exclusive) stop value to 5.
– Solomon Ucko
19 hours ago