Extract name, address and phone number from some web pages using multiprocessing











up vote
2
down vote

favorite












I've written a script in Python using the multiprocessing module to scrape values from web pages (one page per subprocess). As I'm very new to multiprocessing, I'm not sure whether I did everything in the right way. It works without error, though.



Here goes the full script:



import requests 
from lxml.html import fromstring
from multiprocessing import Pool

link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"

def create_links(url):
response = requests.get(url).text
tree = fromstring(response)
for title in tree.cssselect("div.info"):
name = title.cssselect("a.business-name span")[0].text
try:
street = title.cssselect("span.street-address")[0].text
except IndexError: street = ""
try:
phone = title.cssselect("div[class^=phones]")[0].text
except IndexError: phone = ""
print(name, street, phone)

if __name__ == '__main__':
links = [link.format(page) for page in range(1,4)]
with Pool(4) as p:
p.map(create_links, links)


Any idea to make it more robust will be highly appreciated.










share|improve this question









New contributor




robots.txt is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.




















  • Were you intending to include page 4? If so, you need to change the range's (exclusive) stop value to 5.
    – Solomon Ucko
    19 hours ago















up vote
2
down vote

favorite












I've written a script in Python using the multiprocessing module to scrape values from web pages (one page per subprocess). As I'm very new to multiprocessing, I'm not sure whether I did everything in the right way. It works without error, though.



Here goes the full script:



import requests 
from lxml.html import fromstring
from multiprocessing import Pool

link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"

def create_links(url):
response = requests.get(url).text
tree = fromstring(response)
for title in tree.cssselect("div.info"):
name = title.cssselect("a.business-name span")[0].text
try:
street = title.cssselect("span.street-address")[0].text
except IndexError: street = ""
try:
phone = title.cssselect("div[class^=phones]")[0].text
except IndexError: phone = ""
print(name, street, phone)

if __name__ == '__main__':
links = [link.format(page) for page in range(1,4)]
with Pool(4) as p:
p.map(create_links, links)


Any idea to make it more robust will be highly appreciated.










share|improve this question









New contributor




robots.txt is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.




















  • Were you intending to include page 4? If so, you need to change the range's (exclusive) stop value to 5.
    – Solomon Ucko
    19 hours ago













up vote
2
down vote

favorite









up vote
2
down vote

favorite











I've written a script in Python using the multiprocessing module to scrape values from web pages (one page per subprocess). As I'm very new to multiprocessing, I'm not sure whether I did everything in the right way. It works without error, though.



Here goes the full script:



import requests 
from lxml.html import fromstring
from multiprocessing import Pool

link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"

def create_links(url):
response = requests.get(url).text
tree = fromstring(response)
for title in tree.cssselect("div.info"):
name = title.cssselect("a.business-name span")[0].text
try:
street = title.cssselect("span.street-address")[0].text
except IndexError: street = ""
try:
phone = title.cssselect("div[class^=phones]")[0].text
except IndexError: phone = ""
print(name, street, phone)

if __name__ == '__main__':
links = [link.format(page) for page in range(1,4)]
with Pool(4) as p:
p.map(create_links, links)


Any idea to make it more robust will be highly appreciated.










share|improve this question









New contributor




robots.txt is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











I've written a script in Python using the multiprocessing module to scrape values from web pages (one page per subprocess). As I'm very new to multiprocessing, I'm not sure whether I did everything in the right way. It works without error, though.



Here goes the full script:



import requests 
from lxml.html import fromstring
from multiprocessing import Pool

link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"

def create_links(url):
response = requests.get(url).text
tree = fromstring(response)
for title in tree.cssselect("div.info"):
name = title.cssselect("a.business-name span")[0].text
try:
street = title.cssselect("span.street-address")[0].text
except IndexError: street = ""
try:
phone = title.cssselect("div[class^=phones]")[0].text
except IndexError: phone = ""
print(name, street, phone)

if __name__ == '__main__':
links = [link.format(page) for page in range(1,4)]
with Pool(4) as p:
p.map(create_links, links)


Any idea to make it more robust will be highly appreciated.







python python-3.x web-scraping multiprocessing






share|improve this question









New contributor




robots.txt is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question









New contributor




robots.txt is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question








edited 21 hours ago









Toby Speight

21.9k536108




21.9k536108






New contributor




robots.txt is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked 22 hours ago









robots.txt

112




112




New contributor




robots.txt is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





robots.txt is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






robots.txt is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.












  • Were you intending to include page 4? If so, you need to change the range's (exclusive) stop value to 5.
    – Solomon Ucko
    19 hours ago


















  • Were you intending to include page 4? If so, you need to change the range's (exclusive) stop value to 5.
    – Solomon Ucko
    19 hours ago
















Were you intending to include page 4? If so, you need to change the range's (exclusive) stop value to 5.
– Solomon Ucko
19 hours ago




Were you intending to include page 4? If so, you need to change the range's (exclusive) stop value to 5.
– Solomon Ucko
19 hours ago










2 Answers
2






active

oldest

votes

















up vote
0
down vote













Proper use of a Pool p should include p.close() and p.join() (in this order).



Cases of the websites not responding should be handled. Requests should have a timeout, the timeout exception is caught, and non-200 should be handled as well.



Other than that, the script is correct for a one time extract of a few pages, but will need to be extended if you intend to produce a high volume daemon. If that is the case, usage of Pool can be questionnable and has multiple alternatives, such as using Scrapy framework scheduler or using a celery broker for high-level handling of workers (this will avoid your workers to crash entirely on exceptions, among a few other benefits).






share|improve this answer










New contributor




Arthur Havlicek is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

























    up vote
    0
    down vote













    You've (manually?) escaped your query params in your URL string. This is OK, and technically a few nanoseconds faster, but less legible than the alternative:



    requests.get('https://www.yellowpages.com/search',
    params={'search_terms': 'coffee',
    'geo_location_terms': 'Los Angeles, CA',
    'page': page})


    Then, rather than calling format, you simply pass in the page parameter.






    share|improve this answer





















      Your Answer





      StackExchange.ifUsing("editor", function () {
      return StackExchange.using("mathjaxEditing", function () {
      StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
      StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
      });
      });
      }, "mathjax-editing");

      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "196"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });






      robots.txt is a new contributor. Be nice, and check out our Code of Conduct.










       

      draft saved


      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f207790%2fextract-name-address-and-phone-number-from-some-web-pages-using-multiprocessing%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      0
      down vote













      Proper use of a Pool p should include p.close() and p.join() (in this order).



      Cases of the websites not responding should be handled. Requests should have a timeout, the timeout exception is caught, and non-200 should be handled as well.



      Other than that, the script is correct for a one time extract of a few pages, but will need to be extended if you intend to produce a high volume daemon. If that is the case, usage of Pool can be questionnable and has multiple alternatives, such as using Scrapy framework scheduler or using a celery broker for high-level handling of workers (this will avoid your workers to crash entirely on exceptions, among a few other benefits).






      share|improve this answer










      New contributor




      Arthur Havlicek is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






















        up vote
        0
        down vote













        Proper use of a Pool p should include p.close() and p.join() (in this order).



        Cases of the websites not responding should be handled. Requests should have a timeout, the timeout exception is caught, and non-200 should be handled as well.



        Other than that, the script is correct for a one time extract of a few pages, but will need to be extended if you intend to produce a high volume daemon. If that is the case, usage of Pool can be questionnable and has multiple alternatives, such as using Scrapy framework scheduler or using a celery broker for high-level handling of workers (this will avoid your workers to crash entirely on exceptions, among a few other benefits).






        share|improve this answer










        New contributor




        Arthur Havlicek is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.




















          up vote
          0
          down vote










          up vote
          0
          down vote









          Proper use of a Pool p should include p.close() and p.join() (in this order).



          Cases of the websites not responding should be handled. Requests should have a timeout, the timeout exception is caught, and non-200 should be handled as well.



          Other than that, the script is correct for a one time extract of a few pages, but will need to be extended if you intend to produce a high volume daemon. If that is the case, usage of Pool can be questionnable and has multiple alternatives, such as using Scrapy framework scheduler or using a celery broker for high-level handling of workers (this will avoid your workers to crash entirely on exceptions, among a few other benefits).






          share|improve this answer










          New contributor




          Arthur Havlicek is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.









          Proper use of a Pool p should include p.close() and p.join() (in this order).



          Cases of the websites not responding should be handled. Requests should have a timeout, the timeout exception is caught, and non-200 should be handled as well.



          Other than that, the script is correct for a one time extract of a few pages, but will need to be extended if you intend to produce a high volume daemon. If that is the case, usage of Pool can be questionnable and has multiple alternatives, such as using Scrapy framework scheduler or using a celery broker for high-level handling of workers (this will avoid your workers to crash entirely on exceptions, among a few other benefits).







          share|improve this answer










          New contributor




          Arthur Havlicek is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.









          share|improve this answer



          share|improve this answer








          edited 13 hours ago





















          New contributor




          Arthur Havlicek is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.









          answered 14 hours ago









          Arthur Havlicek

          2713




          2713




          New contributor




          Arthur Havlicek is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.





          New contributor





          Arthur Havlicek is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.






          Arthur Havlicek is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.
























              up vote
              0
              down vote













              You've (manually?) escaped your query params in your URL string. This is OK, and technically a few nanoseconds faster, but less legible than the alternative:



              requests.get('https://www.yellowpages.com/search',
              params={'search_terms': 'coffee',
              'geo_location_terms': 'Los Angeles, CA',
              'page': page})


              Then, rather than calling format, you simply pass in the page parameter.






              share|improve this answer

























                up vote
                0
                down vote













                You've (manually?) escaped your query params in your URL string. This is OK, and technically a few nanoseconds faster, but less legible than the alternative:



                requests.get('https://www.yellowpages.com/search',
                params={'search_terms': 'coffee',
                'geo_location_terms': 'Los Angeles, CA',
                'page': page})


                Then, rather than calling format, you simply pass in the page parameter.






                share|improve this answer























                  up vote
                  0
                  down vote










                  up vote
                  0
                  down vote









                  You've (manually?) escaped your query params in your URL string. This is OK, and technically a few nanoseconds faster, but less legible than the alternative:



                  requests.get('https://www.yellowpages.com/search',
                  params={'search_terms': 'coffee',
                  'geo_location_terms': 'Los Angeles, CA',
                  'page': page})


                  Then, rather than calling format, you simply pass in the page parameter.






                  share|improve this answer












                  You've (manually?) escaped your query params in your URL string. This is OK, and technically a few nanoseconds faster, but less legible than the alternative:



                  requests.get('https://www.yellowpages.com/search',
                  params={'search_terms': 'coffee',
                  'geo_location_terms': 'Los Angeles, CA',
                  'page': page})


                  Then, rather than calling format, you simply pass in the page parameter.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered 31 mins ago









                  Reinderien

                  932415




                  932415






















                      robots.txt is a new contributor. Be nice, and check out our Code of Conduct.










                       

                      draft saved


                      draft discarded


















                      robots.txt is a new contributor. Be nice, and check out our Code of Conduct.













                      robots.txt is a new contributor. Be nice, and check out our Code of Conduct.












                      robots.txt is a new contributor. Be nice, and check out our Code of Conduct.















                       


                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f207790%2fextract-name-address-and-phone-number-from-some-web-pages-using-multiprocessing%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Ottavio Pratesi

                      Tricia Helfer

                      15 giugno