Webscraping Amazon with BeautifulSoup











up vote
1
down vote

favorite












I am trying to webscrape Amazon's reviews: https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python



Here is my code:



import requests as req
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Kevin's_request'}
r = req.get('https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python', headers=headers)
soup = BeautifulSoup(r.text, "html.parser")
soup.find(class_="a-expander-content a-expander-partial-collapse-content")


I only end up with an empty list. I am using Python 3.6.4 in Jupyter Notebooks and BS 4










share|improve this question
























  • When sending requests on my end to the URL I receive a 503 HTTP Status code. Check the status code on your side. If using the requests library you can do r.status_code where r is a requests.get()
    – kstullich
    Nov 20 at 0:18












  • I did that but left that part out earlier. I edited my questions to reflect that part. Everything seems to working fine but the .find_all(...) to get the text. It just returns an empty list. When I use the same code with another website it works.
    – KPH3802
    Nov 20 at 0:29








  • 1




    The reviews are wrapped in another div called a-section review. See here. When trying len(soup.findAll(class_='a-section review')) the result is 8, which is how many reviews are displayed.
    – kstullich
    Nov 20 at 0:39












  • I see what you're saying but when I do len(soup.findAll(class_='a-section review')) I get a length of 0 I must be doing something wrong.
    – KPH3802
    Nov 20 at 0:51






  • 1




    I initialized it with soup = BeautifulSoup(r.text, "html.parser") Should I remove the "html.parser"?
    – KPH3802
    Nov 20 at 1:16















up vote
1
down vote

favorite












I am trying to webscrape Amazon's reviews: https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python



Here is my code:



import requests as req
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Kevin's_request'}
r = req.get('https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python', headers=headers)
soup = BeautifulSoup(r.text, "html.parser")
soup.find(class_="a-expander-content a-expander-partial-collapse-content")


I only end up with an empty list. I am using Python 3.6.4 in Jupyter Notebooks and BS 4










share|improve this question
























  • When sending requests on my end to the URL I receive a 503 HTTP Status code. Check the status code on your side. If using the requests library you can do r.status_code where r is a requests.get()
    – kstullich
    Nov 20 at 0:18












  • I did that but left that part out earlier. I edited my questions to reflect that part. Everything seems to working fine but the .find_all(...) to get the text. It just returns an empty list. When I use the same code with another website it works.
    – KPH3802
    Nov 20 at 0:29








  • 1




    The reviews are wrapped in another div called a-section review. See here. When trying len(soup.findAll(class_='a-section review')) the result is 8, which is how many reviews are displayed.
    – kstullich
    Nov 20 at 0:39












  • I see what you're saying but when I do len(soup.findAll(class_='a-section review')) I get a length of 0 I must be doing something wrong.
    – KPH3802
    Nov 20 at 0:51






  • 1




    I initialized it with soup = BeautifulSoup(r.text, "html.parser") Should I remove the "html.parser"?
    – KPH3802
    Nov 20 at 1:16













up vote
1
down vote

favorite









up vote
1
down vote

favorite











I am trying to webscrape Amazon's reviews: https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python



Here is my code:



import requests as req
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Kevin's_request'}
r = req.get('https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python', headers=headers)
soup = BeautifulSoup(r.text, "html.parser")
soup.find(class_="a-expander-content a-expander-partial-collapse-content")


I only end up with an empty list. I am using Python 3.6.4 in Jupyter Notebooks and BS 4










share|improve this question















I am trying to webscrape Amazon's reviews: https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python



Here is my code:



import requests as req
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Kevin's_request'}
r = req.get('https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python', headers=headers)
soup = BeautifulSoup(r.text, "html.parser")
soup.find(class_="a-expander-content a-expander-partial-collapse-content")


I only end up with an empty list. I am using Python 3.6.4 in Jupyter Notebooks and BS 4







python web-scraping beautifulsoup findall






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 20 at 1:40

























asked Nov 20 at 0:10









KPH3802

156




156












  • When sending requests on my end to the URL I receive a 503 HTTP Status code. Check the status code on your side. If using the requests library you can do r.status_code where r is a requests.get()
    – kstullich
    Nov 20 at 0:18












  • I did that but left that part out earlier. I edited my questions to reflect that part. Everything seems to working fine but the .find_all(...) to get the text. It just returns an empty list. When I use the same code with another website it works.
    – KPH3802
    Nov 20 at 0:29








  • 1




    The reviews are wrapped in another div called a-section review. See here. When trying len(soup.findAll(class_='a-section review')) the result is 8, which is how many reviews are displayed.
    – kstullich
    Nov 20 at 0:39












  • I see what you're saying but when I do len(soup.findAll(class_='a-section review')) I get a length of 0 I must be doing something wrong.
    – KPH3802
    Nov 20 at 0:51






  • 1




    I initialized it with soup = BeautifulSoup(r.text, "html.parser") Should I remove the "html.parser"?
    – KPH3802
    Nov 20 at 1:16


















  • When sending requests on my end to the URL I receive a 503 HTTP Status code. Check the status code on your side. If using the requests library you can do r.status_code where r is a requests.get()
    – kstullich
    Nov 20 at 0:18












  • I did that but left that part out earlier. I edited my questions to reflect that part. Everything seems to working fine but the .find_all(...) to get the text. It just returns an empty list. When I use the same code with another website it works.
    – KPH3802
    Nov 20 at 0:29








  • 1




    The reviews are wrapped in another div called a-section review. See here. When trying len(soup.findAll(class_='a-section review')) the result is 8, which is how many reviews are displayed.
    – kstullich
    Nov 20 at 0:39












  • I see what you're saying but when I do len(soup.findAll(class_='a-section review')) I get a length of 0 I must be doing something wrong.
    – KPH3802
    Nov 20 at 0:51






  • 1




    I initialized it with soup = BeautifulSoup(r.text, "html.parser") Should I remove the "html.parser"?
    – KPH3802
    Nov 20 at 1:16
















When sending requests on my end to the URL I receive a 503 HTTP Status code. Check the status code on your side. If using the requests library you can do r.status_code where r is a requests.get()
– kstullich
Nov 20 at 0:18






When sending requests on my end to the URL I receive a 503 HTTP Status code. Check the status code on your side. If using the requests library you can do r.status_code where r is a requests.get()
– kstullich
Nov 20 at 0:18














I did that but left that part out earlier. I edited my questions to reflect that part. Everything seems to working fine but the .find_all(...) to get the text. It just returns an empty list. When I use the same code with another website it works.
– KPH3802
Nov 20 at 0:29






I did that but left that part out earlier. I edited my questions to reflect that part. Everything seems to working fine but the .find_all(...) to get the text. It just returns an empty list. When I use the same code with another website it works.
– KPH3802
Nov 20 at 0:29






1




1




The reviews are wrapped in another div called a-section review. See here. When trying len(soup.findAll(class_='a-section review')) the result is 8, which is how many reviews are displayed.
– kstullich
Nov 20 at 0:39






The reviews are wrapped in another div called a-section review. See here. When trying len(soup.findAll(class_='a-section review')) the result is 8, which is how many reviews are displayed.
– kstullich
Nov 20 at 0:39














I see what you're saying but when I do len(soup.findAll(class_='a-section review')) I get a length of 0 I must be doing something wrong.
– KPH3802
Nov 20 at 0:51




I see what you're saying but when I do len(soup.findAll(class_='a-section review')) I get a length of 0 I must be doing something wrong.
– KPH3802
Nov 20 at 0:51




1




1




I initialized it with soup = BeautifulSoup(r.text, "html.parser") Should I remove the "html.parser"?
– KPH3802
Nov 20 at 1:16




I initialized it with soup = BeautifulSoup(r.text, "html.parser") Should I remove the "html.parser"?
– KPH3802
Nov 20 at 1:16












2 Answers
2






active

oldest

votes

















up vote
1
down vote



accepted










Try this approach. Turn out that your selector could not find anything. However, I've fixed it to serve the purpose:



import requests
from bs4 import BeautifulSoup

def get_reviews(s,url):
s.headers['User-Agent'] = 'Mozilla/5.0'
response = s.get(url)
soup = BeautifulSoup(response.text,"lxml")
return soup.find_all("div",{"data-hook":"review-collapsed"})

if __name__ == '__main__':
link = 'https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python'
with requests.Session() as s:
for review in get_reviews(s,link):
print(f'{review.text}n')





share|improve this answer





















  • With this syntax I assume there is only the one get and as you loop here for review in get_reviews(s,link) you are re-using the returned object ? I am not familiar enough wih this library and syntax yet to understand the efficiency I assume is here.
    – QHarr
    Nov 20 at 4:42












  • This worked! Thank you all for your help!
    – KPH3802
    Nov 20 at 23:45


















up vote
0
down vote













Not sure what's happening on your side, but this code works fine.
Here it goes (python 3.6, BSP 4.6.3):



import requests
from bs4 import BeautifulSoup

def s_comments(url):
headers = {'User-Agent': 'Bob's_request'}
response = requests.get(url, headers=headers )
if response.status_code != 200:
raise ConnectionError

soup = BeautifulSoup(response.content)
return soup.find_all(class_="a-expander-content a-expander-partial- collapse-content")


url = 'https://www.amazon.com/dp/1593276036'
reviews = s_comments(url)
for i, review in enumerate(reviews):
print('---- {} ----'.format(i))
print(review.text)





share|improve this answer





















  • When I use this code I still get an empty list. Could it be something on my end?
    – KPH3802
    Nov 20 at 1:20










  • Yes. Try to print the response you are getting back like:
    – AResem
    Nov 20 at 1:31










  • print(response.content) and also your parsed object. like soup.prettify()
    – AResem
    Nov 20 at 1:33











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53384422%2fwebscraping-amazon-with-beautifulsoup%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
1
down vote



accepted










Try this approach. Turn out that your selector could not find anything. However, I've fixed it to serve the purpose:



import requests
from bs4 import BeautifulSoup

def get_reviews(s,url):
s.headers['User-Agent'] = 'Mozilla/5.0'
response = s.get(url)
soup = BeautifulSoup(response.text,"lxml")
return soup.find_all("div",{"data-hook":"review-collapsed"})

if __name__ == '__main__':
link = 'https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python'
with requests.Session() as s:
for review in get_reviews(s,link):
print(f'{review.text}n')





share|improve this answer





















  • With this syntax I assume there is only the one get and as you loop here for review in get_reviews(s,link) you are re-using the returned object ? I am not familiar enough wih this library and syntax yet to understand the efficiency I assume is here.
    – QHarr
    Nov 20 at 4:42












  • This worked! Thank you all for your help!
    – KPH3802
    Nov 20 at 23:45















up vote
1
down vote



accepted










Try this approach. Turn out that your selector could not find anything. However, I've fixed it to serve the purpose:



import requests
from bs4 import BeautifulSoup

def get_reviews(s,url):
s.headers['User-Agent'] = 'Mozilla/5.0'
response = s.get(url)
soup = BeautifulSoup(response.text,"lxml")
return soup.find_all("div",{"data-hook":"review-collapsed"})

if __name__ == '__main__':
link = 'https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python'
with requests.Session() as s:
for review in get_reviews(s,link):
print(f'{review.text}n')





share|improve this answer





















  • With this syntax I assume there is only the one get and as you loop here for review in get_reviews(s,link) you are re-using the returned object ? I am not familiar enough wih this library and syntax yet to understand the efficiency I assume is here.
    – QHarr
    Nov 20 at 4:42












  • This worked! Thank you all for your help!
    – KPH3802
    Nov 20 at 23:45













up vote
1
down vote



accepted







up vote
1
down vote



accepted






Try this approach. Turn out that your selector could not find anything. However, I've fixed it to serve the purpose:



import requests
from bs4 import BeautifulSoup

def get_reviews(s,url):
s.headers['User-Agent'] = 'Mozilla/5.0'
response = s.get(url)
soup = BeautifulSoup(response.text,"lxml")
return soup.find_all("div",{"data-hook":"review-collapsed"})

if __name__ == '__main__':
link = 'https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python'
with requests.Session() as s:
for review in get_reviews(s,link):
print(f'{review.text}n')





share|improve this answer












Try this approach. Turn out that your selector could not find anything. However, I've fixed it to serve the purpose:



import requests
from bs4 import BeautifulSoup

def get_reviews(s,url):
s.headers['User-Agent'] = 'Mozilla/5.0'
response = s.get(url)
soup = BeautifulSoup(response.text,"lxml")
return soup.find_all("div",{"data-hook":"review-collapsed"})

if __name__ == '__main__':
link = 'https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python'
with requests.Session() as s:
for review in get_reviews(s,link):
print(f'{review.text}n')






share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 20 at 3:17









SIM

9,6553639




9,6553639












  • With this syntax I assume there is only the one get and as you loop here for review in get_reviews(s,link) you are re-using the returned object ? I am not familiar enough wih this library and syntax yet to understand the efficiency I assume is here.
    – QHarr
    Nov 20 at 4:42












  • This worked! Thank you all for your help!
    – KPH3802
    Nov 20 at 23:45


















  • With this syntax I assume there is only the one get and as you loop here for review in get_reviews(s,link) you are re-using the returned object ? I am not familiar enough wih this library and syntax yet to understand the efficiency I assume is here.
    – QHarr
    Nov 20 at 4:42












  • This worked! Thank you all for your help!
    – KPH3802
    Nov 20 at 23:45
















With this syntax I assume there is only the one get and as you loop here for review in get_reviews(s,link) you are re-using the returned object ? I am not familiar enough wih this library and syntax yet to understand the efficiency I assume is here.
– QHarr
Nov 20 at 4:42






With this syntax I assume there is only the one get and as you loop here for review in get_reviews(s,link) you are re-using the returned object ? I am not familiar enough wih this library and syntax yet to understand the efficiency I assume is here.
– QHarr
Nov 20 at 4:42














This worked! Thank you all for your help!
– KPH3802
Nov 20 at 23:45




This worked! Thank you all for your help!
– KPH3802
Nov 20 at 23:45












up vote
0
down vote













Not sure what's happening on your side, but this code works fine.
Here it goes (python 3.6, BSP 4.6.3):



import requests
from bs4 import BeautifulSoup

def s_comments(url):
headers = {'User-Agent': 'Bob's_request'}
response = requests.get(url, headers=headers )
if response.status_code != 200:
raise ConnectionError

soup = BeautifulSoup(response.content)
return soup.find_all(class_="a-expander-content a-expander-partial- collapse-content")


url = 'https://www.amazon.com/dp/1593276036'
reviews = s_comments(url)
for i, review in enumerate(reviews):
print('---- {} ----'.format(i))
print(review.text)





share|improve this answer





















  • When I use this code I still get an empty list. Could it be something on my end?
    – KPH3802
    Nov 20 at 1:20










  • Yes. Try to print the response you are getting back like:
    – AResem
    Nov 20 at 1:31










  • print(response.content) and also your parsed object. like soup.prettify()
    – AResem
    Nov 20 at 1:33















up vote
0
down vote













Not sure what's happening on your side, but this code works fine.
Here it goes (python 3.6, BSP 4.6.3):



import requests
from bs4 import BeautifulSoup

def s_comments(url):
headers = {'User-Agent': 'Bob's_request'}
response = requests.get(url, headers=headers )
if response.status_code != 200:
raise ConnectionError

soup = BeautifulSoup(response.content)
return soup.find_all(class_="a-expander-content a-expander-partial- collapse-content")


url = 'https://www.amazon.com/dp/1593276036'
reviews = s_comments(url)
for i, review in enumerate(reviews):
print('---- {} ----'.format(i))
print(review.text)





share|improve this answer





















  • When I use this code I still get an empty list. Could it be something on my end?
    – KPH3802
    Nov 20 at 1:20










  • Yes. Try to print the response you are getting back like:
    – AResem
    Nov 20 at 1:31










  • print(response.content) and also your parsed object. like soup.prettify()
    – AResem
    Nov 20 at 1:33













up vote
0
down vote










up vote
0
down vote









Not sure what's happening on your side, but this code works fine.
Here it goes (python 3.6, BSP 4.6.3):



import requests
from bs4 import BeautifulSoup

def s_comments(url):
headers = {'User-Agent': 'Bob's_request'}
response = requests.get(url, headers=headers )
if response.status_code != 200:
raise ConnectionError

soup = BeautifulSoup(response.content)
return soup.find_all(class_="a-expander-content a-expander-partial- collapse-content")


url = 'https://www.amazon.com/dp/1593276036'
reviews = s_comments(url)
for i, review in enumerate(reviews):
print('---- {} ----'.format(i))
print(review.text)





share|improve this answer












Not sure what's happening on your side, but this code works fine.
Here it goes (python 3.6, BSP 4.6.3):



import requests
from bs4 import BeautifulSoup

def s_comments(url):
headers = {'User-Agent': 'Bob's_request'}
response = requests.get(url, headers=headers )
if response.status_code != 200:
raise ConnectionError

soup = BeautifulSoup(response.content)
return soup.find_all(class_="a-expander-content a-expander-partial- collapse-content")


url = 'https://www.amazon.com/dp/1593276036'
reviews = s_comments(url)
for i, review in enumerate(reviews):
print('---- {} ----'.format(i))
print(review.text)






share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 20 at 1:07









AResem

1114




1114












  • When I use this code I still get an empty list. Could it be something on my end?
    – KPH3802
    Nov 20 at 1:20










  • Yes. Try to print the response you are getting back like:
    – AResem
    Nov 20 at 1:31










  • print(response.content) and also your parsed object. like soup.prettify()
    – AResem
    Nov 20 at 1:33


















  • When I use this code I still get an empty list. Could it be something on my end?
    – KPH3802
    Nov 20 at 1:20










  • Yes. Try to print the response you are getting back like:
    – AResem
    Nov 20 at 1:31










  • print(response.content) and also your parsed object. like soup.prettify()
    – AResem
    Nov 20 at 1:33
















When I use this code I still get an empty list. Could it be something on my end?
– KPH3802
Nov 20 at 1:20




When I use this code I still get an empty list. Could it be something on my end?
– KPH3802
Nov 20 at 1:20












Yes. Try to print the response you are getting back like:
– AResem
Nov 20 at 1:31




Yes. Try to print the response you are getting back like:
– AResem
Nov 20 at 1:31












print(response.content) and also your parsed object. like soup.prettify()
– AResem
Nov 20 at 1:33




print(response.content) and also your parsed object. like soup.prettify()
– AResem
Nov 20 at 1:33


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53384422%2fwebscraping-amazon-with-beautifulsoup%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Costa Masnaga

Fotorealismo

Sidney Franklin