Webscraping Amazon with BeautifulSoup
up vote
1
down vote
favorite
I am trying to webscrape Amazon's reviews: https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python
Here is my code:
import requests as req
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Kevin's_request'}
r = req.get('https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python', headers=headers)
soup = BeautifulSoup(r.text, "html.parser")
soup.find(class_="a-expander-content a-expander-partial-collapse-content")
I only end up with an empty list. I am using Python 3.6.4 in Jupyter Notebooks and BS 4
python web-scraping beautifulsoup findall
|
show 2 more comments
up vote
1
down vote
favorite
I am trying to webscrape Amazon's reviews: https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python
Here is my code:
import requests as req
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Kevin's_request'}
r = req.get('https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python', headers=headers)
soup = BeautifulSoup(r.text, "html.parser")
soup.find(class_="a-expander-content a-expander-partial-collapse-content")
I only end up with an empty list. I am using Python 3.6.4 in Jupyter Notebooks and BS 4
python web-scraping beautifulsoup findall
When sending requests on my end to the URL I receive a 503 HTTP Status code. Check the status code on your side. If using therequests
library you can dor.status_code
wherer
is arequests.get()
– kstullich
Nov 20 at 0:18
I did that but left that part out earlier. I edited my questions to reflect that part. Everything seems to working fine but the .find_all(...) to get the text. It just returns an empty list. When I use the same code with another website it works.
– KPH3802
Nov 20 at 0:29
1
The reviews are wrapped in anotherdiv
calleda-section review
. See here. When tryinglen(soup.findAll(class_='a-section review'))
the result is 8, which is how many reviews are displayed.
– kstullich
Nov 20 at 0:39
I see what you're saying but when I dolen(soup.findAll(class_='a-section review'))
I get a length of 0 I must be doing something wrong.
– KPH3802
Nov 20 at 0:51
1
I initialized it withsoup = BeautifulSoup(r.text, "html.parser")
Should I remove the"html.parser"
?
– KPH3802
Nov 20 at 1:16
|
show 2 more comments
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I am trying to webscrape Amazon's reviews: https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python
Here is my code:
import requests as req
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Kevin's_request'}
r = req.get('https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python', headers=headers)
soup = BeautifulSoup(r.text, "html.parser")
soup.find(class_="a-expander-content a-expander-partial-collapse-content")
I only end up with an empty list. I am using Python 3.6.4 in Jupyter Notebooks and BS 4
python web-scraping beautifulsoup findall
I am trying to webscrape Amazon's reviews: https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python
Here is my code:
import requests as req
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Kevin's_request'}
r = req.get('https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python', headers=headers)
soup = BeautifulSoup(r.text, "html.parser")
soup.find(class_="a-expander-content a-expander-partial-collapse-content")
I only end up with an empty list. I am using Python 3.6.4 in Jupyter Notebooks and BS 4
python web-scraping beautifulsoup findall
python web-scraping beautifulsoup findall
edited Nov 20 at 1:40
asked Nov 20 at 0:10
KPH3802
156
156
When sending requests on my end to the URL I receive a 503 HTTP Status code. Check the status code on your side. If using therequests
library you can dor.status_code
wherer
is arequests.get()
– kstullich
Nov 20 at 0:18
I did that but left that part out earlier. I edited my questions to reflect that part. Everything seems to working fine but the .find_all(...) to get the text. It just returns an empty list. When I use the same code with another website it works.
– KPH3802
Nov 20 at 0:29
1
The reviews are wrapped in anotherdiv
calleda-section review
. See here. When tryinglen(soup.findAll(class_='a-section review'))
the result is 8, which is how many reviews are displayed.
– kstullich
Nov 20 at 0:39
I see what you're saying but when I dolen(soup.findAll(class_='a-section review'))
I get a length of 0 I must be doing something wrong.
– KPH3802
Nov 20 at 0:51
1
I initialized it withsoup = BeautifulSoup(r.text, "html.parser")
Should I remove the"html.parser"
?
– KPH3802
Nov 20 at 1:16
|
show 2 more comments
When sending requests on my end to the URL I receive a 503 HTTP Status code. Check the status code on your side. If using therequests
library you can dor.status_code
wherer
is arequests.get()
– kstullich
Nov 20 at 0:18
I did that but left that part out earlier. I edited my questions to reflect that part. Everything seems to working fine but the .find_all(...) to get the text. It just returns an empty list. When I use the same code with another website it works.
– KPH3802
Nov 20 at 0:29
1
The reviews are wrapped in anotherdiv
calleda-section review
. See here. When tryinglen(soup.findAll(class_='a-section review'))
the result is 8, which is how many reviews are displayed.
– kstullich
Nov 20 at 0:39
I see what you're saying but when I dolen(soup.findAll(class_='a-section review'))
I get a length of 0 I must be doing something wrong.
– KPH3802
Nov 20 at 0:51
1
I initialized it withsoup = BeautifulSoup(r.text, "html.parser")
Should I remove the"html.parser"
?
– KPH3802
Nov 20 at 1:16
When sending requests on my end to the URL I receive a 503 HTTP Status code. Check the status code on your side. If using the
requests
library you can do r.status_code
where r
is a requests.get()
– kstullich
Nov 20 at 0:18
When sending requests on my end to the URL I receive a 503 HTTP Status code. Check the status code on your side. If using the
requests
library you can do r.status_code
where r
is a requests.get()
– kstullich
Nov 20 at 0:18
I did that but left that part out earlier. I edited my questions to reflect that part. Everything seems to working fine but the .find_all(...) to get the text. It just returns an empty list. When I use the same code with another website it works.
– KPH3802
Nov 20 at 0:29
I did that but left that part out earlier. I edited my questions to reflect that part. Everything seems to working fine but the .find_all(...) to get the text. It just returns an empty list. When I use the same code with another website it works.
– KPH3802
Nov 20 at 0:29
1
1
The reviews are wrapped in another
div
called a-section review
. See here. When trying len(soup.findAll(class_='a-section review'))
the result is 8, which is how many reviews are displayed.– kstullich
Nov 20 at 0:39
The reviews are wrapped in another
div
called a-section review
. See here. When trying len(soup.findAll(class_='a-section review'))
the result is 8, which is how many reviews are displayed.– kstullich
Nov 20 at 0:39
I see what you're saying but when I do
len(soup.findAll(class_='a-section review'))
I get a length of 0 I must be doing something wrong.– KPH3802
Nov 20 at 0:51
I see what you're saying but when I do
len(soup.findAll(class_='a-section review'))
I get a length of 0 I must be doing something wrong.– KPH3802
Nov 20 at 0:51
1
1
I initialized it with
soup = BeautifulSoup(r.text, "html.parser")
Should I remove the "html.parser"
?– KPH3802
Nov 20 at 1:16
I initialized it with
soup = BeautifulSoup(r.text, "html.parser")
Should I remove the "html.parser"
?– KPH3802
Nov 20 at 1:16
|
show 2 more comments
2 Answers
2
active
oldest
votes
up vote
1
down vote
accepted
Try this approach. Turn out that your selector could not find anything. However, I've fixed it to serve the purpose:
import requests
from bs4 import BeautifulSoup
def get_reviews(s,url):
s.headers['User-Agent'] = 'Mozilla/5.0'
response = s.get(url)
soup = BeautifulSoup(response.text,"lxml")
return soup.find_all("div",{"data-hook":"review-collapsed"})
if __name__ == '__main__':
link = 'https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python'
with requests.Session() as s:
for review in get_reviews(s,link):
print(f'{review.text}n')
With this syntax I assume there is only the one get and as you loop here for review in get_reviews(s,link) you are re-using the returned object ? I am not familiar enough wih this library and syntax yet to understand the efficiency I assume is here.
– QHarr
Nov 20 at 4:42
This worked! Thank you all for your help!
– KPH3802
Nov 20 at 23:45
add a comment |
up vote
0
down vote
Not sure what's happening on your side, but this code works fine.
Here it goes (python 3.6, BSP 4.6.3):
import requests
from bs4 import BeautifulSoup
def s_comments(url):
headers = {'User-Agent': 'Bob's_request'}
response = requests.get(url, headers=headers )
if response.status_code != 200:
raise ConnectionError
soup = BeautifulSoup(response.content)
return soup.find_all(class_="a-expander-content a-expander-partial- collapse-content")
url = 'https://www.amazon.com/dp/1593276036'
reviews = s_comments(url)
for i, review in enumerate(reviews):
print('---- {} ----'.format(i))
print(review.text)
When I use this code I still get an empty list. Could it be something on my end?
– KPH3802
Nov 20 at 1:20
Yes. Try to print the response you are getting back like:
– AResem
Nov 20 at 1:31
print(response.content) and also your parsed object. like soup.prettify()
– AResem
Nov 20 at 1:33
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53384422%2fwebscraping-amazon-with-beautifulsoup%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
Try this approach. Turn out that your selector could not find anything. However, I've fixed it to serve the purpose:
import requests
from bs4 import BeautifulSoup
def get_reviews(s,url):
s.headers['User-Agent'] = 'Mozilla/5.0'
response = s.get(url)
soup = BeautifulSoup(response.text,"lxml")
return soup.find_all("div",{"data-hook":"review-collapsed"})
if __name__ == '__main__':
link = 'https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python'
with requests.Session() as s:
for review in get_reviews(s,link):
print(f'{review.text}n')
With this syntax I assume there is only the one get and as you loop here for review in get_reviews(s,link) you are re-using the returned object ? I am not familiar enough wih this library and syntax yet to understand the efficiency I assume is here.
– QHarr
Nov 20 at 4:42
This worked! Thank you all for your help!
– KPH3802
Nov 20 at 23:45
add a comment |
up vote
1
down vote
accepted
Try this approach. Turn out that your selector could not find anything. However, I've fixed it to serve the purpose:
import requests
from bs4 import BeautifulSoup
def get_reviews(s,url):
s.headers['User-Agent'] = 'Mozilla/5.0'
response = s.get(url)
soup = BeautifulSoup(response.text,"lxml")
return soup.find_all("div",{"data-hook":"review-collapsed"})
if __name__ == '__main__':
link = 'https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python'
with requests.Session() as s:
for review in get_reviews(s,link):
print(f'{review.text}n')
With this syntax I assume there is only the one get and as you loop here for review in get_reviews(s,link) you are re-using the returned object ? I am not familiar enough wih this library and syntax yet to understand the efficiency I assume is here.
– QHarr
Nov 20 at 4:42
This worked! Thank you all for your help!
– KPH3802
Nov 20 at 23:45
add a comment |
up vote
1
down vote
accepted
up vote
1
down vote
accepted
Try this approach. Turn out that your selector could not find anything. However, I've fixed it to serve the purpose:
import requests
from bs4 import BeautifulSoup
def get_reviews(s,url):
s.headers['User-Agent'] = 'Mozilla/5.0'
response = s.get(url)
soup = BeautifulSoup(response.text,"lxml")
return soup.find_all("div",{"data-hook":"review-collapsed"})
if __name__ == '__main__':
link = 'https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python'
with requests.Session() as s:
for review in get_reviews(s,link):
print(f'{review.text}n')
Try this approach. Turn out that your selector could not find anything. However, I've fixed it to serve the purpose:
import requests
from bs4 import BeautifulSoup
def get_reviews(s,url):
s.headers['User-Agent'] = 'Mozilla/5.0'
response = s.get(url)
soup = BeautifulSoup(response.text,"lxml")
return soup.find_all("div",{"data-hook":"review-collapsed"})
if __name__ == '__main__':
link = 'https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python'
with requests.Session() as s:
for review in get_reviews(s,link):
print(f'{review.text}n')
answered Nov 20 at 3:17
SIM
9,6553639
9,6553639
With this syntax I assume there is only the one get and as you loop here for review in get_reviews(s,link) you are re-using the returned object ? I am not familiar enough wih this library and syntax yet to understand the efficiency I assume is here.
– QHarr
Nov 20 at 4:42
This worked! Thank you all for your help!
– KPH3802
Nov 20 at 23:45
add a comment |
With this syntax I assume there is only the one get and as you loop here for review in get_reviews(s,link) you are re-using the returned object ? I am not familiar enough wih this library and syntax yet to understand the efficiency I assume is here.
– QHarr
Nov 20 at 4:42
This worked! Thank you all for your help!
– KPH3802
Nov 20 at 23:45
With this syntax I assume there is only the one get and as you loop here for review in get_reviews(s,link) you are re-using the returned object ? I am not familiar enough wih this library and syntax yet to understand the efficiency I assume is here.
– QHarr
Nov 20 at 4:42
With this syntax I assume there is only the one get and as you loop here for review in get_reviews(s,link) you are re-using the returned object ? I am not familiar enough wih this library and syntax yet to understand the efficiency I assume is here.
– QHarr
Nov 20 at 4:42
This worked! Thank you all for your help!
– KPH3802
Nov 20 at 23:45
This worked! Thank you all for your help!
– KPH3802
Nov 20 at 23:45
add a comment |
up vote
0
down vote
Not sure what's happening on your side, but this code works fine.
Here it goes (python 3.6, BSP 4.6.3):
import requests
from bs4 import BeautifulSoup
def s_comments(url):
headers = {'User-Agent': 'Bob's_request'}
response = requests.get(url, headers=headers )
if response.status_code != 200:
raise ConnectionError
soup = BeautifulSoup(response.content)
return soup.find_all(class_="a-expander-content a-expander-partial- collapse-content")
url = 'https://www.amazon.com/dp/1593276036'
reviews = s_comments(url)
for i, review in enumerate(reviews):
print('---- {} ----'.format(i))
print(review.text)
When I use this code I still get an empty list. Could it be something on my end?
– KPH3802
Nov 20 at 1:20
Yes. Try to print the response you are getting back like:
– AResem
Nov 20 at 1:31
print(response.content) and also your parsed object. like soup.prettify()
– AResem
Nov 20 at 1:33
add a comment |
up vote
0
down vote
Not sure what's happening on your side, but this code works fine.
Here it goes (python 3.6, BSP 4.6.3):
import requests
from bs4 import BeautifulSoup
def s_comments(url):
headers = {'User-Agent': 'Bob's_request'}
response = requests.get(url, headers=headers )
if response.status_code != 200:
raise ConnectionError
soup = BeautifulSoup(response.content)
return soup.find_all(class_="a-expander-content a-expander-partial- collapse-content")
url = 'https://www.amazon.com/dp/1593276036'
reviews = s_comments(url)
for i, review in enumerate(reviews):
print('---- {} ----'.format(i))
print(review.text)
When I use this code I still get an empty list. Could it be something on my end?
– KPH3802
Nov 20 at 1:20
Yes. Try to print the response you are getting back like:
– AResem
Nov 20 at 1:31
print(response.content) and also your parsed object. like soup.prettify()
– AResem
Nov 20 at 1:33
add a comment |
up vote
0
down vote
up vote
0
down vote
Not sure what's happening on your side, but this code works fine.
Here it goes (python 3.6, BSP 4.6.3):
import requests
from bs4 import BeautifulSoup
def s_comments(url):
headers = {'User-Agent': 'Bob's_request'}
response = requests.get(url, headers=headers )
if response.status_code != 200:
raise ConnectionError
soup = BeautifulSoup(response.content)
return soup.find_all(class_="a-expander-content a-expander-partial- collapse-content")
url = 'https://www.amazon.com/dp/1593276036'
reviews = s_comments(url)
for i, review in enumerate(reviews):
print('---- {} ----'.format(i))
print(review.text)
Not sure what's happening on your side, but this code works fine.
Here it goes (python 3.6, BSP 4.6.3):
import requests
from bs4 import BeautifulSoup
def s_comments(url):
headers = {'User-Agent': 'Bob's_request'}
response = requests.get(url, headers=headers )
if response.status_code != 200:
raise ConnectionError
soup = BeautifulSoup(response.content)
return soup.find_all(class_="a-expander-content a-expander-partial- collapse-content")
url = 'https://www.amazon.com/dp/1593276036'
reviews = s_comments(url)
for i, review in enumerate(reviews):
print('---- {} ----'.format(i))
print(review.text)
answered Nov 20 at 1:07
AResem
1114
1114
When I use this code I still get an empty list. Could it be something on my end?
– KPH3802
Nov 20 at 1:20
Yes. Try to print the response you are getting back like:
– AResem
Nov 20 at 1:31
print(response.content) and also your parsed object. like soup.prettify()
– AResem
Nov 20 at 1:33
add a comment |
When I use this code I still get an empty list. Could it be something on my end?
– KPH3802
Nov 20 at 1:20
Yes. Try to print the response you are getting back like:
– AResem
Nov 20 at 1:31
print(response.content) and also your parsed object. like soup.prettify()
– AResem
Nov 20 at 1:33
When I use this code I still get an empty list. Could it be something on my end?
– KPH3802
Nov 20 at 1:20
When I use this code I still get an empty list. Could it be something on my end?
– KPH3802
Nov 20 at 1:20
Yes. Try to print the response you are getting back like:
– AResem
Nov 20 at 1:31
Yes. Try to print the response you are getting back like:
– AResem
Nov 20 at 1:31
print(response.content) and also your parsed object. like soup.prettify()
– AResem
Nov 20 at 1:33
print(response.content) and also your parsed object. like soup.prettify()
– AResem
Nov 20 at 1:33
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53384422%2fwebscraping-amazon-with-beautifulsoup%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
When sending requests on my end to the URL I receive a 503 HTTP Status code. Check the status code on your side. If using the
requests
library you can dor.status_code
wherer
is arequests.get()
– kstullich
Nov 20 at 0:18
I did that but left that part out earlier. I edited my questions to reflect that part. Everything seems to working fine but the .find_all(...) to get the text. It just returns an empty list. When I use the same code with another website it works.
– KPH3802
Nov 20 at 0:29
1
The reviews are wrapped in another
div
calleda-section review
. See here. When tryinglen(soup.findAll(class_='a-section review'))
the result is 8, which is how many reviews are displayed.– kstullich
Nov 20 at 0:39
I see what you're saying but when I do
len(soup.findAll(class_='a-section review'))
I get a length of 0 I must be doing something wrong.– KPH3802
Nov 20 at 0:51
1
I initialized it with
soup = BeautifulSoup(r.text, "html.parser")
Should I remove the"html.parser"
?– KPH3802
Nov 20 at 1:16