Webscraping Amazon with BeautifulSoup

up vote
1
down vote

favorite

I am trying to webscrape Amazon's reviews: https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python

Here is my code:

import requests as req

from bs4 import BeautifulSoup

headers = {'User-Agent': 'Kevin's_request'}

r = req.get('https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python', headers=headers)

soup = BeautifulSoup(r.text, "html.parser")

soup.find(class_="a-expander-content a-expander-partial-collapse-content")

I only end up with an empty list. I am using Python 3.6.4 in Jupyter Notebooks and BS 4

edited Nov 20 at 1:40

asked Nov 20 at 0:10

KPH3802

156

When sending requests on my end to the URL I receive a 503 HTTP Status code. Check the status code on your side. If using the requests library you can do r.status_code where r is a requests.get()
– kstullich
Nov 20 at 0:18

I did that but left that part out earlier. I edited my questions to reflect that part. Everything seems to working fine but the .find_all(...) to get the text. It just returns an empty list. When I use the same code with another website it works.
– KPH3802
Nov 20 at 0:29

1

The reviews are wrapped in another div called a-section review. See here. When trying len(soup.findAll(class_='a-section review')) the result is 8, which is how many reviews are displayed.
– kstullich
Nov 20 at 0:39

I see what you're saying but when I do len(soup.findAll(class_='a-section review')) I get a length of 0 I must be doing something wrong.
– KPH3802
Nov 20 at 0:51

1

I initialized it with soup = BeautifulSoup(r.text, "html.parser") Should I remove the "html.parser"?
– KPH3802
Nov 20 at 1:16

|
show 2 more comments

up vote
1
down vote

favorite

I am trying to webscrape Amazon's reviews: https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python

Here is my code:

import requests as req

from bs4 import BeautifulSoup

headers = {'User-Agent': 'Kevin's_request'}

r = req.get('https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python', headers=headers)

soup = BeautifulSoup(r.text, "html.parser")

soup.find(class_="a-expander-content a-expander-partial-collapse-content")

I only end up with an empty list. I am using Python 3.6.4 in Jupyter Notebooks and BS 4

edited Nov 20 at 1:40

asked Nov 20 at 0:10

KPH3802

156

When sending requests on my end to the URL I receive a 503 HTTP Status code. Check the status code on your side. If using the requests library you can do r.status_code where r is a requests.get()
– kstullich
Nov 20 at 0:18

I did that but left that part out earlier. I edited my questions to reflect that part. Everything seems to working fine but the .find_all(...) to get the text. It just returns an empty list. When I use the same code with another website it works.
– KPH3802
Nov 20 at 0:29

1

The reviews are wrapped in another div called a-section review. See here. When trying len(soup.findAll(class_='a-section review')) the result is 8, which is how many reviews are displayed.
– kstullich
Nov 20 at 0:39

I see what you're saying but when I do len(soup.findAll(class_='a-section review')) I get a length of 0 I must be doing something wrong.
– KPH3802
Nov 20 at 0:51

1

I initialized it with soup = BeautifulSoup(r.text, "html.parser") Should I remove the "html.parser"?
– KPH3802
Nov 20 at 1:16

|
show 2 more comments

up vote
1
down vote

favorite

I am trying to webscrape Amazon's reviews: https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python

Here is my code:

import requests as req

from bs4 import BeautifulSoup

headers = {'User-Agent': 'Kevin's_request'}

r = req.get('https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python', headers=headers)

soup = BeautifulSoup(r.text, "html.parser")

soup.find(class_="a-expander-content a-expander-partial-collapse-content")

I only end up with an empty list. I am using Python 3.6.4 in Jupyter Notebooks and BS 4

edited Nov 20 at 1:40

asked Nov 20 at 0:10

KPH3802

156

I am trying to webscrape Amazon's reviews: https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python

Here is my code:

import requests as req

from bs4 import BeautifulSoup

headers = {'User-Agent': 'Kevin's_request'}

r = req.get('https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python', headers=headers)

soup = BeautifulSoup(r.text, "html.parser")

soup.find(class_="a-expander-content a-expander-partial-collapse-content")

I only end up with an empty list. I am using Python 3.6.4 in Jupyter Notebooks and BS 4

python web-scraping beautifulsoup findall

edited Nov 20 at 1:40

asked Nov 20 at 0:10

KPH3802

156

edited Nov 20 at 1:40

asked Nov 20 at 0:10

KPH3802

156

edited Nov 20 at 1:40

asked Nov 20 at 0:10

KPH3802

156

asked Nov 20 at 0:10

KPH3802

156

asked Nov 20 at 0:10

KPH3802

156

When sending requests on my end to the URL I receive a 503 HTTP Status code. Check the status code on your side. If using the requests library you can do r.status_code where r is a requests.get()
– kstullich
Nov 20 at 0:18

I did that but left that part out earlier. I edited my questions to reflect that part. Everything seems to working fine but the .find_all(...) to get the text. It just returns an empty list. When I use the same code with another website it works.
– KPH3802
Nov 20 at 0:29

1

The reviews are wrapped in another div called a-section review. See here. When trying len(soup.findAll(class_='a-section review')) the result is 8, which is how many reviews are displayed.
– kstullich
Nov 20 at 0:39

I see what you're saying but when I do len(soup.findAll(class_='a-section review')) I get a length of 0 I must be doing something wrong.
– KPH3802
Nov 20 at 0:51

1

I initialized it with soup = BeautifulSoup(r.text, "html.parser") Should I remove the "html.parser"?
– KPH3802
Nov 20 at 1:16

|
show 2 more comments

When sending requests on my end to the URL I receive a 503 HTTP Status code. Check the status code on your side. If using the requests library you can do r.status_code where r is a requests.get()
– kstullich
Nov 20 at 0:18

I did that but left that part out earlier. I edited my questions to reflect that part. Everything seems to working fine but the .find_all(...) to get the text. It just returns an empty list. When I use the same code with another website it works.
– KPH3802
Nov 20 at 0:29

1

The reviews are wrapped in another div called a-section review. See here. When trying len(soup.findAll(class_='a-section review')) the result is 8, which is how many reviews are displayed.
– kstullich
Nov 20 at 0:39

I see what you're saying but when I do len(soup.findAll(class_='a-section review')) I get a length of 0 I must be doing something wrong.
– KPH3802
Nov 20 at 0:51

1

I initialized it with soup = BeautifulSoup(r.text, "html.parser") Should I remove the "html.parser"?
– KPH3802
Nov 20 at 1:16

When sending requests on my end to the URL I receive a 503 HTTP Status code. Check the status code on your side. If using the requests library you can do r.status_code where r is a requests.get()
– kstullich
Nov 20 at 0:18

I did that but left that part out earlier. I edited my questions to reflect that part. Everything seems to working fine but the .find_all(...) to get the text. It just returns an empty list. When I use the same code with another website it works.
– KPH3802
Nov 20 at 0:29

The reviews are wrapped in another div called a-section review. See here. When trying len(soup.findAll(class_='a-section review')) the result is 8, which is how many reviews are displayed.
– kstullich
Nov 20 at 0:39

I see what you're saying but when I do len(soup.findAll(class_='a-section review')) I get a length of 0 I must be doing something wrong.
– KPH3802
Nov 20 at 0:51

I initialized it with soup = BeautifulSoup(r.text, "html.parser") Should I remove the "html.parser"?
– KPH3802
Nov 20 at 1:16

|
show 2 more comments

2 Answers
2

active

oldest

votes

up vote
1
down vote

accepted

Try this approach. Turn out that your selector could not find anything. However, I've fixed it to serve the purpose:

import requests

from bs4 import BeautifulSoup



def get_reviews(s,url):

    s.headers['User-Agent'] = 'Mozilla/5.0'

    response = s.get(url)

    soup = BeautifulSoup(response.text,"lxml")

    return soup.find_all("div",{"data-hook":"review-collapsed"})



if __name__ == '__main__':

    link = 'https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python'    

    with requests.Session() as s:

        for review in get_reviews(s,link):

            print(f'{review.text}n')

answered Nov 20 at 3:17

SIM

9,6553639

With this syntax I assume there is only the one get and as you loop here for review in get_reviews(s,link) you are re-using the returned object ? I am not familiar enough wih this library and syntax yet to understand the efficiency I assume is here.
– QHarr
Nov 20 at 4:42

This worked! Thank you all for your help!
– KPH3802
Nov 20 at 23:45

add a comment |

up vote
0
down vote

Not sure what's happening on your side, but this code works fine.
Here it goes (python 3.6, BSP 4.6.3):

import requests

from bs4 import BeautifulSoup



def s_comments(url):

    headers = {'User-Agent': 'Bob's_request'}

    response = requests.get(url, headers=headers )

    if response.status_code != 200:

        raise ConnectionError



    soup = BeautifulSoup(response.content)

    return soup.find_all(class_="a-expander-content a-expander-partial- collapse-content")





url = 'https://www.amazon.com/dp/1593276036'    

reviews = s_comments(url)

for i, review in enumerate(reviews):

    print('---- {} ----'.format(i))

    print(review.text)

answered Nov 20 at 1:07

AResem

1114

When I use this code I still get an empty list. Could it be something on my end?
– KPH3802
Nov 20 at 1:20

Yes. Try to print the response you are getting back like:
– AResem
Nov 20 at 1:31

print(response.content) and also your parsed object. like soup.prettify()
– AResem
Nov 20 at 1:33

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53384422%2fwebscraping-amazon-with-beautifulsoup%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
1
down vote

accepted

Try this approach. Turn out that your selector could not find anything. However, I've fixed it to serve the purpose:

import requests

from bs4 import BeautifulSoup



def get_reviews(s,url):

    s.headers['User-Agent'] = 'Mozilla/5.0'

    response = s.get(url)

    soup = BeautifulSoup(response.text,"lxml")

    return soup.find_all("div",{"data-hook":"review-collapsed"})



if __name__ == '__main__':

    link = 'https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python'    

    with requests.Session() as s:

        for review in get_reviews(s,link):

            print(f'{review.text}n')

answered Nov 20 at 3:17

SIM

9,6553639

With this syntax I assume there is only the one get and as you loop here for review in get_reviews(s,link) you are re-using the returned object ? I am not familiar enough wih this library and syntax yet to understand the efficiency I assume is here.
– QHarr
Nov 20 at 4:42

This worked! Thank you all for your help!
– KPH3802
Nov 20 at 23:45

add a comment |

up vote
1
down vote

accepted

Try this approach. Turn out that your selector could not find anything. However, I've fixed it to serve the purpose:

import requests

from bs4 import BeautifulSoup



def get_reviews(s,url):

    s.headers['User-Agent'] = 'Mozilla/5.0'

    response = s.get(url)

    soup = BeautifulSoup(response.text,"lxml")

    return soup.find_all("div",{"data-hook":"review-collapsed"})



if __name__ == '__main__':

    link = 'https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python'    

    with requests.Session() as s:

        for review in get_reviews(s,link):

            print(f'{review.text}n')

answered Nov 20 at 3:17

SIM

9,6553639

With this syntax I assume there is only the one get and as you loop here for review in get_reviews(s,link) you are re-using the returned object ? I am not familiar enough wih this library and syntax yet to understand the efficiency I assume is here.
– QHarr
Nov 20 at 4:42

This worked! Thank you all for your help!
– KPH3802
Nov 20 at 23:45

add a comment |

up vote
1
down vote

accepted

Try this approach. Turn out that your selector could not find anything. However, I've fixed it to serve the purpose:

import requests

from bs4 import BeautifulSoup



def get_reviews(s,url):

    s.headers['User-Agent'] = 'Mozilla/5.0'

    response = s.get(url)

    soup = BeautifulSoup(response.text,"lxml")

    return soup.find_all("div",{"data-hook":"review-collapsed"})



if __name__ == '__main__':

    link = 'https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python'    

    with requests.Session() as s:

        for review in get_reviews(s,link):

            print(f'{review.text}n')

answered Nov 20 at 3:17

SIM

9,6553639

Try this approach. Turn out that your selector could not find anything. However, I've fixed it to serve the purpose:

import requests

from bs4 import BeautifulSoup



def get_reviews(s,url):

    s.headers['User-Agent'] = 'Mozilla/5.0'

    response = s.get(url)

    soup = BeautifulSoup(response.text,"lxml")

    return soup.find_all("div",{"data-hook":"review-collapsed"})



if __name__ == '__main__':

    link = 'https://www.amazon.com/Python-Crash-Course-Hands-Project-Based/dp/1593276036/ref=sr_1_3?ie=UTF8&qid=1541450645&sr=8-3&keywords=python'    

    with requests.Session() as s:

        for review in get_reviews(s,link):

            print(f'{review.text}n')

answered Nov 20 at 3:17

SIM

9,6553639

answered Nov 20 at 3:17

SIM

9,6553639

answered Nov 20 at 3:17

SIM

9,6553639

answered Nov 20 at 3:17

SIM

9,6553639

With this syntax I assume there is only the one get and as you loop here for review in get_reviews(s,link) you are re-using the returned object ? I am not familiar enough wih this library and syntax yet to understand the efficiency I assume is here.
– QHarr
Nov 20 at 4:42

This worked! Thank you all for your help!
– KPH3802
Nov 20 at 23:45

add a comment |

With this syntax I assume there is only the one get and as you loop here for review in get_reviews(s,link) you are re-using the returned object ? I am not familiar enough wih this library and syntax yet to understand the efficiency I assume is here.
– QHarr
Nov 20 at 4:42

This worked! Thank you all for your help!
– KPH3802
Nov 20 at 23:45

With this syntax I assume there is only the one get and as you loop here for review in get_reviews(s,link) you are re-using the returned object ? I am not familiar enough wih this library and syntax yet to understand the efficiency I assume is here.
– QHarr
Nov 20 at 4:42

This worked! Thank you all for your help!
– KPH3802
Nov 20 at 23:45

add a comment |

up vote
0
down vote

Not sure what's happening on your side, but this code works fine.
Here it goes (python 3.6, BSP 4.6.3):

import requests

from bs4 import BeautifulSoup



def s_comments(url):

    headers = {'User-Agent': 'Bob's_request'}

    response = requests.get(url, headers=headers )

    if response.status_code != 200:

        raise ConnectionError



    soup = BeautifulSoup(response.content)

    return soup.find_all(class_="a-expander-content a-expander-partial- collapse-content")





url = 'https://www.amazon.com/dp/1593276036'    

reviews = s_comments(url)

for i, review in enumerate(reviews):

    print('---- {} ----'.format(i))

    print(review.text)

answered Nov 20 at 1:07

AResem

1114

When I use this code I still get an empty list. Could it be something on my end?
– KPH3802
Nov 20 at 1:20

Yes. Try to print the response you are getting back like:
– AResem
Nov 20 at 1:31

print(response.content) and also your parsed object. like soup.prettify()
– AResem
Nov 20 at 1:33

add a comment |

up vote
0
down vote

Not sure what's happening on your side, but this code works fine.
Here it goes (python 3.6, BSP 4.6.3):

import requests

from bs4 import BeautifulSoup



def s_comments(url):

    headers = {'User-Agent': 'Bob's_request'}

    response = requests.get(url, headers=headers )

    if response.status_code != 200:

        raise ConnectionError



    soup = BeautifulSoup(response.content)

    return soup.find_all(class_="a-expander-content a-expander-partial- collapse-content")





url = 'https://www.amazon.com/dp/1593276036'    

reviews = s_comments(url)

for i, review in enumerate(reviews):

    print('---- {} ----'.format(i))

    print(review.text)

answered Nov 20 at 1:07

AResem

1114

When I use this code I still get an empty list. Could it be something on my end?
– KPH3802
Nov 20 at 1:20

Yes. Try to print the response you are getting back like:
– AResem
Nov 20 at 1:31

print(response.content) and also your parsed object. like soup.prettify()
– AResem
Nov 20 at 1:33

add a comment |

up vote
0
down vote

Not sure what's happening on your side, but this code works fine.
Here it goes (python 3.6, BSP 4.6.3):

import requests

from bs4 import BeautifulSoup



def s_comments(url):

    headers = {'User-Agent': 'Bob's_request'}

    response = requests.get(url, headers=headers )

    if response.status_code != 200:

        raise ConnectionError



    soup = BeautifulSoup(response.content)

    return soup.find_all(class_="a-expander-content a-expander-partial- collapse-content")





url = 'https://www.amazon.com/dp/1593276036'    

reviews = s_comments(url)

for i, review in enumerate(reviews):

    print('---- {} ----'.format(i))

    print(review.text)

answered Nov 20 at 1:07

AResem

1114

Not sure what's happening on your side, but this code works fine.
Here it goes (python 3.6, BSP 4.6.3):

import requests

from bs4 import BeautifulSoup



def s_comments(url):

    headers = {'User-Agent': 'Bob's_request'}

    response = requests.get(url, headers=headers )

    if response.status_code != 200:

        raise ConnectionError



    soup = BeautifulSoup(response.content)

    return soup.find_all(class_="a-expander-content a-expander-partial- collapse-content")





url = 'https://www.amazon.com/dp/1593276036'    

reviews = s_comments(url)

for i, review in enumerate(reviews):

    print('---- {} ----'.format(i))

    print(review.text)

answered Nov 20 at 1:07

AResem

1114

answered Nov 20 at 1:07

AResem

1114

answered Nov 20 at 1:07

AResem

1114

answered Nov 20 at 1:07

AResem

1114

When I use this code I still get an empty list. Could it be something on my end?
– KPH3802
Nov 20 at 1:20

Yes. Try to print the response you are getting back like:
– AResem
Nov 20 at 1:31

print(response.content) and also your parsed object. like soup.prettify()
– AResem
Nov 20 at 1:33

add a comment |

When I use this code I still get an empty list. Could it be something on my end?
– KPH3802
Nov 20 at 1:20

Yes. Try to print the response you are getting back like:
– AResem
Nov 20 at 1:31

print(response.content) and also your parsed object. like soup.prettify()
– AResem
Nov 20 at 1:33

When I use this code I still get an empty list. Could it be something on my end?
– KPH3802
Nov 20 at 1:20

Yes. Try to print the response you are getting back like:
– AResem
Nov 20 at 1:31

print(response.content) and also your parsed object. like soup.prettify()
– AResem
Nov 20 at 1:33

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

B7pfGmsqkbCo3W62hHaL qab8sCqgK75

搜尋此網誌

Nsryjdtyk