Can't create an xpath capable of meeting certain condition
up vote
2
down vote
favorite
I've created a script which is able to extract the links ending with .html
extention available under class tableFile
from a webpage. The script can do it's job. However, my intention at this point is to get only those .html
links which have EX-
in its type field. I'm looking for any pure xpath solution (by not using .getparent()
or something).
Link to that site
Script I've tried with so far:
import requests
from lxml.html import fromstring
res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
root = fromstring(res.text)
for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):
if ".htm" in item:
print(item)
When I try to get the links meeting above condition with the below approach, I get an error:
for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):
if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):
print(item)
Error I get:
if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):
AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'xpath'
This is how the files look like:
python python-3.x xpath web-scraping
add a comment |
up vote
2
down vote
favorite
I've created a script which is able to extract the links ending with .html
extention available under class tableFile
from a webpage. The script can do it's job. However, my intention at this point is to get only those .html
links which have EX-
in its type field. I'm looking for any pure xpath solution (by not using .getparent()
or something).
Link to that site
Script I've tried with so far:
import requests
from lxml.html import fromstring
res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
root = fromstring(res.text)
for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):
if ".htm" in item:
print(item)
When I try to get the links meeting above condition with the below approach, I get an error:
for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):
if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):
print(item)
Error I get:
if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):
AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'xpath'
This is how the files look like:
python python-3.x xpath web-scraping
add a comment |
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I've created a script which is able to extract the links ending with .html
extention available under class tableFile
from a webpage. The script can do it's job. However, my intention at this point is to get only those .html
links which have EX-
in its type field. I'm looking for any pure xpath solution (by not using .getparent()
or something).
Link to that site
Script I've tried with so far:
import requests
from lxml.html import fromstring
res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
root = fromstring(res.text)
for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):
if ".htm" in item:
print(item)
When I try to get the links meeting above condition with the below approach, I get an error:
for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):
if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):
print(item)
Error I get:
if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):
AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'xpath'
This is how the files look like:
python python-3.x xpath web-scraping
I've created a script which is able to extract the links ending with .html
extention available under class tableFile
from a webpage. The script can do it's job. However, my intention at this point is to get only those .html
links which have EX-
in its type field. I'm looking for any pure xpath solution (by not using .getparent()
or something).
Link to that site
Script I've tried with so far:
import requests
from lxml.html import fromstring
res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
root = fromstring(res.text)
for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):
if ".htm" in item:
print(item)
When I try to get the links meeting above condition with the below approach, I get an error:
for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):
if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):
print(item)
Error I get:
if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):
AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'xpath'
This is how the files look like:
python python-3.x xpath web-scraping
python python-3.x xpath web-scraping
asked yesterday
robots.txt
945
945
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
up vote
2
down vote
accepted
If you need pure XPath solution, you can use below:
import requests
from lxml.html import fromstring
res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
root = fromstring(res.text)
for item in root.xpath('//table[contains(@summary,"Document")]//tr[td[starts-with(., "EX-")]]/td/a[contains(@href, ".htm")]/@href'):
print(item)
/Archives/edgar/data/1085596/000146970918000185/ex31_1apg.htm
/Archives/edgar/data/1085596/000146970918000185/ex31_2apg.htm
/Archives/edgar/data/1085596/000146970918000185/ex32_1apg.htm
/Archives/edgar/data/1085596/000146970918000185/ex32_2apg.htm
Apology for the delayed response @sir Andersson. Thanks for your effective solution.
– robots.txt
yesterday
Is there any way to do the same using.cssselect()
@sir Andersson? I hope you will take a look in your spare time.
– robots.txt
yesterday
1
@robots.txt , you can try[link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')]
, but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath
– Andersson
yesterday
add a comment |
up vote
1
down vote
It looks like you want:
//td[following-sibling::td[starts-with(text(), "EX")]]/a[contains(@href, ".htm")]
There's a lot of different ways to do this with xpath. Css is probalby much simpler.
add a comment |
up vote
1
down vote
Here is a way using dataframes and pandas
import pandas as pd
tables = pd.read_html("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
base = "https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/"
results = [base + row[1][2] for row in tables[0].iterrows() if row[1][2].endswith(('.htm', '.txt')) and str(row[1][3]).startswith('EX')]
print(results)
add a comment |
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
If you need pure XPath solution, you can use below:
import requests
from lxml.html import fromstring
res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
root = fromstring(res.text)
for item in root.xpath('//table[contains(@summary,"Document")]//tr[td[starts-with(., "EX-")]]/td/a[contains(@href, ".htm")]/@href'):
print(item)
/Archives/edgar/data/1085596/000146970918000185/ex31_1apg.htm
/Archives/edgar/data/1085596/000146970918000185/ex31_2apg.htm
/Archives/edgar/data/1085596/000146970918000185/ex32_1apg.htm
/Archives/edgar/data/1085596/000146970918000185/ex32_2apg.htm
Apology for the delayed response @sir Andersson. Thanks for your effective solution.
– robots.txt
yesterday
Is there any way to do the same using.cssselect()
@sir Andersson? I hope you will take a look in your spare time.
– robots.txt
yesterday
1
@robots.txt , you can try[link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')]
, but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath
– Andersson
yesterday
add a comment |
up vote
2
down vote
accepted
If you need pure XPath solution, you can use below:
import requests
from lxml.html import fromstring
res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
root = fromstring(res.text)
for item in root.xpath('//table[contains(@summary,"Document")]//tr[td[starts-with(., "EX-")]]/td/a[contains(@href, ".htm")]/@href'):
print(item)
/Archives/edgar/data/1085596/000146970918000185/ex31_1apg.htm
/Archives/edgar/data/1085596/000146970918000185/ex31_2apg.htm
/Archives/edgar/data/1085596/000146970918000185/ex32_1apg.htm
/Archives/edgar/data/1085596/000146970918000185/ex32_2apg.htm
Apology for the delayed response @sir Andersson. Thanks for your effective solution.
– robots.txt
yesterday
Is there any way to do the same using.cssselect()
@sir Andersson? I hope you will take a look in your spare time.
– robots.txt
yesterday
1
@robots.txt , you can try[link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')]
, but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath
– Andersson
yesterday
add a comment |
up vote
2
down vote
accepted
up vote
2
down vote
accepted
If you need pure XPath solution, you can use below:
import requests
from lxml.html import fromstring
res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
root = fromstring(res.text)
for item in root.xpath('//table[contains(@summary,"Document")]//tr[td[starts-with(., "EX-")]]/td/a[contains(@href, ".htm")]/@href'):
print(item)
/Archives/edgar/data/1085596/000146970918000185/ex31_1apg.htm
/Archives/edgar/data/1085596/000146970918000185/ex31_2apg.htm
/Archives/edgar/data/1085596/000146970918000185/ex32_1apg.htm
/Archives/edgar/data/1085596/000146970918000185/ex32_2apg.htm
If you need pure XPath solution, you can use below:
import requests
from lxml.html import fromstring
res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
root = fromstring(res.text)
for item in root.xpath('//table[contains(@summary,"Document")]//tr[td[starts-with(., "EX-")]]/td/a[contains(@href, ".htm")]/@href'):
print(item)
/Archives/edgar/data/1085596/000146970918000185/ex31_1apg.htm
/Archives/edgar/data/1085596/000146970918000185/ex31_2apg.htm
/Archives/edgar/data/1085596/000146970918000185/ex32_1apg.htm
/Archives/edgar/data/1085596/000146970918000185/ex32_2apg.htm
answered yesterday
Andersson
34.5k103063
34.5k103063
Apology for the delayed response @sir Andersson. Thanks for your effective solution.
– robots.txt
yesterday
Is there any way to do the same using.cssselect()
@sir Andersson? I hope you will take a look in your spare time.
– robots.txt
yesterday
1
@robots.txt , you can try[link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')]
, but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath
– Andersson
yesterday
add a comment |
Apology for the delayed response @sir Andersson. Thanks for your effective solution.
– robots.txt
yesterday
Is there any way to do the same using.cssselect()
@sir Andersson? I hope you will take a look in your spare time.
– robots.txt
yesterday
1
@robots.txt , you can try[link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')]
, but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath
– Andersson
yesterday
Apology for the delayed response @sir Andersson. Thanks for your effective solution.
– robots.txt
yesterday
Apology for the delayed response @sir Andersson. Thanks for your effective solution.
– robots.txt
yesterday
Is there any way to do the same using
.cssselect()
@sir Andersson? I hope you will take a look in your spare time.– robots.txt
yesterday
Is there any way to do the same using
.cssselect()
@sir Andersson? I hope you will take a look in your spare time.– robots.txt
yesterday
1
1
@robots.txt , you can try
[link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')]
, but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath– Andersson
yesterday
@robots.txt , you can try
[link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')]
, but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath– Andersson
yesterday
add a comment |
up vote
1
down vote
It looks like you want:
//td[following-sibling::td[starts-with(text(), "EX")]]/a[contains(@href, ".htm")]
There's a lot of different ways to do this with xpath. Css is probalby much simpler.
add a comment |
up vote
1
down vote
It looks like you want:
//td[following-sibling::td[starts-with(text(), "EX")]]/a[contains(@href, ".htm")]
There's a lot of different ways to do this with xpath. Css is probalby much simpler.
add a comment |
up vote
1
down vote
up vote
1
down vote
It looks like you want:
//td[following-sibling::td[starts-with(text(), "EX")]]/a[contains(@href, ".htm")]
There's a lot of different ways to do this with xpath. Css is probalby much simpler.
It looks like you want:
//td[following-sibling::td[starts-with(text(), "EX")]]/a[contains(@href, ".htm")]
There's a lot of different ways to do this with xpath. Css is probalby much simpler.
answered yesterday
pguardiario
35.3k978112
35.3k978112
add a comment |
add a comment |
up vote
1
down vote
Here is a way using dataframes and pandas
import pandas as pd
tables = pd.read_html("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
base = "https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/"
results = [base + row[1][2] for row in tables[0].iterrows() if row[1][2].endswith(('.htm', '.txt')) and str(row[1][3]).startswith('EX')]
print(results)
add a comment |
up vote
1
down vote
Here is a way using dataframes and pandas
import pandas as pd
tables = pd.read_html("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
base = "https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/"
results = [base + row[1][2] for row in tables[0].iterrows() if row[1][2].endswith(('.htm', '.txt')) and str(row[1][3]).startswith('EX')]
print(results)
add a comment |
up vote
1
down vote
up vote
1
down vote
Here is a way using dataframes and pandas
import pandas as pd
tables = pd.read_html("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
base = "https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/"
results = [base + row[1][2] for row in tables[0].iterrows() if row[1][2].endswith(('.htm', '.txt')) and str(row[1][3]).startswith('EX')]
print(results)
Here is a way using dataframes and pandas
import pandas as pd
tables = pd.read_html("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
base = "https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/"
results = [base + row[1][2] for row in tables[0].iterrows() if row[1][2].endswith(('.htm', '.txt')) and str(row[1][3]).startswith('EX')]
print(results)
edited yesterday
answered yesterday
QHarr
25.6k81839
25.6k81839
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53349062%2fcant-create-an-xpath-capable-of-meeting-certain-condition%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown