Can't create an xpath capable of meeting certain condition

up vote
2
down vote

favorite

I've created a script which is able to extract the links ending with .html extention available under class tableFile from a webpage. The script can do it's job. However, my intention at this point is to get only those .html links which have EX- in its type field. I'm looking for any pure xpath solution (by not using .getparent() or something).

Link to that site

Script I've tried with so far:

import requests

from lxml.html import fromstring



res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")

root = fromstring(res.text)



for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):

    if ".htm" in item:

        print(item)

When I try to get the links meeting above condition with the below approach, I get an error:

for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):

    if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):

        print(item)

Error I get:

if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):

AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'xpath'

This is how the files look like:

enter image description here

asked yesterday

robots.txt

945

add a comment |

up vote
2
down vote

favorite

Link to that site

Script I've tried with so far:

import requests

from lxml.html import fromstring



res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")

root = fromstring(res.text)



for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):

    if ".htm" in item:

        print(item)

When I try to get the links meeting above condition with the below approach, I get an error:

for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):

    if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):

        print(item)

Error I get:

if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):

AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'xpath'

This is how the files look like:

enter image description here

asked yesterday

robots.txt

945

add a comment |

up vote
2
down vote

favorite

Link to that site

Script I've tried with so far:

import requests

from lxml.html import fromstring



res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")

root = fromstring(res.text)



for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):

    if ".htm" in item:

        print(item)

When I try to get the links meeting above condition with the below approach, I get an error:

for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):

    if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):

        print(item)

Error I get:

if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):

AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'xpath'

This is how the files look like:

enter image description here

asked yesterday

robots.txt

945

Link to that site

Script I've tried with so far:

import requests

from lxml.html import fromstring



res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")

root = fromstring(res.text)



for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):

    if ".htm" in item:

        print(item)

When I try to get the links meeting above condition with the below approach, I get an error:

for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):

    if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):

        print(item)

Error I get:

if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):

AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'xpath'

This is how the files look like:

enter image description here

python python-3.x xpath web-scraping

asked yesterday

robots.txt

945

asked yesterday

robots.txt

945

asked yesterday

robots.txt

945

asked yesterday

robots.txt

945

asked yesterday

robots.txt

945

add a comment |

3 Answers
3

active

oldest

votes

up vote
2
down vote

accepted

If you need pure XPath solution, you can use below:

import requests

from lxml.html import fromstring



res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")

root = fromstring(res.text)

for item in root.xpath('//table[contains(@summary,"Document")]//tr[td[starts-with(., "EX-")]]/td/a[contains(@href, ".htm")]/@href'):

    print(item)





/Archives/edgar/data/1085596/000146970918000185/ex31_1apg.htm

/Archives/edgar/data/1085596/000146970918000185/ex31_2apg.htm

/Archives/edgar/data/1085596/000146970918000185/ex32_1apg.htm

/Archives/edgar/data/1085596/000146970918000185/ex32_2apg.htm

answered yesterday

Andersson

34.5k103063

Apology for the delayed response @sir Andersson. Thanks for your effective solution.
– robots.txt
yesterday

Is there any way to do the same using .cssselect() @sir Andersson? I hope you will take a look in your spare time.
– robots.txt
yesterday

1

@robots.txt , you can try [link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')] , but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath
– Andersson
yesterday

add a comment |

up vote
1
down vote

It looks like you want:

//td[following-sibling::td[starts-with(text(), "EX")]]/a[contains(@href, ".htm")]

There's a lot of different ways to do this with xpath. Css is probalby much simpler.

answered yesterday

pguardiario

35.3k978112

add a comment |

up vote
1
down vote

Here is a way using dataframes and pandas

import pandas as pd

tables = pd.read_html("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")

base = "https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/"

results = [base + row[1][2] for row in tables[0].iterrows() if row[1][2].endswith(('.htm', '.txt')) and str(row[1][3]).startswith('EX')]

print(results)

edited yesterday

answered yesterday

QHarr

25.6k81839

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53349062%2fcant-create-an-xpath-capable-of-meeting-certain-condition%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
2
down vote

accepted

If you need pure XPath solution, you can use below:

import requests

from lxml.html import fromstring



res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")

root = fromstring(res.text)

for item in root.xpath('//table[contains(@summary,"Document")]//tr[td[starts-with(., "EX-")]]/td/a[contains(@href, ".htm")]/@href'):

    print(item)





/Archives/edgar/data/1085596/000146970918000185/ex31_1apg.htm

/Archives/edgar/data/1085596/000146970918000185/ex31_2apg.htm

/Archives/edgar/data/1085596/000146970918000185/ex32_1apg.htm

/Archives/edgar/data/1085596/000146970918000185/ex32_2apg.htm

answered yesterday

Andersson

34.5k103063

Apology for the delayed response @sir Andersson. Thanks for your effective solution.
– robots.txt
yesterday

Is there any way to do the same using .cssselect() @sir Andersson? I hope you will take a look in your spare time.
– robots.txt
yesterday

1

@robots.txt , you can try [link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')] , but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath
– Andersson
yesterday

add a comment |

up vote
2
down vote

accepted

If you need pure XPath solution, you can use below:

import requests

from lxml.html import fromstring



res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")

root = fromstring(res.text)

for item in root.xpath('//table[contains(@summary,"Document")]//tr[td[starts-with(., "EX-")]]/td/a[contains(@href, ".htm")]/@href'):

    print(item)





/Archives/edgar/data/1085596/000146970918000185/ex31_1apg.htm

/Archives/edgar/data/1085596/000146970918000185/ex31_2apg.htm

/Archives/edgar/data/1085596/000146970918000185/ex32_1apg.htm

/Archives/edgar/data/1085596/000146970918000185/ex32_2apg.htm

answered yesterday

Andersson

34.5k103063

Apology for the delayed response @sir Andersson. Thanks for your effective solution.
– robots.txt
yesterday

Is there any way to do the same using .cssselect() @sir Andersson? I hope you will take a look in your spare time.
– robots.txt
yesterday

1

@robots.txt , you can try [link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')] , but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath
– Andersson
yesterday

add a comment |

up vote
2
down vote

accepted

If you need pure XPath solution, you can use below:

import requests

from lxml.html import fromstring



res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")

root = fromstring(res.text)

for item in root.xpath('//table[contains(@summary,"Document")]//tr[td[starts-with(., "EX-")]]/td/a[contains(@href, ".htm")]/@href'):

    print(item)





/Archives/edgar/data/1085596/000146970918000185/ex31_1apg.htm

/Archives/edgar/data/1085596/000146970918000185/ex31_2apg.htm

/Archives/edgar/data/1085596/000146970918000185/ex32_1apg.htm

/Archives/edgar/data/1085596/000146970918000185/ex32_2apg.htm

answered yesterday

Andersson

34.5k103063

If you need pure XPath solution, you can use below:

import requests

from lxml.html import fromstring



res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")

root = fromstring(res.text)

for item in root.xpath('//table[contains(@summary,"Document")]//tr[td[starts-with(., "EX-")]]/td/a[contains(@href, ".htm")]/@href'):

    print(item)





/Archives/edgar/data/1085596/000146970918000185/ex31_1apg.htm

/Archives/edgar/data/1085596/000146970918000185/ex31_2apg.htm

/Archives/edgar/data/1085596/000146970918000185/ex32_1apg.htm

/Archives/edgar/data/1085596/000146970918000185/ex32_2apg.htm

answered yesterday

Andersson

34.5k103063

answered yesterday

Andersson

34.5k103063

answered yesterday

Andersson

34.5k103063

answered yesterday

Andersson

34.5k103063

Apology for the delayed response @sir Andersson. Thanks for your effective solution.
– robots.txt
yesterday

Is there any way to do the same using .cssselect() @sir Andersson? I hope you will take a look in your spare time.
– robots.txt
yesterday

1

@robots.txt , you can try [link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')] , but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath
– Andersson
yesterday

add a comment |

Apology for the delayed response @sir Andersson. Thanks for your effective solution.
– robots.txt
yesterday

Is there any way to do the same using .cssselect() @sir Andersson? I hope you will take a look in your spare time.
– robots.txt
yesterday

1

@robots.txt , you can try [link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')] , but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath
– Andersson
yesterday

Apology for the delayed response @sir Andersson. Thanks for your effective solution.
– robots.txt
yesterday

Is there any way to do the same using .cssselect() @sir Andersson? I hope you will take a look in your spare time.
– robots.txt
yesterday

@robots.txt , you can try [link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')] , but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath
– Andersson
yesterday

add a comment |

up vote
1
down vote

It looks like you want:

//td[following-sibling::td[starts-with(text(), "EX")]]/a[contains(@href, ".htm")]

There's a lot of different ways to do this with xpath. Css is probalby much simpler.

answered yesterday

pguardiario

35.3k978112

add a comment |

up vote
1
down vote

It looks like you want:

//td[following-sibling::td[starts-with(text(), "EX")]]/a[contains(@href, ".htm")]

There's a lot of different ways to do this with xpath. Css is probalby much simpler.

answered yesterday

pguardiario

35.3k978112

add a comment |

up vote
1
down vote

It looks like you want:

//td[following-sibling::td[starts-with(text(), "EX")]]/a[contains(@href, ".htm")]

There's a lot of different ways to do this with xpath. Css is probalby much simpler.

answered yesterday

pguardiario

35.3k978112

It looks like you want:

//td[following-sibling::td[starts-with(text(), "EX")]]/a[contains(@href, ".htm")]

There's a lot of different ways to do this with xpath. Css is probalby much simpler.

answered yesterday

pguardiario

35.3k978112

answered yesterday

pguardiario

35.3k978112

answered yesterday

pguardiario

35.3k978112

answered yesterday

pguardiario

35.3k978112

add a comment |

up vote
1
down vote

Here is a way using dataframes and pandas

import pandas as pd

tables = pd.read_html("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")

base = "https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/"

results = [base + row[1][2] for row in tables[0].iterrows() if row[1][2].endswith(('.htm', '.txt')) and str(row[1][3]).startswith('EX')]

print(results)

edited yesterday

answered yesterday

QHarr

25.6k81839

add a comment |

up vote
1
down vote

Here is a way using dataframes and pandas

import pandas as pd

tables = pd.read_html("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")

base = "https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/"

results = [base + row[1][2] for row in tables[0].iterrows() if row[1][2].endswith(('.htm', '.txt')) and str(row[1][3]).startswith('EX')]

print(results)

edited yesterday

answered yesterday

QHarr

25.6k81839

add a comment |

up vote
1
down vote

Here is a way using dataframes and pandas

import pandas as pd

tables = pd.read_html("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")

base = "https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/"

results = [base + row[1][2] for row in tables[0].iterrows() if row[1][2].endswith(('.htm', '.txt')) and str(row[1][3]).startswith('EX')]

print(results)

edited yesterday

answered yesterday

QHarr

25.6k81839

Here is a way using dataframes and pandas

import pandas as pd

tables = pd.read_html("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")

base = "https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/"

results = [base + row[1][2] for row in tables[0].iterrows() if row[1][2].endswith(('.htm', '.txt')) and str(row[1][3]).startswith('EX')]

print(results)

edited yesterday

answered yesterday

QHarr

25.6k81839

edited yesterday

answered yesterday

QHarr

25.6k81839

answered yesterday

QHarr

25.6k81839

answered yesterday

QHarr

25.6k81839

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Nsryjdtyk