Can't create an xpath capable of meeting certain condition











up vote
2
down vote

favorite












I've created a script which is able to extract the links ending with .html extention available under class tableFile from a webpage. The script can do it's job. However, my intention at this point is to get only those .html links which have EX- in its type field. I'm looking for any pure xpath solution (by not using .getparent() or something).



Link to that site



Script I've tried with so far:



import requests
from lxml.html import fromstring

res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
root = fromstring(res.text)

for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):
if ".htm" in item:
print(item)


When I try to get the links meeting above condition with the below approach, I get an error:



for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):
if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):
print(item)


Error I get:



if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):
AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'xpath'


This is how the files look like:



enter image description here










share|improve this question


























    up vote
    2
    down vote

    favorite












    I've created a script which is able to extract the links ending with .html extention available under class tableFile from a webpage. The script can do it's job. However, my intention at this point is to get only those .html links which have EX- in its type field. I'm looking for any pure xpath solution (by not using .getparent() or something).



    Link to that site



    Script I've tried with so far:



    import requests
    from lxml.html import fromstring

    res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
    root = fromstring(res.text)

    for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):
    if ".htm" in item:
    print(item)


    When I try to get the links meeting above condition with the below approach, I get an error:



    for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):
    if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):
    print(item)


    Error I get:



    if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):
    AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'xpath'


    This is how the files look like:



    enter image description here










    share|improve this question
























      up vote
      2
      down vote

      favorite









      up vote
      2
      down vote

      favorite











      I've created a script which is able to extract the links ending with .html extention available under class tableFile from a webpage. The script can do it's job. However, my intention at this point is to get only those .html links which have EX- in its type field. I'm looking for any pure xpath solution (by not using .getparent() or something).



      Link to that site



      Script I've tried with so far:



      import requests
      from lxml.html import fromstring

      res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
      root = fromstring(res.text)

      for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):
      if ".htm" in item:
      print(item)


      When I try to get the links meeting above condition with the below approach, I get an error:



      for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):
      if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):
      print(item)


      Error I get:



      if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):
      AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'xpath'


      This is how the files look like:



      enter image description here










      share|improve this question













      I've created a script which is able to extract the links ending with .html extention available under class tableFile from a webpage. The script can do it's job. However, my intention at this point is to get only those .html links which have EX- in its type field. I'm looking for any pure xpath solution (by not using .getparent() or something).



      Link to that site



      Script I've tried with so far:



      import requests
      from lxml.html import fromstring

      res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
      root = fromstring(res.text)

      for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):
      if ".htm" in item:
      print(item)


      When I try to get the links meeting above condition with the below approach, I get an error:



      for item in root.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href'):
      if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):
      print(item)


      Error I get:



      if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):
      AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'xpath'


      This is how the files look like:



      enter image description here







      python python-3.x xpath web-scraping






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked yesterday









      robots.txt

      945




      945
























          3 Answers
          3






          active

          oldest

          votes

















          up vote
          2
          down vote



          accepted










          If you need pure XPath solution, you can use below:



          import requests
          from lxml.html import fromstring

          res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
          root = fromstring(res.text)
          for item in root.xpath('//table[contains(@summary,"Document")]//tr[td[starts-with(., "EX-")]]/td/a[contains(@href, ".htm")]/@href'):
          print(item)


          /Archives/edgar/data/1085596/000146970918000185/ex31_1apg.htm
          /Archives/edgar/data/1085596/000146970918000185/ex31_2apg.htm
          /Archives/edgar/data/1085596/000146970918000185/ex32_1apg.htm
          /Archives/edgar/data/1085596/000146970918000185/ex32_2apg.htm





          share|improve this answer





















          • Apology for the delayed response @sir Andersson. Thanks for your effective solution.
            – robots.txt
            yesterday










          • Is there any way to do the same using .cssselect() @sir Andersson? I hope you will take a look in your spare time.
            – robots.txt
            yesterday






          • 1




            @robots.txt , you can try [link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')] , but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath
            – Andersson
            yesterday




















          up vote
          1
          down vote













          It looks like you want:



          //td[following-sibling::td[starts-with(text(), "EX")]]/a[contains(@href, ".htm")]


          There's a lot of different ways to do this with xpath. Css is probalby much simpler.






          share|improve this answer




























            up vote
            1
            down vote













            Here is a way using dataframes and pandas



            import pandas as pd
            tables = pd.read_html("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
            base = "https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/"
            results = [base + row[1][2] for row in tables[0].iterrows() if row[1][2].endswith(('.htm', '.txt')) and str(row[1][3]).startswith('EX')]
            print(results)





            share|improve this answer























              Your Answer






              StackExchange.ifUsing("editor", function () {
              StackExchange.using("externalEditor", function () {
              StackExchange.using("snippets", function () {
              StackExchange.snippets.init();
              });
              });
              }, "code-snippets");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "1"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              convertImagesToLinks: true,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














               

              draft saved


              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53349062%2fcant-create-an-xpath-capable-of-meeting-certain-condition%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              3 Answers
              3






              active

              oldest

              votes








              3 Answers
              3






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes








              up vote
              2
              down vote



              accepted










              If you need pure XPath solution, you can use below:



              import requests
              from lxml.html import fromstring

              res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
              root = fromstring(res.text)
              for item in root.xpath('//table[contains(@summary,"Document")]//tr[td[starts-with(., "EX-")]]/td/a[contains(@href, ".htm")]/@href'):
              print(item)


              /Archives/edgar/data/1085596/000146970918000185/ex31_1apg.htm
              /Archives/edgar/data/1085596/000146970918000185/ex31_2apg.htm
              /Archives/edgar/data/1085596/000146970918000185/ex32_1apg.htm
              /Archives/edgar/data/1085596/000146970918000185/ex32_2apg.htm





              share|improve this answer





















              • Apology for the delayed response @sir Andersson. Thanks for your effective solution.
                – robots.txt
                yesterday










              • Is there any way to do the same using .cssselect() @sir Andersson? I hope you will take a look in your spare time.
                – robots.txt
                yesterday






              • 1




                @robots.txt , you can try [link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')] , but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath
                – Andersson
                yesterday

















              up vote
              2
              down vote



              accepted










              If you need pure XPath solution, you can use below:



              import requests
              from lxml.html import fromstring

              res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
              root = fromstring(res.text)
              for item in root.xpath('//table[contains(@summary,"Document")]//tr[td[starts-with(., "EX-")]]/td/a[contains(@href, ".htm")]/@href'):
              print(item)


              /Archives/edgar/data/1085596/000146970918000185/ex31_1apg.htm
              /Archives/edgar/data/1085596/000146970918000185/ex31_2apg.htm
              /Archives/edgar/data/1085596/000146970918000185/ex32_1apg.htm
              /Archives/edgar/data/1085596/000146970918000185/ex32_2apg.htm





              share|improve this answer





















              • Apology for the delayed response @sir Andersson. Thanks for your effective solution.
                – robots.txt
                yesterday










              • Is there any way to do the same using .cssselect() @sir Andersson? I hope you will take a look in your spare time.
                – robots.txt
                yesterday






              • 1




                @robots.txt , you can try [link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')] , but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath
                – Andersson
                yesterday















              up vote
              2
              down vote



              accepted







              up vote
              2
              down vote



              accepted






              If you need pure XPath solution, you can use below:



              import requests
              from lxml.html import fromstring

              res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
              root = fromstring(res.text)
              for item in root.xpath('//table[contains(@summary,"Document")]//tr[td[starts-with(., "EX-")]]/td/a[contains(@href, ".htm")]/@href'):
              print(item)


              /Archives/edgar/data/1085596/000146970918000185/ex31_1apg.htm
              /Archives/edgar/data/1085596/000146970918000185/ex31_2apg.htm
              /Archives/edgar/data/1085596/000146970918000185/ex32_1apg.htm
              /Archives/edgar/data/1085596/000146970918000185/ex32_2apg.htm





              share|improve this answer












              If you need pure XPath solution, you can use below:



              import requests
              from lxml.html import fromstring

              res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
              root = fromstring(res.text)
              for item in root.xpath('//table[contains(@summary,"Document")]//tr[td[starts-with(., "EX-")]]/td/a[contains(@href, ".htm")]/@href'):
              print(item)


              /Archives/edgar/data/1085596/000146970918000185/ex31_1apg.htm
              /Archives/edgar/data/1085596/000146970918000185/ex31_2apg.htm
              /Archives/edgar/data/1085596/000146970918000185/ex32_1apg.htm
              /Archives/edgar/data/1085596/000146970918000185/ex32_2apg.htm






              share|improve this answer












              share|improve this answer



              share|improve this answer










              answered yesterday









              Andersson

              34.5k103063




              34.5k103063












              • Apology for the delayed response @sir Andersson. Thanks for your effective solution.
                – robots.txt
                yesterday










              • Is there any way to do the same using .cssselect() @sir Andersson? I hope you will take a look in your spare time.
                – robots.txt
                yesterday






              • 1




                @robots.txt , you can try [link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')] , but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath
                – Andersson
                yesterday




















              • Apology for the delayed response @sir Andersson. Thanks for your effective solution.
                – robots.txt
                yesterday










              • Is there any way to do the same using .cssselect() @sir Andersson? I hope you will take a look in your spare time.
                – robots.txt
                yesterday






              • 1




                @robots.txt , you can try [link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')] , but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath
                – Andersson
                yesterday


















              Apology for the delayed response @sir Andersson. Thanks for your effective solution.
              – robots.txt
              yesterday




              Apology for the delayed response @sir Andersson. Thanks for your effective solution.
              – robots.txt
              yesterday












              Is there any way to do the same using .cssselect() @sir Andersson? I hope you will take a look in your spare time.
              – robots.txt
              yesterday




              Is there any way to do the same using .cssselect() @sir Andersson? I hope you will take a look in your spare time.
              – robots.txt
              yesterday




              1




              1




              @robots.txt , you can try [link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')] , but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath
              – Andersson
              yesterday






              @robots.txt , you can try [link.attrib['href'] for link in root.cssselect('table[summary*="Document"] td>a:contains("ex")[href*="htm"]')] , but CSS selectors are not so flexible IMHO, so it's not the same as provided XPath
              – Andersson
              yesterday














              up vote
              1
              down vote













              It looks like you want:



              //td[following-sibling::td[starts-with(text(), "EX")]]/a[contains(@href, ".htm")]


              There's a lot of different ways to do this with xpath. Css is probalby much simpler.






              share|improve this answer

























                up vote
                1
                down vote













                It looks like you want:



                //td[following-sibling::td[starts-with(text(), "EX")]]/a[contains(@href, ".htm")]


                There's a lot of different ways to do this with xpath. Css is probalby much simpler.






                share|improve this answer























                  up vote
                  1
                  down vote










                  up vote
                  1
                  down vote









                  It looks like you want:



                  //td[following-sibling::td[starts-with(text(), "EX")]]/a[contains(@href, ".htm")]


                  There's a lot of different ways to do this with xpath. Css is probalby much simpler.






                  share|improve this answer












                  It looks like you want:



                  //td[following-sibling::td[starts-with(text(), "EX")]]/a[contains(@href, ".htm")]


                  There's a lot of different ways to do this with xpath. Css is probalby much simpler.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered yesterday









                  pguardiario

                  35.3k978112




                  35.3k978112






















                      up vote
                      1
                      down vote













                      Here is a way using dataframes and pandas



                      import pandas as pd
                      tables = pd.read_html("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
                      base = "https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/"
                      results = [base + row[1][2] for row in tables[0].iterrows() if row[1][2].endswith(('.htm', '.txt')) and str(row[1][3]).startswith('EX')]
                      print(results)





                      share|improve this answer



























                        up vote
                        1
                        down vote













                        Here is a way using dataframes and pandas



                        import pandas as pd
                        tables = pd.read_html("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
                        base = "https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/"
                        results = [base + row[1][2] for row in tables[0].iterrows() if row[1][2].endswith(('.htm', '.txt')) and str(row[1][3]).startswith('EX')]
                        print(results)





                        share|improve this answer

























                          up vote
                          1
                          down vote










                          up vote
                          1
                          down vote









                          Here is a way using dataframes and pandas



                          import pandas as pd
                          tables = pd.read_html("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
                          base = "https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/"
                          results = [base + row[1][2] for row in tables[0].iterrows() if row[1][2].endswith(('.htm', '.txt')) and str(row[1][3]).startswith('EX')]
                          print(results)





                          share|improve this answer














                          Here is a way using dataframes and pandas



                          import pandas as pd
                          tables = pd.read_html("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
                          base = "https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/"
                          results = [base + row[1][2] for row in tables[0].iterrows() if row[1][2].endswith(('.htm', '.txt')) and str(row[1][3]).startswith('EX')]
                          print(results)






                          share|improve this answer














                          share|improve this answer



                          share|improve this answer








                          edited yesterday

























                          answered yesterday









                          QHarr

                          25.6k81839




                          25.6k81839






























                               

                              draft saved


                              draft discarded



















































                               


                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53349062%2fcant-create-an-xpath-capable-of-meeting-certain-condition%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Costa Masnaga

                              Fotorealismo

                              Sidney Franklin