python requests enable cookies/javascript












2















I try to download an excel file from a specific website. In my local computer it works perfectly:



>>> r = requests.get('http://www.health.gov.il/PublicationsFiles/IWER01_2004.xls')
>>> r.status_code
200
>>> r.content
b'xd0xcfx11xe0xa1xb1...x00x00' # Long binary string


But when I connect to a remote ubuntu server, I get a message related to enabling cookies/javascript.



r = requests.get('http://www.health.gov.il/PublicationsFiles/IWER01_2004.xls')
>>> r.status_code
200
>>> r.content
b'<HTML>n<head>n<script>nChallenge=141020;nChallengeId=120854618;nGenericErrorMessageCookies="Cookies must be enabled in order to view this page.";n</script>n<script>nfunction test(var1)n{ntvar var_str=""+Challenge;ntvar var_arr=var_str.split("");ntvar LastDig=var_arr.reverse()[0];ntvar minDig=var_arr.sort()[0];ntvar subvar1 = (2 * (var_arr[2]))+(var_arr[1]*1);ntvar subvar2 = (2 * var_arr[2])+var_arr[1];ntvar my_pow=Math.pow(((var_arr[0]*1)+2),var_arr[1]);ntvar x=(var1*3+subvar1)*1;ntvar y=Math.cos(Math.PI*subvar2);ntvar answer=x*y;ntanswer-=my_pow*1;ntanswer+=(minDig*1)-(LastDig*1);ntanswer=answer+subvar2;ntreturn answer;n}n</script>n<script>nclient = null;nif (window.XMLHttpRequest)n{ntvar client=new XMLHttpRequest();n}nelsen{ntif (window.ActiveXObject)nt{nttclient = new ActiveXObject('MSXML2.XMLHTTP.3.0');nt};n}nif (!((!!client)&&(!!Math.pow)&&(!!Math.cos)&&(!!.sort)&&(!!.reverse)))n{ntdocument.write("Not all needed JavaScript methods are supported.<BR>");nn}nelsen{ntclient.onreadystatechange = function()nt{nttif(client.readyState == 4)ntt{ntttvar MyCookie=client.getResponseHeader("X-AA-Cookie-Value");ntttif ((MyCookie == null) || (MyCookie==""))nttt{nttttdocument.write(client.responseText);nttttreturn;nttt}ntttntttvar cookieName = MyCookie.split('=')[0];ntttif (document.cookie.indexOf(cookieName)==-1)nttt{nttttdocument.write(GenericErrorMessageCookies);nttttreturn;nttt}ntttwindow.location.reload(true);ntt}nt};nty=test(Challenge);ntclient.open("POST",window.location,true);ntclient.setRequestHeader('X-AA-Challenge-ID', ChallengeId);ntclient.setRequestHeader('X-AA-Challenge-Result',y);ntclient.setRequestHeader('X-AA-Challenge',Challenge);ntclient.setRequestHeader('Content-Type' , 'text/plain');ntclient.send();n}n</script>n</head>n<body>n<noscript>JavaScript must be enabled in order to view this page.</noscript>n</body>n</HTML>'


On local I run from MACos that has Chrome installed (I'm not actively using it for the script, but maybe it's related?), on remote I run ubuntu on digital ocean without any GUI browser installed.










share|improve this question



























    2















    I try to download an excel file from a specific website. In my local computer it works perfectly:



    >>> r = requests.get('http://www.health.gov.il/PublicationsFiles/IWER01_2004.xls')
    >>> r.status_code
    200
    >>> r.content
    b'xd0xcfx11xe0xa1xb1...x00x00' # Long binary string


    But when I connect to a remote ubuntu server, I get a message related to enabling cookies/javascript.



    r = requests.get('http://www.health.gov.il/PublicationsFiles/IWER01_2004.xls')
    >>> r.status_code
    200
    >>> r.content
    b'<HTML>n<head>n<script>nChallenge=141020;nChallengeId=120854618;nGenericErrorMessageCookies="Cookies must be enabled in order to view this page.";n</script>n<script>nfunction test(var1)n{ntvar var_str=""+Challenge;ntvar var_arr=var_str.split("");ntvar LastDig=var_arr.reverse()[0];ntvar minDig=var_arr.sort()[0];ntvar subvar1 = (2 * (var_arr[2]))+(var_arr[1]*1);ntvar subvar2 = (2 * var_arr[2])+var_arr[1];ntvar my_pow=Math.pow(((var_arr[0]*1)+2),var_arr[1]);ntvar x=(var1*3+subvar1)*1;ntvar y=Math.cos(Math.PI*subvar2);ntvar answer=x*y;ntanswer-=my_pow*1;ntanswer+=(minDig*1)-(LastDig*1);ntanswer=answer+subvar2;ntreturn answer;n}n</script>n<script>nclient = null;nif (window.XMLHttpRequest)n{ntvar client=new XMLHttpRequest();n}nelsen{ntif (window.ActiveXObject)nt{nttclient = new ActiveXObject('MSXML2.XMLHTTP.3.0');nt};n}nif (!((!!client)&&(!!Math.pow)&&(!!Math.cos)&&(!!.sort)&&(!!.reverse)))n{ntdocument.write("Not all needed JavaScript methods are supported.<BR>");nn}nelsen{ntclient.onreadystatechange = function()nt{nttif(client.readyState == 4)ntt{ntttvar MyCookie=client.getResponseHeader("X-AA-Cookie-Value");ntttif ((MyCookie == null) || (MyCookie==""))nttt{nttttdocument.write(client.responseText);nttttreturn;nttt}ntttntttvar cookieName = MyCookie.split('=')[0];ntttif (document.cookie.indexOf(cookieName)==-1)nttt{nttttdocument.write(GenericErrorMessageCookies);nttttreturn;nttt}ntttwindow.location.reload(true);ntt}nt};nty=test(Challenge);ntclient.open("POST",window.location,true);ntclient.setRequestHeader('X-AA-Challenge-ID', ChallengeId);ntclient.setRequestHeader('X-AA-Challenge-Result',y);ntclient.setRequestHeader('X-AA-Challenge',Challenge);ntclient.setRequestHeader('Content-Type' , 'text/plain');ntclient.send();n}n</script>n</head>n<body>n<noscript>JavaScript must be enabled in order to view this page.</noscript>n</body>n</HTML>'


    On local I run from MACos that has Chrome installed (I'm not actively using it for the script, but maybe it's related?), on remote I run ubuntu on digital ocean without any GUI browser installed.










    share|improve this question

























      2












      2








      2


      1






      I try to download an excel file from a specific website. In my local computer it works perfectly:



      >>> r = requests.get('http://www.health.gov.il/PublicationsFiles/IWER01_2004.xls')
      >>> r.status_code
      200
      >>> r.content
      b'xd0xcfx11xe0xa1xb1...x00x00' # Long binary string


      But when I connect to a remote ubuntu server, I get a message related to enabling cookies/javascript.



      r = requests.get('http://www.health.gov.il/PublicationsFiles/IWER01_2004.xls')
      >>> r.status_code
      200
      >>> r.content
      b'<HTML>n<head>n<script>nChallenge=141020;nChallengeId=120854618;nGenericErrorMessageCookies="Cookies must be enabled in order to view this page.";n</script>n<script>nfunction test(var1)n{ntvar var_str=""+Challenge;ntvar var_arr=var_str.split("");ntvar LastDig=var_arr.reverse()[0];ntvar minDig=var_arr.sort()[0];ntvar subvar1 = (2 * (var_arr[2]))+(var_arr[1]*1);ntvar subvar2 = (2 * var_arr[2])+var_arr[1];ntvar my_pow=Math.pow(((var_arr[0]*1)+2),var_arr[1]);ntvar x=(var1*3+subvar1)*1;ntvar y=Math.cos(Math.PI*subvar2);ntvar answer=x*y;ntanswer-=my_pow*1;ntanswer+=(minDig*1)-(LastDig*1);ntanswer=answer+subvar2;ntreturn answer;n}n</script>n<script>nclient = null;nif (window.XMLHttpRequest)n{ntvar client=new XMLHttpRequest();n}nelsen{ntif (window.ActiveXObject)nt{nttclient = new ActiveXObject('MSXML2.XMLHTTP.3.0');nt};n}nif (!((!!client)&&(!!Math.pow)&&(!!Math.cos)&&(!!.sort)&&(!!.reverse)))n{ntdocument.write("Not all needed JavaScript methods are supported.<BR>");nn}nelsen{ntclient.onreadystatechange = function()nt{nttif(client.readyState == 4)ntt{ntttvar MyCookie=client.getResponseHeader("X-AA-Cookie-Value");ntttif ((MyCookie == null) || (MyCookie==""))nttt{nttttdocument.write(client.responseText);nttttreturn;nttt}ntttntttvar cookieName = MyCookie.split('=')[0];ntttif (document.cookie.indexOf(cookieName)==-1)nttt{nttttdocument.write(GenericErrorMessageCookies);nttttreturn;nttt}ntttwindow.location.reload(true);ntt}nt};nty=test(Challenge);ntclient.open("POST",window.location,true);ntclient.setRequestHeader('X-AA-Challenge-ID', ChallengeId);ntclient.setRequestHeader('X-AA-Challenge-Result',y);ntclient.setRequestHeader('X-AA-Challenge',Challenge);ntclient.setRequestHeader('Content-Type' , 'text/plain');ntclient.send();n}n</script>n</head>n<body>n<noscript>JavaScript must be enabled in order to view this page.</noscript>n</body>n</HTML>'


      On local I run from MACos that has Chrome installed (I'm not actively using it for the script, but maybe it's related?), on remote I run ubuntu on digital ocean without any GUI browser installed.










      share|improve this question














      I try to download an excel file from a specific website. In my local computer it works perfectly:



      >>> r = requests.get('http://www.health.gov.il/PublicationsFiles/IWER01_2004.xls')
      >>> r.status_code
      200
      >>> r.content
      b'xd0xcfx11xe0xa1xb1...x00x00' # Long binary string


      But when I connect to a remote ubuntu server, I get a message related to enabling cookies/javascript.



      r = requests.get('http://www.health.gov.il/PublicationsFiles/IWER01_2004.xls')
      >>> r.status_code
      200
      >>> r.content
      b'<HTML>n<head>n<script>nChallenge=141020;nChallengeId=120854618;nGenericErrorMessageCookies="Cookies must be enabled in order to view this page.";n</script>n<script>nfunction test(var1)n{ntvar var_str=""+Challenge;ntvar var_arr=var_str.split("");ntvar LastDig=var_arr.reverse()[0];ntvar minDig=var_arr.sort()[0];ntvar subvar1 = (2 * (var_arr[2]))+(var_arr[1]*1);ntvar subvar2 = (2 * var_arr[2])+var_arr[1];ntvar my_pow=Math.pow(((var_arr[0]*1)+2),var_arr[1]);ntvar x=(var1*3+subvar1)*1;ntvar y=Math.cos(Math.PI*subvar2);ntvar answer=x*y;ntanswer-=my_pow*1;ntanswer+=(minDig*1)-(LastDig*1);ntanswer=answer+subvar2;ntreturn answer;n}n</script>n<script>nclient = null;nif (window.XMLHttpRequest)n{ntvar client=new XMLHttpRequest();n}nelsen{ntif (window.ActiveXObject)nt{nttclient = new ActiveXObject('MSXML2.XMLHTTP.3.0');nt};n}nif (!((!!client)&&(!!Math.pow)&&(!!Math.cos)&&(!!.sort)&&(!!.reverse)))n{ntdocument.write("Not all needed JavaScript methods are supported.<BR>");nn}nelsen{ntclient.onreadystatechange = function()nt{nttif(client.readyState == 4)ntt{ntttvar MyCookie=client.getResponseHeader("X-AA-Cookie-Value");ntttif ((MyCookie == null) || (MyCookie==""))nttt{nttttdocument.write(client.responseText);nttttreturn;nttt}ntttntttvar cookieName = MyCookie.split('=')[0];ntttif (document.cookie.indexOf(cookieName)==-1)nttt{nttttdocument.write(GenericErrorMessageCookies);nttttreturn;nttt}ntttwindow.location.reload(true);ntt}nt};nty=test(Challenge);ntclient.open("POST",window.location,true);ntclient.setRequestHeader('X-AA-Challenge-ID', ChallengeId);ntclient.setRequestHeader('X-AA-Challenge-Result',y);ntclient.setRequestHeader('X-AA-Challenge',Challenge);ntclient.setRequestHeader('Content-Type' , 'text/plain');ntclient.send();n}n</script>n</head>n<body>n<noscript>JavaScript must be enabled in order to view this page.</noscript>n</body>n</HTML>'


      On local I run from MACos that has Chrome installed (I'm not actively using it for the script, but maybe it's related?), on remote I run ubuntu on digital ocean without any GUI browser installed.







      python cookies browser python-requests






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 22 '18 at 15:56









      DeanLaDeanLa

      662616




      662616
























          1 Answer
          1






          active

          oldest

          votes


















          1














          The behavior of requests has nothing to do with what browsers are installed on the system, it does not depend on or interact with them in any way.



          The problem here is that the resource you are requesting has some kind of "bot mitigation" mechanism enabled to prevent just this kind of access. It returns some javascript with logic that needs to be evaluated, and the results of that logic are then used for an additional request to "prove" you're not a bot.



          Luckily, it appears that this specific mitigation mechanism has been solved before, and I was able to quickly get this request working utilizing the challenge-solving functions from that code:



          from math import cos, pi, floor

          import requests

          URL = 'http://www.health.gov.il/PublicationsFiles/IWER01_2004.xls'


          def parse_challenge(page):
          """
          Parse a challenge given by mmi and mavat's web servers, forcing us to solve
          some math stuff and send the result as a header to actually get the page.
          This logic is pretty much copied from https://github.com/R3dy/jigsaw-rails/blob/master/lib/breakbot.rb
          """
          top = page.split('<script>')[1].split('n')
          challenge = top[1].split(';')[0].split('=')[1]
          challenge_id = top[2].split(';')[0].split('=')[1]
          return {'challenge': challenge, 'challenge_id': challenge_id, 'challenge_result': get_challenge_answer(challenge)}


          def get_challenge_answer(challenge):
          """
          Solve the math part of the challenge and get the result
          """
          arr = list(challenge)
          last_digit = int(arr[-1])
          arr.sort()
          min_digit = int(arr[0])
          subvar1 = (2 * int(arr[2])) + int(arr[1])
          subvar2 = str(2 * int(arr[2])) + arr[1]
          power = ((int(arr[0]) * 1) + 2) ** int(arr[1])
          x = (int(challenge) * 3 + subvar1)
          y = cos(pi * subvar1)
          answer = x * y
          answer -= power
          answer += (min_digit - last_digit)
          answer = str(int(floor(answer))) + subvar2
          return answer


          def main():
          s = requests.Session()
          r = s.get(URL)

          if 'X-AA-Challenge' in r.text:
          challenge = parse_challenge(r.text)
          r = s.get(URL, headers={
          'X-AA-Challenge': challenge['challenge'],
          'X-AA-Challenge-ID': challenge['challenge_id'],
          'X-AA-Challenge-Result': challenge['challenge_result']
          })

          yum = r.cookies
          r = s.get(URL, cookies=yum)

          print(r.content)


          if __name__ == '__main__':
          main()





          share|improve this answer


























          • im so curious about this question, and can you explain why he successfully request on local machine but fail on remote one with same code?

            – kcorlidy
            Nov 23 '18 at 5:59











          • @kcorlidy That is a great question, but I'm really not sure. I suspect that the mitigation mechanism might not kick in on every request, but only ones it deems "suspicious". It may be based on factors like the requestor's ip address, whether that ip address has successfully accessed other resources recently, etc. You see that kind of behavior a lot with things like captchas.

            – cody
            Nov 23 '18 at 11:25











          • Thanks @cody this seems like an elegant solution, but for some reason, it doesn't work. When I debug, after the get with the headers, I try r.cookies and I receive an empty cookie jar <RequestsCookieJar>.

            – DeanLa
            Nov 23 '18 at 14:21











          • @DeanLa Hmm.. did you try running the code I provided verbatim? I get the full binary content of the XLS file as output. If I have it print the value of yum, the cookie, I get something like <RequestsCookieJar[<Cookie BotMitigationCookie_1....for www.health.gov.il/>]>

            – cody
            Nov 23 '18 at 15:10











          • My bad, I wasn't thinking of non-existing files. They don't return 404 as previously, they return empty cookie jars.

            – DeanLa
            Nov 23 '18 at 17:30











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53434555%2fpython-requests-enable-cookies-javascript%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1














          The behavior of requests has nothing to do with what browsers are installed on the system, it does not depend on or interact with them in any way.



          The problem here is that the resource you are requesting has some kind of "bot mitigation" mechanism enabled to prevent just this kind of access. It returns some javascript with logic that needs to be evaluated, and the results of that logic are then used for an additional request to "prove" you're not a bot.



          Luckily, it appears that this specific mitigation mechanism has been solved before, and I was able to quickly get this request working utilizing the challenge-solving functions from that code:



          from math import cos, pi, floor

          import requests

          URL = 'http://www.health.gov.il/PublicationsFiles/IWER01_2004.xls'


          def parse_challenge(page):
          """
          Parse a challenge given by mmi and mavat's web servers, forcing us to solve
          some math stuff and send the result as a header to actually get the page.
          This logic is pretty much copied from https://github.com/R3dy/jigsaw-rails/blob/master/lib/breakbot.rb
          """
          top = page.split('<script>')[1].split('n')
          challenge = top[1].split(';')[0].split('=')[1]
          challenge_id = top[2].split(';')[0].split('=')[1]
          return {'challenge': challenge, 'challenge_id': challenge_id, 'challenge_result': get_challenge_answer(challenge)}


          def get_challenge_answer(challenge):
          """
          Solve the math part of the challenge and get the result
          """
          arr = list(challenge)
          last_digit = int(arr[-1])
          arr.sort()
          min_digit = int(arr[0])
          subvar1 = (2 * int(arr[2])) + int(arr[1])
          subvar2 = str(2 * int(arr[2])) + arr[1]
          power = ((int(arr[0]) * 1) + 2) ** int(arr[1])
          x = (int(challenge) * 3 + subvar1)
          y = cos(pi * subvar1)
          answer = x * y
          answer -= power
          answer += (min_digit - last_digit)
          answer = str(int(floor(answer))) + subvar2
          return answer


          def main():
          s = requests.Session()
          r = s.get(URL)

          if 'X-AA-Challenge' in r.text:
          challenge = parse_challenge(r.text)
          r = s.get(URL, headers={
          'X-AA-Challenge': challenge['challenge'],
          'X-AA-Challenge-ID': challenge['challenge_id'],
          'X-AA-Challenge-Result': challenge['challenge_result']
          })

          yum = r.cookies
          r = s.get(URL, cookies=yum)

          print(r.content)


          if __name__ == '__main__':
          main()





          share|improve this answer


























          • im so curious about this question, and can you explain why he successfully request on local machine but fail on remote one with same code?

            – kcorlidy
            Nov 23 '18 at 5:59











          • @kcorlidy That is a great question, but I'm really not sure. I suspect that the mitigation mechanism might not kick in on every request, but only ones it deems "suspicious". It may be based on factors like the requestor's ip address, whether that ip address has successfully accessed other resources recently, etc. You see that kind of behavior a lot with things like captchas.

            – cody
            Nov 23 '18 at 11:25











          • Thanks @cody this seems like an elegant solution, but for some reason, it doesn't work. When I debug, after the get with the headers, I try r.cookies and I receive an empty cookie jar <RequestsCookieJar>.

            – DeanLa
            Nov 23 '18 at 14:21











          • @DeanLa Hmm.. did you try running the code I provided verbatim? I get the full binary content of the XLS file as output. If I have it print the value of yum, the cookie, I get something like <RequestsCookieJar[<Cookie BotMitigationCookie_1....for www.health.gov.il/>]>

            – cody
            Nov 23 '18 at 15:10











          • My bad, I wasn't thinking of non-existing files. They don't return 404 as previously, they return empty cookie jars.

            – DeanLa
            Nov 23 '18 at 17:30
















          1














          The behavior of requests has nothing to do with what browsers are installed on the system, it does not depend on or interact with them in any way.



          The problem here is that the resource you are requesting has some kind of "bot mitigation" mechanism enabled to prevent just this kind of access. It returns some javascript with logic that needs to be evaluated, and the results of that logic are then used for an additional request to "prove" you're not a bot.



          Luckily, it appears that this specific mitigation mechanism has been solved before, and I was able to quickly get this request working utilizing the challenge-solving functions from that code:



          from math import cos, pi, floor

          import requests

          URL = 'http://www.health.gov.il/PublicationsFiles/IWER01_2004.xls'


          def parse_challenge(page):
          """
          Parse a challenge given by mmi and mavat's web servers, forcing us to solve
          some math stuff and send the result as a header to actually get the page.
          This logic is pretty much copied from https://github.com/R3dy/jigsaw-rails/blob/master/lib/breakbot.rb
          """
          top = page.split('<script>')[1].split('n')
          challenge = top[1].split(';')[0].split('=')[1]
          challenge_id = top[2].split(';')[0].split('=')[1]
          return {'challenge': challenge, 'challenge_id': challenge_id, 'challenge_result': get_challenge_answer(challenge)}


          def get_challenge_answer(challenge):
          """
          Solve the math part of the challenge and get the result
          """
          arr = list(challenge)
          last_digit = int(arr[-1])
          arr.sort()
          min_digit = int(arr[0])
          subvar1 = (2 * int(arr[2])) + int(arr[1])
          subvar2 = str(2 * int(arr[2])) + arr[1]
          power = ((int(arr[0]) * 1) + 2) ** int(arr[1])
          x = (int(challenge) * 3 + subvar1)
          y = cos(pi * subvar1)
          answer = x * y
          answer -= power
          answer += (min_digit - last_digit)
          answer = str(int(floor(answer))) + subvar2
          return answer


          def main():
          s = requests.Session()
          r = s.get(URL)

          if 'X-AA-Challenge' in r.text:
          challenge = parse_challenge(r.text)
          r = s.get(URL, headers={
          'X-AA-Challenge': challenge['challenge'],
          'X-AA-Challenge-ID': challenge['challenge_id'],
          'X-AA-Challenge-Result': challenge['challenge_result']
          })

          yum = r.cookies
          r = s.get(URL, cookies=yum)

          print(r.content)


          if __name__ == '__main__':
          main()





          share|improve this answer


























          • im so curious about this question, and can you explain why he successfully request on local machine but fail on remote one with same code?

            – kcorlidy
            Nov 23 '18 at 5:59











          • @kcorlidy That is a great question, but I'm really not sure. I suspect that the mitigation mechanism might not kick in on every request, but only ones it deems "suspicious". It may be based on factors like the requestor's ip address, whether that ip address has successfully accessed other resources recently, etc. You see that kind of behavior a lot with things like captchas.

            – cody
            Nov 23 '18 at 11:25











          • Thanks @cody this seems like an elegant solution, but for some reason, it doesn't work. When I debug, after the get with the headers, I try r.cookies and I receive an empty cookie jar <RequestsCookieJar>.

            – DeanLa
            Nov 23 '18 at 14:21











          • @DeanLa Hmm.. did you try running the code I provided verbatim? I get the full binary content of the XLS file as output. If I have it print the value of yum, the cookie, I get something like <RequestsCookieJar[<Cookie BotMitigationCookie_1....for www.health.gov.il/>]>

            – cody
            Nov 23 '18 at 15:10











          • My bad, I wasn't thinking of non-existing files. They don't return 404 as previously, they return empty cookie jars.

            – DeanLa
            Nov 23 '18 at 17:30














          1












          1








          1







          The behavior of requests has nothing to do with what browsers are installed on the system, it does not depend on or interact with them in any way.



          The problem here is that the resource you are requesting has some kind of "bot mitigation" mechanism enabled to prevent just this kind of access. It returns some javascript with logic that needs to be evaluated, and the results of that logic are then used for an additional request to "prove" you're not a bot.



          Luckily, it appears that this specific mitigation mechanism has been solved before, and I was able to quickly get this request working utilizing the challenge-solving functions from that code:



          from math import cos, pi, floor

          import requests

          URL = 'http://www.health.gov.il/PublicationsFiles/IWER01_2004.xls'


          def parse_challenge(page):
          """
          Parse a challenge given by mmi and mavat's web servers, forcing us to solve
          some math stuff and send the result as a header to actually get the page.
          This logic is pretty much copied from https://github.com/R3dy/jigsaw-rails/blob/master/lib/breakbot.rb
          """
          top = page.split('<script>')[1].split('n')
          challenge = top[1].split(';')[0].split('=')[1]
          challenge_id = top[2].split(';')[0].split('=')[1]
          return {'challenge': challenge, 'challenge_id': challenge_id, 'challenge_result': get_challenge_answer(challenge)}


          def get_challenge_answer(challenge):
          """
          Solve the math part of the challenge and get the result
          """
          arr = list(challenge)
          last_digit = int(arr[-1])
          arr.sort()
          min_digit = int(arr[0])
          subvar1 = (2 * int(arr[2])) + int(arr[1])
          subvar2 = str(2 * int(arr[2])) + arr[1]
          power = ((int(arr[0]) * 1) + 2) ** int(arr[1])
          x = (int(challenge) * 3 + subvar1)
          y = cos(pi * subvar1)
          answer = x * y
          answer -= power
          answer += (min_digit - last_digit)
          answer = str(int(floor(answer))) + subvar2
          return answer


          def main():
          s = requests.Session()
          r = s.get(URL)

          if 'X-AA-Challenge' in r.text:
          challenge = parse_challenge(r.text)
          r = s.get(URL, headers={
          'X-AA-Challenge': challenge['challenge'],
          'X-AA-Challenge-ID': challenge['challenge_id'],
          'X-AA-Challenge-Result': challenge['challenge_result']
          })

          yum = r.cookies
          r = s.get(URL, cookies=yum)

          print(r.content)


          if __name__ == '__main__':
          main()





          share|improve this answer















          The behavior of requests has nothing to do with what browsers are installed on the system, it does not depend on or interact with them in any way.



          The problem here is that the resource you are requesting has some kind of "bot mitigation" mechanism enabled to prevent just this kind of access. It returns some javascript with logic that needs to be evaluated, and the results of that logic are then used for an additional request to "prove" you're not a bot.



          Luckily, it appears that this specific mitigation mechanism has been solved before, and I was able to quickly get this request working utilizing the challenge-solving functions from that code:



          from math import cos, pi, floor

          import requests

          URL = 'http://www.health.gov.il/PublicationsFiles/IWER01_2004.xls'


          def parse_challenge(page):
          """
          Parse a challenge given by mmi and mavat's web servers, forcing us to solve
          some math stuff and send the result as a header to actually get the page.
          This logic is pretty much copied from https://github.com/R3dy/jigsaw-rails/blob/master/lib/breakbot.rb
          """
          top = page.split('<script>')[1].split('n')
          challenge = top[1].split(';')[0].split('=')[1]
          challenge_id = top[2].split(';')[0].split('=')[1]
          return {'challenge': challenge, 'challenge_id': challenge_id, 'challenge_result': get_challenge_answer(challenge)}


          def get_challenge_answer(challenge):
          """
          Solve the math part of the challenge and get the result
          """
          arr = list(challenge)
          last_digit = int(arr[-1])
          arr.sort()
          min_digit = int(arr[0])
          subvar1 = (2 * int(arr[2])) + int(arr[1])
          subvar2 = str(2 * int(arr[2])) + arr[1]
          power = ((int(arr[0]) * 1) + 2) ** int(arr[1])
          x = (int(challenge) * 3 + subvar1)
          y = cos(pi * subvar1)
          answer = x * y
          answer -= power
          answer += (min_digit - last_digit)
          answer = str(int(floor(answer))) + subvar2
          return answer


          def main():
          s = requests.Session()
          r = s.get(URL)

          if 'X-AA-Challenge' in r.text:
          challenge = parse_challenge(r.text)
          r = s.get(URL, headers={
          'X-AA-Challenge': challenge['challenge'],
          'X-AA-Challenge-ID': challenge['challenge_id'],
          'X-AA-Challenge-Result': challenge['challenge_result']
          })

          yum = r.cookies
          r = s.get(URL, cookies=yum)

          print(r.content)


          if __name__ == '__main__':
          main()






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 23 '18 at 11:27

























          answered Nov 22 '18 at 16:38









          codycody

          4,09121124




          4,09121124













          • im so curious about this question, and can you explain why he successfully request on local machine but fail on remote one with same code?

            – kcorlidy
            Nov 23 '18 at 5:59











          • @kcorlidy That is a great question, but I'm really not sure. I suspect that the mitigation mechanism might not kick in on every request, but only ones it deems "suspicious". It may be based on factors like the requestor's ip address, whether that ip address has successfully accessed other resources recently, etc. You see that kind of behavior a lot with things like captchas.

            – cody
            Nov 23 '18 at 11:25











          • Thanks @cody this seems like an elegant solution, but for some reason, it doesn't work. When I debug, after the get with the headers, I try r.cookies and I receive an empty cookie jar <RequestsCookieJar>.

            – DeanLa
            Nov 23 '18 at 14:21











          • @DeanLa Hmm.. did you try running the code I provided verbatim? I get the full binary content of the XLS file as output. If I have it print the value of yum, the cookie, I get something like <RequestsCookieJar[<Cookie BotMitigationCookie_1....for www.health.gov.il/>]>

            – cody
            Nov 23 '18 at 15:10











          • My bad, I wasn't thinking of non-existing files. They don't return 404 as previously, they return empty cookie jars.

            – DeanLa
            Nov 23 '18 at 17:30



















          • im so curious about this question, and can you explain why he successfully request on local machine but fail on remote one with same code?

            – kcorlidy
            Nov 23 '18 at 5:59











          • @kcorlidy That is a great question, but I'm really not sure. I suspect that the mitigation mechanism might not kick in on every request, but only ones it deems "suspicious". It may be based on factors like the requestor's ip address, whether that ip address has successfully accessed other resources recently, etc. You see that kind of behavior a lot with things like captchas.

            – cody
            Nov 23 '18 at 11:25











          • Thanks @cody this seems like an elegant solution, but for some reason, it doesn't work. When I debug, after the get with the headers, I try r.cookies and I receive an empty cookie jar <RequestsCookieJar>.

            – DeanLa
            Nov 23 '18 at 14:21











          • @DeanLa Hmm.. did you try running the code I provided verbatim? I get the full binary content of the XLS file as output. If I have it print the value of yum, the cookie, I get something like <RequestsCookieJar[<Cookie BotMitigationCookie_1....for www.health.gov.il/>]>

            – cody
            Nov 23 '18 at 15:10











          • My bad, I wasn't thinking of non-existing files. They don't return 404 as previously, they return empty cookie jars.

            – DeanLa
            Nov 23 '18 at 17:30

















          im so curious about this question, and can you explain why he successfully request on local machine but fail on remote one with same code?

          – kcorlidy
          Nov 23 '18 at 5:59





          im so curious about this question, and can you explain why he successfully request on local machine but fail on remote one with same code?

          – kcorlidy
          Nov 23 '18 at 5:59













          @kcorlidy That is a great question, but I'm really not sure. I suspect that the mitigation mechanism might not kick in on every request, but only ones it deems "suspicious". It may be based on factors like the requestor's ip address, whether that ip address has successfully accessed other resources recently, etc. You see that kind of behavior a lot with things like captchas.

          – cody
          Nov 23 '18 at 11:25





          @kcorlidy That is a great question, but I'm really not sure. I suspect that the mitigation mechanism might not kick in on every request, but only ones it deems "suspicious". It may be based on factors like the requestor's ip address, whether that ip address has successfully accessed other resources recently, etc. You see that kind of behavior a lot with things like captchas.

          – cody
          Nov 23 '18 at 11:25













          Thanks @cody this seems like an elegant solution, but for some reason, it doesn't work. When I debug, after the get with the headers, I try r.cookies and I receive an empty cookie jar <RequestsCookieJar>.

          – DeanLa
          Nov 23 '18 at 14:21





          Thanks @cody this seems like an elegant solution, but for some reason, it doesn't work. When I debug, after the get with the headers, I try r.cookies and I receive an empty cookie jar <RequestsCookieJar>.

          – DeanLa
          Nov 23 '18 at 14:21













          @DeanLa Hmm.. did you try running the code I provided verbatim? I get the full binary content of the XLS file as output. If I have it print the value of yum, the cookie, I get something like <RequestsCookieJar[<Cookie BotMitigationCookie_1....for www.health.gov.il/>]>

          – cody
          Nov 23 '18 at 15:10





          @DeanLa Hmm.. did you try running the code I provided verbatim? I get the full binary content of the XLS file as output. If I have it print the value of yum, the cookie, I get something like <RequestsCookieJar[<Cookie BotMitigationCookie_1....for www.health.gov.il/>]>

          – cody
          Nov 23 '18 at 15:10













          My bad, I wasn't thinking of non-existing files. They don't return 404 as previously, they return empty cookie jars.

          – DeanLa
          Nov 23 '18 at 17:30





          My bad, I wasn't thinking of non-existing files. They don't return 404 as previously, they return empty cookie jars.

          – DeanLa
          Nov 23 '18 at 17:30


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53434555%2fpython-requests-enable-cookies-javascript%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Create new schema in PostgreSQL using DBeaver

          Deepest pit of an array with Javascript: test on Codility

          Fotorealismo