Problem with attachments' character encoding using gmail gem in ruby/rails











up vote
2
down vote

favorite
1












What I am doing:
I am using the gmail gem in a Rails 4 app to get email attachments from a specific account at regular intervals. Here is an extract from the core part (here for simplicity only considering the first email and its first attachment):



require 'gmail'

Gmail.connect(@user_email,@user_password) do |gmail|
if gmail.logged_in?
emails = gmail.inbox.emails(:from => @sender_email)
email = emails[0]
attachment = email.message.attachments[0]
File.open("~/temp.csv", 'w') do |file|
file.write(
StringIO.new(attachment.decoded.to_s[2..-2].force_encoding("ISO-8859-15").encode!('UTF-8')).read
)
end
end
end


The encoding of the attached file can vary. The particular one that I am currently having issues with is in Finnish. It contains Finnish characters and a superscripted 3 character.



This is what I expect to get when I run the above code. (This is what I get when I download the attachment manually through gmail user interface):
This is what I expect to get (and what I get when I download the attachment manually)



What the problem is:



However, I am getting the following odd results.



From cat temp.csv (Looks good to me):
This is from a cat temp.csv (looks good)



With nano temp.csv (Here I have no idea what I am looking at):
This is what it looks like with nano temp.csv



This is what temp.csv looks like opened in Sublime Text (directly via winscp). First line and small parts look ok but then Chinese/Japanese characters:
This is what temp.csv looks like opened in Sublime Text (via winscp)



This is what temp.csv looks like in Notepad (after download via winscp). Looks ok except a blank space has been inserted between each character and the new lines seems to be missing:
what temp.csv looks like in Notepad



What I have tried:



I have without success tried:





  • .force_encoding(...) with all the different "ISO-8859-x" character sets

  • putting the force_encoding("ISO-8859-15").encode!('UTF-8') outside the .read (works but doesn't solve the problem)

  • encode to UTF-8 without first forcing another encoding but this leads to Encoding::UndefinedConversionError: "xC4" from ASCII-8BIT to UTF-8

  • writing as binary with 'wb' and 'w+b' in the File.open() (which oddly doesn't seem to make a difference to the outcome).

  • searching stackoverflow and the web for other ideas.


Any ideas would be much appreciated!










share|improve this question




























    up vote
    2
    down vote

    favorite
    1












    What I am doing:
    I am using the gmail gem in a Rails 4 app to get email attachments from a specific account at regular intervals. Here is an extract from the core part (here for simplicity only considering the first email and its first attachment):



    require 'gmail'

    Gmail.connect(@user_email,@user_password) do |gmail|
    if gmail.logged_in?
    emails = gmail.inbox.emails(:from => @sender_email)
    email = emails[0]
    attachment = email.message.attachments[0]
    File.open("~/temp.csv", 'w') do |file|
    file.write(
    StringIO.new(attachment.decoded.to_s[2..-2].force_encoding("ISO-8859-15").encode!('UTF-8')).read
    )
    end
    end
    end


    The encoding of the attached file can vary. The particular one that I am currently having issues with is in Finnish. It contains Finnish characters and a superscripted 3 character.



    This is what I expect to get when I run the above code. (This is what I get when I download the attachment manually through gmail user interface):
    This is what I expect to get (and what I get when I download the attachment manually)



    What the problem is:



    However, I am getting the following odd results.



    From cat temp.csv (Looks good to me):
    This is from a cat temp.csv (looks good)



    With nano temp.csv (Here I have no idea what I am looking at):
    This is what it looks like with nano temp.csv



    This is what temp.csv looks like opened in Sublime Text (directly via winscp). First line and small parts look ok but then Chinese/Japanese characters:
    This is what temp.csv looks like opened in Sublime Text (via winscp)



    This is what temp.csv looks like in Notepad (after download via winscp). Looks ok except a blank space has been inserted between each character and the new lines seems to be missing:
    what temp.csv looks like in Notepad



    What I have tried:



    I have without success tried:





    • .force_encoding(...) with all the different "ISO-8859-x" character sets

    • putting the force_encoding("ISO-8859-15").encode!('UTF-8') outside the .read (works but doesn't solve the problem)

    • encode to UTF-8 without first forcing another encoding but this leads to Encoding::UndefinedConversionError: "xC4" from ASCII-8BIT to UTF-8

    • writing as binary with 'wb' and 'w+b' in the File.open() (which oddly doesn't seem to make a difference to the outcome).

    • searching stackoverflow and the web for other ideas.


    Any ideas would be much appreciated!










    share|improve this question


























      up vote
      2
      down vote

      favorite
      1









      up vote
      2
      down vote

      favorite
      1






      1





      What I am doing:
      I am using the gmail gem in a Rails 4 app to get email attachments from a specific account at regular intervals. Here is an extract from the core part (here for simplicity only considering the first email and its first attachment):



      require 'gmail'

      Gmail.connect(@user_email,@user_password) do |gmail|
      if gmail.logged_in?
      emails = gmail.inbox.emails(:from => @sender_email)
      email = emails[0]
      attachment = email.message.attachments[0]
      File.open("~/temp.csv", 'w') do |file|
      file.write(
      StringIO.new(attachment.decoded.to_s[2..-2].force_encoding("ISO-8859-15").encode!('UTF-8')).read
      )
      end
      end
      end


      The encoding of the attached file can vary. The particular one that I am currently having issues with is in Finnish. It contains Finnish characters and a superscripted 3 character.



      This is what I expect to get when I run the above code. (This is what I get when I download the attachment manually through gmail user interface):
      This is what I expect to get (and what I get when I download the attachment manually)



      What the problem is:



      However, I am getting the following odd results.



      From cat temp.csv (Looks good to me):
      This is from a cat temp.csv (looks good)



      With nano temp.csv (Here I have no idea what I am looking at):
      This is what it looks like with nano temp.csv



      This is what temp.csv looks like opened in Sublime Text (directly via winscp). First line and small parts look ok but then Chinese/Japanese characters:
      This is what temp.csv looks like opened in Sublime Text (via winscp)



      This is what temp.csv looks like in Notepad (after download via winscp). Looks ok except a blank space has been inserted between each character and the new lines seems to be missing:
      what temp.csv looks like in Notepad



      What I have tried:



      I have without success tried:





      • .force_encoding(...) with all the different "ISO-8859-x" character sets

      • putting the force_encoding("ISO-8859-15").encode!('UTF-8') outside the .read (works but doesn't solve the problem)

      • encode to UTF-8 without first forcing another encoding but this leads to Encoding::UndefinedConversionError: "xC4" from ASCII-8BIT to UTF-8

      • writing as binary with 'wb' and 'w+b' in the File.open() (which oddly doesn't seem to make a difference to the outcome).

      • searching stackoverflow and the web for other ideas.


      Any ideas would be much appreciated!










      share|improve this question















      What I am doing:
      I am using the gmail gem in a Rails 4 app to get email attachments from a specific account at regular intervals. Here is an extract from the core part (here for simplicity only considering the first email and its first attachment):



      require 'gmail'

      Gmail.connect(@user_email,@user_password) do |gmail|
      if gmail.logged_in?
      emails = gmail.inbox.emails(:from => @sender_email)
      email = emails[0]
      attachment = email.message.attachments[0]
      File.open("~/temp.csv", 'w') do |file|
      file.write(
      StringIO.new(attachment.decoded.to_s[2..-2].force_encoding("ISO-8859-15").encode!('UTF-8')).read
      )
      end
      end
      end


      The encoding of the attached file can vary. The particular one that I am currently having issues with is in Finnish. It contains Finnish characters and a superscripted 3 character.



      This is what I expect to get when I run the above code. (This is what I get when I download the attachment manually through gmail user interface):
      This is what I expect to get (and what I get when I download the attachment manually)



      What the problem is:



      However, I am getting the following odd results.



      From cat temp.csv (Looks good to me):
      This is from a cat temp.csv (looks good)



      With nano temp.csv (Here I have no idea what I am looking at):
      This is what it looks like with nano temp.csv



      This is what temp.csv looks like opened in Sublime Text (directly via winscp). First line and small parts look ok but then Chinese/Japanese characters:
      This is what temp.csv looks like opened in Sublime Text (via winscp)



      This is what temp.csv looks like in Notepad (after download via winscp). Looks ok except a blank space has been inserted between each character and the new lines seems to be missing:
      what temp.csv looks like in Notepad



      What I have tried:



      I have without success tried:





      • .force_encoding(...) with all the different "ISO-8859-x" character sets

      • putting the force_encoding("ISO-8859-15").encode!('UTF-8') outside the .read (works but doesn't solve the problem)

      • encode to UTF-8 without first forcing another encoding but this leads to Encoding::UndefinedConversionError: "xC4" from ASCII-8BIT to UTF-8

      • writing as binary with 'wb' and 'w+b' in the File.open() (which oddly doesn't seem to make a difference to the outcome).

      • searching stackoverflow and the web for other ideas.


      Any ideas would be much appreciated!







      ruby-on-rails ruby character-encoding gmail






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 18 at 10:41

























      asked Nov 17 at 19:27









      Morten Grum

      3561414




      3561414
























          2 Answers
          2






          active

          oldest

          votes

















          up vote
          0
          down vote













          Not beautiful, but it will work for me now.



          After re-encoding, I convert the string to a char array, then remove the chars I do not want and then join the remaining array elements to form a string.



          decoded_att = attachment.decoded
          data = decoded_att.encode("UTF-8", "ISO-8859-1", invalid: :replace, undef: :replace).gsub("rn", "n")

          data_as_array = data.chars
          data_as_array = data_as_array.delete_if {|i| i == "u0000" || i == "ÿ" || i == "þ"}
          data = data_as_array.join('').to_s

          File.write("~/temp.csv", data.to_s)


          This will work for me now. However, I have no idea how these characters have ended up in the attachment ("ÿ" and "þ" in the start of the document and "u0000" between all remaining characters).






          share|improve this answer




























            up vote
            0
            down vote













            It seems like you need to do attachment.body.decoded instead of attachment.decoded






            share|improve this answer





















            • Thanks. It seems actually that attachment.body.decoded and attachment.decoded return the exact same string. I check both the strings and their arrays of bytes.
              – Morten Grum
              Nov 19 at 6:28











            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














             

            draft saved


            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53354765%2fproblem-with-attachments-character-encoding-using-gmail-gem-in-ruby-rails%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            0
            down vote













            Not beautiful, but it will work for me now.



            After re-encoding, I convert the string to a char array, then remove the chars I do not want and then join the remaining array elements to form a string.



            decoded_att = attachment.decoded
            data = decoded_att.encode("UTF-8", "ISO-8859-1", invalid: :replace, undef: :replace).gsub("rn", "n")

            data_as_array = data.chars
            data_as_array = data_as_array.delete_if {|i| i == "u0000" || i == "ÿ" || i == "þ"}
            data = data_as_array.join('').to_s

            File.write("~/temp.csv", data.to_s)


            This will work for me now. However, I have no idea how these characters have ended up in the attachment ("ÿ" and "þ" in the start of the document and "u0000" between all remaining characters).






            share|improve this answer

























              up vote
              0
              down vote













              Not beautiful, but it will work for me now.



              After re-encoding, I convert the string to a char array, then remove the chars I do not want and then join the remaining array elements to form a string.



              decoded_att = attachment.decoded
              data = decoded_att.encode("UTF-8", "ISO-8859-1", invalid: :replace, undef: :replace).gsub("rn", "n")

              data_as_array = data.chars
              data_as_array = data_as_array.delete_if {|i| i == "u0000" || i == "ÿ" || i == "þ"}
              data = data_as_array.join('').to_s

              File.write("~/temp.csv", data.to_s)


              This will work for me now. However, I have no idea how these characters have ended up in the attachment ("ÿ" and "þ" in the start of the document and "u0000" between all remaining characters).






              share|improve this answer























                up vote
                0
                down vote










                up vote
                0
                down vote









                Not beautiful, but it will work for me now.



                After re-encoding, I convert the string to a char array, then remove the chars I do not want and then join the remaining array elements to form a string.



                decoded_att = attachment.decoded
                data = decoded_att.encode("UTF-8", "ISO-8859-1", invalid: :replace, undef: :replace).gsub("rn", "n")

                data_as_array = data.chars
                data_as_array = data_as_array.delete_if {|i| i == "u0000" || i == "ÿ" || i == "þ"}
                data = data_as_array.join('').to_s

                File.write("~/temp.csv", data.to_s)


                This will work for me now. However, I have no idea how these characters have ended up in the attachment ("ÿ" and "þ" in the start of the document and "u0000" between all remaining characters).






                share|improve this answer












                Not beautiful, but it will work for me now.



                After re-encoding, I convert the string to a char array, then remove the chars I do not want and then join the remaining array elements to form a string.



                decoded_att = attachment.decoded
                data = decoded_att.encode("UTF-8", "ISO-8859-1", invalid: :replace, undef: :replace).gsub("rn", "n")

                data_as_array = data.chars
                data_as_array = data_as_array.delete_if {|i| i == "u0000" || i == "ÿ" || i == "þ"}
                data = data_as_array.join('').to_s

                File.write("~/temp.csv", data.to_s)


                This will work for me now. However, I have no idea how these characters have ended up in the attachment ("ÿ" and "þ" in the start of the document and "u0000" between all remaining characters).







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 18 at 18:36









                Morten Grum

                3561414




                3561414
























                    up vote
                    0
                    down vote













                    It seems like you need to do attachment.body.decoded instead of attachment.decoded






                    share|improve this answer





















                    • Thanks. It seems actually that attachment.body.decoded and attachment.decoded return the exact same string. I check both the strings and their arrays of bytes.
                      – Morten Grum
                      Nov 19 at 6:28















                    up vote
                    0
                    down vote













                    It seems like you need to do attachment.body.decoded instead of attachment.decoded






                    share|improve this answer





















                    • Thanks. It seems actually that attachment.body.decoded and attachment.decoded return the exact same string. I check both the strings and their arrays of bytes.
                      – Morten Grum
                      Nov 19 at 6:28













                    up vote
                    0
                    down vote










                    up vote
                    0
                    down vote









                    It seems like you need to do attachment.body.decoded instead of attachment.decoded






                    share|improve this answer












                    It seems like you need to do attachment.body.decoded instead of attachment.decoded







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Nov 18 at 21:11









                    Dorian

                    12.5k37383




                    12.5k37383












                    • Thanks. It seems actually that attachment.body.decoded and attachment.decoded return the exact same string. I check both the strings and their arrays of bytes.
                      – Morten Grum
                      Nov 19 at 6:28


















                    • Thanks. It seems actually that attachment.body.decoded and attachment.decoded return the exact same string. I check both the strings and their arrays of bytes.
                      – Morten Grum
                      Nov 19 at 6:28
















                    Thanks. It seems actually that attachment.body.decoded and attachment.decoded return the exact same string. I check both the strings and their arrays of bytes.
                    – Morten Grum
                    Nov 19 at 6:28




                    Thanks. It seems actually that attachment.body.decoded and attachment.decoded return the exact same string. I check both the strings and their arrays of bytes.
                    – Morten Grum
                    Nov 19 at 6:28


















                     

                    draft saved


                    draft discarded



















































                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53354765%2fproblem-with-attachments-character-encoding-using-gmail-gem-in-ruby-rails%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Costa Masnaga

                    Fotorealismo

                    Sidney Franklin