Select columns which contains a string in pyspark












1














I have a pyspark dataframe with a lot of columns, and I want to select the ones which contain a certain string, and others. For example:



df.columns = ['hello_world','hello_country','hello_everyone','byebye','ciao','index']


I want to select the ones which contains 'hello' and also the column named 'index', so the result will be:



['hello_world','hello_country','hello_everyone','index']


I want something like df.select('hello*','index')



Thanks in advance :)



EDIT:



I found a quick way to solve it, so I answered myself, Q&A style. If someone sees my solution and can provide a better one I will appreciate it










share|improve this question





























    1














    I have a pyspark dataframe with a lot of columns, and I want to select the ones which contain a certain string, and others. For example:



    df.columns = ['hello_world','hello_country','hello_everyone','byebye','ciao','index']


    I want to select the ones which contains 'hello' and also the column named 'index', so the result will be:



    ['hello_world','hello_country','hello_everyone','index']


    I want something like df.select('hello*','index')



    Thanks in advance :)



    EDIT:



    I found a quick way to solve it, so I answered myself, Q&A style. If someone sees my solution and can provide a better one I will appreciate it










    share|improve this question



























      1












      1








      1







      I have a pyspark dataframe with a lot of columns, and I want to select the ones which contain a certain string, and others. For example:



      df.columns = ['hello_world','hello_country','hello_everyone','byebye','ciao','index']


      I want to select the ones which contains 'hello' and also the column named 'index', so the result will be:



      ['hello_world','hello_country','hello_everyone','index']


      I want something like df.select('hello*','index')



      Thanks in advance :)



      EDIT:



      I found a quick way to solve it, so I answered myself, Q&A style. If someone sees my solution and can provide a better one I will appreciate it










      share|improve this question















      I have a pyspark dataframe with a lot of columns, and I want to select the ones which contain a certain string, and others. For example:



      df.columns = ['hello_world','hello_country','hello_everyone','byebye','ciao','index']


      I want to select the ones which contains 'hello' and also the column named 'index', so the result will be:



      ['hello_world','hello_country','hello_everyone','index']


      I want something like df.select('hello*','index')



      Thanks in advance :)



      EDIT:



      I found a quick way to solve it, so I answered myself, Q&A style. If someone sees my solution and can provide a better one I will appreciate it







      python pyspark pyspark-sql






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 21 '18 at 11:07







      Manrique

















      asked Nov 21 '18 at 9:23









      ManriqueManrique

      500113




      500113
























          3 Answers
          3






          active

          oldest

          votes


















          2














          I've found a quick and elegant way:



          selected = [s for s in df.columns if 'hello' in s]+['index']
          df.select(selected)


          With this solution i can add more columns I want without editing the for loop that Ali AzG suggested.






          share|improve this answer





















          • Great solution. and do not need * before selected?
            – Ali AzG
            Nov 21 '18 at 9:52












          • Thanks ! I don't :)
            – Manrique
            Nov 21 '18 at 9:58



















          2














          You can also try to use colRegex function introduced in Spark 2.3, where in you can specify the column name as regular expression as well.



          Hope it helps.



          Regards,



          Neeraj






          share|improve this answer





























            1














            This sample code does what you want:



            hello_cols = 

            for col in df.columns:
            if(('index' in col) or ('hello' in col)):
            hello_cols.append(col)

            df.select(*hello_cols)





            share|improve this answer























            • Thanks, i fixed an error in your code and it worked.
              – Manrique
              Nov 21 '18 at 9:45










            • @Antonio Manrique You're right. I wrote my code in python 2.7. Please accept my answer if it was helpful.
              – Ali AzG
              Nov 21 '18 at 9:46










            • I will give it an upvote ! But i've found myself a better option for what i am doing, i'll post it as an answer and accept it. But thank you so much !
              – Manrique
              Nov 21 '18 at 9:47













            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53408830%2fselect-columns-which-contains-a-string-in-pyspark%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            3 Answers
            3






            active

            oldest

            votes








            3 Answers
            3






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            2














            I've found a quick and elegant way:



            selected = [s for s in df.columns if 'hello' in s]+['index']
            df.select(selected)


            With this solution i can add more columns I want without editing the for loop that Ali AzG suggested.






            share|improve this answer





















            • Great solution. and do not need * before selected?
              – Ali AzG
              Nov 21 '18 at 9:52












            • Thanks ! I don't :)
              – Manrique
              Nov 21 '18 at 9:58
















            2














            I've found a quick and elegant way:



            selected = [s for s in df.columns if 'hello' in s]+['index']
            df.select(selected)


            With this solution i can add more columns I want without editing the for loop that Ali AzG suggested.






            share|improve this answer





















            • Great solution. and do not need * before selected?
              – Ali AzG
              Nov 21 '18 at 9:52












            • Thanks ! I don't :)
              – Manrique
              Nov 21 '18 at 9:58














            2












            2








            2






            I've found a quick and elegant way:



            selected = [s for s in df.columns if 'hello' in s]+['index']
            df.select(selected)


            With this solution i can add more columns I want without editing the for loop that Ali AzG suggested.






            share|improve this answer












            I've found a quick and elegant way:



            selected = [s for s in df.columns if 'hello' in s]+['index']
            df.select(selected)


            With this solution i can add more columns I want without editing the for loop that Ali AzG suggested.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Nov 21 '18 at 9:49









            ManriqueManrique

            500113




            500113












            • Great solution. and do not need * before selected?
              – Ali AzG
              Nov 21 '18 at 9:52












            • Thanks ! I don't :)
              – Manrique
              Nov 21 '18 at 9:58


















            • Great solution. and do not need * before selected?
              – Ali AzG
              Nov 21 '18 at 9:52












            • Thanks ! I don't :)
              – Manrique
              Nov 21 '18 at 9:58
















            Great solution. and do not need * before selected?
            – Ali AzG
            Nov 21 '18 at 9:52






            Great solution. and do not need * before selected?
            – Ali AzG
            Nov 21 '18 at 9:52














            Thanks ! I don't :)
            – Manrique
            Nov 21 '18 at 9:58




            Thanks ! I don't :)
            – Manrique
            Nov 21 '18 at 9:58













            2














            You can also try to use colRegex function introduced in Spark 2.3, where in you can specify the column name as regular expression as well.



            Hope it helps.



            Regards,



            Neeraj






            share|improve this answer


























              2














              You can also try to use colRegex function introduced in Spark 2.3, where in you can specify the column name as regular expression as well.



              Hope it helps.



              Regards,



              Neeraj






              share|improve this answer
























                2












                2








                2






                You can also try to use colRegex function introduced in Spark 2.3, where in you can specify the column name as regular expression as well.



                Hope it helps.



                Regards,



                Neeraj






                share|improve this answer












                You can also try to use colRegex function introduced in Spark 2.3, where in you can specify the column name as regular expression as well.



                Hope it helps.



                Regards,



                Neeraj







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 21 '18 at 13:59









                neeraj bhadanineeraj bhadani

                829212




                829212























                    1














                    This sample code does what you want:



                    hello_cols = 

                    for col in df.columns:
                    if(('index' in col) or ('hello' in col)):
                    hello_cols.append(col)

                    df.select(*hello_cols)





                    share|improve this answer























                    • Thanks, i fixed an error in your code and it worked.
                      – Manrique
                      Nov 21 '18 at 9:45










                    • @Antonio Manrique You're right. I wrote my code in python 2.7. Please accept my answer if it was helpful.
                      – Ali AzG
                      Nov 21 '18 at 9:46










                    • I will give it an upvote ! But i've found myself a better option for what i am doing, i'll post it as an answer and accept it. But thank you so much !
                      – Manrique
                      Nov 21 '18 at 9:47


















                    1














                    This sample code does what you want:



                    hello_cols = 

                    for col in df.columns:
                    if(('index' in col) or ('hello' in col)):
                    hello_cols.append(col)

                    df.select(*hello_cols)





                    share|improve this answer























                    • Thanks, i fixed an error in your code and it worked.
                      – Manrique
                      Nov 21 '18 at 9:45










                    • @Antonio Manrique You're right. I wrote my code in python 2.7. Please accept my answer if it was helpful.
                      – Ali AzG
                      Nov 21 '18 at 9:46










                    • I will give it an upvote ! But i've found myself a better option for what i am doing, i'll post it as an answer and accept it. But thank you so much !
                      – Manrique
                      Nov 21 '18 at 9:47
















                    1












                    1








                    1






                    This sample code does what you want:



                    hello_cols = 

                    for col in df.columns:
                    if(('index' in col) or ('hello' in col)):
                    hello_cols.append(col)

                    df.select(*hello_cols)





                    share|improve this answer














                    This sample code does what you want:



                    hello_cols = 

                    for col in df.columns:
                    if(('index' in col) or ('hello' in col)):
                    hello_cols.append(col)

                    df.select(*hello_cols)






                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited Nov 21 '18 at 9:44









                    Manrique

                    500113




                    500113










                    answered Nov 21 '18 at 9:39









                    Ali AzGAli AzG

                    581515




                    581515












                    • Thanks, i fixed an error in your code and it worked.
                      – Manrique
                      Nov 21 '18 at 9:45










                    • @Antonio Manrique You're right. I wrote my code in python 2.7. Please accept my answer if it was helpful.
                      – Ali AzG
                      Nov 21 '18 at 9:46










                    • I will give it an upvote ! But i've found myself a better option for what i am doing, i'll post it as an answer and accept it. But thank you so much !
                      – Manrique
                      Nov 21 '18 at 9:47




















                    • Thanks, i fixed an error in your code and it worked.
                      – Manrique
                      Nov 21 '18 at 9:45










                    • @Antonio Manrique You're right. I wrote my code in python 2.7. Please accept my answer if it was helpful.
                      – Ali AzG
                      Nov 21 '18 at 9:46










                    • I will give it an upvote ! But i've found myself a better option for what i am doing, i'll post it as an answer and accept it. But thank you so much !
                      – Manrique
                      Nov 21 '18 at 9:47


















                    Thanks, i fixed an error in your code and it worked.
                    – Manrique
                    Nov 21 '18 at 9:45




                    Thanks, i fixed an error in your code and it worked.
                    – Manrique
                    Nov 21 '18 at 9:45












                    @Antonio Manrique You're right. I wrote my code in python 2.7. Please accept my answer if it was helpful.
                    – Ali AzG
                    Nov 21 '18 at 9:46




                    @Antonio Manrique You're right. I wrote my code in python 2.7. Please accept my answer if it was helpful.
                    – Ali AzG
                    Nov 21 '18 at 9:46












                    I will give it an upvote ! But i've found myself a better option for what i am doing, i'll post it as an answer and accept it. But thank you so much !
                    – Manrique
                    Nov 21 '18 at 9:47






                    I will give it an upvote ! But i've found myself a better option for what i am doing, i'll post it as an answer and accept it. But thank you so much !
                    – Manrique
                    Nov 21 '18 at 9:47




















                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53408830%2fselect-columns-which-contains-a-string-in-pyspark%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Ottavio Pratesi

                    Tricia Helfer

                    15 giugno