SQL: select data before first occurence of a certain value












1















For example, I have order data come from customers, like this



test = spark.createDataFrame([
(0, 1, 1, "2018-06-03"),
(1, 1, 1, "2018-06-04"),
(2, 1, 3, "2018-06-04"),
(3, 1, 2, "2018-06-05"),
(4, 1, 1, "2018-06-06"),
(5, 2, 3, "2018-06-01"),
(6, 2, 1, "2018-06-01"),
(7, 3, 1, "2018-06-02"),
(8, 3, 1, "2018-06-02"),
(9, 3, 1, "2018-06-05")
])
.toDF("order_id", "customer_id", "order_status", "created_at")
test.show()


enter image description here



Each order has its own status, 1 means newly created but not finished, 3 means it's payed and finished.



Now, I want to do analysis for order comes from




  • new customers (who has not made purchase before)

  • old customers (who has finished purchase before)


so I want to add a feature to the above the data, turn into like this



enter image description here



The logic is for every customer, every order created before first order with status 3 (include itself) is counted as come from new customer, and every order after that is counted as old customer.



Or put it into another way, select the data before the first occurance of value 3 (for each customer's order, sort by date asc)



How can I do this, in SQL?



I searched around but didn't find good solution. If in Python, I think maybe I'll simply do some loop to get the values.










share|improve this question























  • You could try using where exists or not exists to join to the same table on matching customer id and a theta join (order_id < order_id).

    – Alan
    Nov 25 '18 at 9:08
















1















For example, I have order data come from customers, like this



test = spark.createDataFrame([
(0, 1, 1, "2018-06-03"),
(1, 1, 1, "2018-06-04"),
(2, 1, 3, "2018-06-04"),
(3, 1, 2, "2018-06-05"),
(4, 1, 1, "2018-06-06"),
(5, 2, 3, "2018-06-01"),
(6, 2, 1, "2018-06-01"),
(7, 3, 1, "2018-06-02"),
(8, 3, 1, "2018-06-02"),
(9, 3, 1, "2018-06-05")
])
.toDF("order_id", "customer_id", "order_status", "created_at")
test.show()


enter image description here



Each order has its own status, 1 means newly created but not finished, 3 means it's payed and finished.



Now, I want to do analysis for order comes from




  • new customers (who has not made purchase before)

  • old customers (who has finished purchase before)


so I want to add a feature to the above the data, turn into like this



enter image description here



The logic is for every customer, every order created before first order with status 3 (include itself) is counted as come from new customer, and every order after that is counted as old customer.



Or put it into another way, select the data before the first occurance of value 3 (for each customer's order, sort by date asc)



How can I do this, in SQL?



I searched around but didn't find good solution. If in Python, I think maybe I'll simply do some loop to get the values.










share|improve this question























  • You could try using where exists or not exists to join to the same table on matching customer id and a theta join (order_id < order_id).

    – Alan
    Nov 25 '18 at 9:08














1












1








1








For example, I have order data come from customers, like this



test = spark.createDataFrame([
(0, 1, 1, "2018-06-03"),
(1, 1, 1, "2018-06-04"),
(2, 1, 3, "2018-06-04"),
(3, 1, 2, "2018-06-05"),
(4, 1, 1, "2018-06-06"),
(5, 2, 3, "2018-06-01"),
(6, 2, 1, "2018-06-01"),
(7, 3, 1, "2018-06-02"),
(8, 3, 1, "2018-06-02"),
(9, 3, 1, "2018-06-05")
])
.toDF("order_id", "customer_id", "order_status", "created_at")
test.show()


enter image description here



Each order has its own status, 1 means newly created but not finished, 3 means it's payed and finished.



Now, I want to do analysis for order comes from




  • new customers (who has not made purchase before)

  • old customers (who has finished purchase before)


so I want to add a feature to the above the data, turn into like this



enter image description here



The logic is for every customer, every order created before first order with status 3 (include itself) is counted as come from new customer, and every order after that is counted as old customer.



Or put it into another way, select the data before the first occurance of value 3 (for each customer's order, sort by date asc)



How can I do this, in SQL?



I searched around but didn't find good solution. If in Python, I think maybe I'll simply do some loop to get the values.










share|improve this question














For example, I have order data come from customers, like this



test = spark.createDataFrame([
(0, 1, 1, "2018-06-03"),
(1, 1, 1, "2018-06-04"),
(2, 1, 3, "2018-06-04"),
(3, 1, 2, "2018-06-05"),
(4, 1, 1, "2018-06-06"),
(5, 2, 3, "2018-06-01"),
(6, 2, 1, "2018-06-01"),
(7, 3, 1, "2018-06-02"),
(8, 3, 1, "2018-06-02"),
(9, 3, 1, "2018-06-05")
])
.toDF("order_id", "customer_id", "order_status", "created_at")
test.show()


enter image description here



Each order has its own status, 1 means newly created but not finished, 3 means it's payed and finished.



Now, I want to do analysis for order comes from




  • new customers (who has not made purchase before)

  • old customers (who has finished purchase before)


so I want to add a feature to the above the data, turn into like this



enter image description here



The logic is for every customer, every order created before first order with status 3 (include itself) is counted as come from new customer, and every order after that is counted as old customer.



Or put it into another way, select the data before the first occurance of value 3 (for each customer's order, sort by date asc)



How can I do this, in SQL?



I searched around but didn't find good solution. If in Python, I think maybe I'll simply do some loop to get the values.







sql pyspark






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 25 '18 at 8:13









cqcn1991cqcn1991

4,7061961122




4,7061961122













  • You could try using where exists or not exists to join to the same table on matching customer id and a theta join (order_id < order_id).

    – Alan
    Nov 25 '18 at 9:08



















  • You could try using where exists or not exists to join to the same table on matching customer id and a theta join (order_id < order_id).

    – Alan
    Nov 25 '18 at 9:08

















You could try using where exists or not exists to join to the same table on matching customer id and a theta join (order_id < order_id).

– Alan
Nov 25 '18 at 9:08





You could try using where exists or not exists to join to the same table on matching customer id and a theta join (order_id < order_id).

– Alan
Nov 25 '18 at 9:08












2 Answers
2






active

oldest

votes


















0














This is tested for SQLite:



SELECT order_id, customer_id, order_status, created_at, 
CASE
WHEN order_id > (SELECT MIN(order_id) FROM orders WHERE customer_id = o.customer_id AND order_status = 3) THEN 'old'
ELSE 'new'
END AS customer_status
FROM orders o





share|improve this answer
























  • Why it uses order_id?

    – cqcn1991
    Nov 26 '18 at 1:18











  • Take for example the 2 rows for customer_id=2, they both have created_at=2018-06-01. So there is no before or after for these 2 rows if we compare them based on created_at. If I replace in my code order_id with created_at then both these rows will give new. I think that it's safer to use the order_id unless it is not related to the date of the placement of the order.

    – forpas
    Nov 26 '18 at 12:08



















0














You can do this using window functions in Spark:



select t.*,
(case when created_at > min(case when status = 3 then created_at end) over (partition by customer_id)
then 'old'
else 'new'
end) as customer_status
from test t;


Note that this assigns "new" to customers with no order with status "3".



You can also write this using join and group by:



select t.*,
coalesce(t3.customer_status, 'old') as customer_status
from test t left join
(select t.customer_id, min(created_at) as min_created_at,
'new' as customer_status
from t
where status = 3
group by t.customer_id
) t3
on t.customer_id = t3.customer_id and
t.created_at <= t3.min_created_at;





share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53465745%2fsql-select-data-before-first-occurence-of-a-certain-value%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    This is tested for SQLite:



    SELECT order_id, customer_id, order_status, created_at, 
    CASE
    WHEN order_id > (SELECT MIN(order_id) FROM orders WHERE customer_id = o.customer_id AND order_status = 3) THEN 'old'
    ELSE 'new'
    END AS customer_status
    FROM orders o





    share|improve this answer
























    • Why it uses order_id?

      – cqcn1991
      Nov 26 '18 at 1:18











    • Take for example the 2 rows for customer_id=2, they both have created_at=2018-06-01. So there is no before or after for these 2 rows if we compare them based on created_at. If I replace in my code order_id with created_at then both these rows will give new. I think that it's safer to use the order_id unless it is not related to the date of the placement of the order.

      – forpas
      Nov 26 '18 at 12:08
















    0














    This is tested for SQLite:



    SELECT order_id, customer_id, order_status, created_at, 
    CASE
    WHEN order_id > (SELECT MIN(order_id) FROM orders WHERE customer_id = o.customer_id AND order_status = 3) THEN 'old'
    ELSE 'new'
    END AS customer_status
    FROM orders o





    share|improve this answer
























    • Why it uses order_id?

      – cqcn1991
      Nov 26 '18 at 1:18











    • Take for example the 2 rows for customer_id=2, they both have created_at=2018-06-01. So there is no before or after for these 2 rows if we compare them based on created_at. If I replace in my code order_id with created_at then both these rows will give new. I think that it's safer to use the order_id unless it is not related to the date of the placement of the order.

      – forpas
      Nov 26 '18 at 12:08














    0












    0








    0







    This is tested for SQLite:



    SELECT order_id, customer_id, order_status, created_at, 
    CASE
    WHEN order_id > (SELECT MIN(order_id) FROM orders WHERE customer_id = o.customer_id AND order_status = 3) THEN 'old'
    ELSE 'new'
    END AS customer_status
    FROM orders o





    share|improve this answer













    This is tested for SQLite:



    SELECT order_id, customer_id, order_status, created_at, 
    CASE
    WHEN order_id > (SELECT MIN(order_id) FROM orders WHERE customer_id = o.customer_id AND order_status = 3) THEN 'old'
    ELSE 'new'
    END AS customer_status
    FROM orders o






    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Nov 25 '18 at 9:09









    forpasforpas

    16.2k3627




    16.2k3627













    • Why it uses order_id?

      – cqcn1991
      Nov 26 '18 at 1:18











    • Take for example the 2 rows for customer_id=2, they both have created_at=2018-06-01. So there is no before or after for these 2 rows if we compare them based on created_at. If I replace in my code order_id with created_at then both these rows will give new. I think that it's safer to use the order_id unless it is not related to the date of the placement of the order.

      – forpas
      Nov 26 '18 at 12:08



















    • Why it uses order_id?

      – cqcn1991
      Nov 26 '18 at 1:18











    • Take for example the 2 rows for customer_id=2, they both have created_at=2018-06-01. So there is no before or after for these 2 rows if we compare them based on created_at. If I replace in my code order_id with created_at then both these rows will give new. I think that it's safer to use the order_id unless it is not related to the date of the placement of the order.

      – forpas
      Nov 26 '18 at 12:08

















    Why it uses order_id?

    – cqcn1991
    Nov 26 '18 at 1:18





    Why it uses order_id?

    – cqcn1991
    Nov 26 '18 at 1:18













    Take for example the 2 rows for customer_id=2, they both have created_at=2018-06-01. So there is no before or after for these 2 rows if we compare them based on created_at. If I replace in my code order_id with created_at then both these rows will give new. I think that it's safer to use the order_id unless it is not related to the date of the placement of the order.

    – forpas
    Nov 26 '18 at 12:08





    Take for example the 2 rows for customer_id=2, they both have created_at=2018-06-01. So there is no before or after for these 2 rows if we compare them based on created_at. If I replace in my code order_id with created_at then both these rows will give new. I think that it's safer to use the order_id unless it is not related to the date of the placement of the order.

    – forpas
    Nov 26 '18 at 12:08













    0














    You can do this using window functions in Spark:



    select t.*,
    (case when created_at > min(case when status = 3 then created_at end) over (partition by customer_id)
    then 'old'
    else 'new'
    end) as customer_status
    from test t;


    Note that this assigns "new" to customers with no order with status "3".



    You can also write this using join and group by:



    select t.*,
    coalesce(t3.customer_status, 'old') as customer_status
    from test t left join
    (select t.customer_id, min(created_at) as min_created_at,
    'new' as customer_status
    from t
    where status = 3
    group by t.customer_id
    ) t3
    on t.customer_id = t3.customer_id and
    t.created_at <= t3.min_created_at;





    share|improve this answer




























      0














      You can do this using window functions in Spark:



      select t.*,
      (case when created_at > min(case when status = 3 then created_at end) over (partition by customer_id)
      then 'old'
      else 'new'
      end) as customer_status
      from test t;


      Note that this assigns "new" to customers with no order with status "3".



      You can also write this using join and group by:



      select t.*,
      coalesce(t3.customer_status, 'old') as customer_status
      from test t left join
      (select t.customer_id, min(created_at) as min_created_at,
      'new' as customer_status
      from t
      where status = 3
      group by t.customer_id
      ) t3
      on t.customer_id = t3.customer_id and
      t.created_at <= t3.min_created_at;





      share|improve this answer


























        0












        0








        0







        You can do this using window functions in Spark:



        select t.*,
        (case when created_at > min(case when status = 3 then created_at end) over (partition by customer_id)
        then 'old'
        else 'new'
        end) as customer_status
        from test t;


        Note that this assigns "new" to customers with no order with status "3".



        You can also write this using join and group by:



        select t.*,
        coalesce(t3.customer_status, 'old') as customer_status
        from test t left join
        (select t.customer_id, min(created_at) as min_created_at,
        'new' as customer_status
        from t
        where status = 3
        group by t.customer_id
        ) t3
        on t.customer_id = t3.customer_id and
        t.created_at <= t3.min_created_at;





        share|improve this answer













        You can do this using window functions in Spark:



        select t.*,
        (case when created_at > min(case when status = 3 then created_at end) over (partition by customer_id)
        then 'old'
        else 'new'
        end) as customer_status
        from test t;


        Note that this assigns "new" to customers with no order with status "3".



        You can also write this using join and group by:



        select t.*,
        coalesce(t3.customer_status, 'old') as customer_status
        from test t left join
        (select t.customer_id, min(created_at) as min_created_at,
        'new' as customer_status
        from t
        where status = 3
        group by t.customer_id
        ) t3
        on t.customer_id = t3.customer_id and
        t.created_at <= t3.min_created_at;






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 25 '18 at 12:33









        Gordon LinoffGordon Linoff

        784k35310415




        784k35310415






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53465745%2fsql-select-data-before-first-occurence-of-a-certain-value%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Ottavio Pratesi

            Tricia Helfer

            15 giugno