SQL: select data before first occurence of a certain value

For example, I have order data come from customers, like this

test = spark.createDataFrame([

    (0, 1, 1, "2018-06-03"),

    (1, 1, 1, "2018-06-04"),

    (2, 1, 3, "2018-06-04"),

    (3, 1, 2, "2018-06-05"),

    (4, 1, 1, "2018-06-06"),

    (5, 2, 3, "2018-06-01"),

    (6, 2, 1, "2018-06-01"),

    (7, 3, 1, "2018-06-02"),

    (8, 3, 1, "2018-06-02"),

    (9, 3, 1, "2018-06-05")

])

  .toDF("order_id", "customer_id", "order_status", "created_at")

test.show()

enter image description here

Each order has its own status, 1 means newly created but not finished, 3 means it's payed and finished.

Now, I want to do analysis for order comes from

new customers (who has not made purchase before)

old customers (who has finished purchase before)

so I want to add a feature to the above the data, turn into like this

enter image description here

The logic is for every customer, every order created before first order with status 3 (include itself) is counted as come from new customer, and every order after that is counted as old customer.

Or put it into another way, select the data before the first occurance of value 3 (for each customer's order, sort by date asc)

How can I do this, in SQL?

I searched around but didn't find good solution. If in Python, I think maybe I'll simply do some loop to get the values.

asked Nov 25 '18 at 8:13

cqcn1991

4,7061961122

You could try using where exists or not exists to join to the same table on matching customer id and a theta join (order_id < order_id).

– Alan
Nov 25 '18 at 9:08

add a comment |

For example, I have order data come from customers, like this

test = spark.createDataFrame([

    (0, 1, 1, "2018-06-03"),

    (1, 1, 1, "2018-06-04"),

    (2, 1, 3, "2018-06-04"),

    (3, 1, 2, "2018-06-05"),

    (4, 1, 1, "2018-06-06"),

    (5, 2, 3, "2018-06-01"),

    (6, 2, 1, "2018-06-01"),

    (7, 3, 1, "2018-06-02"),

    (8, 3, 1, "2018-06-02"),

    (9, 3, 1, "2018-06-05")

])

  .toDF("order_id", "customer_id", "order_status", "created_at")

test.show()

enter image description here

Each order has its own status, 1 means newly created but not finished, 3 means it's payed and finished.

Now, I want to do analysis for order comes from

new customers (who has not made purchase before)

old customers (who has finished purchase before)

so I want to add a feature to the above the data, turn into like this

enter image description here

Or put it into another way, select the data before the first occurance of value 3 (for each customer's order, sort by date asc)

How can I do this, in SQL?

I searched around but didn't find good solution. If in Python, I think maybe I'll simply do some loop to get the values.

asked Nov 25 '18 at 8:13

cqcn1991

4,7061961122

You could try using where exists or not exists to join to the same table on matching customer id and a theta join (order_id < order_id).

– Alan
Nov 25 '18 at 9:08

add a comment |

For example, I have order data come from customers, like this

test = spark.createDataFrame([

    (0, 1, 1, "2018-06-03"),

    (1, 1, 1, "2018-06-04"),

    (2, 1, 3, "2018-06-04"),

    (3, 1, 2, "2018-06-05"),

    (4, 1, 1, "2018-06-06"),

    (5, 2, 3, "2018-06-01"),

    (6, 2, 1, "2018-06-01"),

    (7, 3, 1, "2018-06-02"),

    (8, 3, 1, "2018-06-02"),

    (9, 3, 1, "2018-06-05")

])

  .toDF("order_id", "customer_id", "order_status", "created_at")

test.show()

enter image description here

Each order has its own status, 1 means newly created but not finished, 3 means it's payed and finished.

Now, I want to do analysis for order comes from

new customers (who has not made purchase before)

old customers (who has finished purchase before)

so I want to add a feature to the above the data, turn into like this

enter image description here

Or put it into another way, select the data before the first occurance of value 3 (for each customer's order, sort by date asc)

How can I do this, in SQL?

I searched around but didn't find good solution. If in Python, I think maybe I'll simply do some loop to get the values.

asked Nov 25 '18 at 8:13

cqcn1991

4,7061961122

For example, I have order data come from customers, like this

test = spark.createDataFrame([

    (0, 1, 1, "2018-06-03"),

    (1, 1, 1, "2018-06-04"),

    (2, 1, 3, "2018-06-04"),

    (3, 1, 2, "2018-06-05"),

    (4, 1, 1, "2018-06-06"),

    (5, 2, 3, "2018-06-01"),

    (6, 2, 1, "2018-06-01"),

    (7, 3, 1, "2018-06-02"),

    (8, 3, 1, "2018-06-02"),

    (9, 3, 1, "2018-06-05")

])

  .toDF("order_id", "customer_id", "order_status", "created_at")

test.show()

enter image description here

Each order has its own status, 1 means newly created but not finished, 3 means it's payed and finished.

Now, I want to do analysis for order comes from

new customers (who has not made purchase before)

old customers (who has finished purchase before)

so I want to add a feature to the above the data, turn into like this

enter image description here

Or put it into another way, select the data before the first occurance of value 3 (for each customer's order, sort by date asc)

How can I do this, in SQL?

I searched around but didn't find good solution. If in Python, I think maybe I'll simply do some loop to get the values.

sql pyspark

asked Nov 25 '18 at 8:13

cqcn1991

4,7061961122

asked Nov 25 '18 at 8:13

cqcn1991

4,7061961122

asked Nov 25 '18 at 8:13

cqcn1991

4,7061961122

asked Nov 25 '18 at 8:13

cqcn1991

4,7061961122

asked Nov 25 '18 at 8:13

cqcn1991

4,7061961122

You could try using where exists or not exists to join to the same table on matching customer id and a theta join (order_id < order_id).

– Alan
Nov 25 '18 at 9:08

add a comment |

You could try using where exists or not exists to join to the same table on matching customer id and a theta join (order_id < order_id).

– Alan
Nov 25 '18 at 9:08

You could try using where exists or not exists to join to the same table on matching customer id and a theta join (order_id < order_id).

– Alan
Nov 25 '18 at 9:08

add a comment |

2 Answers
2

active

oldest

votes

This is tested for SQLite:

SELECT order_id, customer_id, order_status, created_at, 

CASE

     WHEN order_id > (SELECT MIN(order_id) FROM orders WHERE customer_id = o.customer_id AND order_status = 3) THEN 'old'

     ELSE 'new'  

END AS customer_status

FROM orders o

answered Nov 25 '18 at 9:09

forpas

16.2k3627

Why it uses order_id?

– cqcn1991
Nov 26 '18 at 1:18

Take for example the 2 rows for customer_id=2, they both have created_at=2018-06-01. So there is no before or after for these 2 rows if we compare them based on created_at. If I replace in my code order_id with created_at then both these rows will give new. I think that it's safer to use the order_id unless it is not related to the date of the placement of the order.

– forpas
Nov 26 '18 at 12:08

add a comment |

You can do this using window functions in Spark:

select t.*,

       (case when created_at > min(case when status = 3 then created_at end) over (partition by customer_id)

             then 'old'

             else 'new'

        end) as customer_status

from test t;

Note that this assigns "new" to customers with no order with status "3".

You can also write this using join and group by:

select t.*,

       coalesce(t3.customer_status, 'old') as customer_status

from test t left join

     (select t.customer_id, min(created_at) as min_created_at,

             'new' as customer_status

      from t

      where status = 3

      group by t.customer_id

     ) t3

     on t.customer_id = t3.customer_id and

        t.created_at <= t3.min_created_at;

answered Nov 25 '18 at 12:33

Gordon Linoff

784k35310415

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53465745%2fsql-select-data-before-first-occurence-of-a-certain-value%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

This is tested for SQLite:

SELECT order_id, customer_id, order_status, created_at, 

CASE

     WHEN order_id > (SELECT MIN(order_id) FROM orders WHERE customer_id = o.customer_id AND order_status = 3) THEN 'old'

     ELSE 'new'  

END AS customer_status

FROM orders o

answered Nov 25 '18 at 9:09

forpas

16.2k3627

Why it uses order_id?

– cqcn1991
Nov 26 '18 at 1:18

Take for example the 2 rows for customer_id=2, they both have created_at=2018-06-01. So there is no before or after for these 2 rows if we compare them based on created_at. If I replace in my code order_id with created_at then both these rows will give new. I think that it's safer to use the order_id unless it is not related to the date of the placement of the order.

– forpas
Nov 26 '18 at 12:08

add a comment |

This is tested for SQLite:

SELECT order_id, customer_id, order_status, created_at, 

CASE

     WHEN order_id > (SELECT MIN(order_id) FROM orders WHERE customer_id = o.customer_id AND order_status = 3) THEN 'old'

     ELSE 'new'  

END AS customer_status

FROM orders o

answered Nov 25 '18 at 9:09

forpas

16.2k3627

Why it uses order_id?

– cqcn1991
Nov 26 '18 at 1:18

Take for example the 2 rows for customer_id=2, they both have created_at=2018-06-01. So there is no before or after for these 2 rows if we compare them based on created_at. If I replace in my code order_id with created_at then both these rows will give new. I think that it's safer to use the order_id unless it is not related to the date of the placement of the order.

– forpas
Nov 26 '18 at 12:08

add a comment |

This is tested for SQLite:

SELECT order_id, customer_id, order_status, created_at, 

CASE

     WHEN order_id > (SELECT MIN(order_id) FROM orders WHERE customer_id = o.customer_id AND order_status = 3) THEN 'old'

     ELSE 'new'  

END AS customer_status

FROM orders o

answered Nov 25 '18 at 9:09

forpas

16.2k3627

This is tested for SQLite:

SELECT order_id, customer_id, order_status, created_at, 

CASE

     WHEN order_id > (SELECT MIN(order_id) FROM orders WHERE customer_id = o.customer_id AND order_status = 3) THEN 'old'

     ELSE 'new'  

END AS customer_status

FROM orders o

answered Nov 25 '18 at 9:09

forpas

16.2k3627

answered Nov 25 '18 at 9:09

forpas

16.2k3627

answered Nov 25 '18 at 9:09

forpas

16.2k3627

answered Nov 25 '18 at 9:09

forpas

16.2k3627

Why it uses order_id?

– cqcn1991
Nov 26 '18 at 1:18

Take for example the 2 rows for customer_id=2, they both have created_at=2018-06-01. So there is no before or after for these 2 rows if we compare them based on created_at. If I replace in my code order_id with created_at then both these rows will give new. I think that it's safer to use the order_id unless it is not related to the date of the placement of the order.

– forpas
Nov 26 '18 at 12:08

add a comment |

Why it uses order_id?

– cqcn1991
Nov 26 '18 at 1:18

Take for example the 2 rows for customer_id=2, they both have created_at=2018-06-01. So there is no before or after for these 2 rows if we compare them based on created_at. If I replace in my code order_id with created_at then both these rows will give new. I think that it's safer to use the order_id unless it is not related to the date of the placement of the order.

– forpas
Nov 26 '18 at 12:08

Why it uses order_id?

– cqcn1991
Nov 26 '18 at 1:18

Take for example the 2 rows for customer_id=2, they both have created_at=2018-06-01. So there is no before or after for these 2 rows if we compare them based on created_at. If I replace in my code order_id with created_at then both these rows will give new. I think that it's safer to use the order_id unless it is not related to the date of the placement of the order.

– forpas
Nov 26 '18 at 12:08

add a comment |

You can do this using window functions in Spark:

select t.*,

       (case when created_at > min(case when status = 3 then created_at end) over (partition by customer_id)

             then 'old'

             else 'new'

        end) as customer_status

from test t;

Note that this assigns "new" to customers with no order with status "3".

You can also write this using join and group by:

select t.*,

       coalesce(t3.customer_status, 'old') as customer_status

from test t left join

     (select t.customer_id, min(created_at) as min_created_at,

             'new' as customer_status

      from t

      where status = 3

      group by t.customer_id

     ) t3

     on t.customer_id = t3.customer_id and

        t.created_at <= t3.min_created_at;

answered Nov 25 '18 at 12:33

Gordon Linoff

784k35310415

add a comment |

You can do this using window functions in Spark:

select t.*,

       (case when created_at > min(case when status = 3 then created_at end) over (partition by customer_id)

             then 'old'

             else 'new'

        end) as customer_status

from test t;

Note that this assigns "new" to customers with no order with status "3".

You can also write this using join and group by:

select t.*,

       coalesce(t3.customer_status, 'old') as customer_status

from test t left join

     (select t.customer_id, min(created_at) as min_created_at,

             'new' as customer_status

      from t

      where status = 3

      group by t.customer_id

     ) t3

     on t.customer_id = t3.customer_id and

        t.created_at <= t3.min_created_at;

answered Nov 25 '18 at 12:33

Gordon Linoff

784k35310415

add a comment |

You can do this using window functions in Spark:

select t.*,

       (case when created_at > min(case when status = 3 then created_at end) over (partition by customer_id)

             then 'old'

             else 'new'

        end) as customer_status

from test t;

Note that this assigns "new" to customers with no order with status "3".

You can also write this using join and group by:

select t.*,

       coalesce(t3.customer_status, 'old') as customer_status

from test t left join

     (select t.customer_id, min(created_at) as min_created_at,

             'new' as customer_status

      from t

      where status = 3

      group by t.customer_id

     ) t3

     on t.customer_id = t3.customer_id and

        t.created_at <= t3.min_created_at;

answered Nov 25 '18 at 12:33

Gordon Linoff

784k35310415

You can do this using window functions in Spark:

select t.*,

       (case when created_at > min(case when status = 3 then created_at end) over (partition by customer_id)

             then 'old'

             else 'new'

        end) as customer_status

from test t;

Note that this assigns "new" to customers with no order with status "3".

You can also write this using join and group by:

select t.*,

       coalesce(t3.customer_status, 'old') as customer_status

from test t left join

     (select t.customer_id, min(created_at) as min_created_at,

             'new' as customer_status

      from t

      where status = 3

      group by t.customer_id

     ) t3

     on t.customer_id = t3.customer_id and

        t.created_at <= t3.min_created_at;

answered Nov 25 '18 at 12:33

Gordon Linoff

784k35310415

answered Nov 25 '18 at 12:33

Gordon Linoff

784k35310415

answered Nov 25 '18 at 12:33

Gordon Linoff

784k35310415

answered Nov 25 '18 at 12:33

Gordon Linoff

784k35310415

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Nsryjdtyk