SQL: select data before first occurence of a certain value
For example, I have order data come from customers, like this
test = spark.createDataFrame([
(0, 1, 1, "2018-06-03"),
(1, 1, 1, "2018-06-04"),
(2, 1, 3, "2018-06-04"),
(3, 1, 2, "2018-06-05"),
(4, 1, 1, "2018-06-06"),
(5, 2, 3, "2018-06-01"),
(6, 2, 1, "2018-06-01"),
(7, 3, 1, "2018-06-02"),
(8, 3, 1, "2018-06-02"),
(9, 3, 1, "2018-06-05")
])
.toDF("order_id", "customer_id", "order_status", "created_at")
test.show()

Each order has its own status, 1 means newly created but not finished, 3 means it's payed and finished.
Now, I want to do analysis for order comes from
- new customers (who has not made purchase before)
- old customers (who has finished purchase before)
so I want to add a feature to the above the data, turn into like this

The logic is for every customer, every order created before first order with status 3 (include itself) is counted as come from new customer, and every order after that is counted as old customer.
Or put it into another way, select the data before the first occurance of value 3 (for each customer's order, sort by date asc)
How can I do this, in SQL?
I searched around but didn't find good solution. If in Python, I think maybe I'll simply do some loop to get the values.
sql pyspark
add a comment |
For example, I have order data come from customers, like this
test = spark.createDataFrame([
(0, 1, 1, "2018-06-03"),
(1, 1, 1, "2018-06-04"),
(2, 1, 3, "2018-06-04"),
(3, 1, 2, "2018-06-05"),
(4, 1, 1, "2018-06-06"),
(5, 2, 3, "2018-06-01"),
(6, 2, 1, "2018-06-01"),
(7, 3, 1, "2018-06-02"),
(8, 3, 1, "2018-06-02"),
(9, 3, 1, "2018-06-05")
])
.toDF("order_id", "customer_id", "order_status", "created_at")
test.show()

Each order has its own status, 1 means newly created but not finished, 3 means it's payed and finished.
Now, I want to do analysis for order comes from
- new customers (who has not made purchase before)
- old customers (who has finished purchase before)
so I want to add a feature to the above the data, turn into like this

The logic is for every customer, every order created before first order with status 3 (include itself) is counted as come from new customer, and every order after that is counted as old customer.
Or put it into another way, select the data before the first occurance of value 3 (for each customer's order, sort by date asc)
How can I do this, in SQL?
I searched around but didn't find good solution. If in Python, I think maybe I'll simply do some loop to get the values.
sql pyspark
You could try using where exists or not exists to join to the same table on matching customer id and a theta join (order_id < order_id).
– Alan
Nov 25 '18 at 9:08
add a comment |
For example, I have order data come from customers, like this
test = spark.createDataFrame([
(0, 1, 1, "2018-06-03"),
(1, 1, 1, "2018-06-04"),
(2, 1, 3, "2018-06-04"),
(3, 1, 2, "2018-06-05"),
(4, 1, 1, "2018-06-06"),
(5, 2, 3, "2018-06-01"),
(6, 2, 1, "2018-06-01"),
(7, 3, 1, "2018-06-02"),
(8, 3, 1, "2018-06-02"),
(9, 3, 1, "2018-06-05")
])
.toDF("order_id", "customer_id", "order_status", "created_at")
test.show()

Each order has its own status, 1 means newly created but not finished, 3 means it's payed and finished.
Now, I want to do analysis for order comes from
- new customers (who has not made purchase before)
- old customers (who has finished purchase before)
so I want to add a feature to the above the data, turn into like this

The logic is for every customer, every order created before first order with status 3 (include itself) is counted as come from new customer, and every order after that is counted as old customer.
Or put it into another way, select the data before the first occurance of value 3 (for each customer's order, sort by date asc)
How can I do this, in SQL?
I searched around but didn't find good solution. If in Python, I think maybe I'll simply do some loop to get the values.
sql pyspark
For example, I have order data come from customers, like this
test = spark.createDataFrame([
(0, 1, 1, "2018-06-03"),
(1, 1, 1, "2018-06-04"),
(2, 1, 3, "2018-06-04"),
(3, 1, 2, "2018-06-05"),
(4, 1, 1, "2018-06-06"),
(5, 2, 3, "2018-06-01"),
(6, 2, 1, "2018-06-01"),
(7, 3, 1, "2018-06-02"),
(8, 3, 1, "2018-06-02"),
(9, 3, 1, "2018-06-05")
])
.toDF("order_id", "customer_id", "order_status", "created_at")
test.show()

Each order has its own status, 1 means newly created but not finished, 3 means it's payed and finished.
Now, I want to do analysis for order comes from
- new customers (who has not made purchase before)
- old customers (who has finished purchase before)
so I want to add a feature to the above the data, turn into like this

The logic is for every customer, every order created before first order with status 3 (include itself) is counted as come from new customer, and every order after that is counted as old customer.
Or put it into another way, select the data before the first occurance of value 3 (for each customer's order, sort by date asc)
How can I do this, in SQL?
I searched around but didn't find good solution. If in Python, I think maybe I'll simply do some loop to get the values.
sql pyspark
sql pyspark
asked Nov 25 '18 at 8:13
cqcn1991cqcn1991
4,7061961122
4,7061961122
You could try using where exists or not exists to join to the same table on matching customer id and a theta join (order_id < order_id).
– Alan
Nov 25 '18 at 9:08
add a comment |
You could try using where exists or not exists to join to the same table on matching customer id and a theta join (order_id < order_id).
– Alan
Nov 25 '18 at 9:08
You could try using where exists or not exists to join to the same table on matching customer id and a theta join (order_id < order_id).
– Alan
Nov 25 '18 at 9:08
You could try using where exists or not exists to join to the same table on matching customer id and a theta join (order_id < order_id).
– Alan
Nov 25 '18 at 9:08
add a comment |
2 Answers
2
active
oldest
votes
This is tested for SQLite:
SELECT order_id, customer_id, order_status, created_at,
CASE
WHEN order_id > (SELECT MIN(order_id) FROM orders WHERE customer_id = o.customer_id AND order_status = 3) THEN 'old'
ELSE 'new'
END AS customer_status
FROM orders o
Why it usesorder_id?
– cqcn1991
Nov 26 '18 at 1:18
Take for example the 2 rows forcustomer_id=2, they both havecreated_at=2018-06-01. So there is no before or after for these 2 rows if we compare them based oncreated_at. If I replace in my codeorder_idwithcreated_atthen both these rows will givenew. I think that it's safer to use theorder_idunless it is not related to the date of the placement of the order.
– forpas
Nov 26 '18 at 12:08
add a comment |
You can do this using window functions in Spark:
select t.*,
(case when created_at > min(case when status = 3 then created_at end) over (partition by customer_id)
then 'old'
else 'new'
end) as customer_status
from test t;
Note that this assigns "new" to customers with no order with status "3".
You can also write this using join and group by:
select t.*,
coalesce(t3.customer_status, 'old') as customer_status
from test t left join
(select t.customer_id, min(created_at) as min_created_at,
'new' as customer_status
from t
where status = 3
group by t.customer_id
) t3
on t.customer_id = t3.customer_id and
t.created_at <= t3.min_created_at;
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53465745%2fsql-select-data-before-first-occurence-of-a-certain-value%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
This is tested for SQLite:
SELECT order_id, customer_id, order_status, created_at,
CASE
WHEN order_id > (SELECT MIN(order_id) FROM orders WHERE customer_id = o.customer_id AND order_status = 3) THEN 'old'
ELSE 'new'
END AS customer_status
FROM orders o
Why it usesorder_id?
– cqcn1991
Nov 26 '18 at 1:18
Take for example the 2 rows forcustomer_id=2, they both havecreated_at=2018-06-01. So there is no before or after for these 2 rows if we compare them based oncreated_at. If I replace in my codeorder_idwithcreated_atthen both these rows will givenew. I think that it's safer to use theorder_idunless it is not related to the date of the placement of the order.
– forpas
Nov 26 '18 at 12:08
add a comment |
This is tested for SQLite:
SELECT order_id, customer_id, order_status, created_at,
CASE
WHEN order_id > (SELECT MIN(order_id) FROM orders WHERE customer_id = o.customer_id AND order_status = 3) THEN 'old'
ELSE 'new'
END AS customer_status
FROM orders o
Why it usesorder_id?
– cqcn1991
Nov 26 '18 at 1:18
Take for example the 2 rows forcustomer_id=2, they both havecreated_at=2018-06-01. So there is no before or after for these 2 rows if we compare them based oncreated_at. If I replace in my codeorder_idwithcreated_atthen both these rows will givenew. I think that it's safer to use theorder_idunless it is not related to the date of the placement of the order.
– forpas
Nov 26 '18 at 12:08
add a comment |
This is tested for SQLite:
SELECT order_id, customer_id, order_status, created_at,
CASE
WHEN order_id > (SELECT MIN(order_id) FROM orders WHERE customer_id = o.customer_id AND order_status = 3) THEN 'old'
ELSE 'new'
END AS customer_status
FROM orders o
This is tested for SQLite:
SELECT order_id, customer_id, order_status, created_at,
CASE
WHEN order_id > (SELECT MIN(order_id) FROM orders WHERE customer_id = o.customer_id AND order_status = 3) THEN 'old'
ELSE 'new'
END AS customer_status
FROM orders o
answered Nov 25 '18 at 9:09
forpasforpas
16.2k3627
16.2k3627
Why it usesorder_id?
– cqcn1991
Nov 26 '18 at 1:18
Take for example the 2 rows forcustomer_id=2, they both havecreated_at=2018-06-01. So there is no before or after for these 2 rows if we compare them based oncreated_at. If I replace in my codeorder_idwithcreated_atthen both these rows will givenew. I think that it's safer to use theorder_idunless it is not related to the date of the placement of the order.
– forpas
Nov 26 '18 at 12:08
add a comment |
Why it usesorder_id?
– cqcn1991
Nov 26 '18 at 1:18
Take for example the 2 rows forcustomer_id=2, they both havecreated_at=2018-06-01. So there is no before or after for these 2 rows if we compare them based oncreated_at. If I replace in my codeorder_idwithcreated_atthen both these rows will givenew. I think that it's safer to use theorder_idunless it is not related to the date of the placement of the order.
– forpas
Nov 26 '18 at 12:08
Why it uses
order_id?– cqcn1991
Nov 26 '18 at 1:18
Why it uses
order_id?– cqcn1991
Nov 26 '18 at 1:18
Take for example the 2 rows for
customer_id=2, they both have created_at=2018-06-01. So there is no before or after for these 2 rows if we compare them based on created_at. If I replace in my code order_id with created_at then both these rows will give new. I think that it's safer to use the order_id unless it is not related to the date of the placement of the order.– forpas
Nov 26 '18 at 12:08
Take for example the 2 rows for
customer_id=2, they both have created_at=2018-06-01. So there is no before or after for these 2 rows if we compare them based on created_at. If I replace in my code order_id with created_at then both these rows will give new. I think that it's safer to use the order_id unless it is not related to the date of the placement of the order.– forpas
Nov 26 '18 at 12:08
add a comment |
You can do this using window functions in Spark:
select t.*,
(case when created_at > min(case when status = 3 then created_at end) over (partition by customer_id)
then 'old'
else 'new'
end) as customer_status
from test t;
Note that this assigns "new" to customers with no order with status "3".
You can also write this using join and group by:
select t.*,
coalesce(t3.customer_status, 'old') as customer_status
from test t left join
(select t.customer_id, min(created_at) as min_created_at,
'new' as customer_status
from t
where status = 3
group by t.customer_id
) t3
on t.customer_id = t3.customer_id and
t.created_at <= t3.min_created_at;
add a comment |
You can do this using window functions in Spark:
select t.*,
(case when created_at > min(case when status = 3 then created_at end) over (partition by customer_id)
then 'old'
else 'new'
end) as customer_status
from test t;
Note that this assigns "new" to customers with no order with status "3".
You can also write this using join and group by:
select t.*,
coalesce(t3.customer_status, 'old') as customer_status
from test t left join
(select t.customer_id, min(created_at) as min_created_at,
'new' as customer_status
from t
where status = 3
group by t.customer_id
) t3
on t.customer_id = t3.customer_id and
t.created_at <= t3.min_created_at;
add a comment |
You can do this using window functions in Spark:
select t.*,
(case when created_at > min(case when status = 3 then created_at end) over (partition by customer_id)
then 'old'
else 'new'
end) as customer_status
from test t;
Note that this assigns "new" to customers with no order with status "3".
You can also write this using join and group by:
select t.*,
coalesce(t3.customer_status, 'old') as customer_status
from test t left join
(select t.customer_id, min(created_at) as min_created_at,
'new' as customer_status
from t
where status = 3
group by t.customer_id
) t3
on t.customer_id = t3.customer_id and
t.created_at <= t3.min_created_at;
You can do this using window functions in Spark:
select t.*,
(case when created_at > min(case when status = 3 then created_at end) over (partition by customer_id)
then 'old'
else 'new'
end) as customer_status
from test t;
Note that this assigns "new" to customers with no order with status "3".
You can also write this using join and group by:
select t.*,
coalesce(t3.customer_status, 'old') as customer_status
from test t left join
(select t.customer_id, min(created_at) as min_created_at,
'new' as customer_status
from t
where status = 3
group by t.customer_id
) t3
on t.customer_id = t3.customer_id and
t.created_at <= t3.min_created_at;
answered Nov 25 '18 at 12:33
Gordon LinoffGordon Linoff
784k35310415
784k35310415
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53465745%2fsql-select-data-before-first-occurence-of-a-certain-value%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
You could try using where exists or not exists to join to the same table on matching customer id and a theta join (order_id < order_id).
– Alan
Nov 25 '18 at 9:08