How to combine a frequency count of multiple columns with a transposition of categorical variables linked to...
up vote
0
down vote
favorite
Let's assume I have a very large dataset storing thousands of households profiles which sizes are up to 11 members.
The order of the data is examplified in the table below where I have gender of each member of the household, his/her profession (let's say 20 types of predefined categories) and his/her revenue for each income source.
Gender1 <- c("M","F")
Gender2 <- c("F", "M")
Work1 <- c("A", "B")
Work2 <- c("B","A")
RevenueA <- c(10,20)
RevenueB <- c(20,10)
df <- data.frame(Gender1, Gender2, Work1, Work2, RevenueA, RevenueB)
Now, my R code challange is to get a frequency count of how many males and females work in each codified sector (work 1, work 2 up to 20 categories) and the average revenue value declared by each gender across all the predefined categories. I wish to keep the types of sectors as labels in the output table. The exemplification of the output is shown in the table below:
What is the most efficient way to get the proposed output without entering the label for each work category in the code? I would also like to repeat the same logic while considering location as primary aggregation, like in the following table:
On a last note, the dataframe has multiple NAs values as well.
Thank you for your support!
r frequency transpose summary frequency-distribution
add a comment |
up vote
0
down vote
favorite
Let's assume I have a very large dataset storing thousands of households profiles which sizes are up to 11 members.
The order of the data is examplified in the table below where I have gender of each member of the household, his/her profession (let's say 20 types of predefined categories) and his/her revenue for each income source.
Gender1 <- c("M","F")
Gender2 <- c("F", "M")
Work1 <- c("A", "B")
Work2 <- c("B","A")
RevenueA <- c(10,20)
RevenueB <- c(20,10)
df <- data.frame(Gender1, Gender2, Work1, Work2, RevenueA, RevenueB)
Now, my R code challange is to get a frequency count of how many males and females work in each codified sector (work 1, work 2 up to 20 categories) and the average revenue value declared by each gender across all the predefined categories. I wish to keep the types of sectors as labels in the output table. The exemplification of the output is shown in the table below:
What is the most efficient way to get the proposed output without entering the label for each work category in the code? I would also like to repeat the same logic while considering location as primary aggregation, like in the following table:
On a last note, the dataframe has multiple NAs values as well.
Thank you for your support!
r frequency transpose summary frequency-distribution
can u dput() sample data
– sai saran
Nov 17 at 14:26
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
Let's assume I have a very large dataset storing thousands of households profiles which sizes are up to 11 members.
The order of the data is examplified in the table below where I have gender of each member of the household, his/her profession (let's say 20 types of predefined categories) and his/her revenue for each income source.
Gender1 <- c("M","F")
Gender2 <- c("F", "M")
Work1 <- c("A", "B")
Work2 <- c("B","A")
RevenueA <- c(10,20)
RevenueB <- c(20,10)
df <- data.frame(Gender1, Gender2, Work1, Work2, RevenueA, RevenueB)
Now, my R code challange is to get a frequency count of how many males and females work in each codified sector (work 1, work 2 up to 20 categories) and the average revenue value declared by each gender across all the predefined categories. I wish to keep the types of sectors as labels in the output table. The exemplification of the output is shown in the table below:
What is the most efficient way to get the proposed output without entering the label for each work category in the code? I would also like to repeat the same logic while considering location as primary aggregation, like in the following table:
On a last note, the dataframe has multiple NAs values as well.
Thank you for your support!
r frequency transpose summary frequency-distribution
Let's assume I have a very large dataset storing thousands of households profiles which sizes are up to 11 members.
The order of the data is examplified in the table below where I have gender of each member of the household, his/her profession (let's say 20 types of predefined categories) and his/her revenue for each income source.
Gender1 <- c("M","F")
Gender2 <- c("F", "M")
Work1 <- c("A", "B")
Work2 <- c("B","A")
RevenueA <- c(10,20)
RevenueB <- c(20,10)
df <- data.frame(Gender1, Gender2, Work1, Work2, RevenueA, RevenueB)
Now, my R code challange is to get a frequency count of how many males and females work in each codified sector (work 1, work 2 up to 20 categories) and the average revenue value declared by each gender across all the predefined categories. I wish to keep the types of sectors as labels in the output table. The exemplification of the output is shown in the table below:
What is the most efficient way to get the proposed output without entering the label for each work category in the code? I would also like to repeat the same logic while considering location as primary aggregation, like in the following table:
On a last note, the dataframe has multiple NAs values as well.
Thank you for your support!
r frequency transpose summary frequency-distribution
r frequency transpose summary frequency-distribution
edited Nov 17 at 15:02
asked Nov 17 at 14:18
Nicola
11518
11518
can u dput() sample data
– sai saran
Nov 17 at 14:26
add a comment |
can u dput() sample data
– sai saran
Nov 17 at 14:26
can u dput() sample data
– sai saran
Nov 17 at 14:26
can u dput() sample data
– sai saran
Nov 17 at 14:26
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
Something like that would work on your example (I've added a location to the dataframe):
library(tidyverse)
Gender1 <- c("M","F")
Gender2 <- c("F", "M")
Work1 <- c("A", "B")
Work2 <- c("B","A")
RevenueA <- c(10,20)
RevenueB <- c(20,10)
Location <- c("ABC", "ABC")
df <- data.frame(Location, Gender1, Gender2, Work1, Work2, RevenueA, RevenueB)
df %>%
gather(key_gen, value_gen, which(grepl("Gender", colnames(.)))) %>%
gather(key_work, value_work, which(grepl("Work", colnames(.)))) %>%
gather(key_reven, value_reven, which(grepl("Revenue", colnames(.)))) %>%
mutate(
gen_id = gsub(".*(\d+$)", "\1", key_gen),
work_id = gsub(".*(\d+$)", "\1", key_work),
reven_id = gsub("Revenue", "", key_reven),
key_work = "Work", key_gen = "Gender", key_reven = "Revenue"
) %>%
filter(gen_id == work_id & value_work == reven_id) %>%
select(-contains("_id"), -key_reven) %>%
add_count(value_gen, value_work) %>%
group_by(value_gen, value_work) %>%
mutate(
mean_reven = paste0("MeanRevenue", value_work),
mean_reven_n = mean(value_reven, na.rm = TRUE),
key_work = paste0(key_work, value_work)
) %>% ungroup() %>%
distinct(Location, key_gen, value_gen, key_work, n, mean_reven, mean_reven_n) %>%
spread(key_gen, value_gen) %>%
spread(key_work, n) %>%
spread(mean_reven, mean_reven_n) %>%
mutate_at(vars(contains("Work"), contains("MeanRevenue")), funs(replace(., is.na(.), 0)))
Output:
Location Gender WorkA WorkB MeanRevenueA MeanRevenueB
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 ABC F 0 2 0 15
2 ABC M 2 0 15 0
However, I believe your data may be more complex. If this doesn't scale well to your dataset, it would be helpful if you could provide us a more complex example that better resembles your original dataframe.
Definetly helpful contribution, but I got the following error ~Error in filter(., gen_id == work_id & value_work == reven_id): object 'gen_id' not found
– Nicola
Nov 17 at 15:48
It may be that your dataset is different than your example, e.g. that column names are not the same, or something similar. Could you check if there are substantial differences between your dataframe and the example you've provided?
– arg0naut
Nov 17 at 15:52
That's useful, I had to apply dplyr:: in front of the filter function and it worked on the small sample that was here proposed. Yet, when running it on a larget set of structured data I get into another errorError: Duplicate identifiers for rows (1, 2), (8, 9, 10), (12, 15), (3, 5), (11, 14)
. I wonder if it has something to do with the way to spread the data when there are repeated identifiers and how to sort it.
– Nicola
Nov 17 at 16:41
You can insert row number as a variable but wouldn't advise so unless we can see the dataframe. Can you augment your data example with more data points resembling your data?
– arg0naut
Nov 17 at 17:26
I actually just kept the proposed example and added a few more random data points but that's when I got the duplicate identifiers error.
– Nicola
Nov 17 at 22:06
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
Something like that would work on your example (I've added a location to the dataframe):
library(tidyverse)
Gender1 <- c("M","F")
Gender2 <- c("F", "M")
Work1 <- c("A", "B")
Work2 <- c("B","A")
RevenueA <- c(10,20)
RevenueB <- c(20,10)
Location <- c("ABC", "ABC")
df <- data.frame(Location, Gender1, Gender2, Work1, Work2, RevenueA, RevenueB)
df %>%
gather(key_gen, value_gen, which(grepl("Gender", colnames(.)))) %>%
gather(key_work, value_work, which(grepl("Work", colnames(.)))) %>%
gather(key_reven, value_reven, which(grepl("Revenue", colnames(.)))) %>%
mutate(
gen_id = gsub(".*(\d+$)", "\1", key_gen),
work_id = gsub(".*(\d+$)", "\1", key_work),
reven_id = gsub("Revenue", "", key_reven),
key_work = "Work", key_gen = "Gender", key_reven = "Revenue"
) %>%
filter(gen_id == work_id & value_work == reven_id) %>%
select(-contains("_id"), -key_reven) %>%
add_count(value_gen, value_work) %>%
group_by(value_gen, value_work) %>%
mutate(
mean_reven = paste0("MeanRevenue", value_work),
mean_reven_n = mean(value_reven, na.rm = TRUE),
key_work = paste0(key_work, value_work)
) %>% ungroup() %>%
distinct(Location, key_gen, value_gen, key_work, n, mean_reven, mean_reven_n) %>%
spread(key_gen, value_gen) %>%
spread(key_work, n) %>%
spread(mean_reven, mean_reven_n) %>%
mutate_at(vars(contains("Work"), contains("MeanRevenue")), funs(replace(., is.na(.), 0)))
Output:
Location Gender WorkA WorkB MeanRevenueA MeanRevenueB
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 ABC F 0 2 0 15
2 ABC M 2 0 15 0
However, I believe your data may be more complex. If this doesn't scale well to your dataset, it would be helpful if you could provide us a more complex example that better resembles your original dataframe.
Definetly helpful contribution, but I got the following error ~Error in filter(., gen_id == work_id & value_work == reven_id): object 'gen_id' not found
– Nicola
Nov 17 at 15:48
It may be that your dataset is different than your example, e.g. that column names are not the same, or something similar. Could you check if there are substantial differences between your dataframe and the example you've provided?
– arg0naut
Nov 17 at 15:52
That's useful, I had to apply dplyr:: in front of the filter function and it worked on the small sample that was here proposed. Yet, when running it on a larget set of structured data I get into another errorError: Duplicate identifiers for rows (1, 2), (8, 9, 10), (12, 15), (3, 5), (11, 14)
. I wonder if it has something to do with the way to spread the data when there are repeated identifiers and how to sort it.
– Nicola
Nov 17 at 16:41
You can insert row number as a variable but wouldn't advise so unless we can see the dataframe. Can you augment your data example with more data points resembling your data?
– arg0naut
Nov 17 at 17:26
I actually just kept the proposed example and added a few more random data points but that's when I got the duplicate identifiers error.
– Nicola
Nov 17 at 22:06
add a comment |
up vote
0
down vote
Something like that would work on your example (I've added a location to the dataframe):
library(tidyverse)
Gender1 <- c("M","F")
Gender2 <- c("F", "M")
Work1 <- c("A", "B")
Work2 <- c("B","A")
RevenueA <- c(10,20)
RevenueB <- c(20,10)
Location <- c("ABC", "ABC")
df <- data.frame(Location, Gender1, Gender2, Work1, Work2, RevenueA, RevenueB)
df %>%
gather(key_gen, value_gen, which(grepl("Gender", colnames(.)))) %>%
gather(key_work, value_work, which(grepl("Work", colnames(.)))) %>%
gather(key_reven, value_reven, which(grepl("Revenue", colnames(.)))) %>%
mutate(
gen_id = gsub(".*(\d+$)", "\1", key_gen),
work_id = gsub(".*(\d+$)", "\1", key_work),
reven_id = gsub("Revenue", "", key_reven),
key_work = "Work", key_gen = "Gender", key_reven = "Revenue"
) %>%
filter(gen_id == work_id & value_work == reven_id) %>%
select(-contains("_id"), -key_reven) %>%
add_count(value_gen, value_work) %>%
group_by(value_gen, value_work) %>%
mutate(
mean_reven = paste0("MeanRevenue", value_work),
mean_reven_n = mean(value_reven, na.rm = TRUE),
key_work = paste0(key_work, value_work)
) %>% ungroup() %>%
distinct(Location, key_gen, value_gen, key_work, n, mean_reven, mean_reven_n) %>%
spread(key_gen, value_gen) %>%
spread(key_work, n) %>%
spread(mean_reven, mean_reven_n) %>%
mutate_at(vars(contains("Work"), contains("MeanRevenue")), funs(replace(., is.na(.), 0)))
Output:
Location Gender WorkA WorkB MeanRevenueA MeanRevenueB
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 ABC F 0 2 0 15
2 ABC M 2 0 15 0
However, I believe your data may be more complex. If this doesn't scale well to your dataset, it would be helpful if you could provide us a more complex example that better resembles your original dataframe.
Definetly helpful contribution, but I got the following error ~Error in filter(., gen_id == work_id & value_work == reven_id): object 'gen_id' not found
– Nicola
Nov 17 at 15:48
It may be that your dataset is different than your example, e.g. that column names are not the same, or something similar. Could you check if there are substantial differences between your dataframe and the example you've provided?
– arg0naut
Nov 17 at 15:52
That's useful, I had to apply dplyr:: in front of the filter function and it worked on the small sample that was here proposed. Yet, when running it on a larget set of structured data I get into another errorError: Duplicate identifiers for rows (1, 2), (8, 9, 10), (12, 15), (3, 5), (11, 14)
. I wonder if it has something to do with the way to spread the data when there are repeated identifiers and how to sort it.
– Nicola
Nov 17 at 16:41
You can insert row number as a variable but wouldn't advise so unless we can see the dataframe. Can you augment your data example with more data points resembling your data?
– arg0naut
Nov 17 at 17:26
I actually just kept the proposed example and added a few more random data points but that's when I got the duplicate identifiers error.
– Nicola
Nov 17 at 22:06
add a comment |
up vote
0
down vote
up vote
0
down vote
Something like that would work on your example (I've added a location to the dataframe):
library(tidyverse)
Gender1 <- c("M","F")
Gender2 <- c("F", "M")
Work1 <- c("A", "B")
Work2 <- c("B","A")
RevenueA <- c(10,20)
RevenueB <- c(20,10)
Location <- c("ABC", "ABC")
df <- data.frame(Location, Gender1, Gender2, Work1, Work2, RevenueA, RevenueB)
df %>%
gather(key_gen, value_gen, which(grepl("Gender", colnames(.)))) %>%
gather(key_work, value_work, which(grepl("Work", colnames(.)))) %>%
gather(key_reven, value_reven, which(grepl("Revenue", colnames(.)))) %>%
mutate(
gen_id = gsub(".*(\d+$)", "\1", key_gen),
work_id = gsub(".*(\d+$)", "\1", key_work),
reven_id = gsub("Revenue", "", key_reven),
key_work = "Work", key_gen = "Gender", key_reven = "Revenue"
) %>%
filter(gen_id == work_id & value_work == reven_id) %>%
select(-contains("_id"), -key_reven) %>%
add_count(value_gen, value_work) %>%
group_by(value_gen, value_work) %>%
mutate(
mean_reven = paste0("MeanRevenue", value_work),
mean_reven_n = mean(value_reven, na.rm = TRUE),
key_work = paste0(key_work, value_work)
) %>% ungroup() %>%
distinct(Location, key_gen, value_gen, key_work, n, mean_reven, mean_reven_n) %>%
spread(key_gen, value_gen) %>%
spread(key_work, n) %>%
spread(mean_reven, mean_reven_n) %>%
mutate_at(vars(contains("Work"), contains("MeanRevenue")), funs(replace(., is.na(.), 0)))
Output:
Location Gender WorkA WorkB MeanRevenueA MeanRevenueB
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 ABC F 0 2 0 15
2 ABC M 2 0 15 0
However, I believe your data may be more complex. If this doesn't scale well to your dataset, it would be helpful if you could provide us a more complex example that better resembles your original dataframe.
Something like that would work on your example (I've added a location to the dataframe):
library(tidyverse)
Gender1 <- c("M","F")
Gender2 <- c("F", "M")
Work1 <- c("A", "B")
Work2 <- c("B","A")
RevenueA <- c(10,20)
RevenueB <- c(20,10)
Location <- c("ABC", "ABC")
df <- data.frame(Location, Gender1, Gender2, Work1, Work2, RevenueA, RevenueB)
df %>%
gather(key_gen, value_gen, which(grepl("Gender", colnames(.)))) %>%
gather(key_work, value_work, which(grepl("Work", colnames(.)))) %>%
gather(key_reven, value_reven, which(grepl("Revenue", colnames(.)))) %>%
mutate(
gen_id = gsub(".*(\d+$)", "\1", key_gen),
work_id = gsub(".*(\d+$)", "\1", key_work),
reven_id = gsub("Revenue", "", key_reven),
key_work = "Work", key_gen = "Gender", key_reven = "Revenue"
) %>%
filter(gen_id == work_id & value_work == reven_id) %>%
select(-contains("_id"), -key_reven) %>%
add_count(value_gen, value_work) %>%
group_by(value_gen, value_work) %>%
mutate(
mean_reven = paste0("MeanRevenue", value_work),
mean_reven_n = mean(value_reven, na.rm = TRUE),
key_work = paste0(key_work, value_work)
) %>% ungroup() %>%
distinct(Location, key_gen, value_gen, key_work, n, mean_reven, mean_reven_n) %>%
spread(key_gen, value_gen) %>%
spread(key_work, n) %>%
spread(mean_reven, mean_reven_n) %>%
mutate_at(vars(contains("Work"), contains("MeanRevenue")), funs(replace(., is.na(.), 0)))
Output:
Location Gender WorkA WorkB MeanRevenueA MeanRevenueB
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 ABC F 0 2 0 15
2 ABC M 2 0 15 0
However, I believe your data may be more complex. If this doesn't scale well to your dataset, it would be helpful if you could provide us a more complex example that better resembles your original dataframe.
edited Nov 17 at 19:19
answered Nov 17 at 15:30
arg0naut
1,407212
1,407212
Definetly helpful contribution, but I got the following error ~Error in filter(., gen_id == work_id & value_work == reven_id): object 'gen_id' not found
– Nicola
Nov 17 at 15:48
It may be that your dataset is different than your example, e.g. that column names are not the same, or something similar. Could you check if there are substantial differences between your dataframe and the example you've provided?
– arg0naut
Nov 17 at 15:52
That's useful, I had to apply dplyr:: in front of the filter function and it worked on the small sample that was here proposed. Yet, when running it on a larget set of structured data I get into another errorError: Duplicate identifiers for rows (1, 2), (8, 9, 10), (12, 15), (3, 5), (11, 14)
. I wonder if it has something to do with the way to spread the data when there are repeated identifiers and how to sort it.
– Nicola
Nov 17 at 16:41
You can insert row number as a variable but wouldn't advise so unless we can see the dataframe. Can you augment your data example with more data points resembling your data?
– arg0naut
Nov 17 at 17:26
I actually just kept the proposed example and added a few more random data points but that's when I got the duplicate identifiers error.
– Nicola
Nov 17 at 22:06
add a comment |
Definetly helpful contribution, but I got the following error ~Error in filter(., gen_id == work_id & value_work == reven_id): object 'gen_id' not found
– Nicola
Nov 17 at 15:48
It may be that your dataset is different than your example, e.g. that column names are not the same, or something similar. Could you check if there are substantial differences between your dataframe and the example you've provided?
– arg0naut
Nov 17 at 15:52
That's useful, I had to apply dplyr:: in front of the filter function and it worked on the small sample that was here proposed. Yet, when running it on a larget set of structured data I get into another errorError: Duplicate identifiers for rows (1, 2), (8, 9, 10), (12, 15), (3, 5), (11, 14)
. I wonder if it has something to do with the way to spread the data when there are repeated identifiers and how to sort it.
– Nicola
Nov 17 at 16:41
You can insert row number as a variable but wouldn't advise so unless we can see the dataframe. Can you augment your data example with more data points resembling your data?
– arg0naut
Nov 17 at 17:26
I actually just kept the proposed example and added a few more random data points but that's when I got the duplicate identifiers error.
– Nicola
Nov 17 at 22:06
Definetly helpful contribution, but I got the following error ~Error in filter(., gen_id == work_id & value_work == reven_id): object 'gen_id' not found
– Nicola
Nov 17 at 15:48
Definetly helpful contribution, but I got the following error ~Error in filter(., gen_id == work_id & value_work == reven_id): object 'gen_id' not found
– Nicola
Nov 17 at 15:48
It may be that your dataset is different than your example, e.g. that column names are not the same, or something similar. Could you check if there are substantial differences between your dataframe and the example you've provided?
– arg0naut
Nov 17 at 15:52
It may be that your dataset is different than your example, e.g. that column names are not the same, or something similar. Could you check if there are substantial differences between your dataframe and the example you've provided?
– arg0naut
Nov 17 at 15:52
That's useful, I had to apply dplyr:: in front of the filter function and it worked on the small sample that was here proposed. Yet, when running it on a larget set of structured data I get into another error
Error: Duplicate identifiers for rows (1, 2), (8, 9, 10), (12, 15), (3, 5), (11, 14)
. I wonder if it has something to do with the way to spread the data when there are repeated identifiers and how to sort it.– Nicola
Nov 17 at 16:41
That's useful, I had to apply dplyr:: in front of the filter function and it worked on the small sample that was here proposed. Yet, when running it on a larget set of structured data I get into another error
Error: Duplicate identifiers for rows (1, 2), (8, 9, 10), (12, 15), (3, 5), (11, 14)
. I wonder if it has something to do with the way to spread the data when there are repeated identifiers and how to sort it.– Nicola
Nov 17 at 16:41
You can insert row number as a variable but wouldn't advise so unless we can see the dataframe. Can you augment your data example with more data points resembling your data?
– arg0naut
Nov 17 at 17:26
You can insert row number as a variable but wouldn't advise so unless we can see the dataframe. Can you augment your data example with more data points resembling your data?
– arg0naut
Nov 17 at 17:26
I actually just kept the proposed example and added a few more random data points but that's when I got the duplicate identifiers error.
– Nicola
Nov 17 at 22:06
I actually just kept the proposed example and added a few more random data points but that's when I got the duplicate identifiers error.
– Nicola
Nov 17 at 22:06
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53352065%2fhow-to-combine-a-frequency-count-of-multiple-columns-with-a-transposition-of-cat%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
can u dput() sample data
– sai saran
Nov 17 at 14:26