How to fill in NAs of various columns grouped by duplicated IDs in R

up vote
0
down vote

favorite

I have a table with columns id, colA, and colB. The data contains duplicated id columns where for some rows, colA or colB is null, but its duplicated id has valid values. I want to clean the data so that I remove duplicates, but have complete data. For example my data looks like

id | colA | colB

 1     NA    X

 1      Y    X

 2      Z    NA

 2      Z    Y

 3      Z    Y

 3      Z    Y

 4     NA    NA

 4     NA    NA

and I want my dataframe to look like

id | colA | colB

1     Y      X

2     Z      Y

3     Z      Y

4     NA     NA

I usually use the ifelse statement to replace missing values, but I am confused on how to use this in the context of having duplicated ids.

asked Nov 19 at 23:56

Kevin Sun

11818

In the case where only one value per id is NA, is it always the first row with that id, as in your example data?
– neilfws
Nov 20 at 0:40

add a comment |

up vote
0
down vote

favorite

id | colA | colB

 1     NA    X

 1      Y    X

 2      Z    NA

 2      Z    Y

 3      Z    Y

 3      Z    Y

 4     NA    NA

 4     NA    NA

and I want my dataframe to look like

id | colA | colB

1     Y      X

2     Z      Y

3     Z      Y

4     NA     NA

I usually use the ifelse statement to replace missing values, but I am confused on how to use this in the context of having duplicated ids.

asked Nov 19 at 23:56

Kevin Sun

11818

In the case where only one value per id is NA, is it always the first row with that id, as in your example data?
– neilfws
Nov 20 at 0:40

add a comment |

up vote
0
down vote

favorite

id | colA | colB

 1     NA    X

 1      Y    X

 2      Z    NA

 2      Z    Y

 3      Z    Y

 3      Z    Y

 4     NA    NA

 4     NA    NA

and I want my dataframe to look like

id | colA | colB

1     Y      X

2     Z      Y

3     Z      Y

4     NA     NA

I usually use the ifelse statement to replace missing values, but I am confused on how to use this in the context of having duplicated ids.

asked Nov 19 at 23:56

Kevin Sun

11818

id | colA | colB

 1     NA    X

 1      Y    X

 2      Z    NA

 2      Z    Y

 3      Z    Y

 3      Z    Y

 4     NA    NA

 4     NA    NA

and I want my dataframe to look like

id | colA | colB

1     Y      X

2     Z      Y

3     Z      Y

4     NA     NA

I usually use the ifelse statement to replace missing values, but I am confused on how to use this in the context of having duplicated ids.

r merge na

asked Nov 19 at 23:56

Kevin Sun

11818

asked Nov 19 at 23:56

Kevin Sun

11818

asked Nov 19 at 23:56

Kevin Sun

11818

asked Nov 19 at 23:56

Kevin Sun

11818

asked Nov 19 at 23:56

Kevin Sun

11818

In the case where only one value per id is NA, is it always the first row with that id, as in your example data?
– neilfws
Nov 20 at 0:40

add a comment |

In the case where only one value per id is NA, is it always the first row with that id, as in your example data?
– neilfws
Nov 20 at 0:40

In the case where only one value per id is NA, is it always the first row with that id, as in your example data?
– neilfws
Nov 20 at 0:40

add a comment |

3 Answers
3

active

oldest

votes

up vote
2
down vote

accepted

This answer is very dependent on your actual data being similar in structure to your example data.

Your data:

df1 <- structure(list(id = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), 

                      colA = c(NA, "Y", "Z", "Z", "Z", "Z", NA, NA), 

                      colB = c("X", "X", NA, "Y", "Y", "Y", NA, NA)), 

                      class = "data.frame", 

                      row.names = c(NA, -8L))

Assuming, as in your example, that each id occurs twice and that where one observation is NA, it is the first observation for that id, then this works:

library(dplyr)

library(tidyr)



df1 %>% 

  group_by(id) %>% 

  fill(colA, colB, .direction = "up") %>%  

  ungroup() %>% 

  distinct()



# A tibble: 4 x 3

     id colA  colB 

  <int> <chr> <chr>

1     1 Y     X    

2     2 Z     Y    

3     3 Z     Y    

4     4 NA    NA

If the second observation for an id can be NA, you could try adding a second fill after the first one, but this time fill down:

df1 %>% 

  group_by(id) %>% 

  fill(colA, colB, .direction = "up") %>%  

  fill(colA, colB, .direction = "down") %>%  

  ungroup() %>% 

  distinct()

answered Nov 20 at 0:47

neilfws

17.5k53648

add a comment |

up vote
4
down vote

First add a column that tells how many NAs in each row. Then using dplyr, remove duplicated rows first and then for each id keep the row with least missing values -

df$test <- rowSums(is.na(df))



df %>% 

  filter(!duplicated(.)) %>% 

  arrange(id, test) %>% 

  group_by(id) %>% 

  filter(row_number() == 1) %>% 

  ungroup() %>% 

  select(-test)



# A tibble: 4 x 3

     id colA  colB 

  <int> <chr> <chr>

1     1 y     x    

2     2 z     y    

3     3 z     y    

4     4 <NA>  <NA>

EDIT:
Actually no need to remove duplicates first. Just keeping the row with least missing values for each id should also work -

df$test <- rowSums(is.na(df))



df %>% 

  arrange(id, test) %>% 

  group_by(id) %>% 

  filter(row_number() == 1) %>% 

  ungroup() %>% 

  select(-test)

Data -

df <- data.frame(

  id = c(rep(seq(1:4), each =2)), colA = c(NA, "y", "z", "z", "z", "z", NA, NA), 

  colB = c("x", "x", NA, "y", "y", "y", NA, NA), stringsAsFactors = F)

edited Nov 20 at 14:23

answered Nov 20 at 0:44

Shree

3,2861323

add a comment |

up vote
1
down vote

Creating dataframe - it helps if you post the code to make the sample data

df <- data.frame(id = c(rep(seq(1:4), each =2)), colA = c(NA, "y", "z", "z", "z", "z", NA, NA), colB = c("x", "x", NA, "y", "y", "y", NA, NA))

Removing rows with single NAs

for(i in 1:nrow(df)){



  if(is.na(df[i,]$colA) & !is.na(df[i,]$colB) | !is.na(df[i,]$colA) & is.na(df[i,]$colB)){



df <- df[-i,]



  }

}

Removing remaining duplicates (i.e. double NA rows)

df  <- df[!duplicated(df), ]

Output

df

Probably a more computationally efficient way of doing this but this ought to work.

answered Nov 20 at 0:32

André.B

508

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53384336%2fhow-to-fill-in-nas-of-various-columns-grouped-by-duplicated-ids-in-r%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
2
down vote

accepted

This answer is very dependent on your actual data being similar in structure to your example data.

Your data:

df1 <- structure(list(id = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), 

                      colA = c(NA, "Y", "Z", "Z", "Z", "Z", NA, NA), 

                      colB = c("X", "X", NA, "Y", "Y", "Y", NA, NA)), 

                      class = "data.frame", 

                      row.names = c(NA, -8L))

Assuming, as in your example, that each id occurs twice and that where one observation is NA, it is the first observation for that id, then this works:

library(dplyr)

library(tidyr)



df1 %>% 

  group_by(id) %>% 

  fill(colA, colB, .direction = "up") %>%  

  ungroup() %>% 

  distinct()



# A tibble: 4 x 3

     id colA  colB 

  <int> <chr> <chr>

1     1 Y     X    

2     2 Z     Y    

3     3 Z     Y    

4     4 NA    NA

If the second observation for an id can be NA, you could try adding a second fill after the first one, but this time fill down:

df1 %>% 

  group_by(id) %>% 

  fill(colA, colB, .direction = "up") %>%  

  fill(colA, colB, .direction = "down") %>%  

  ungroup() %>% 

  distinct()

answered Nov 20 at 0:47

neilfws

17.5k53648

add a comment |

up vote
2
down vote

accepted

This answer is very dependent on your actual data being similar in structure to your example data.

Your data:

df1 <- structure(list(id = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), 

                      colA = c(NA, "Y", "Z", "Z", "Z", "Z", NA, NA), 

                      colB = c("X", "X", NA, "Y", "Y", "Y", NA, NA)), 

                      class = "data.frame", 

                      row.names = c(NA, -8L))

Assuming, as in your example, that each id occurs twice and that where one observation is NA, it is the first observation for that id, then this works:

library(dplyr)

library(tidyr)



df1 %>% 

  group_by(id) %>% 

  fill(colA, colB, .direction = "up") %>%  

  ungroup() %>% 

  distinct()



# A tibble: 4 x 3

     id colA  colB 

  <int> <chr> <chr>

1     1 Y     X    

2     2 Z     Y    

3     3 Z     Y    

4     4 NA    NA

If the second observation for an id can be NA, you could try adding a second fill after the first one, but this time fill down:

df1 %>% 

  group_by(id) %>% 

  fill(colA, colB, .direction = "up") %>%  

  fill(colA, colB, .direction = "down") %>%  

  ungroup() %>% 

  distinct()

answered Nov 20 at 0:47

neilfws

17.5k53648

add a comment |

up vote
2
down vote

accepted

This answer is very dependent on your actual data being similar in structure to your example data.

Your data:

df1 <- structure(list(id = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), 

                      colA = c(NA, "Y", "Z", "Z", "Z", "Z", NA, NA), 

                      colB = c("X", "X", NA, "Y", "Y", "Y", NA, NA)), 

                      class = "data.frame", 

                      row.names = c(NA, -8L))

Assuming, as in your example, that each id occurs twice and that where one observation is NA, it is the first observation for that id, then this works:

library(dplyr)

library(tidyr)



df1 %>% 

  group_by(id) %>% 

  fill(colA, colB, .direction = "up") %>%  

  ungroup() %>% 

  distinct()



# A tibble: 4 x 3

     id colA  colB 

  <int> <chr> <chr>

1     1 Y     X    

2     2 Z     Y    

3     3 Z     Y    

4     4 NA    NA

If the second observation for an id can be NA, you could try adding a second fill after the first one, but this time fill down:

df1 %>% 

  group_by(id) %>% 

  fill(colA, colB, .direction = "up") %>%  

  fill(colA, colB, .direction = "down") %>%  

  ungroup() %>% 

  distinct()

answered Nov 20 at 0:47

neilfws

17.5k53648

This answer is very dependent on your actual data being similar in structure to your example data.

Your data:

df1 <- structure(list(id = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), 

                      colA = c(NA, "Y", "Z", "Z", "Z", "Z", NA, NA), 

                      colB = c("X", "X", NA, "Y", "Y", "Y", NA, NA)), 

                      class = "data.frame", 

                      row.names = c(NA, -8L))

Assuming, as in your example, that each id occurs twice and that where one observation is NA, it is the first observation for that id, then this works:

library(dplyr)

library(tidyr)



df1 %>% 

  group_by(id) %>% 

  fill(colA, colB, .direction = "up") %>%  

  ungroup() %>% 

  distinct()



# A tibble: 4 x 3

     id colA  colB 

  <int> <chr> <chr>

1     1 Y     X    

2     2 Z     Y    

3     3 Z     Y    

4     4 NA    NA

If the second observation for an id can be NA, you could try adding a second fill after the first one, but this time fill down:

df1 %>% 

  group_by(id) %>% 

  fill(colA, colB, .direction = "up") %>%  

  fill(colA, colB, .direction = "down") %>%  

  ungroup() %>% 

  distinct()

answered Nov 20 at 0:47

neilfws

17.5k53648

answered Nov 20 at 0:47

neilfws

17.5k53648

answered Nov 20 at 0:47

neilfws

17.5k53648

answered Nov 20 at 0:47

neilfws

17.5k53648

add a comment |

up vote
4
down vote

First add a column that tells how many NAs in each row. Then using dplyr, remove duplicated rows first and then for each id keep the row with least missing values -

df$test <- rowSums(is.na(df))



df %>% 

  filter(!duplicated(.)) %>% 

  arrange(id, test) %>% 

  group_by(id) %>% 

  filter(row_number() == 1) %>% 

  ungroup() %>% 

  select(-test)



# A tibble: 4 x 3

     id colA  colB 

  <int> <chr> <chr>

1     1 y     x    

2     2 z     y    

3     3 z     y    

4     4 <NA>  <NA>

EDIT:
Actually no need to remove duplicates first. Just keeping the row with least missing values for each id should also work -

df$test <- rowSums(is.na(df))



df %>% 

  arrange(id, test) %>% 

  group_by(id) %>% 

  filter(row_number() == 1) %>% 

  ungroup() %>% 

  select(-test)

Data -

df <- data.frame(

  id = c(rep(seq(1:4), each =2)), colA = c(NA, "y", "z", "z", "z", "z", NA, NA), 

  colB = c("x", "x", NA, "y", "y", "y", NA, NA), stringsAsFactors = F)

edited Nov 20 at 14:23

answered Nov 20 at 0:44

Shree

3,2861323

add a comment |

up vote
4
down vote

First add a column that tells how many NAs in each row. Then using dplyr, remove duplicated rows first and then for each id keep the row with least missing values -

df$test <- rowSums(is.na(df))



df %>% 

  filter(!duplicated(.)) %>% 

  arrange(id, test) %>% 

  group_by(id) %>% 

  filter(row_number() == 1) %>% 

  ungroup() %>% 

  select(-test)



# A tibble: 4 x 3

     id colA  colB 

  <int> <chr> <chr>

1     1 y     x    

2     2 z     y    

3     3 z     y    

4     4 <NA>  <NA>

EDIT:
Actually no need to remove duplicates first. Just keeping the row with least missing values for each id should also work -

df$test <- rowSums(is.na(df))



df %>% 

  arrange(id, test) %>% 

  group_by(id) %>% 

  filter(row_number() == 1) %>% 

  ungroup() %>% 

  select(-test)

Data -

df <- data.frame(

  id = c(rep(seq(1:4), each =2)), colA = c(NA, "y", "z", "z", "z", "z", NA, NA), 

  colB = c("x", "x", NA, "y", "y", "y", NA, NA), stringsAsFactors = F)

edited Nov 20 at 14:23

answered Nov 20 at 0:44

Shree

3,2861323

add a comment |

up vote
4
down vote

First add a column that tells how many NAs in each row. Then using dplyr, remove duplicated rows first and then for each id keep the row with least missing values -

df$test <- rowSums(is.na(df))



df %>% 

  filter(!duplicated(.)) %>% 

  arrange(id, test) %>% 

  group_by(id) %>% 

  filter(row_number() == 1) %>% 

  ungroup() %>% 

  select(-test)



# A tibble: 4 x 3

     id colA  colB 

  <int> <chr> <chr>

1     1 y     x    

2     2 z     y    

3     3 z     y    

4     4 <NA>  <NA>

EDIT:
Actually no need to remove duplicates first. Just keeping the row with least missing values for each id should also work -

df$test <- rowSums(is.na(df))



df %>% 

  arrange(id, test) %>% 

  group_by(id) %>% 

  filter(row_number() == 1) %>% 

  ungroup() %>% 

  select(-test)

Data -

df <- data.frame(

  id = c(rep(seq(1:4), each =2)), colA = c(NA, "y", "z", "z", "z", "z", NA, NA), 

  colB = c("x", "x", NA, "y", "y", "y", NA, NA), stringsAsFactors = F)

edited Nov 20 at 14:23

answered Nov 20 at 0:44

Shree

3,2861323

First add a column that tells how many NAs in each row. Then using dplyr, remove duplicated rows first and then for each id keep the row with least missing values -

df$test <- rowSums(is.na(df))



df %>% 

  filter(!duplicated(.)) %>% 

  arrange(id, test) %>% 

  group_by(id) %>% 

  filter(row_number() == 1) %>% 

  ungroup() %>% 

  select(-test)



# A tibble: 4 x 3

     id colA  colB 

  <int> <chr> <chr>

1     1 y     x    

2     2 z     y    

3     3 z     y    

4     4 <NA>  <NA>

EDIT:
Actually no need to remove duplicates first. Just keeping the row with least missing values for each id should also work -

df$test <- rowSums(is.na(df))



df %>% 

  arrange(id, test) %>% 

  group_by(id) %>% 

  filter(row_number() == 1) %>% 

  ungroup() %>% 

  select(-test)

Data -

df <- data.frame(

  id = c(rep(seq(1:4), each =2)), colA = c(NA, "y", "z", "z", "z", "z", NA, NA), 

  colB = c("x", "x", NA, "y", "y", "y", NA, NA), stringsAsFactors = F)

edited Nov 20 at 14:23

answered Nov 20 at 0:44

Shree

3,2861323

edited Nov 20 at 14:23

answered Nov 20 at 0:44

Shree

3,2861323

answered Nov 20 at 0:44

Shree

3,2861323

answered Nov 20 at 0:44

Shree

3,2861323

add a comment |

up vote
1
down vote

Creating dataframe - it helps if you post the code to make the sample data

df <- data.frame(id = c(rep(seq(1:4), each =2)), colA = c(NA, "y", "z", "z", "z", "z", NA, NA), colB = c("x", "x", NA, "y", "y", "y", NA, NA))

Removing rows with single NAs

for(i in 1:nrow(df)){



  if(is.na(df[i,]$colA) & !is.na(df[i,]$colB) | !is.na(df[i,]$colA) & is.na(df[i,]$colB)){



df <- df[-i,]



  }

}

Removing remaining duplicates (i.e. double NA rows)

df  <- df[!duplicated(df), ]

Output

df

Probably a more computationally efficient way of doing this but this ought to work.

answered Nov 20 at 0:32

André.B

508

add a comment |

up vote
1
down vote

Creating dataframe - it helps if you post the code to make the sample data

df <- data.frame(id = c(rep(seq(1:4), each =2)), colA = c(NA, "y", "z", "z", "z", "z", NA, NA), colB = c("x", "x", NA, "y", "y", "y", NA, NA))

Removing rows with single NAs

for(i in 1:nrow(df)){



  if(is.na(df[i,]$colA) & !is.na(df[i,]$colB) | !is.na(df[i,]$colA) & is.na(df[i,]$colB)){



df <- df[-i,]



  }

}

Removing remaining duplicates (i.e. double NA rows)

df  <- df[!duplicated(df), ]

Output

df

Probably a more computationally efficient way of doing this but this ought to work.

answered Nov 20 at 0:32

André.B

508

add a comment |

up vote
1
down vote

Creating dataframe - it helps if you post the code to make the sample data

df <- data.frame(id = c(rep(seq(1:4), each =2)), colA = c(NA, "y", "z", "z", "z", "z", NA, NA), colB = c("x", "x", NA, "y", "y", "y", NA, NA))

Removing rows with single NAs

for(i in 1:nrow(df)){



  if(is.na(df[i,]$colA) & !is.na(df[i,]$colB) | !is.na(df[i,]$colA) & is.na(df[i,]$colB)){



df <- df[-i,]



  }

}

Removing remaining duplicates (i.e. double NA rows)

df  <- df[!duplicated(df), ]

Output

df

Probably a more computationally efficient way of doing this but this ought to work.

answered Nov 20 at 0:32

André.B

508

Creating dataframe - it helps if you post the code to make the sample data

df <- data.frame(id = c(rep(seq(1:4), each =2)), colA = c(NA, "y", "z", "z", "z", "z", NA, NA), colB = c("x", "x", NA, "y", "y", "y", NA, NA))

Removing rows with single NAs

for(i in 1:nrow(df)){



  if(is.na(df[i,]$colA) & !is.na(df[i,]$colB) | !is.na(df[i,]$colA) & is.na(df[i,]$colB)){



df <- df[-i,]



  }

}

Removing remaining duplicates (i.e. double NA rows)

df  <- df[!duplicated(df), ]

Output

df

Probably a more computationally efficient way of doing this but this ought to work.

answered Nov 20 at 0:32

André.B

508

answered Nov 20 at 0:32

André.B

508

answered Nov 20 at 0:32

André.B

508

answered Nov 20 at 0:32

André.B

508

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Nsryjdtyk