美文网首页
DataCamp课程 <用dplyr合并数据> Chapter2

DataCamp课程 <用dplyr合并数据> Chapter2

作者: Jason数据分析生信教室 | 来源:发表于2021-08-11 12:56 被阅读0次

用dplyr合并数据

Chapter1. 合并数据表
Chapter2. 向左,向右合并
Chapter3. 完全合并,半完全合并
Chapter4. 问题实践

Chapter2. 向左,向右合并

left_join向左合并

向左合并left_join,顾名思义,就是向左边的数据集对齐,保留第一个数据集所有的信息。


举个例子,根据"part_num"和"color_id"这两个变量把"millennium_falcon"和"star_destroyer"向左合并。并重命名名字一样的变量。
# Combine the star_destroyer and millennium_falcon tables
millennium_falcon %>%
  left_join(star_destroyer,by=c("part_num","color_id"),
  suffix=c("_falcon","_star_destroyer"))
# A tibble: 263 x 6
   set_num_falcon part_num color_id quantity_falcon set_num_star_de~
   <chr>          <chr>       <dbl>           <dbl> <chr>           
 1 7965-1         63868          71              62 <NA>            
 2 7965-1         3023            0              60 <NA>            
 3 7965-1         3021           72              46 75190-1         
 4 7965-1         2780            0              37 75190-1         
 5 7965-1         60478          72              36 <NA>            
 6 7965-1         6636           71              34 75190-1         
 7 7965-1         3009           71              28 75190-1         
 8 7965-1         3665           71              22 <NA>            
 9 7965-1         2412b          72              20 75190-1         
10 7965-1         3010           71              19 <NA>            
# ... with 253 more rows, and 1 more variable: quantity_star_destroyer <dbl>

接下来的例子稍微复杂点,会结合到别的课程学到的知识。

    1. 根据某个变量分别对两组数据进行描述行统计(用到group_bysummarize)
    1. 合并这两个描述性统计量
# Aggregate Millennium Falcon for the total quantity in each part
millennium_falcon_colors <- millennium_falcon %>%
  group_by(color_id) %>%
  summarize(total_quantity = sum(quantity))

# Aggregate Star Destroyer for the total quantity in each part
star_destroyer_colors <- star_destroyer %>%
  group_by(color_id) %>%
  summarize(total_quantity = sum(quantity))

# Left join the Millennium Falcon colors to the Star Destroyer colors
millennium_falcon_colors %>%
  left_join(star_destroyer_colors,by="color_id",
  suffix=c("_falcon","_star_destroyer"))

下面的例子会用到以前学过的filter。先从数据集inventories里提取出变量"version"是1的数据,然后和第二个数据集sets根据共同变量"set_ num"向左合并。然后提取出数据集inventories里不存在的变量,也就是合并以后"version"显示NA的数据。这里用到了is.na()

inventory_version_1 <- inventories %>%
  filter(version == 1)

# Join versions to sets
sets %>%
  left_join(inventory_version_1,by="set_num") %>%
  # Filter for where version is na
  filter(is.na(version))
# A tibble: 1 x 6
  set_num name       year theme_id    id version
  <chr>   <chr>     <dbl>    <dbl> <dbl>   <dbl>
1 40198-1 Ludo game  2018      598    NA      NA

right_join向右合并


向右合并和向左相反,合并以后保留第二个数据集的所有内容。现举个例子,用count描述变量"part_cat_id"的频度(此时产生一个默认频度变量n)。然后和数据集"part_categories"向右合并。提取出n为NA的数据。
这里用到了之前不同变量名之间的匹配语法by=c("A"="B")
parts %>%
    count(part_cat_id) %>%
    right_join(part_categories, by = c("part_cat_id" = "id")) %>%
    # Filter for NA
    filter(is.na(n))
# A tibble: 1 x 3
  part_cat_id     n name   
        <dbl> <int> <chr>  
1          66    NA Modulex

教程里还介绍了替换NA值得方法。replace_na用0来替换NA。

parts %>%
    count(part_cat_id) %>%
    right_join(part_categories, by = c("part_cat_id" = "id")) %>%
    # Use replace_na to replace missing values in the n column
    replace_na(list(n=0))
# A tibble: 64 x 3
   part_cat_id     n name                   
         <dbl> <dbl> <chr>                  
 1           1   135 Baseplates             
 2           3   303 Bricks Sloped          
 3           4  1900 Duplo, Quatro and Primo
 4           5   107 Bricks Special         
 5           6   128 Bricks Wedged          
 6           7    97 Containers             
 7           8    24 Technic Bricks         
 8           9   167 Plates Special         
 9          11   490 Bricks                 
10          12    85 Technic Connectors     
# ... with 54 more rows

相关文章

网友评论

      本文标题:DataCamp课程 <用dplyr合并数据> Chapter2

      本文链接:https://www.haomeiwen.com/subject/ssvqbltx.html