美文网首页
DataCamp课程 <用dplyr操作数据> Chapter1

DataCamp课程 <用dplyr操作数据> Chapter1

作者: Jason数据分析生信教室 | 来源:发表于2021-07-12 14:49 被阅读0次

用dplyr操作数据课程目录

Chapter1. 数据变形
Chapter2. 数据统计
Chapter3. 数据选择和变形
Chapter4. 实战演练

Chapter1. 数据变形

这一章主要是温故知新吧,大多数知识点在之前的课程里出现过。

常用语法

select()
filter()
arrange()
mutate()

select()选择指定列

在数据counties里选择state,county,population,poverty

counties %>%
 select(state,county,population,poverty) 
# A tibble: 3,138 x 4
   state   county   population poverty
   <chr>   <chr>         <dbl>   <dbl>
 1 Alabama Autauga       55221    12.9
 2 Alabama Baldwin      195121    13.4
 3 Alabama Barbour       26932    26.7
 4 Alabama Bibb          22604    16.8
 5 Alabama Blount        57710    16.7
 6 Alabama Bullock       10678    24.6
 7 Alabama Butler        20354    25.4
 8 Alabama Calhoun      116648    20.5
 9 Alabama Chambers      34079    21.6
10 Alabama Cherokee      26008    19.2
# ... with 3,128 more rows

arrange()给数据排序

在数据counties里选择state,county,population,poverty定义为新的数据集counties_selected,并且根据public_work的大小给新的数据集排序。

counties_selected <- counties %>%
  select(private_work, public_work, self_employed)
# Add a verb to sort in descending order of public_work
counties_selected %>% 
 arrange(desc(public_work))
# A tibble: 3,138 x 3
   private_work public_work self_employed
          <dbl>       <dbl>         <dbl>
 1         25          64.1          10.9
 2         33.3        61.7           5.1
 3         36.8        59.1           3.7
 4         32.9        56.8          10.2
 5         34.4        55             9.8
 6         42.2        51.6           6.1
 7         42.6        50.5           6.8
 8         48.4        49.5           1.8
 9         34.9        49.2          14.7
10         51.9        48.1           0  
# ... with 3,128 more rows

filter()根据条件选取数据

选取population大于1000000的数据。

counties_selected <- counties %>%
  select(state, county, population)
# Filter for counties with a population above 1000000
counties_selected %>% filter(population>1000000)
# A tibble: 41 x 3
   state      county         population
   <chr>      <chr>               <dbl>
 1 Arizona    Maricopa          4018143
 2 California Alameda           1584983
 3 California Contra Costa      1096068
 4 California Los Angeles      10038388
 5 California Orange            3116069
 6 California Riverside         2298032
 7 California Sacramento        1465832
 8 California San Bernardino    2094769
 9 California San Diego         3223096
10 California Santa Clara       1868149
# ... with 31 more rows

选取stateCaliforniapopulation大于1000000的数据。

counties_selected %>%
  filter(state=="California",population>1000000)
# A tibble: 9 x 3
  state      county         population
  <chr>      <chr>               <dbl>
1 California Alameda           1584983
2 California Contra Costa      1096068
3 California Los Angeles      10038388
4 California Orange            3116069
5 California Riverside         2298032
6 California Sacramento        1465832
7 California San Bernardino    2094769
8 California San Diego         3223096
9 California Santa Clara       1868149

select(),filter()arrange()的组合使用

选取几个变量作为新的数据集,然后继续选取stateTexaspopulation大于10000的数据,最后根据private_work排序。

counties_selected <- counties %>%
  select(state, county, population, private_work, public_work, self_employed)

# Filter for Texas and more than 10000 people; sort in descending order of private_work
counties_selected %>%
  filter(state=="Texas", population>10000) %>%
  arrange(desc(private_work))

# A tibble: 169 x 6
   state county  population private_work public_work self_employed
   <chr> <chr>        <dbl>        <dbl>       <dbl>         <dbl>
 1 Texas Gregg       123178         84.7         9.8           5.4
 2 Texas Collin      862215         84.1        10             5.8
 3 Texas Dallas     2485003         83.9         9.5           6.4
 4 Texas Harris     4356362         83.4        10.1           6.3
 5 Texas Andrews      16775         83.1         9.6           6.8
 6 Texas Tarrant    1914526         83.1        11.4           5.4
 7 Texas Titus        32553         82.5        10             7.4
 8 Texas Denton      731851         82.2        11.9           5.7
 9 Texas Ector       149557         82          11.2           6.7
10 Texas Moore        22281         82          11.7           5.9
# ... with 159 more rows

mutate()创建新变量

选取变量作为新的数据集,然后根据population*public_work/100来计算并创建新的变量命名为public_workers

counties_selected <- counties %>%
  select(state, county, population, public_work)
# Add a new column public_workers with the number of people employed in public work
counties_selected %>% 
  mutate(public_workers = population*public_work/100)  
# A tibble: 3,138 x 5
   state   county   population public_work public_workers
   <chr>   <chr>         <dbl>       <dbl>          <dbl>
 1 Alabama Autauga       55221        20.9         11541.
 2 Alabama Baldwin      195121        12.3         24000.
 3 Alabama Barbour       26932        20.8          5602.
 4 Alabama Bibb          22604        16.1          3639.
 5 Alabama Blount        57710        13.5          7791.
 6 Alabama Bullock       10678        15.1          1612.
 7 Alabama Butler        20354        16.2          3297.
 8 Alabama Calhoun      116648        20.8         24263.
 9 Alabama Chambers      34079        12.1          4124.
10 Alabama Cherokee      26008        18.5          4811.
# ... with 3,128 more rows

相关文章

网友评论

      本文标题:DataCamp课程 <用dplyr操作数据> Chapter1

      本文链接:https://www.haomeiwen.com/subject/podtpltx.html