用dplyr操作数据课程目录
Chapter1. 数据变形
Chapter2. 数据统计
Chapter3. 数据选择和变形
Chapter4. 实战演练
Chapter1. 数据变形
这一章主要是温故知新吧,大多数知识点在之前的课程里出现过。
常用语法
select()
filter()
arrange()
mutate()
用select()选择指定列
在数据counties里选择state,county,population,poverty
counties %>%
select(state,county,population,poverty)
# A tibble: 3,138 x 4
state county population poverty
<chr> <chr> <dbl> <dbl>
1 Alabama Autauga 55221 12.9
2 Alabama Baldwin 195121 13.4
3 Alabama Barbour 26932 26.7
4 Alabama Bibb 22604 16.8
5 Alabama Blount 57710 16.7
6 Alabama Bullock 10678 24.6
7 Alabama Butler 20354 25.4
8 Alabama Calhoun 116648 20.5
9 Alabama Chambers 34079 21.6
10 Alabama Cherokee 26008 19.2
# ... with 3,128 more rows
用arrange()给数据排序
在数据counties里选择state,county,population,poverty定义为新的数据集counties_selected,并且根据public_work的大小给新的数据集排序。
counties_selected <- counties %>%
select(private_work, public_work, self_employed)
# Add a verb to sort in descending order of public_work
counties_selected %>%
arrange(desc(public_work))
# A tibble: 3,138 x 3
private_work public_work self_employed
<dbl> <dbl> <dbl>
1 25 64.1 10.9
2 33.3 61.7 5.1
3 36.8 59.1 3.7
4 32.9 56.8 10.2
5 34.4 55 9.8
6 42.2 51.6 6.1
7 42.6 50.5 6.8
8 48.4 49.5 1.8
9 34.9 49.2 14.7
10 51.9 48.1 0
# ... with 3,128 more rows
用filter()根据条件选取数据
选取population大于1000000的数据。
counties_selected <- counties %>%
select(state, county, population)
# Filter for counties with a population above 1000000
counties_selected %>% filter(population>1000000)
# A tibble: 41 x 3
state county population
<chr> <chr> <dbl>
1 Arizona Maricopa 4018143
2 California Alameda 1584983
3 California Contra Costa 1096068
4 California Los Angeles 10038388
5 California Orange 3116069
6 California Riverside 2298032
7 California Sacramento 1465832
8 California San Bernardino 2094769
9 California San Diego 3223096
10 California Santa Clara 1868149
# ... with 31 more rows
选取state是California,population大于1000000的数据。
counties_selected %>%
filter(state=="California",population>1000000)
# A tibble: 9 x 3
state county population
<chr> <chr> <dbl>
1 California Alameda 1584983
2 California Contra Costa 1096068
3 California Los Angeles 10038388
4 California Orange 3116069
5 California Riverside 2298032
6 California Sacramento 1465832
7 California San Bernardino 2094769
8 California San Diego 3223096
9 California Santa Clara 1868149
select(),filter()和arrange()的组合使用
选取几个变量作为新的数据集,然后继续选取state为Texas,population大于10000的数据,最后根据private_work排序。
counties_selected <- counties %>%
select(state, county, population, private_work, public_work, self_employed)
# Filter for Texas and more than 10000 people; sort in descending order of private_work
counties_selected %>%
filter(state=="Texas", population>10000) %>%
arrange(desc(private_work))
# A tibble: 169 x 6
state county population private_work public_work self_employed
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Texas Gregg 123178 84.7 9.8 5.4
2 Texas Collin 862215 84.1 10 5.8
3 Texas Dallas 2485003 83.9 9.5 6.4
4 Texas Harris 4356362 83.4 10.1 6.3
5 Texas Andrews 16775 83.1 9.6 6.8
6 Texas Tarrant 1914526 83.1 11.4 5.4
7 Texas Titus 32553 82.5 10 7.4
8 Texas Denton 731851 82.2 11.9 5.7
9 Texas Ector 149557 82 11.2 6.7
10 Texas Moore 22281 82 11.7 5.9
# ... with 159 more rows
用mutate()创建新变量
选取变量作为新的数据集,然后根据population*public_work/100来计算并创建新的变量命名为public_workers。
counties_selected <- counties %>%
select(state, county, population, public_work)
# Add a new column public_workers with the number of people employed in public work
counties_selected %>%
mutate(public_workers = population*public_work/100)
# A tibble: 3,138 x 5
state county population public_work public_workers
<chr> <chr> <dbl> <dbl> <dbl>
1 Alabama Autauga 55221 20.9 11541.
2 Alabama Baldwin 195121 12.3 24000.
3 Alabama Barbour 26932 20.8 5602.
4 Alabama Bibb 22604 16.1 3639.
5 Alabama Blount 57710 13.5 7791.
6 Alabama Bullock 10678 15.1 1612.
7 Alabama Butler 20354 16.2 3297.
8 Alabama Calhoun 116648 20.8 24263.
9 Alabama Chambers 34079 12.1 4124.
10 Alabama Cherokee 26008 18.5 4811.
# ... with 3,128 more rows











网友评论