Series第六讲 apply、groupby、window
本节课将讲解pandas中如何应用apply、分组、窗口方法
apply、分组、窗口
Series.apply()Series.agg()Series.aggregate()Series.transform()Series.map()Series.groupby()Series.rolling()Series.expanding()Series.pipe()
详细介绍
先来创建一个Series
In [4]: s = pd.Series([1, 2, 3, None, 5, None, None, None, 9])
In [5]: s
Out[5]:
0 1.0
1 2.0
2 3.0
3 NaN
4 5.0
5 NaN
6 NaN
7 NaN
8 9.0
dtype: float64
1. Series.apply()
Series.apply(func, convert_dtype=True, args=(), **kwds)
对Series里的值调用func方法
常用参数介绍:
- func:Python function or NumPy ufunc to apply 【Python方法或者Numpy方法】
- convert_dtype:bool, default True 【是否对func的结果转换成更合适的dtype,如果False,保留dtype=object】
- args:tuple 【一个元组,表示要传递给func的位置参数】
In [6]: s.apply(lambda x: x ** 2)
Out[6]:
0 1.0
1 4.0
2 9.0
3 NaN
4 25.0
5 NaN
6 NaN
7 NaN
8 81.0
dtype: float64
In [7]: def subtract_custom_value(x, custom_value):
...: return x - custom_value
In [8]: s.apply(subtract_custom_value, args=(5,))
Out[8]:
0 -4.0
1 -3.0
2 -2.0
3 NaN
4 0.0
5 NaN
6 NaN
7 NaN
8 4.0
dtype: float64
In [9]: s.apply(np.log)
Out[9]:
0 0.000000
1 0.693147
2 1.098612
3 NaN
4 1.609438
5 NaN
6 NaN
7 NaN
8 2.197225
dtype: float64
2. Series.agg()
Series.agg(func=None, axis=0, *args, **kwargs
对指定轴进行一项或多项汇总
常用参数介绍:
- func:function, str, list or dict
- function
- string function name
- list of functions and/or function names, e.g. [np.sum, 'mean']
- dict of axis labels -> functions, function names or list of such
In [11]: s.agg('min')
Out[11]: 1.0
In [12]: s.agg(['min', 'max'])
Out[12]:
min 1.0
max 9.0
dtype: float64
3. Series.aggregate()
Series.aggregate(func=None, axis=0, *args, **kwargs)
同Series.agg()方法一样
4. Series.transform()
Series.transform(func, axis=0, *args, **kwargs)
对每一个value都执行transform方法,结果的长度与Series长度一致
常用参数介绍:
- func:function, str, list or dict
- function
- string function name
- list of functions and/or function names, e.g. [np.sum, 'mean']
- dict of axis labels -> functions, function names or list of such
In [13]: s.transform([np.sqrt, np.exp])
Out[13]:
sqrt exp
0 1.000000 2.718282
1 1.414214 7.389056
2 1.732051 20.085537
3 NaN NaN
4 2.236068 148.413159
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 3.000000 8103.083928
s.transform([np.sqrt, np.exp])效果等同于s.apply([np.sqrt, np.exp]) 等同于s.agg([np.sqrt, np.exp])
5. Series.map()
Series.map(arg, na_action=None)
将Series里的值替换为map指定的值,是对值进行映射.
map接收一个dict或一个Series,如果在dict里没找到对应的映射则转为NaN,除非字典有默认值 (e.g. defaultdict)
常用参数介绍:
- na_action:{None, ‘ignore’}, default None 【如果'ignore',则忽略NaN值,不对NaN进行映射】
In [17]: s.map({1.0: 'a'})
Out[17]:
0 a
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
dtype: object
# 对NaN值也进行映射
In [20]: s.map('I am a {}'.format)
Out[20]:
0 I am a 1.0
1 I am a 2.0
2 I am a 3.0
3 I am a nan
4 I am a 5.0
5 I am a nan
6 I am a nan
7 I am a nan
8 I am a 9.0
dtype: object
# 不映射NaN
In [21]: s.map('I am a {}'.format, na_action='ignore')
Out[21]:
0 I am a 1.0
1 I am a 2.0
2 I am a 3.0
3 NaN
4 I am a 5.0
5 NaN
6 NaN
7 NaN
8 I am a 9.0
dtype: object
6. Series.groupby()
Series.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=<object object>, observed=False, dropna=True)
分组
常用参数介绍:
- by:mapping, function, label, or list of labels 【根据什么进行分组】
In [27]: s.groupby(['a', 'b', 'a', 'b', 'a', 'a', 'b', 'b', 'a']).mean()
Out[27]:
a 4.5
b 2.0
dtype: float64
In [28]: s.groupby(s>3).mean()
Out[28]:
False 2.0
True 7.0
dtype: float64
In [29]: s.groupby(['a', 'b', 'a', 'b', 'a', 'a', 'b', 'b', np.nan]).mean()
Out[29]:
a 3.0
b 2.0
dtype: float64
In [43]: df = pd.DataFrame({'A':['a', 'a', 'b'], 'B':[1, 2, 3]})
In [44]: df
Out[44]:
A B
0 a 1
1 a 2
2 b 3
In [45]: df.groupby('A').sum()
Out[45]:
B
A
a 3
b 3
7. Series.rolling()
Series.rolling(window, min_periods=None, center=False, win_type=None, on=None, axis=0, closed=None)
滚动窗口计算
常用参数介绍:
- window:int, offset, or BaseIndexer subclass 【窗口大小】
- min_periods:int, default None 【窗口中具有值的最小观察数否则结果为NA,对于由offset指定的窗口,min_periods将默认为1。否则,min_periods将默认为窗口的大小】
In [50]: s.rolling(2).sum()
Out[50]:
0 NaN
1 3.0
2 5.0
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
dtype: float64
8. Series.expanding()
Series.expanding(min_periods=1, center=None, axis=0)
扩展转换。
和rolling()方法类似,只不过expanding()不设置窗口大小,窗口向后一直累加变大。
常用参数介绍:
- min_periods:int, default None 【窗口中具有值的最小观察数否则结果为NA】
In [54]: s.expanding(min_periods=2).sum()
Out[54]:
0 NaN
1 3.0
2 6.0
3 6.0
4 11.0
5 11.0
6 11.0
7 11.0
8 20.0
dtype: float64
因为窗口中至少要有两个值,所以第一个为NaN,窗口向下拉大以此类推。
9. Series.pipe()
Series.pipe(func, *args, **kwargs)
对Series里的值应用 func(self, *args, **kwargs)
常用参数介绍:
- func:function 【应用的方法,并将args和kwargs参数传入】
# 对每一个value加一
In [59]: s.pipe(lambda x: x+1)
Out[59]:
0 2.0
1 3.0
2 4.0
3 NaN
4 6.0
5 NaN
6 NaN
7 NaN
8 10.0
dtype: float64
# 链式调用
In [61]: s.pipe(lambda x: x+1).pipe(lambda x: x+1)
Out[61]:
0 3.0
1 4.0
2 5.0
3 NaN
4 7.0
5 NaN
6 NaN
7 NaN
8 11.0
dtype: float64
# 传递参数 加10
In [62]: s.pipe(lambda x, y: x+y, y=10)
Out[62]:
0 11.0
1 12.0
2 13.0
3 NaN
4 15.0
5 NaN
6 NaN
7 NaN
8 19.0
dtype: float64
周末也要继续 坚持 ✊ ✊ ✊!!!











网友评论