pandas练习

作者: 小T数据站 | 来源:发表于2019-01-27 17:58 被阅读0次

爬虫分析之数据存储——基于MySQL,Scrapy
Pandas练习
pandas练习
pandas练习
pandas实例-总结
pandas实例-Visualization-Titanic_D
pandas实例-Visualization-Scores
pandas实例-merge-House Market
pandas实例-Stats-US_Baby_Names
pandas实例-Stats-Wind Statistics

https://github.com/ajcr/100-pandas-puzzles/blob/master/100-pandas-puzzles.ipynb

导入pandas

将pandas以pd的名字引入

import pandas

查看导入的pandas的版本

pd.version

打印出pandas库需要的库的所有版本信息

pd.show_versions()

数据库基础
一些基础的选择、排序、添加、聚合数据的方法

import numpy as  np
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

创建索引为labels的数据框df

df = pd.DataFrame(data=data,index=labels)

展示df基础信息和它的数据

df.describe()

返回df的前三行

df.head(3)

只选择 'animal' 和'age' 列

df[['animal','age']]

选择['animal', 'age']的 [3, 4, 8]行

df.loc[df.index[[3,4,8]],['animal','age']]

选择visits大于 2的数据

df.query('visits > 2')

选择有缺失值的行, i.e. is NaN.

df[df['age'].isnull()]

选择 animal 是cat和age小于3.

df.query('animal=="cat" and age < 3') # 单=是赋值，双==才是判别

选择 age is between 2 and 4 (inclusive)的行

df.query('age>= 2 and age<=4')

将age里的'f'改成1.5.

df.loc['f','age'] = 1.5

计算visits的和 (the total number of visits).

df['visits'].sum()

计算每一个动物的平均年龄

df.groupby(by='animal')['age'].mean()
16.添加新的一行 'k' ，数据自己填，然后再将此行删除返回原数据框
df.loc['k'] = [5,'dog',2,'no']
df = df.drop('k')
df

对每个动物的数量进行计数

df['animal'].value_counts()

Sort df first by the values in the 'age' in decending order, then by the value in the 'visit' column in ascending order.

df.sort_values(by=['age','visits'],ascending=[False,True])

The 'priority' column contains the values 'yes' and 'no'. Replace this column with a column of boolean values: 'yes' should be True and 'no' should be False.

df['priority'] = df['priority'].map({'yes': True, 'no': False})
df

In the 'animal' column, change the 'snake' entries to 'python'.

df['animal']=df['animal'].replace('snake','python')
df

For each animal type and each number of visits, find the mean age. In other words, each row is an animal, each column is a number of visits and the values are the mean ages (hint: use a pivot table).

df.pivot_table(values='age',index='animal',columns='visits',aggfunc='mean')

数据框：超越基础
你可能需要结合使用两个或者更多的方法去得到正确的答案

You have a DataFrame df with a column 'A' of integers. For example:
df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
How do you filter out rows which contain the same integer as the row immediately above?

df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
df.drop_duplicates(subset='A',keep='first')

Given a DataFrame of numeric values, say
df = pd.DataFrame(np.random.random(size=(5, 3))) # a 5x3 frame of float values
how do you subtract the row mean from each element in the row?

df = pd.DataFrame(np.random.random(size=(5, 3)))
df.sub(df.mean(axis=1),axis=0)

Suppose you have DataFrame with 10 columns of real numbers, for example:
df = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij'))
Which column of numbers has the smallest sum? (Find that column's label.)

df = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij'))
df.sum().idxmin()

How do you count how many unique rows a DataFrame has (i.e. ignore all rows that are duplicates)?

len(df.drop_duplicates(keep=False))

You have a DataFrame that consists of 10 columns of floating--point numbers. Suppose that exactly 5 entries in each row are NaN values. For each row of the DataFrame, find the column which contains the third NaN value.
(You should return a Series of column labels.)

(df.isnill().cumsum(axis=1)==3).idxmax(axis=1)

A DataFrame has a column of groups 'grps' and and column of numbers 'vals'. For example:
df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'),
'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})
For each group, find the sum of the three greatest values.

df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'),
'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})
df.groupby(by='grps')['vals'].nlargest(3).sum(level=0)

A DataFrame has two integer columns 'A' and 'B'. The values in 'A' are between 1 and 100 (inclusive). For each group of 10 consecutive integers in 'A' (i.e. (0, 10], (10, 20], ...), calculate the sum of the corresponding values in column 'B'.