Introduce
前几周lab主要在网站上完成一些基本函数操作,在这里记录一些笔记,便于查阅。
Reading Data
#查看有哪些文件在数据库中
file()
#通过read()读取数据 赋值给data
data = read('ks-projects-201801.csv')
#读取data的前五行数据
head(data,5)

# in Python, getting 1 column is used by indexing[] and ["..."]
head (data['id'])
# describe 函数是对data的一个summary
#Describe gives you information about the:
#Data Type (float? 1.002, - decimals), (int? integer 1,2,3), (object? string)
#Missing% (how much of the data is NAN values - missing)
#Nunique (how many unique values are in the data - [1,1,1,2,2,3] has 3 unique values == [1,2,3]
#Top3 (the top 3 values which are the most occuring)
#Min (minimum value in the data)
#Mean (average value in the data)
#Median (the middle value in the data)
#Max (the maximum value in the data)
#Sample (an example of what the data looks like
describe(data)

kuckstarter Data
Our aim for the Kickstarter data is to see how the STATE (failed, successful) is correlated with the other variables.
This means, first we need to check the STATE column.
Note - all datasets in real life are very corrupt. In fact, there is a statistic which says that over 80% of a data scientists' job is just data cleaning and data manipulation.
First, let just inspect the STATE column.
Since it's over 300,000 rows, it's infeasible to check all rows.
Let's use the TALLY command. This counts how many occurences of each unique value is in the data.
If you want more information on a function, type HELP
help(tally)

#统计data数据下state的情况,
tally(data['state'])

First, let us see how the GOAL relates to PERCENTAGE SUCCESS.
Using TABULATE, we can summarise a dataset's information into a table. The goal is to find the AVERAGE GOAL per STATE.
Let us check what TABULATE does after the code below usin HELP
tabulate(data['state'], data['goal'], method = 'mean')

help(tabulate)

goal = tabulate(data['state'], data['goal'], method = 'mean')
#x的名字是state,y的名字叫mean,数据源来自于goal,默认画图是折线图
plot(x = 'state', y = 'mean', data = goal)

#和上面一样,改变了图表的样式,柱状图
plot(x = 'state', y = 'mean', data = goal, style = 'barplot')
Clearly, a SUCCESSFUL campaign has a very very low GOAL.
This essentially means more expensive projects are more likely to fail.
You can see SUSPENDED has a very high GOAL. This is also not good, and is indicative of failure.
Let's check the country vs success rate.
Country is "country". Let us first inspect the column for data issues using the tally command.
Now, we use TABULATE again to summarise the country and success rate.
We want to know the success rate per country, so we use the PERCENT mode in TABULATE
tabulate(data['country'], data['state'], method = 'count_percent')

But, we only want the SUCCESSFUL %.
So, use QUERY, and filter == "successful" out.
country = tabulate(data['country'], data['state'], method = 'count_percent')
query(data = country, column = 'state', condition = '=="successful"')

country_success = query(data = country, column = 'state', condition = '=="successful"')
#Choose to plot ALL countries, so I placed top = 30. You can do top = 10000 ur choice,
# but the X Axis will be longer IF the data has 10000 rows
plot(data = country_success, x = 'country', y = 'Percent%', top = 30, style = 'barplot')
Clearly, US wins, with over 35% success rate. Next is Great Britain 35% ish. Worst is IT, JP (Japan and Italy) at around 15-17%

tabulate(data['currency'], data['state'], method = 'count_percent')

#only choose state =='successful'
query(data = currency, column = 'state', condition = '=="successful"')

Analysis of Time Data
head( year(data['launched']) , 10)

You can see we only got the years.
Now, do the same, check how much MAIN_CATEGORY has changed.
Which was most popular in 2014? 2015?
launched = tabulate(year(data['launched']), data['main_category'], method = 'count_percent')
head(launched)

Also it's a bit strange --> why is there 1970?
1970 is wrong data.
Let's remove it using QUERY and YEAR
data = query(data, column = year(data['launched']), condition = '>1970')
plot(data = launched, x = 'main_category', y = 'Percent%', colour = 'launched', style = 'barplot')

You can also instead of plotting multiple lines, split it into multiple plots using COLUMN instead of COLOUR
plot(data = launched, x = 'main_category', y = 'Percent%', column = 'launched', style = 'lineplot')

Finally, let's check what happens to the SUCCESS RATE over time. This time, let's not just use YEARS, but MONTH + YEARS.
Use the year_month command.
So, 2012 July (7th month) will be == 2012+7/12.
ps:我也不是很明白这样做的意义是什么
Processing and Using Dates
head( year_month(data['launched']) )

months = tabulate(year_month(data['launched']), data['state'], method = 'count_percent')
months_percent = query(data = months, column = 'state', condition = '=="successful"')
head(months_percent)

plot(data = months_percent, x = 'launched', y = 'Percent%', style = 'scatterplot')

Clearly, % Success have been decreasing over the years!
Let's draw a REGRESSION LINE to forecast what might happen! Use regplot in STYLE
plot(data = months_percent, x = 'launched', y = 'Percent%', style = 'regplot')

We can see a clearer trend if we draw a CUBIC graph (degree 3)
plot(data = months_percent, x = 'launched', y = 'Percent%', style = 'regplot', power = 3)

网友评论