MARK5826 Week 2 lab

作者: GhostintheCode | 来源:发表于2018-08-17 08:56 被阅读0次

MARK5826 Week 2 lab
MARK5826 Week4 lab
Introduction to Data S
My Leetcode Solutions [week1 - w
代做Lab Exercises 语言程序、JSP编程作业代写留学
gcc题文章
The Lab(2)
Daily Memo &February 2020
2020年week 9,10,11,12,13 迟到的复盘
Daily Memo &April 2020

Introduce

前几周lab主要在网站上完成一些基本函数操作，在这里记录一些笔记，便于查阅。

Reading Data

#查看有哪些文件在数据库中
file()
#通过read（）读取数据 赋值给data
data = read('ks-projects-201801.csv')
#读取data的前五行数据
head(data,5)

head(data,5)

# in Python, getting 1 column is used by indexing[] and ["..."]
head (data['id'])
# describe 函数是对data的一个summary
#Describe gives you information about the:
#Data Type (float? 1.002, - decimals), (int? integer 1,2,3), (object? string)
#Missing% (how much of the data is NAN values - missing)
#Nunique (how many unique values are in the data - [1,1,1,2,2,3] has 3 unique values == [1,2,3]
#Top3 (the top 3 values which are the most occuring)
#Min (minimum value in the data)
#Mean (average value in the data)
#Median (the middle value in the data)
#Max (the maximum value in the data)
#Sample (an example of what the data looks like
describe(data)

describe(data)

kuckstarter Data

Our aim for the Kickstarter data is to see how the STATE (failed, successful) is correlated with the other variables.
This means, first we need to check the STATE column.
Note - all datasets in real life are very corrupt. In fact, there is a statistic which says that over 80% of a data scientists' job is just data cleaning and data manipulation.
First, let just inspect the STATE column.
Since it's over 300,000 rows, it's infeasible to check all rows.
Let's use the TALLY command. This counts how many occurences of each unique value is in the data.
If you want more information on a function, type HELP

help(tally)

help(tally)

#统计data数据下state的情况，
tally(data['state'])

tally(data['state'])

First, let us see how the GOAL relates to PERCENTAGE SUCCESS.
Using TABULATE, we can summarise a dataset's information into a table. The goal is to find the AVERAGE GOAL per STATE.
Let us check what TABULATE does after the code below usin HELP

tabulate(data['state'], data['goal'], method = 'mean')

average goal per state

help(tabulate)

tabulate

goal = tabulate(data['state'], data['goal'], method = 'mean')
#x的名字是state，y的名字叫mean，数据源来自于goal，默认画图是折线图
plot(x = 'state', y = 'mean', data = goal)

plot(x = 'state', y = 'mean', data = goal)

#和上面一样，改变了图表的样式，柱状图
plot(x = 'state', y = 'mean', data = goal, style = 'barplot')

Clearly, a SUCCESSFUL campaign has a very very low GOAL.
This essentially means more expensive projects are more likely to fail.
You can see SUSPENDED has a very high GOAL. This is also not good, and is indicative of failure.

Let's check the country vs success rate.
Country is "country". Let us first inspect the column for data issues using the tally command.

Now, we use TABULATE again to summarise the country and success rate.
We want to know the success rate per country, so we use the PERCENT mode in TABULATE

tabulate(data['country'], data['state'], method = 'count_percent')

tabulate(data['country'], data['state'], method = 'count_percent')

But, we only want the SUCCESSFUL %.
So, use QUERY, and filter == "successful" out.

country = tabulate(data['country'], data['state'], method = 'count_percent')
query(data = country, column = 'state', condition = '=="successful"')

只包括成功

country_success = query(data = country, column = 'state', condition = '=="successful"')
#Choose to plot ALL countries, so I placed top = 30. You can do top = 10000 ur choice,
# but the X Axis will be longer IF the data has 10000 rows
plot(data = country_success, x = 'country', y = 'Percent%', top = 30, style = 'barplot')

Clearly, US wins, with over 35% success rate. Next is Great Britain 35% ish. Worst is IT, JP (Japan and Italy) at around 15-17%

top

tabulate(data['currency'], data['state'], method = 'count_percent')

tabulate(data['currency'], data['state'], method = 'count_percent')

#only choose state =='successful'
query(data = currency, column = 'state', condition = '=="successful"')

only choose state =='successful'

Analysis of Time Data

head(   year(data['launched'])  , 10)

head( year(data['launched']) , 10)

You can see we only got the years.
Now, do the same, check how much MAIN_CATEGORY has changed.
Which was most popular in 2014? 2015?

launched = tabulate(year(data['launched']), data['main_category'], method = 'count_percent')
head(launched)

每一年的main_category的占比

Also it's a bit strange --> why is there 1970?
1970 is wrong data.
Let's remove it using QUERY and YEAR

data = query(data, column = year(data['launched']), condition = '>1970')

plot(data = launched, x = 'main_category', y = 'Percent%', colour = 'launched', style = 'barplot')

没有去掉数据之前的柱状图，数据源是launched

You can also instead of plotting multiple lines, split it into multiple plots using COLUMN instead of COLOUR

plot(data = launched, x = 'main_category', y = 'Percent%', column = 'launched', style = 'lineplot')

plot(data = launched, x = 'main_category', y = 'Percent%', column = 'launched', style = 'lineplot')

Finally, let's check what happens to the SUCCESS RATE over time. This time, let's not just use YEARS, but MONTH + YEARS.
Use the year_month command.
So, 2012 July (7th month) will be == 2012+7/12.
ps:我也不是很明白这样做的意义是什么

Processing and Using Dates

head(    year_month(data['launched'])  )

head( year_month(data['launched']) )

months = tabulate(year_month(data['launched']), data['state'], method = 'count_percent')
months_percent = query(data = months, column = 'state', condition = '=="successful"')
head(months_percent)

head(months_percent)

plot(data = months_percent, x = 'launched', y = 'Percent%', style = 'scatterplot')

plot(data = months_percent, x = 'launched', y = 'Percent%', style = 'scatterplot')

Clearly, % Success have been decreasing over the years!
Let's draw a REGRESSION LINE to forecast what might happen! Use regplot in STYLE

plot(data = months_percent, x = 'launched', y = 'Percent%', style = 'regplot')

plot(data = months_percent, x = 'launched', y = 'Percent%', style = 'regplot')

We can see a clearer trend if we draw a CUBIC graph (degree 3)

plot(data = months_percent, x = 'launched', y = 'Percent%', style = 'regplot', power = 3)

plot(data = months_percent, x = 'launched', y = 'Percent%', style = 'regplot', power = 3)

Questions:year_month form, and CUBIC graph why power =3?

网友评论

本文标题：MARK5826 Week 2 lab

本文链接：https://www.haomeiwen.com/subject/wprbbftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

MARK5826 Week 2 lab

Introduce

Reading Data

kuckstarter Data

Analysis of Time Data

Processing and Using Dates

Questions:year_month form, and CUBIC graph why power =3?

相关文章

MARK5826 Week 2 lab

MARK5826 Week4 lab

Introduction to Data S

My Leetcode Solutions [week1 - w

代做Lab Exercises 语言程序、JSP编程作业代写留学

gcc题文章

The Lab(2)

Daily Memo &February 2020

2020年week 9,10,11,12,13 迟到的复盘

Daily Memo &April 2020

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读