DS

作者: casey0808 | 来源:发表于2019-06-03 15:48 被阅读0次
  1. data analyst learning path?
  • Flask is super easy and used for a lot of API development in data engineering and for productionizing machine learning models.

  • Start: Pick something you are interested in And solve a problem (Can be soccer betting for all I know)

    • Step 1 Learn web scraping using python Will allow you to obtain data on your own, allowing you to analyze any data source you could wish. Learn selenium, requests, beautifulsoup. Learn how to interact with api’s.
    • Step 2 Learn how to take said data and create a sql database around it Make multiple trimmed down tables that focus on keeping a certain type of data and can be referenced by other tables to make a master file. I.e. a human database could have A Demographics Table A School History Table A work experience table etc
    • Step 3 After creating said database, use said data to create an analysis. Do an analysis of “human” that figures out key components of what makes them successful. ( or whatever your data is about
      Step 4 Automate data collection, transformation Upload, and analysis

Bonus Step Create a web app using node.js / angularjs Flask (just to throw out a stack this can be done with and keep you from the 50 other options) The web app will show your automated analysis real time, to anyone who wants to see it.
Now you will use this project to market yourself.
You explain all the things necessary and all the complexities it took to do this project and you’ll be hired in literally no time.
You will have shown capability in multiple programming languages, an ability to problem solve and to research, an ability to learn rapidly, and an ability to get things done independently. This last one is very important, can’t tell you how many people need their hands held every step of the way to figure out how to do something. You prove you can do that, you’ll be worth your weight in gold.


  1. Two trending series may show a strong correlation even if they are completely unrelated. This is referred to as "spurious correlation". That's why when you look at the correlation of say, two stocks, you should look at the correlation of their returns/changes and not their levels. ( pct_change())

  2. Autocorrelation is a mathematical representation of the degree of similarity between a given time series and a lagged version of itself over successive time intervals. It's the same as calculating the correlation between two different time series, except autocorrelation uses the same time series twice: once in its original form and once lagged one or more time periods.

  3. Mean Reversion is a theory used in finance that suggest that asset prices and history returns eventually will revert to long-run mean or average level of the entire dataset. This mean can pertain to another relevant average, such as economic growth or the average return of an industry.

  4. Even if the true autocorrelations were zero at all lags, in a finite sample of returns you won't see the estimate of autocorrelations exactly zero. In fact, the standard deviation of the sample auto autocorrelation is 1/sqrt(N) where N is the number of observations. Since 95% of a normal curve is between +1.96 and -1.96 standard deviations from the mean, the 95% confidence intervals is ±1.96/sqrt(N). This approximation only holds when the true autocorrelations are all zero. (Autocorrelations at all lags are zero = we cannot forecast future observations based on the past.

  5. ADF (augmented Dickey-Fuller test) tests the null hypothesis that a unit root is present in a time series sample.

  6. Null Hypothesis is a general statement or default position that there is no relationship between two measured phenomena, or no association among groups. It is generally assumed to be true until evidence indicates otherwise.
    零假设的内容一般是希望证明其错误的假设。比如说,在相关性检验中,一般会取“两者之间没有关联”作为零假设,而在独立性检验中,一般会取“两者之间有关联”作为零假设。

  7. p-value is, for a given statistical model, the probability that, when the null hypothesis is true, the statistical summary (such as the absolute value of the sample mean difference between two compared groups) would be greater than or equal to the actual observed results.

  8. Extract transform load (ETL) is the process of extraction, transformation and loading during database use, but particularly during data storage use. It includes the following sub-processes:

  • Retrieving data from external data storage or transmission sources
  • Transforming data into an understandable format, where data is typically stored together with an error detection and correction code to meet operational needs
  • Transmitting and loading data to the receiving end
  1. Data warehousing is a technology that aggregates structured data from one or more sources so that it can be compared and analyzed for greater business intelligence. Data warehouses are typically used to correlate broad business data to provide greater executive insight into corporate performance.

相关文章

网友评论

      本文标题:DS

      本文链接:https://www.haomeiwen.com/subject/edoqwqtx.html