美文网首页
讲解:data、matlab、matlab、LDLWeb|Mat

讲解:data、matlab、matlab、LDLWeb|Mat

作者: jy07043 | 来源:发表于2020-01-12 16:07 被阅读0次

Chapter 7AssignmentDue: To Be Determined by Class Vote by the End of the First Week of TeachingThe penalty for late hand-in of coursework is 10% of the total marks per working day. Nocredit will be given after more than 5 working days. A working day is a 24-hour periodstarting from the original hand-in deadline, but skipping days on which the School is closed.7.1 Individual Assignment (40%)We will study the so-called South-African Heart-Disease (SAHD) data. Necessary backgroundis in [3, Section 4.4.2].Briefly, observations on n = 462 individuals are available. On each individual, the dataconsists of: a binary indicator (value 0 or 1) of the presence of Coronary Heart Disease(1=have disease, 0=does not have disease), viewed as a response; and the following rawinputs: systolic blood pressure; tobacco use; LDL, also known as “bad cholesterol”; (binary)indicator of family history of heart disease; (measure of) obesity; (measure of) alcohol use;and age. In data file SAHD.mat, the response variable is chd, and the inputs are sbp,tobacco, ldl, famhist, obesity, alcohol, age respectively.We want to regress the response on inputs, where an input may be a raw input or a functionof it (e.g., a quadratic function). The main tool is a logit (logistic) regression of the formYi ~ Bernoulli(μi) and independent, where logit(μi) = Xpj=1Xj,iβj (7.1)where logit(μ) = log(μ/(1?μ)), and (Yi, X1,i, . . . , Xp,i) is the i-th observation of the responseand the inputs. Another possibility is probit regression of the formYi ~ Bernoulli(μi) and independent, where μi = Φ(Xpj=1Xj,iβj ), (7.2)where Φ() is the Normal(0,1) cdf.[3, Section 4.4.2] fit logit models of the form (7.1). In [3, Section 5.2.2, pages 146-148], amore advanced model in equation (5.6) there captures the effect of a raw input Xj on theresponse by a spline (piecewise-polynomial) function hj (Xj ); the (estimated) functions areshown in Figure 5.4 for selected inputs. These findings suggest the following: Variables sbp, tobacco, and obesity may appear in (7.1) in quadratic form (b1x +b2x2,where x is the raw input) or in cubic form (b1x + b2x2 + b3x3). To capture the effect of age, which does not seem to resemble a low-order polynomial,consider as inputs the binary indicators of categorized age, similar to what is done in [5,Example 10]. For example, the sample deciles of age, namely 18, 28, 34, 40, 45, 49, 54,58, 61, and 64, define the intervals (0, 18], (18, 28], (28, 34], ..., (61, 64], which give 10categories of age.47Standard tasks (model estimation; testing the significance a certain coefficient; computingthe AIC; etc.) may be done via sahd.m Part 1, which is similar to sco.m (lab 3, Section6.2.2). (Part 2 supports the group assignment.) A very thorough exploration could involvetransformed raw inputs beyond those seen there.Those wishing to have matlab on a non-University computer should start at software.soton.ac.uk.Install at least matlab (the basic product); and the toolboxes on statistics and optimization.7.1.1 Deliverables and Marking SchemeSubmit a typed report to the Faculty Office, with the usual cover about academic integrity.The audience is a hypothetical manager or analyst familiar with our notes, the brief and theanalyzes in [3] cited above.Discuss results, and analysis as necessary, on the following:1. Develop a preferred model of the response. A better (more parsimonious) model is oftenfound by removing an input Xj if the hypothesis “βj = 0” cannot be rejected at some levelα (the corresponding t value satisfies |t| (1 α/2), where Φ1() is the Normal(0,1)inverse cdf); and then re-fitting the model. Use of a model-selection criterion, such asAkaike Information (AIC), is recommended.2. Use the preferred model to describe the effect of each of obesity and age on the likelihoodof heart disease. For background on interpretation, see Section 4.5.4 and the second-fromlastparagraph in [3, Section 4.4.2].3. Brief explanation of how results were obtained can appear in an appendix. If significantextensions are made to any codes provided to you during the course (sahd.m or other),then: (a) submit the code on blackboard (individual assignment), and state in the reportthe submitting user (e.g., “xy1g09”) (so the code can be matched to the report); and (b)provide instructions on how the code can be used.Avoid re-stating the brief or the codes provided. Length: up to 400 words, excluding tables,figures, and their captions.Marking Scheme: 40%. Plausibility of the proposed model. The plausibility will be judged against thefindings in [3] and against experiments the manager could attempt; thus your analysisshould be informed by the above. For example, while Table 4.3 of [3] suggests thatobesity and systolic blood pressure do not have a significant effect, the reverse is found in[3, Figure 5.4]; the latter suggests that appropriate functions of these inputs should enterthe regression (e.g., as a quadratic function, or a categorization). 30%. Correctness of estimation of any model being proposed. 30%. Quality of presentation (clarity, coherence, ease of reading).7.2 Group Assignment (60%)We wish to develop a classification (prediction) method of the response (0/1 indicator of heartdisease) based on the inputs. The aim is to minimize the expected loss (per observation,as in (5.7)) under the following cost structure: correct classification costs nothing; misclassificationof class 0 (true is 0, prediction is 1) costs 1 unit; and mis-classification of class1 (true is 1, prediction is 0) costs 10 units.Similar to the analysis seen in lab 4, we consider classifiers defined by a Bernoulli (GLM)model of the response and a classification threshold t, as in (6.1); and the main aim is tominimize the expected loss by choice of a classifier. The more extensive the set of modelsand thresholds considered, the smaller the loss we can expect. Moreover, a “better” model(with an appropriate threshold) is more likely to minimize the loss; thus, the model chosenfor the individual assignment seems (a priori) a strong contender.48Script sahd.m, Part 2, implements an analysis of the form above, similar to cvsco.m (lab 4).Specifically, a set of classifiers is developed; the expected (out-of-sample) loss is estimatedvia cross-validation (section 5.3, equation (5.10)); and the classifier minimizing the estimatedloss is selected.A second (subsequent) task is to estimate the expected loss, L say, of the selected classifier.The CV of the selected classifier (minimum CV across all classifiers considered) is an optimistic(biased low) estimate of L [3, Section 7.2, page 222]. To obtain an unbiased estimate,the following is proposed: (a) on the first two thirds of the data, apply cross-validation forclassifier selection; and (b) on the last third of the data, the selected classifier is applied; theaverage loss is an unbiased estimate of L. For this task, minor modification of sahd.m (or anew code) may be necessary.7.2.1 Deliverables and Marking SchemeSubmit a typed report to the Faculty Office, with the usual cover about academic integrity.The audience is a hypothetical manager or analyst familiar with our notes, the brief and theanalyzes in [3] cited above.Discuss results, and analysis as necessary, on the following:1. State a preferred classifier, with the aim of minimizing the expected loss. In estimatingthe loss, leave-one-out cross-validation (Section 5.3) seems preferable (the modest size ofthe sample makes this practical).2. Provide an unbiased estimate of the expected loss L of the selected classifier.3. Brief explanation of how results were obtained can appear in an appendix. If significantextensions are made to any codes provided to you during the course (sahd.m or other),then: (a) submit the code on blackboard (group assignment), and state in the report thesubmitting user (e.g., “xy1g09”) (so the code can be matched to the report); and (b)provide instructions on how the code can be used.Avoid re-stating the brief or the codes provided. Length: up to 500 words, excluding tables,figures, and their captions.Marking Scheme: 50%. Thoroughness of exploration of potential classifiers, and success in minimizing theexpected loss. 20%. Correctness of estimation of the expected loss L of the selected classifier. 30%. Quality of presentation: clarity, coherence, ease of reading.Group WorkWork in a group of size up to 4. Each group makes a single submission of the deliverables.Group members are responsible for functioning as a team. A common mark will be givento the members of a group. In exceptional cases where there is strong evidence of lack ofcontribution by a member, a correction may be made. A key requirement for making such acorrection is majority opinion (i.e., 2 or more against 1). To help resolve such cases, keepingminutes of work meetings may be appropriate.49转自:http://www.7daixie.com/2019043020195731.html

相关文章

网友评论

      本文标题:讲解:data、matlab、matlab、LDLWeb|Mat

      本文链接:https://www.haomeiwen.com/subject/nptwactx.html