*Thyne, Clayton *Intro to Stats, Chapter 12 examples clear set mem 80m cd [your directory] * NOTE: the above command asks you to change the directory. The directory will depend on how you've set up * the folders on your computer. For example, my command is... * cd F:\teaching\Fall_2008\PS572-401-baby_stats\do_file_directories\chapter12 * This is necessary so you know where stata is saving things. ****Part 1a: test for linearity (#1) *scatter Y on each X *We'll take data from John Fox's website. Use the following model: clear use http://www.ats.ucla.edu/stat/stata/examples/ara/prestige, clear regress prestige educat inc *So, we're predicting that occuapational presige = education + income graph twoway scatter prest educat if e(sample), saving(ed, replace) graph twoway scatter prest inc if e(sample), saving(inc, replace) graph combine ed.gph inc.gph *We see some possible non-linearity with prestige=income ****Part 1b: test for linearity (#2) *Now, we'll do the more sophisticated approach using partial residual plots regress prestige educat inc predict e1, resid gen edhat=e1 + 4.137444*educat scatter edhat educat, ytitle(partial resid) xtitle(education) *We see no problem with linearity for education. *Now, let's try income drop e1 edhat regress prestige educat inc predict e2, resid gen inchat=e2 + .0013612*income scatter inchat income, ytitle(partial resid) xtitle(income) *Or, see this better with... lowess inchat income, ytitle(partial resid) xtitle(income) *Again, perhaps a bit of non-linearity, but no big problem. ****Part 1c: test for perfect collinearity clear use http://www.ats.ucla.edu/stat/stata/examples/ara/prestige, clear regress prestige educat inc corr educat inc graph matrix educat inc *also check variance inflation factor; above 10 is a problem regress prestige educat inc vif ****Part 2a: homoskedasticity test 1 clear use "http://www.uky.edu/~clthyn2/intro_stats/data_set_1.dta" scatter milper tpop if tpop<500000 & milper<2000 ****Part 2b: Homoskedasticity test 2 regress milper tpop if tpop<500000 & milper<2000 predict yhat if e(sample) predict ehat if e(sample), resid scatter yhat ehat drop yhat ehat ****Part 2c: Homoskedasticity test 3 regress milper tpop if tpop<500000 & milper<2000 hettest tpop *Note: Ho=homoskedasticity. Because we can reject Ho, we have a problem. ****Part 2d: Homoskedasticity test 4 regress milper tpop if tpop<500000 & milper<2000 rvfplot, yline(0) ****Part 3: fixing heteroskedasticity *First, check which transformation makes the most sense. gladder milper gladder tpop *Next, transform both variables. gen lmilper=log10(milper) gen ltpop=log10(tpop) *Next, see if this has fixed the problem regress lmilper ltpop if tpop<500000 & milper<2000 hettest ltpop *As we can see, we cannot reject the Ho, so we're okay. *You could have used any of the 4 tests above. drop ltpop lmilper ****Part 4: test for autocorrelation clear use "http://www.uky.edu/~clthyn2/intro_stats/dwstat_example.dta" *These data are for the US only from 1980 to 2000 tsset year regress milper tpop upop milex dwstat *We want a DW stat of 2, so we have a problem here. Stata doesn't *have a P value - the threshhold is set by the researcher. The nature *of the data usually tell us if there will be a problem. ****Part 5: normality, test 1 clear use http://www.ats.ucla.edu/stat/stata/examples/ara/prestige, clear regress prestige educat inc predict ehat, resid kdensity ehat, normal qnorm ehat ****Part 5: normality, test 2 clear use http://www.ats.ucla.edu/stat/stata/examples/ara/prestige, clear regress prestige educat inc predict ehat, resid swilk ehat ****Part 5: normality, test 3...an abnormal example drop ehat gen prestige2=1 if prestige>46 replace prestige2=0 if prestige2==. regress prestige2 educat inc predict ehat, resid kdensity ehat, normal qnorm ehat swilk ehat ****Part 6 (skipped) ****Part 7: confidence intervals clear use "http://www.uky.edu/~clthyn2/intro_stats/GPA_example.dta" *now, consider the uncertainty around or estimated *GPA of a student with: ACT=24...we'll just do 1 variable for this example *first, let's get our point estimate regress GPA ACT di .1564924*24 -1.252839 *so, our model predicts a GPA of 2.503 for this student *second, let's look at our C.I. for ACT graph twoway lfitci GPA ACT if e(sample) || scatter GPA ACT if e(sample) *why is the CI wider at the ends? *third, let's simulate 100 parameter estimates to see how *certain we are of this prediction estsimp regress GPA ACT , sims(100) rename b1 ACT_sims rename b2 constant_sims gen estimate=ACT_sims*24 + constant_sims*_cons sum estimate hist estimate *so, the mean is as expected, but we see a somewhat wide confidence interval ****Part 8: extrapolating the model clear use "http://www.uky.edu/~clthyn2/intro_stats/GPA_example.dta" *first, let's estimate our model: GPA = ACT + study_hrs + e regress GPA ACT study_hrs *second, let's try to estimate the predicted GPA for the following student: *ACT=36, study_hrs=100 di -.6093642 + .097656*36 + .0387667*100 *So, our estimate is 6.783, which doesn't make much sense. ****Part 9: examining leverage clear use "http://www.uky.edu/~clthyn2/intro_stats/GPA_example.dta" *first, let's estimate our model: GPA = ACT + e regress GPA ACT *second, predict the leverage of each observation and compare to cut-off predict leverage, leverage sum di 2*.1666 list *we see that one of the UK observations is above the .3332 cut-off *we can also see this using a figure... lvr2plot, mlabel(school) ****Part 10: examining outliers regress GPA ACT predict outlier, rstudent *Iowa's 3rd observation looks like a potential problem... *we also saw this on the leverage plot in Part 9 ****Part 11: examining influence using Cook's D clear use "http://www.uky.edu/~clthyn2/intro_stats/GPA_example.dta" *generage cut-off di 4/(12-1-1) *run the model, check to see if any obs are above the cut-off regress GPA ACT predict cooks, cooksd list *We again see that the 4th UK observation seems to be problematic. ****Part 12: examining influence graphically clear use "http://www.uky.edu/~clthyn2/intro_stats/GPA_example.dta" regress GPA ACT predict leverage, leverage di 2*.1666 label var leverage "leverage, cut-off=.3332 predict outliers, rstudent label var outliers "outliers, cut-off=2 scatter outliers leverage, xline(.3332) yline(2) mlabel(school)