*Thyne, Clayton
*Intro to Stats, Chapter 12 examples

	clear
	set mem 80m
	cd [your directory]

*  NOTE: the above command asks you to change the directory.  The directory will depend on how you've set up
*  the folders on your computer.  For example, my command is...
*  cd F:\teaching\Fall_2008\PS572-401-baby_stats\do_file_directories\chapter12
*  This is necessary so you know where stata is saving things.

****Part 1a: test for linearity (#1)
*scatter Y on each X
*We'll take data from John Fox's website.  Use the following model:
	clear
	use http://www.ats.ucla.edu/stat/stata/examples/ara/prestige, clear
	regress prestige educat inc
	*So, we're predicting that occuapational presige = education + income

	graph twoway scatter prest educat if e(sample), saving(ed, replace)
	graph twoway scatter prest inc if e(sample), saving(inc, replace)
	graph combine ed.gph inc.gph
	*We see some possible non-linearity with prestige=income

****Part 1b: test for linearity (#2)
*Now, we'll do the more sophisticated approach using partial residual plots
	regress prestige educat inc
	predict e1, resid
	gen edhat=e1 + 4.137444*educat
	scatter edhat educat, ytitle(partial resid) xtitle(education)
	*We see no problem with linearity for education.

*Now, let's try income
	drop e1 edhat
	regress prestige educat inc
	predict e2, resid
	gen inchat=e2 + .0013612*income
	scatter inchat income, ytitle(partial resid) xtitle(income)

*Or, see this better with...
	lowess inchat income, ytitle(partial resid) xtitle(income)
	*Again, perhaps a bit of non-linearity, but no big problem.

****Part 1c: test for perfect collinearity
	clear
	use http://www.ats.ucla.edu/stat/stata/examples/ara/prestige, clear
	regress prestige educat inc
	corr educat inc
	graph matrix educat inc

*also check variance inflation factor; above 10 is a problem
	regress prestige educat inc
	vif

****Part 2a: homoskedasticity test 1
	clear
	use "http://www.uky.edu/~clthyn2/intro_stats/data_set_1.dta"
	scatter milper tpop if tpop<500000 & milper<2000

****Part 2b: Homoskedasticity test 2
	regress milper tpop if tpop<500000 & milper<2000
	predict yhat if e(sample)
	predict ehat if e(sample), resid
	scatter yhat ehat
	drop yhat ehat

****Part 2c: Homoskedasticity test 3
	regress milper tpop if tpop<500000 & milper<2000
	hettest tpop
	*Note: Ho=homoskedasticity.  Because we can reject Ho, we have a problem.

****Part 2d: Homoskedasticity test 4
	regress milper tpop if tpop<500000 & milper<2000
	rvfplot, yline(0)

****Part 3: fixing heteroskedasticity
*First, check which transformation makes the most sense.
	gladder milper
	gladder tpop

*Next, transform both variables.
	gen lmilper=log10(milper)
	gen ltpop=log10(tpop)

*Next, see if this has fixed the problem
	regress lmilper ltpop if tpop<500000 & milper<2000
	hettest ltpop
	*As we can see, we cannot reject the Ho, so we're okay.
	*You could have used any of the 4 tests above.

drop ltpop lmilper

****Part 4: test for autocorrelation
	clear
	use "http://www.uky.edu/~clthyn2/intro_stats/dwstat_example.dta"

*These data are for the US only from 1980 to 2000
	tsset year
	regress milper tpop upop milex
	dwstat
	*We want a DW stat of 2, so we have a problem here.  Stata doesn't
	*have a P value - the threshhold is set by the researcher.  The nature
	*of the data usually tell us if there will be a problem.

****Part 5: normality, test 1
	clear
	use http://www.ats.ucla.edu/stat/stata/examples/ara/prestige, clear
	regress prestige educat inc
	predict ehat, resid
	kdensity ehat, normal
	qnorm ehat

****Part 5: normality, test 2
	clear
	use http://www.ats.ucla.edu/stat/stata/examples/ara/prestige, clear
	regress prestige educat inc
	predict ehat, resid
	swilk ehat

****Part 5: normality, test 3...an abnormal example
	drop ehat
	gen prestige2=1 if prestige>46
	replace prestige2=0 if prestige2==.
	regress prestige2 educat inc
	predict ehat, resid
	kdensity ehat, normal
	qnorm ehat
	swilk ehat

****Part 6 (skipped) 

****Part 7: confidence intervals
	clear
	use "http://www.uky.edu/~clthyn2/intro_stats/GPA_example.dta"
*now, consider the uncertainty around or estimated
*GPA of a student with: ACT=24...we'll just do 1 variable for this example

*first, let's get our point estimate
	regress GPA ACT
	di .1564924*24 -1.252839
	*so, our model predicts a GPA of 2.503 for this student

*second, let's look at our C.I. for ACT
	graph twoway lfitci GPA ACT if e(sample) || scatter GPA ACT if e(sample)
	*why is the CI wider at the ends?

*third, let's simulate 100 parameter estimates to see how 
*certain we are of this prediction
	estsimp regress GPA ACT , sims(100)
	rename b1 ACT_sims
	rename b2 constant_sims

	gen estimate=ACT_sims*24 + constant_sims*_cons
	sum estimate
	hist estimate
	*so, the mean is as expected, but we see a somewhat wide confidence interval

****Part 8: extrapolating the model
	clear
	use "http://www.uky.edu/~clthyn2/intro_stats/GPA_example.dta"

*first, let's estimate our model: GPA = ACT + study_hrs + e
	regress GPA ACT study_hrs

*second, let's try to estimate the predicted GPA for the following student:
*ACT=36, study_hrs=100
	di -.6093642 + .097656*36 + .0387667*100
	*So, our estimate is 6.783, which doesn't make much sense.

****Part 9: examining leverage
	clear
	use "http://www.uky.edu/~clthyn2/intro_stats/GPA_example.dta"

*first, let's estimate our model: GPA = ACT + e
	regress GPA ACT

*second, predict the leverage of each observation and compare to cut-off
	predict leverage, leverage
	sum
	di 2*.1666
	list
	*we see that one of the UK observations is above the .3332 cut-off

*we can also see this using a figure...
	lvr2plot, mlabel(school)

****Part 10: examining outliers
	regress GPA ACT
	predict outlier, rstudent
	*Iowa's 3rd observation looks like a potential problem...
	*we also saw this on the leverage plot in Part 9

****Part 11: examining influence using Cook's D
	clear
	use "http://www.uky.edu/~clthyn2/intro_stats/GPA_example.dta"

*generage cut-off
	di 4/(12-1-1)	

*run the model, check to see if any obs are above the cut-off
	regress GPA ACT
	predict cooks, cooksd
	list
	*We again see that the 4th UK observation seems to be problematic.

****Part 12: examining influence graphically
	clear
	use "http://www.uky.edu/~clthyn2/intro_stats/GPA_example.dta"
	regress GPA ACT
	predict leverage, leverage
	di 2*.1666
	label var leverage "leverage, cut-off=.3332
	predict outliers, rstudent
	label var outliers "outliers, cut-off=2
	scatter outliers leverage, xline(.3332) yline(2) mlabel(school)