READING IN DATA WITH STATA -- PS 245 handout

If you get data from almost any source other than me for your final data project, this data will not come in a .DTA format directly readable by STATA. The Harvard Data Center, for example, never distributed data in STATA format. This note is a long explanation of reading data into STATA. It is not easy to read; do not expect to skim this while watching "Seinfeld" and then understand it. The concepts here are quite difficult for anyone with limited computer experience. Yet many methodology students have found this document useful; I found someone printing it in the Data Center two years after I had first typed it up. You are unlikely to find a more accessible introduction to this part of data analysis before writing your senior thesis.

Let's begin by establishing a common vocabulary. In any dataset, you will have several "variables" -- maybe income, age and party identification if the dataset comes from a survey. Each variable will have a series of values -- maybe a series of ages or incomes. Across variables these values are parallel. That is, the fifth value for "AGE" will come from the same respondent as the fifth value for "INCOME". Therefore, if we placed these variables into a spreadsheet or table as columns side by side, each row in that spreadsheet would correspond to a single "observation" -- maybe one respondent, or one state, or one country. This concept of your dataset, where variables represent columns and observations represent rows, is the one used by STATA; the commands often are easier to remember with this concept in mind.

To turn data into a STATA .DTA file, STATA requires that your data start in ASCII format. ASCII, which stands for American Standard Code for Information Interchange, provides a representation for most commonly used numbers, letters and symbols so that computer programs of all sorts can pass information back and forth (a sort of Esperanto for diverse software). If you type in your own data, you will have to figure out how to produce an ASCII file from whatever software you use. Fortunately, with most common spreadsheet programs on the market, this is a simple procedure.

Functional ASCII data files come in two forms: fixed format and free (or delimited) format. In fixed form, the values for a particular variable have to appear in the same place for each observation. So if your "AGE" value appears as characters 4-6 in a row of data for the first observation, it must do so in later rows as well. This is difficult because, to read the data into a program such as STATA, you have to identify where the values for each variable are located in the ASCII file. Therefore, if you create your own dataset you almost certainly want to use the easier free format. With free format, some special character(s) (such as commas, blank spaces or TABs) delimit the value for "AGE" from that for "INCOME", and so on; as long as the age value always precedes the income value you need not line them up row by row. The main drawback to this method of organization is that the extra characters greatly add to the size of an ASCII file for large datasets. Therefore, if you receive your dataset from some outside source, such as the Harvard Data Center, the information might be packed into fixed format. Example: A survey with 5 observations (respondents) and three variables (question responses): AGE, INCOME and PARTYID. In free format the data could appear in a number of ways.

METHOD 1:    METHOD 2:    METHOD 3:
24,12345,0   24 12345 0   24  12345 0 | High school civics teacher
45,24118,1   45 24118 1   45  24118 1 | Public university professor
34,45678,0   34 45678 0   34  45678 0 | Harvard prof (w/o consulting fees)
55,112444,1  55 112444 1  55 112444 1 | Harvard prof (w/ consulting fees)
29,9999,0    29 9999 0    29   9999 0 | Political theorist
Note that Method 3 is also in fixed format, since the first two characters represent age, the next seven (including blank spaces) represent income, and the last character is a dummy variable representing party. However, another fixed form example might be:
METHOD 4:    METHOD 5:
240123450    24012345demo
450241181    45024118repb
340456780    34045678demo
551124441    55112444repb
290099990    29009999demo
Unlike Method 3, these versions of fixed form data cannot be treated as free form; no simple delimiter separates values on one variable from those of the next along a row (also called a record in many codebooks). This is very much how HDC data often appears -- it is gibberish without a "data dictionary" telling you and the computer what the numbers represent. So on to the question at hand. The command STATA uses for data entry is "infile". The syntax differs depending upon the data format. With free format data, such as Method 1 above, you would type: infile age income partyid using method1.raw In this example: "infile" is the command, sort of the verb telling STATA what to do "age income partyid" are the variable names, in the order of the dataset "using" is a second part of the command, announcing the data file "method1.raw" is one possible name we could give the ASCII text file If you were not interested in the income variable, you could type: infile age _skip partyid using method3.raw STATA assumes that each variable is a "floating point" number (basically, that it will sometimes have a decimal place, and not always in the same location among the digits). This takes up memory, so with large datasets you might want to specify that something is an "integer" variable or a "byte" (i.e., one-character) variable. infile int(age income) byte(partyid) using method4.raw If you spelled out party names ("Demo" or "Repb") rather than putting numbers, you also must indicate that one name applies to a "string" variable that contains letters and symbols: infile int(age income) str4(partyid) using method5.raw Note that this last method only works with certain types of string data -- those with no delimit characters in the middle (e.g. blank spaces, commas) or those set apart in the dataset by single or double quotation marks. Cases such as people's full names, which will have spaces and commas, cannot be entered in this simple form unless surrounded by quotation marks. (This is true even though we've told STATA that the string has four characters with "str4"; if it reaches a blank space or comma before the fourth character STATA will stop reading.) For fixed format data files that do not also have simple delimiters, such that you can treat them as free format files, you need to create a "data dictionary." Let's start by assuming we wanted to read in the Method 3 data. We would enter some type of text editor (such as DOS EDIT or Notepad), or failing that a word processor that can save ASCII files (most can, with varying levels of ease and probable pitfalls). Then we might type:
dictionary using method3.asc {
        age 
        income 
        partyid
}
Perhaps we could save it as "method3.dct". Then to access this file within STATA itself, the infile command is easy because we can skip variable names. We would type: infile using method3.dct plus any of the qualifiers desired (see "help infile"). You should note that I am only listing the file names, not their paths. If you were using a campus computer, to make this method work you would need to put everything in one location (such as the desktop) and make sure this is the location where STATA is looking. Since we didn't list variable names, STATA knew that method3.dct was a dictionary file and not a data file. This dictionary file is very simple. It tells STATA nothing other than the variable names; STATA figures out where each variable's values are in the file using the delimiters same as before. But the dictionary can be used in a more complicated fashion. Let's say, for example, that we wanted to read in the Method 4 data. Then the dictionary file, with all of its accouterments, could look like this:
dictionary using method4.raw {
* Note: I am not italicizing the unique items such as varnames and filenames
int     age     %2f     "Respondent age"
int     income  %6f     "Family income of respondent"
byte    partyid %1f     "Party ID -- 0=Democrat  1=Republican"
}
For Method 5 data we could write:
dictionary using method5.raw {
* Note: You can put comments behind an asterisk and STATA will ignore them
int     age     %2f     "Respondent age"
int     income  %6f     "Family income of respondent"
str4    partyid %4s     "Party ID -- Demo or Repb"
}
You've seen most of this before, and the labels are self-explanatory, so the only new thing here is the stuff starting with the percentage sign. This tells STATA the data format in the files. The percentage sign starts the formatting information, the number tells how many characters are part of the variable (i.e., how wide it is in the file), and the letter tells whether the variable should be read in as a number (f) or a string (s). You also could write a number such as 6.3 if STATA had to add a decimal point (in this case, doing so would give you income in thousands rather than dollars). If you get a huge dataset from the Harvard Data Center, you might not want to bother reading in all of the variables. This would make an extremely large file that was hard to manage, both during analysis and also in terms of computer storage. Say, for example, that between the age and income variables was a 30-character string representing the respondent's name, and you did not want this. Then the DICTIONARY file would be:
dictionary using method5.raw {
int     age     %2f     "Respondent age"
    _skip(30)
int     income  %6f     "Family income of respondent"
str4    partyid %4s     "Party ID -- Demo or Repb"
}
Note that, for any given variable, STATA always starts drawing the value with the first character after it left off. Therefore, AGE is assumed to be the first two characters, then income is assumed to start with character 33 (3+30) and partyid with character 39. Finally, I should mention that some large datasets are so big that the values for a single observation take up more than one line. This is not a problem with free format data, since STATA will ignore carriage returns (i.e., what you get when you hit the enter key or a line wraps around) same as it ignores blank spaces or commas, and skip to the next valid value. With fixed format, however, you need to tell STATA to skip a line to continue entering data for the same observation (otherwise it will assume that the rest of the data for that observation is missing). At the point in the dictionary file where STATA needs to jump to the start of a second line, merely type "_newline".