Mathematical and Statistical Application - Stata - Introduction to Stata
Table of Contents
The Next Level in Programming in Stata:
Getting Started
What is Stata?
Stata is a fast and user-friendly statistical package, which provides comprehensive data management and analysis. Stata has pull down menus which let you submit commands with no knowledge of Stata commands. Stata is available for a variety of platforms. Stata offers a wide array of pre-defined statistical procedures, yet its programming features allow for much flexibility. Stata has a broad suite of capabilities for professionals of all disciplines, including statistics for economists, political scientists, and other social scientists (including a range of panel-data models); statistics for biostatisticians, epidemiologists, and other health scientists (including a range of survival analysis models); and so on.
Stata at UNC-CH:
Stata for Windows is accessible from all ITS computer labs. It can also be used on the Research Computing server [http://help.unc.edu/?id=6020] Emerald. Reference manuals are available at several ITS labs, the statistics lab in Manning Hall, as well as at the Statistical Applications Support Group library located in ITS Manning 211 Manning Drive. Even though Stata for Windows will be introduced in this document, the commands in these notes work for UNIX, Linux and Macintosh Stata users.
Stata's Online HELP is very helpful!
Pull down Help and select Contents . You will be offered five categories: Basics , Data Management , Statistics , Graphics , and Programming and matrices . Once you have selected a category, you will be offered more options. Choose what you need, and in the end, you will be presented with the description of the command, concept, etc. you were searching for and you will be provided with examples. A number of the help documents will contain a link that will launch a dialog box that can be used to submit the Stata command. To learn about a command you already know exists, you can view Stata's help documentation by typing at the Stata command prompt:
help some_command
For example, since you now know about the help command:
help help
If you need to search, then pull down the Help menu and select Search or execute the findit command:
findit survey
Stata will search its Help documents, the Stata web site, the FAQs, the Stata Journal, and even user-written programs available on the web.
Pull down menus are very helpful!
You do not need to know Stata commands to use Stata thanks to the dialog boxes. There are three ways to launch a dialog box:
1. use the menus (pull down Data , Graphics , or Statistics )
2. use the db command
3. pick the command from the online help
Language Syntax:
Code in brackets "[]" means it is optional and not required. Command options occur after commas.
[prefix :] command [varlist] [=exp] [if exp] [in range] [weight] [ using my_file ] [, options]
Example:
summarize price mpg-length foreign if mpg == 30 in 1/74, detail
help language
Varlists :
Stata has a number of ways to list multiple variables. In the above example "mpg-length" is all the variables in the dataset between mpg and length including mpg and length .
help varlist
Abbreviations:
You can abbreviate commands in Stata. However, there is no rule for abbreviations. Some commands are uniquely identified with only one letter. Some require a full name and will not accept abbreviations. To see what the abbreviation is type:
help some_command
and then the shortest abbreviation is the one that is underscored. For example the command generate can be simply invoked by the letter g but gen or gene are most commonly used. You can also abbreviate variable names. If you type enough characters to uniquely identify a variable, then Stata will assume that variable is being specified. This is not suggested, but just an FYI.
Operators:
Stata uses two equal signs in a row to test for equality " == " but only one equal sign when assigning a value to a variable. Check out Stata's help page on the use of other operators:
help operator
Log files:
When you start a logging session Stata will save all of the commands and the output in a text file (actually everything you see in the results window) so that at a later date you can see what you did now and how Stata responded to your commands. In general, it is a good idea to keep logs about what you do in a particular Stata session, so that when you try to remember what you did one month from now, you can simply look at the log file.
log using "introstata.log", replace
You can even log only the commands without the output. That may be a preferable option if you only want to remember the commands you used in the past, and you can always re-run those commands and get exactly the same result again.
cmdlog using "introstatacommands.log"
You can suspend logging by typing “log off” and close the log file by typing “log close”. Once you close the log file, you can continue logging with the same file by typing:
log using "introstata.log", append
You can later view any of these log files using any text editor (NotePad, WordPad, MS Word), but if you have SAS also installed on your computer it may be that SAS has claimed files ending in " .log" as SAS files so double-clicking on log files to open them will open them in SAS. To avoid that, right-click on the file and choose "open with..." and choose the editor you want to open the file. Stata's viewer will also open a log file. Under Stata's File menu, choose Log, View... and then browse to your log file.
Loading Data Into Stata:
Commands used in this section:
|
Command |
Explanation |
|
log |
Create a log file |
|
cd |
Change directory |
|
dir or ls |
Show the files in the current directory |
|
insheet |
Read ASCII (text) data created by a spreadsheet |
|
infile |
Read unformatted ASCII (text) data |
|
infix |
Read ASCII (text) data in fixed format |
|
input |
Enter data from the keyboard |
|
edit |
Open Stata's data editor window |
|
compress |
Compress data in memory |
|
save |
Store the dataset currently in memory on disk in Stata data format |
|
use |
Load/open a Stata-formatted dataset |
|
sysuse |
Load an example Stata dataset that came with your Stata installation |
|
count |
Show the number of observations |
|
list |
List the values of variables |
|
clear |
Clear the entire dataset and everything else out of Stata's memory |
|
memory |
Display a report on memory usage |
|
set memory |
Set the size of memory |
Let's start with loading a Stata dataset. As can be expected, it is easier for Stata to read in a Stata dataset.
Stata data files end with the .dta extension. Stata data files are loaded into memory using the use command, but Stata has an example dataset called auto.dta and it can be loaded using the sysuse command no matter what directory you are currently in:
clear
sysuse "auto.dta"
To list all the example datasets available type “sysuse dir”.
We will continue with inputting the spreadsheet type of data file into Stata. A spreadsheet type of file is created by programs such as Excel. For example, in Excel, we can save a file into comma-separated-values format (CSV) file. Stata reads in this type of data using the insheet command. On your C:\ drive first create an Excel file that has variable names on the first line, example.xls and then save it as example.csv. Then in Stata do the following:
cd c:\
dir example.csv
insheet using "example.csv"
list
describe
What if the data file does not have the variable names on the first line? Create such a file and name it: example_noname.csv . Here is what we can do:
insheet variable1 variable2 variable3 using "example_nonames.csv", clear
count
list
describe
To read a space-delimited file, use the infile command. Enter the following using any text editor and save it as phd.txt.
0 70 Hamilton 2
1 121 Betty 2
0 86 Arnie 2
0 141 Zach 2
1 . Joy 2
1 100 Alica 5
0 99 Ron 5
0 55 Matt 4
Notice how we specify that “name” is a character variable of length 10 using “str10”.
infile gender score str10 name year using "phd.txt", clear
list if year==2
The other type of commonly used ASCII data format is fixed format. It always requires a codebook to specify which column corresponds to which variable. Here is small example of this type of data with a codebook. Notice how we make use of the codebook in the infix command below. Copy and paste the following into a text file and name it “grades.txt”.
0195 094951
026386161941
038780081841
0479700 870
056878163690
066487182960
0786 069 0
088194193921
098979090781
107868180801
The variables to be created will all be numeric and exist in these columns:
|
variable name |
column number |
|
id |
1-2 |
|
hw |
3-4 |
|
quiz |
5-6 |
|
gender |
7 |
|
mt |
8-9 |
|
final |
10-11 |
|
hispanic |
12 |
Now read in the data with the infix command:
infix id 1-2 hw 3-4 quiz 5-6 gender 7 mt 8-9 final 10-11 hispanic 12 using "grades.txt", clear
The compress command reduces the size of the dataset by resetting the data type each variable is stored with by choosing the most efficient storage type.
compress
To learn more about what data types Stata use, check out the help page:
help data_types
We can save the data to disk by issuing the save command.
save "grades.dta", replace
When we are loading data into Stata, the data file may be too big to be read in. We will have to reset the size of memory in Stata.
memory
Stata will not allow the change in the memory if a dataset is in use so use the clear command to clear out any data that might be in use. (Make sure you have saved your data first.):
clear
set memory 40m
The "m" of "40m" stands for "megabytes." This setting is decent standard amount of memory to request Stata to have available to use.
There are 2 interactive ways to enter data:
1. using the input command
2. using the edit command or equivalently clicking on the data editor icon on the menu bar
input var1 var3-var5
1 3 4 5
3 5 7 8
4 7 6 7
end
edit
Exploring and Modifying Data
Commands used in this section:
|
Data Management Commands |
Description | |
|
describe |
Describe a dataset | |
|
codebook |
Detailed contents of a dataset | |
|
sort |
Sort observations in a dataset | |
|
label define |
Define a set of a labels for the levels of a categorical variable | |
|
label values |
Apply value labels to a variable | |
|
generate |
Create a new variable | |
|
egen |
Extended generate - has special functions that can be used when creating a new variable | |
|
replace |
Modify a variable | |
|
rename |
Rename a variable | |
|
recode |
Recode a value or range of values to another value, sometimes simpler than a series of replace commands with if conditions. | |
|
order |
Order the variables in a data set | |
|
keep |
Keep variables in varlist (and drop all the rest) | |
|
keep if |
Keep observations if specified condition is met | |
|
drop |
Drop variables in varlist (and keep all the rest) | |
|
Data Analysis Commands |
Description | |
|
summarize |
Descriptive statistics | |
|
tabstat |
Table of descriptive statistics | |
|
table |
Create a table of statistics | |
|
graph |
Create high resolution graphs | |
|
kdensity |
Create kernal density estimates and graph results | |
|
histogram |
Histogram of a categorical variable | |
|
tabulate |
One- and two-way frequency tables | |
|
correlate |
Correlations | |
|
pwcorr |
Pairwise correlations | |
Load auto.dta:
sysuse "auto.dta", clear
Before we start our statistical exploration we will look at the data using the describe, list and codebook commands. Note that the variable make is a string variable.
describe
codebook
list
list make price mpg weight
List the third, fourth, fifth, and sixth records:
list in 3/6
Sort the data by the variable mpg :
sort mpg
List the first and last records ( _n is an internal Stata variable which is the current record, _N is the total number of records in the dataset)
list if _n==1 | _n==_N
List the very last record ("l" is a lowercase L and stands for "last")
list in l
List the last 5 records:
list in -5/l
The basic descriptive statistics command in Stata is summarize . Along with summarize, we also use the tabstat and table commands for displaying descriptive statistics within groups.
summarize
summarize mpg weight price
summarize weight, detail
In the following example "sum" is an abbreviation of "summarize":
by foreign, sort: sum mpg price
tabulate foreign, summarize(mpg)
table foreign, contents(freq mean mpg p50 mpg)
list if make=="Toyota Corolla"
sum mpg in 1/10
tabstat weight price mpg , by(rep78) statistics(n mean sd)
tabstat mpg, statistics(n mean sd p25 p50 p75) by(foreign)
In the following example "tab" is abbreviation of "tabulate":
tab mpg
Create a one-way table for each variable in the varlist:
tab1 foreign rep78
sum price, detail
In the following example "!" means "not" so "& !missing(price)" means "and variable price does not contain a missing value":
generate class = ((price > 5000) & !missing(price))
Perhaps a more obvious way of doing this would be the following since missing values in Stata are the largest numeric value which is represented by a period:
generate class = 0
replace class = 1 if price > 5000 & price < .
Create value label expen with 0 meaning "cheap" and 1 "expensive":
label define expen 0 cheap 1 expensive
label list expen
Assign value label expen to variable class:
label values class expen
table foreign class, contents(freq mean mpg mean price mean weight) row col
table foreign ,by(class) contents(freq mean mpg mean price mean weight) row col
Correlate the correlation or covariance matrix for varlist or, if varlist is not specified, for all variables in the data. Observations are excluded from the calculation due to missing values on a casewise basis:
correlate foreign mpg price weight
The command pwcorr displays all the pairwise correlation coefficients between the variables in varlist or, if varlist is not specified, all the variables in the dataset.
pwcorr foreign mpg price weight
Graphing:
histogram mpg, normal bin(10)
kdensity mpg, normal
graph twoway scatter mpg weight
Graphs can be combined:
graph twoway scatter mpg weight, by(foreign)
In the following example "gen" is short for "generate":
gen score = mpg
rename score grade
recode grade 0/14=0 15/19=1 20/29=2 30/34=3 35/39=4 40/50=5
label define abcdf 0 "F" 1 "D" 2 "C" 3 "B" 4 "A+"
label variable grade "These are the grades for miles per gallon score."
label values grade abcdf
codebook grade
Let's label the dataset itself so that we will remember what the data are. We can also add some notes to the dataset using the notes command.
label data "Domestic and Foreign Cars Data"
There is another way to create variables in Stata that uses special functions. Some of the functions available to you are listed in the table below. Some examples of the use of the functions follow.
egen wmean = mean(price), by(grade)
list weight grade wmean
save "auto1.dta"
This use of the keep command keeps variables:
keep make mpg price weight foreign grade
In the following example "keep if !foreign" means "keep if variable foreign == 0 " since the variable foreign has only the values of 1 and 0". Also this use of the keep command keeps observations (a.k.a records or rows):
keep if !foreign
save "autodom.dta", replace
drop if grade==0
Analyzing Data
Commands used in this section:
|
Command |
Description |
|
ttest |
T-test |
|
regress |
Regression |
|
predict |
Predict after model estimation |
|
twoway |
Two-way graphs |
|
probit |
Probit regression |
|
dprobit |
Probit regression reporting marginal effects |
T-test:
sysuse "auto.dta", clear
This is the one-sample t-test, testing whether the sample of gas mileage was drawn from a population with a mean of 20.
ttest mpg = 20
This is the paired t-test, testing whether or not the mean of weight equals the mean of price .
ttest weight = price
This is the two-sample independent t-test with pooled (equal) variances.
ttest mpg, by(foreign)
This is the two-sample independent t-test with separate (unequal) variances.
ttest mpg, by(foreign) unequal
Regression:
regress mpg weight foreign
regress mpg weight
Regression with robust standard errors. This is very useful when there is heterogeneity of variance. This option does not affect the estimates of the regression coefficients.
predict p
When we are using the resid option the predict command calculates the residual.
predict r, resid
The list command displays the values of the variables that we have generated. The in 1/20 option stipulates that only the first 20 observations be displayed.
list mpg p r in 1/20
In order to demonstrate the Probit estimation we will create a dichotomous variable called goodmpg. This is purely for illustrative purposes only!
gen goodmpg = (mpg == 30)
In the above example missing values of the variable mpg are not expected to exist. If there were missing values in mpg, then goodmpg would equal 0 instead of missing when mpg has a missing value. So, the above example would more wisely be coded as such:
gen goodmpg = (mpg == 30) if mpg < .
probit goodmpg weight
dprobit goodmpg weight
Graphing:
twoway scatter mpg weight || line p weight
An easier way to do this without explicitly running a regression:
twoway scatter mpg weight || lfit mpg weight
twoway (qfitci mpg weight, stdf) (sc mpg weight), by(foreign)
The kdensity command with the normal option displays a density graph of the residuals with an normal distribution superimposed on the graph. This is particularly useful in verifying that the residuals are normally distributed which is a very important assumption for regression.
kdensity r, normal
The Next Level in Programming in Stata:
Do-files:
Rather than typing commands at the keyboard, you can create a text file containing commands and instruct Stata to execute the commands stored in that file. Such files are called do-files, since the command that causes them to be executed is do and the preferred file extension is ".do".
- A do-file is a standard ASCII text file.
- A do-file is executed by Stata when you type do filename .
You can use any text editor to create do-files, or you can use the built-in do-file editor by typing doedit, or by clicking on the do-file editor icon on the menu bar at the top.
do "dofile.do"
This following illustrates some features of do-files. Stata reads each line as a separate command because by default Stata uses the carriage return (invisible to our eyes) as the end-of-line delimiter. The command #delimit can change the end-of-line delimiter from a carriage return to a semi-colon and back again.
Copy and paste the following into the do-file editor window and "do" it, i.e click the icon that looks like a sheet of paper with lines on it.
// This is a comment.
/* This is also a comment. */
* This is a comment as well.
version 10 // This tells Stata the version under which this do-file was written.
// This assures that it was work in all future versions of Stata.
set more off // Now Stata does not pause every time the screen is full.
/* unless you specify an explicit directory path,
* myjob.log is saved in the current directory */
log using "myjob.log", replace
// Stata can even use datasets that are stored on the web:
use "http://www.stata-press.com/data/r8/census", clear
tab region
tab region, nolabel
summarize if region==3
log close
// Stata has internal macro variables. Use the
creturn list
command to see all that are available.
display "Today's date is: `c(current_date)'"
// Using three forward slashes in a row tells Stata that the command continues to the next line:
local logfilename1 = upper(word("`c(current_date)'",1) + word("`c(current_date)'",2) ///
+ word("`c(current_date)'",3))
log using "`logfilename1'.log", replace
Macros:
Macro variables are Stata variables that do not exist in the data so they do not change per record like a data variable does. To evaluate a local macro variable you need to surround the text with left and right quotes. The left quote ( ` ) is on the left-hand side of your keyboard and is the same key as the tilde ( ~ ). The right quote is the normal single quote ( ' ) which is on the same key as the double quote ( " ). Local macro variable are called "local" because they only have value in the do-file, for loop, or program that they were created in. The global macro variable maintains its value during your Stata session no matter where it was created. To evaluate a global macro you need to add a dollar sign ( $ ) to the beginning of it.
For the most part, consider macros as a way of pasting in text to your Stata code so that you can reduce how much typing you need to do to get Stata to do what you want. Learn macros when you get to a point where you see patterns in your code and you would like to come up with a more efficient way of writing your code.
clear
set seed 1234
set obs 30 // this creates a blank dataset of 30 observations
foreach var of newlist var1-var10 { // "newlist" checks to make sure the variables in the varlist do not already exist in your data.
gen `var' = int(uniform()*10)
} // this is the end of the foreach loop
list var1-var10
local set1 "var2 var3 var6"
global set2 "var4 var5 var8 var10"
regress var1 `set1'
regress var1 $set2
forvalues x = 1/10 {
if mod(`x',2) {
display "`x' is odd"
continue
}
display "`x' is even"
}
foreach var of newlist z1-z20 {
gen `var' = uniform()
}
foreach num of numlist 1(1)4 6(2)13 {
if mod(`num',2) {
display "`num' is odd"
continue
}
display "`num' is even"
}
clear
set obs 100
Generate 10 uniform random variables named x1, x2, ..., x10.
set seed 12345
forvalues i = 1(1)10 { // equivalently 1/10, or 1 2 to 10
generate x`i' = uniform()
quietly count if x`i' > .1
display " % of x`i' > 1/10 = " round(100*r(N)/_N,.01)
gen x`i'ltdec = x`i' < .1
}
sum x*dec
Do the same for obs 1,000, and for 1,000,000. We can do exactly the same thing by using a while loop:
clear
set obs 1000
set seed 12345
local i = 1
while `i' < 11 {
generate x`i' = uniform()
quietly count if x`i' > .1
display " % of x`i' > 1/10 = " round(100*r(N)/_N,.01)
gen x`i'ltdec = x`i' < .1
local i = `i' + 1 // you need to increment the local macro i, otherwise the loop is endless.
}
sum x*dec
Ado-files:
Consider ado-files to be "automatic do-files" which contain canned Stata code that you can reference in other do-files or interactive Stata usage. A lot of Stata is written in ado-files. So, if you really do the same thing a lot, consider putting that Stata code in an ado-file.
Stata looks for ado-files in seven places, which can be categorized in two ways:
I. The official ado-directories
|
(UPDATES) |
the official updates directory |
|
(BASE) |
the official base directory |
II. Your personal ado directories, meaning
|
(SITE) |
the directory for ado-files your system administrator might have installed |
|
(.) |
the ado-files you have written just this instant or for just this project |
|
(PERSONAL) |
the directory for ado-files you personally might have written |
|
(PLUS) |
the directory for ado-files you personally might have installed |
|
(OLDPALCE) |
the directory where Stata users used to save their personally written ado-files |
. adopath
[1] (UPDATES) "C:\Program Files\Stata-9\ado\updates/"
[2] (BASE) "C:\Program Files\Stata-9\ado\base/"
[3] (SITE) "C:\Program Files\Stata-9\ado\site/"
[4] "."
[5] (PERSONAL) "c:\ado\personal/"
[6] (PLUS) "c:\ado\plus/"
[7] (OLDPLACE) "c:\ado/"
program rangeours // arguments are n a b
drop _all
args n a b
set obs `n'
gen x = (_n-1)/(_N-1)*(`b'-`a') + `a'
end
Copy and paste the above into a text editor and then save it as a plain text file rangeours.ado in the current directory.
Now, type:
rangeours 100 1 2
list


