Mathematical and Statistical Application - Stata - Introduction to Stata

Table of Contents

Getting Started

Loading Data Into Stata:

Exploring and Modifying Data

Analyzing Data

The Next Level in Programming in Stata:

Additional help

Getting Started

What is Stata?

Stata is a fast and user-friendly statistical package, which provides comprehensive data management and analysis. Stata has pull down menus which let you submit commands with no knowledge of Stata commands. Stata is available for a variety of platforms. Stata offers a wide array of pre-defined statistical procedures, yet its programming features allow for much flexibility. Stata has a broad suite of capabilities for professionals of all disciplines, including statistics for economists, political scientists, and other social scientists (including a range of panel-data models); statistics for biostatisticians, epidemiologists, and other health scientists (including a range of survival analysis models); and so on.

Note: Stata is case sensitive, just like UNIX/Linux. Most commands are in lower case. Variable names: age, Age, and AGE can all exist in a dataset...but it is not considered a good idea to create unique variables based on upper and lower case. Try to use only lower case letters.

Stata at UNC-CH:

Stata for Windows is accessible from all ITS computer labs. It can also be used on the Research Computing server [http://help.unc.edu/?id=6020] Emerald. Reference manuals are available at several ITS labs, the statistics lab in Manning Hall, as well as at the Statistical Applications Support Group library located in ITS Manning 211 Manning Drive. Even though Stata for Windows will be introduced in this document, the commands in these notes work for UNIX, Linux and Macintosh Stata users.

Stata's Online HELP is very helpful!

Pull down Help and select Contents . You will be offered five categories: Basics , Data Management , Statistics , Graphics , and Programming and matrices . Once you have selected a category, you will be offered more options. Choose what you need, and in the end, you will be presented with the description of the command, concept, etc. you were searching for and you will be provided with examples. A number of the help documents will contain a link that will launch a dialog box that can be used to submit the Stata command. To learn about a command you already know exists, you can view Stata's help documentation by typing at the Stata command prompt:

 help some_command

For example, since you now know about the help command:

 help help

If you need to search, then pull down the Help menu and select Search or execute the findit command:

 findit survey

Stata will search its Help documents, the Stata web site, the FAQs, the Stata Journal, and even user-written programs available on the web.

Pull down menus are very helpful!

You do not need to know Stata commands to use Stata thanks to the dialog boxes. There are three ways to launch a dialog box:

1. use the menus (pull down Data , Graphics , or Statistics )

2. use the db command

3. pick the command from the online help

Note: This document is prepared just to get you familiar with Stata. Stata has lots of other commands and features that you may want to know.

Language Syntax:

Code in brackets "[]" means it is optional and not required. Command options occur after commas.

[prefix :] command [varlist] [=exp] [if exp] [in range] [weight] [ using my_file ] [, options]

Example:

 summarize price mpg-length foreign if mpg == 30 in 1/74, detail
 help language

Varlists :

Stata has a number of ways to list multiple variables. In the above example "mpg-length" is all the variables in the dataset between mpg and length including mpg and length .

 help varlist

Abbreviations:

You can abbreviate commands in Stata. However, there is no rule for abbreviations. Some commands are uniquely identified with only one letter. Some require a full name and will not accept abbreviations. To see what the abbreviation is type:

 help some_command

and then the shortest abbreviation is the one that is underscored. For example the command generate can be simply invoked by the letter g but gen or gene are most commonly used. You can also abbreviate variable names. If you type enough characters to uniquely identify a variable, then Stata will assume that variable is being specified. This is not suggested, but just an FYI.

Operators:

Stata uses two equal signs in a row to test for equality " == " but only one equal sign when assigning a value to a variable. Check out Stata's help page on the use of other operators:

 help operator

Log files:

When you start a logging session Stata will save all of the commands and the output in a text file (actually everything you see in the results window) so that at a later date you can see what you did now and how Stata responded to your commands. In general, it is a good idea to keep logs about what you do in a particular Stata session, so that when you try to remember what you did one month from now, you can simply look at the log file.

 log using "introstata.log", replace
Note: When you do not specify a directory path, Stata uses the current working directory which is printed in the bottom left-hand side of the Stata window. You can also use the pwd (print working directory) command to see what your current working directory is.

You can even log only the commands without the output. That may be a preferable option if you only want to remember the commands you used in the past, and you can always re-run those commands and get exactly the same result again.

 cmdlog using "introstatacommands.log"

You can suspend logging by typing “log off” and close the log file by typing “log close”. Once you close the log file, you can continue logging with the same file by typing:

 log using "introstata.log", append

You can later view any of these log files using any text editor (NotePad, WordPad, MS Word), but if you have SAS also installed on your computer it may be that SAS has claimed files ending in " .log" as SAS files so double-clicking on log files to open them will open them in SAS. To avoid that, right-click on the file and choose "open with..." and choose the editor you want to open the file. Stata's viewer will also open a log file. Under Stata's File menu, choose Log, View... and then browse to your log file.

Loading Data Into Stata:

Commands used in this section:

Command

Explanation

log

Create a log file

cd

Change directory

dir or ls

Show the files in the current directory

insheet

Read ASCII (text) data created by a spreadsheet

infile

Read unformatted ASCII (text) data

infix

Read ASCII (text) data in fixed format

input

Enter data from the keyboard

edit

Open Stata's data editor window

compress

Compress data in memory

save

Store the dataset currently in memory on disk in Stata data format

use

Load/open a Stata-formatted dataset

sysuse

Load an example Stata dataset that came with your Stata installation

count

Show the number of observations

list

List the values of variables

clear

Clear the entire dataset and everything else out of Stata's memory

memory

Display a report on memory usage

set memory

Set the size of memory

Let's start with loading a Stata dataset. As can be expected, it is easier for Stata to read in a Stata dataset.

Stata data files end with the .dta extension. Stata data files are loaded into memory using the use command, but Stata has an example dataset called auto.dta and it can be loaded using the sysuse command no matter what directory you are currently in:

 clear
 sysuse "auto.dta"

To list all the example datasets available type “sysuse dir”.

We will continue with inputting the spreadsheet type of data file into Stata. A spreadsheet type of file is created by programs such as Excel. For example, in Excel, we can save a file into comma-separated-values format (CSV) file. Stata reads in this type of data using the insheet command. On your C:\ drive first create an Excel file that has variable names on the first line, example.xls and then save it as example.csv. Then in Stata do the following:

 cd c:\
 dir example.csv
 insheet using "example.csv"
 list
 describe

What if the data file does not have the variable names on the first line? Create such a file and name it: example_noname.csv . Here is what we can do:

insheet variable1 variable2 variable3  using "example_nonames.csv", clear
 count
 list
 describe

To read a space-delimited file, use the infile command. Enter the following using any text editor and save it as phd.txt.

0 70 Hamilton 2
1 121 Betty 2
0 86 Arnie 2
0 141 Zach 2
1 . Joy 2
1 100 Alica 5
0 99 Ron 5
0 55 Matt 4

Notice how we specify that “name” is a character variable of length 10 using “str10”.

 infile gender score str10 name year using "phd.txt", clear
 list if year==2

The other type of commonly used ASCII data format is fixed format. It always requires a codebook to specify which column corresponds to which variable. Here is small example of this type of data with a codebook. Notice how we make use of the codebook in the infix command below. Copy and paste the following into a text file and name it “grades.txt”.

0195 094951
026386161941
038780081841
0479700 870
056878163690
066487182960
0786 069 0
088194193921
098979090781
107868180801

The variables to be created will all be numeric and exist in these columns:

variable name

column number

id

1-2

hw

3-4

quiz

5-6

gender

7

mt

8-9

final

10-11

hispanic

12

Now read in the data with the infix command:

 infix id 1-2 hw 3-4 quiz 5-6 gender 7 mt 8-9 final 10-11 hispanic 12 using "grades.txt", clear

The compress command reduces the size of the dataset by resetting the data type each variable is stored with by choosing the most efficient storage type.

 compress

To learn more about what data types Stata use, check out the help page:

 help data_types

We can save the data to disk by issuing the save command.

 save "grades.dta", replace

When we are loading data into Stata, the data file may be too big to be read in. We will have to reset the size of memory in Stata.

 memory

Stata will not allow the change in the memory if a dataset is in use so use the clear command to clear out any data that might be in use. (Make sure you have saved your data first.):

 clear 
 set memory 40m

The "m" of "40m" stands for "megabytes." This setting is decent standard amount of memory to request Stata to have available to use.

Note: Stata stores the data in memory. How large of a dataset you can analyze in Stata is dependent on how much memory your computer can allot to Stata to use. This is not often a problem but it is worth knowing. Setting your memory too large can acutally slow down Stata.

There are 2 interactive ways to enter data:

1. using the input command

2. using the edit command or equivalently clicking on the data editor icon on the menu bar

input var1 var3-var5
1 3 4 5
3 5 7 8
4 7 6 7
end
edit

Exploring and Modifying Data

Commands used in this section:

Data Management Commands

Description

describe

Describe a dataset

codebook

Detailed contents of a dataset

sort

Sort observations in a dataset

label define

Define a set of a labels for the levels of a categorical variable

label values

Apply value labels to a variable

generate

Create a new variable

egen

Extended generate - has special functions that can be used when creating a new variable

replace

Modify a variable

rename

Rename a variable

recode

Recode a value or range of values to another value, sometimes simpler than a series of replace commands with if conditions.

order

Order the variables in a data set

keep

Keep variables in varlist (and drop all the rest)

keep if

Keep observations if specified condition is met

drop

Drop variables in varlist (and keep all the rest)

Data Analysis Commands

Description

summarize

Descriptive statistics

tabstat

Table of descriptive statistics

table

Create a table of statistics

graph

Create high resolution graphs

kdensity

Create kernal density estimates and graph results

histogram

Histogram of a categorical variable

tabulate

One- and two-way frequency tables

correlate

Correlations

pwcorr

Pairwise correlations

Load auto.dta:

 sysuse "auto.dta", clear

Before we start our statistical exploration we will look at the data using the describe, list and codebook commands. Note that the variable make is a string variable.

 describe
 codebook
 list
 list make price mpg weight

List the third, fourth, fifth, and sixth records:

 list in 3/6   

Sort the data by the variable mpg :

 sort mpg 

List the first and last records ( _n is an internal Stata variable which is the current record, _N is the total number of records in the dataset)

 list if _n==1 | _n==_N 

List the very last record ("l" is a lowercase L and stands for "last")

 list in l 

List the last 5 records:

 list in -5/l 

The basic descriptive statistics command in Stata is summarize . Along with summarize, we also use the tabstat and table commands for displaying descriptive statistics within groups.

 summarize
 summarize mpg weight price
 summarize weight, detail

In the following example "sum" is an abbreviation of "summarize":

 by foreign, sort: sum mpg price 
 tabulate foreign, summarize(mpg)
 table foreign, contents(freq mean mpg p50 mpg)
 list if make=="Toyota Corolla"
 sum mpg in 1/10
 tabstat weight price mpg , by(rep78) statistics(n mean sd)
 tabstat mpg, statistics(n mean sd p25 p50 p75) by(foreign)

In the following example "tab" is abbreviation of "tabulate":

 tab mpg

Create a one-way table for each variable in the varlist:

 tab1 foreign rep78
 sum price, detail

In the following example "!" means "not" so "& !missing(price)" means "and variable price does not contain a missing value":

 generate class = ((price > 5000) & !missing(price))

Perhaps a more obvious way of doing this would be the following since missing values in Stata are the largest numeric value which is represented by a period:

 generate class = 0
 replace class = 1 if price > 5000 & price < .

Create value label expen with 0 meaning "cheap" and 1 "expensive":

 label define expen 0 cheap 1 expensive  
 label list expen 

Assign value label expen to variable class:

 label values class expen 
 table foreign class, contents(freq mean mpg mean price mean weight) row col
 table foreign ,by(class) contents(freq mean mpg mean price mean weight) row col

Correlate the correlation or covariance matrix for varlist or, if varlist is not specified, for all variables in the data. Observations are excluded from the calculation due to missing values on a casewise basis:

 correlate foreign mpg price weight

The command pwcorr displays all the pairwise correlation coefficients between the variables in varlist or, if varlist is not specified, all the variables in the dataset.

 pwcorr foreign mpg price weight

Graphing:

 histogram mpg, normal bin(10)
 kdensity mpg, normal
 graph twoway scatter mpg weight

Graphs can be combined:

 graph twoway scatter mpg weight, by(foreign)

In the following example "gen" is short for "generate":

 gen score = mpg    
 rename score grade
 recode grade 0/14=0 15/19=1 20/29=2 30/34=3 35/39=4 40/50=5
 label define abcdf 0 "F" 1 "D" 2 "C" 3 "B" 4 "A+"
 label variable grade "These are the grades for miles per gallon score."
 label values grade abcdf
 codebook grade

Let's label the dataset itself so that we will remember what the data are. We can also add some notes to the dataset using the notes command.

 label data "Domestic and Foreign Cars Data"

There is another way to create variables in Stata that uses special functions. Some of the functions available to you are listed in the table below. Some examples of the use of the functions follow.

 egen wmean = mean(price), by(grade)
 list weight grade wmean
 save "auto1.dta"

This use of the keep command keeps variables:

 keep make mpg price weight foreign grade

In the following example "keep if !foreign" means "keep if variable foreign == 0 " since the variable foreign has only the values of 1 and 0". Also this use of the keep command keeps observations (a.k.a records or rows):

 keep if !foreign  
 save "autodom.dta", replace
 drop if grade==0

Analyzing Data

Commands used in this section:

Command

Description

ttest

T-test

regress

Regression

predict

Predict after model estimation

twoway

Two-way graphs

probit

Probit regression

dprobit

Probit regression reporting marginal effects

T-test:

 sysuse "auto.dta", clear

This is the one-sample t-test, testing whether the sample of gas mileage was drawn from a population with a mean of 20.

 ttest mpg = 20

This is the paired t-test, testing whether or not the mean of weight equals the mean of price .

 ttest weight = price

This is the two-sample independent t-test with pooled (equal) variances.

 ttest mpg, by(foreign)

This is the two-sample independent t-test with separate (unequal) variances.

 ttest mpg, by(foreign) unequal

Regression:

 regress mpg weight foreign
 regress mpg weight

Regression with robust standard errors. This is very useful when there is heterogeneity of variance. This option does not affect the estimates of the regression coefficients.

 predict p

When we are using the resid option the predict command calculates the residual.

 predict r, resid

The list command displays the values of the variables that we have generated. The in 1/20 option stipulates that only the first 20 observations be displayed.

 list mpg p r  in 1/20

In order to demonstrate the Probit estimation we will create a dichotomous variable called goodmpg. This is purely for illustrative purposes only!

 gen goodmpg = (mpg == 30) 

In the above example missing values of the variable mpg are not expected to exist. If there were missing values in mpg, then goodmpg would equal 0 instead of missing when mpg has a missing value. So, the above example would more wisely be coded as such:

 gen goodmpg = (mpg == 30) if mpg < . 
 probit goodmpg weight
 dprobit goodmpg weight

Graphing:

 twoway scatter mpg weight || line p weight

An easier way to do this without explicitly running a regression:

 twoway scatter mpg weight || lfit mpg weight
 twoway (qfitci mpg weight, stdf) (sc mpg weight), by(foreign)

The kdensity command with the normal option displays a density graph of the residuals with an normal distribution superimposed on the graph. This is particularly useful in verifying that the residuals are normally distributed which is a very important assumption for regression.

 kdensity r, normal

The Next Level in Programming in Stata:

Do-files:

Rather than typing commands at the keyboard, you can create a text file containing commands and instruct Stata to execute the commands stored in that file. Such files are called do-files, since the command that causes them to be executed is do and the preferred file extension is ".do".

  • A do-file is a standard ASCII text file.
  • A do-file is executed by Stata when you type do filename .

You can use any text editor to create do-files, or you can use the built-in do-file editor by typing doedit, or by clicking on the do-file editor icon on the menu bar at the top.

 do "dofile.do"

This following illustrates some features of do-files. Stata reads each line as a separate command because by default Stata uses the carriage return (invisible to our eyes) as the end-of-line delimiter. The command #delimit can change the end-of-line delimiter from a carriage return to a semi-colon and back again.

Copy and paste the following into the do-file editor window and "do" it, i.e click the icon that looks like a sheet of paper with lines on it.

 // This is a comment.
 /* This is also a comment. */
  * This is a comment as well.
 version 10  // This tells Stata the version under which this do-file was written.
             // This assures that it was work in all future versions of Stata.
 set more off  // Now Stata does not pause every time the screen is full.
 /* unless you specify an explicit directory path, 
  *  myjob.log is saved in the current directory */ 
 log using "myjob.log", replace 
 // Stata can even use datasets that are stored on the web:
 use "http://www.stata-press.com/data/r8/census", clear
 tab region
 tab region, nolabel
 summarize if region==3
 log close
 // Stata has internal macro variables.  Use the 
            creturn list
          command to see all that are available.
 display "Today's date is: `c(current_date)'"
 // Using three forward slashes in a row tells Stata that the command continues to the next line:
 local logfilename1 = upper(word("`c(current_date)'",1) + word("`c(current_date)'",2)  ///
                            + word("`c(current_date)'",3))
 log using "`logfilename1'.log", replace

Macros:

Macro variables are Stata variables that do not exist in the data so they do not change per record like a data variable does. To evaluate a local macro variable you need to surround the text with left and right quotes. The left quote ( ` ) is on the left-hand side of your keyboard and is the same key as the tilde ( ~ ). The right quote is the normal single quote ( ' ) which is on the same key as the double quote ( " ). Local macro variable are called "local" because they only have value in the do-file, for loop, or program that they were created in. The global macro variable maintains its value during your Stata session no matter where it was created. To evaluate a global macro you need to add a dollar sign ( $ ) to the beginning of it.

For the most part, consider macros as a way of pasting in text to your Stata code so that you can reduce how much typing you need to do to get Stata to do what you want. Learn macros when you get to a point where you see patterns in your code and you would like to come up with a more efficient way of writing your code.

 clear
 set seed 1234  
 set obs 30   // this creates a blank dataset of 30 observations
 foreach var of newlist var1-var10 {  //  "newlist" checks to make sure the variables in the varlist do not already exist in your data.
           gen `var' = int(uniform()*10)
 }  // this is the end of the foreach loop
 list var1-var10
 local set1 "var2 var3 var6"
 global set2 "var4 var5 var8 var10"
 regress var1 `set1'
 regress var1 $set2
 forvalues x = 1/10 {
           if mod(`x',2) {
                display "`x' is odd"
                continue
           }
           display "`x' is even"
 }
 foreach var of newlist z1-z20 {
           gen `var' = uniform()
 }
 foreach num of numlist 1(1)4 6(2)13 {
           if mod(`num',2) {
                display "`num' is odd"
                continue
           }
           display "`num' is even"
 }
 clear
 set obs 100

Generate 10 uniform random variables named x1, x2, ..., x10.

 set seed 12345
 forvalues i = 1(1)10 { // equivalently 1/10, or 1 2 to 10
           generate x`i' = uniform()
           quietly count if x`i' > .1
           display " % of x`i' > 1/10 = " round(100*r(N)/_N,.01)
           gen x`i'ltdec = x`i' < .1
 }
 sum x*dec

Do the same for obs 1,000, and for 1,000,000. We can do exactly the same thing by using a while loop:

 clear
 set obs 1000
 set seed 12345
 local i = 1
 while `i' < 11 {
        generate x`i' = uniform()
        quietly count if x`i' > .1
        display " % of x`i' > 1/10 = " round(100*r(N)/_N,.01)
        gen x`i'ltdec = x`i' < .1
        local i = `i' + 1   // you need to increment the local macro i, otherwise the loop is endless.
 }
 sum x*dec

Ado-files:

Consider ado-files to be "automatic do-files" which contain canned Stata code that you can reference in other do-files or interactive Stata usage. A lot of Stata is written in ado-files. So, if you really do the same thing a lot, consider putting that Stata code in an ado-file.

Stata looks for ado-files in seven places, which can be categorized in two ways:

I. The official ado-directories

(UPDATES)

the official updates directory

(BASE)

the official base directory

II. Your personal ado directories, meaning

(SITE)

the directory for ado-files your system administrator might have installed

(.)

the ado-files you have written just this instant or for just this project

(PERSONAL)

the directory for ado-files you personally might have written

(PLUS)

the directory for ado-files you personally might have installed

(OLDPALCE)

the directory where Stata users used to save their personally written ado-files

 . adopath
   [1]  (UPDATES)   "C:\Program Files\Stata-9\ado\updates/"
   [2]  (BASE)      "C:\Program Files\Stata-9\ado\base/"
   [3]  (SITE)      "C:\Program Files\Stata-9\ado\site/"
   [4]              "."
   [5]  (PERSONAL)  "c:\ado\personal/"
   [6]  (PLUS)      "c:\ado\plus/"
   [7]  (OLDPLACE)  "c:\ado/"
 program rangeours // arguments are n a b
         drop _all
         args n a b
         set obs `n'
         gen x = (_n-1)/(_N-1)*(`b'-`a') + `a'
 end

Copy and paste the above into a text editor and then save it as a plain text file rangeours.ado in the current directory.

Now, type:

 rangeours 100 1 2
 list

Additional help

More on Stata

Research Computing home page


Top
University of North Carolina - Chapel Hill