Stata Command Syntax - Duke University Libraries

Stata – Commonly Used Commands and Useful Information
Stata Files
.dta files – Stata data files. Any time Stata saves data, it saves as a Stata data file.
.do files – Do files store Stata commands. These commands are the same as those typed into the
Command window
.smcl and .log files – These are log files that store the output window. Start a log prior to running most
of your commands.
Stata Command Syntax
Stata commands, with few exceptions, follow this template. Bracketed items are optional. Bolded items
are most common.
[by varlist:] command [varlist] [=exp] [if exp] [in range] [weight] [,options]
-
-
All commands must contain command, which is a Stata command. For example, regress will run
an OLS regression.
A prefix may precede the command and is followed by a colon. Common prefixes are discussed
below and include by, bysort, xi, and quietly.
A varlist is a list of one or more variables. Some commands only allow for a single variable. In
many cases, the order of the variables is important. The dependent variable always precedes
one or more independent variables.
The item =exp is an algebraic expression. These are typically found with the generate and
replace commands.
The if exp component evaluates true or false for each observation depending on the condition.
The command is performed when true.
The in range component denotes an observation range (for example, the first 100 observations).
weight denotes a weighting expression if one is needed.
Most commands have one or more options, which follow a single comma. These allow for
additional information or for non-default operation of the command and can be found in the
help documentation. ALWAYS CHECK THE OPTIONS BY TYPING help command.
Example
by gender: regress income educ paeduc if marital==1, vce(robust)
In English: for each gender, separately run a regression. Income is the dependent variable, education
and father’s education are the independent variables, but only if marital is equal to 1. Use robust
standard errors.
1
Stata Operators
Arithmetic
Logical
Relational
+
Addition
&
And
>
Greater Than
Subtraction
|
Or
<
Less Than
*
Multiplication
!
Not
>= Greater Than or Equal
/
Dividsion
~
Not
<= Less Than or Equal
^
Power
== Equal
Negation
!= Not Equal
+
String Concatenation
~= Not Equal
The order of evaluation (from first to last) of all operators is ! (or ~), ^, - (negation), /, *, -(subtraction),
+, != (or ~=), >, <, <=, >=, ==, &, and |.
Time Series Operators
Lag / Lead
L.var
Lag var one period
(vart – 1)
L2.var Lag var 2 periods
(vart – 2)
…
F.var
F2.var
Lead var one period
(vart + 1)
Lead var 2 periods
(vart + 2)
…
D.var
D2.var
Differences
Difference
vart - vart – 1
Difference of difference
vart - vart – 1 – (vart – 1 - vart – 2)
…
L.var
L2.var
Seasonal difference, one period
vart - vart – 1
Seasonal difference, two periods
vart - vart – 2
…
File Access and Set-Up Commands
Working directory
- The working directory is the folder where Stata looks to find data and save data and log files.
Setting a working directory means that only the file name, not the full path, will be needed.
Command
pwd
cd
dir
Description
Print the current working to the
output window
Change the working directory
Display contents of working
directory
Example
cd “C:\Users\Ryan\myproject\”
Opening and Importing Data
Stata stores data in a proprietary format, Stata dta files. These can be directly opened with the use
command, and all saved data is saved in this form with the save command. Stata also has the ability to
import data from a variety of formats, including Excel and comma separated value text files. The import
command is used to do this
2
Command
use
save
clear
import delimited
import excel
infix
Description
Open Stata dataset
Save as Stata dataset
Clear all data in memory
Imports delimited text files (Note: options
exist to define the delimiter)
Imports an Excel sheet (Note: you must
specify the sheet name)
Import fixed width text file (Note: you may
specify the fields, data types, and lengths
here or in a dictionary file)
Example
use mydata.dta, clear
save mydatacopy.dta, replace
clear
Import delimited mydata.csv, clear
import excel "mydata.xlsx",
sheet("Sheet1") firstrow clear
infix id 1-2 str name 3-6 education 7-8
income 9-14 using "mydata.txt"
Other Set-Up Commands Prior to Analysis
Command
Description
Log files and do files
log using
Starts a new log
log close
Closes an open log
doedit
Opens a do file editor
Set-Up (Defaults)
set maxvar
set level
set logtype
set maxiter
set matsize
set more off
Data
browse
edit
Help
search
help
Change default maximum number of
variables in dataset
Change default level for confidence
intervals
Change default format for log files
Change default number of iterations for
maximum likelihood. Default is 16000.
Change default maximum number of
variables in a model. Default is 400.
Change default behavior of more
prompt for multiple output screens.
Default is on.
Example
log using mylog.log, replace
set maxvar 5000
set level 90
set logtype text
set maxiter 8000
set matsize 800
set more off
Opens the data browser
Opens the data editor (you should
NEVER do this)
Search for a concept or term
Get help on a specific command
search postestimation
help reg
3
Basic Information, Descriptive Statistics, and Plots
This section covers basic information regarding the data types used by your data, size of memory
allocated for data storage, variable and value labels, and descriptive statistics and plots.
Command
Basic Information
describe
codebook
list
Descriptive Statistics
summarize
tabulate
tabstat
correlate
misstable
ttest
Description
Example
Lists variables, data type, size, any
variable labels, and any value label sets.
Lists variable name and label, type,
value labels, number of missing, and
tabulation if few values are present
List variables for some or all
observations
describe
describe var1 var2 var3
codebook
codebook var1 var2
Summary statistics for one or more
variables
One- and two-way tabulations
summarize
summarize var1, detail
tab var1 var2
tab var3, missing
tabstat var1, statistics (mean
median min max count)
correlate
Table of summary statistics for one or
more variables.
Correlation matrix of two or more
variables
Table of present and missing value
counts
T tests
list var1 var2 in 1/20
misstable summarize var1 var2
ttest var1==0 – One sample
ttest var1==var2 – Paired
ttest var1 by(group) – Twosample, grouped
Generalized Linear Models and Post-Estimation
There are dozens of commands that each specify a particular model. The table below lists a few of the
common models. The table is principally focused on post-estimation commands, which allow for
assessment of the model once run.
Post-estimation commands utilize the stored estimates and model information and must be run
immediately after the model. The post-estimation commands listed here apply to the regress command.
Models may store different pieces of information, and post-estimation commands will necessarily vary.
4
Command
Description
OLS Regression and Generalized Linear Models
regress
OLS regression
logit
Logit model
probit
Probit model
poisson
Poisson model
mlogit
Multinomial logistic regression
Post-estimation Commands
estimates store
Stores model estimates for later use and
recall. Use immediately after model.
estimates replay
Replays model estimates.
estimates restore
Makes specified model statistics active.
Example
regress var1 var2 var3
regress var1 var2 var 3
estimates store my_results
estimates replay my_results
estimates restore my_results
estat Examples
Estat contains many of the post-estimation statistics. The command requires a second word to
specify the particular statistic.
estat ovtest
Ramsay RESET test for omitted variables estat ovtest
estat gof
Goodness of fit chi-square test for
estat gof
model form
estat vif
Variance inflation factor test
estat vif
estat hettest
Test for heteroskedasticity
estat hettest
estat ic
Akaike’s and Schwarz’s Baysean
estat ic
Information Criteria (AIC and BIC)
estat vce
Variance-covariance matrix
estat vce
Other Post-estimation Commands
predict
Predict values as new variable (fitted
values, residuals, etc.). Fitted values is
default
margins
Marginal means and effects
Post-Estimation Plots
rvfplot
avplot
lvr2plot
Residuals versus fitted plot
Added variable plot
Leverage versus squared residual plot
predict myfit
predict r, resid
rvfplot
avplot
lvr2plot
Data Manipulation
Data manipulation can involve anything from the generation of a new variable to complex
transformation of your data (aggregation, transposition, or preparation for time series analysis).
5
Command
Data Browser
browse
edit
Description
Example
Browse a view of the data
Like browser, except you don’t want to
use
browse
edit
Variable Creation, Population, and Deletion
generate
Generate a new variable
egen
replace
Extensions to generate. Use if gen does
not work. This is often used when you
need to generate a variable based on all
observations.
Replace a variable with a constant or
expression.
gen mynewvar=.
egen myvar = max(var)
egen myvar = count(var)
replace mynewvar=1
replace mynewvar=2 if var2==2
recode myvar myvar2 (1 2 = 1) (3
= 2) (4 5 = 3) gen (mynewvar
mynewvar2)
drop var1 var2
recode
Recode one or more variables
drop
Drop variables or observations
keep
drop if var3==.
Inverse of drop. This will only keep what keep if var3!=.
is specified
Variable Names, Labels,
rename
label variable
label define
label values
encode
decode
Change the name of a variable. This is
the name, not the longer label that is
often attached to a variable. Old name
goes first, new name second
Change the label attached to a variable.
Defines a value label set. A label set can
be attached to the values contained
within one or more variables by the
label values command.
Attaches a label set (last object listed in
command) to the values in one or more
variables.
Creates numeric version of a string
variable. For example, turns
male/female to 1 or 2.
Creates string version of a numeric
variable. Note: you will want attached
value labels for this to work.
rename oldvar1 newvar1
label variable newvar1 “Marital
Status at Age 30”
label define lab_mar 1 “Married”
2 “Divorced” 3 “Separated” 4
“Widowed” 5 “Never married”
label values newvar1 lab_mar
encode var1, gen(newvar1)
decode var1, gen(stringvar)
6
Command
Description
Advanced Data manipulation
preserve
Generate a new variable
restore
merge
append
collapse
reshape
cross
compress
Extensions to generate. Use if gen does
not work. This is often used when you
need to generate a variable based on all
observations.
Merge saved dataset to currently loaded
dataset. A single field must be shared.
This adds variables.
Append saved dataset to currently
loaded. This adds observations.
Collapses data into larger unit of
analysis. For example, counties may be
collapsed into fewer states. Summary
statistics for data fields must be
specified.
Transforms data from wide to long
format and vice versa. For example, if
you have time series data but want each
time point to be a separate column, this
is moving from long to wide data.
Form every pairwise combination of
saved dataset and the currently loaded
dataset. If each dataset has 10 records,
the output will contain 100 records
Changes variable types to smallest data
type that will retain all information.
Makes strings smaller and number types
smaller to save space.
Example
gen mynewvar=.
egen myvar = max(var)
egen myvar = count(var)
merge 1:1 id_variable using
data2.dta
append using data2.dta
collapse (median) var1 var2 var3,
by(state)
reshape wide var1, i(state) j(year)
cross using data2.dta
compress
Graphs
The easiest way to make graphs is to use the clickable interface. Most commands use the graph
command, followed by the type, variables, and options. The options allow for manipulation of almost all
aspects of the graph.
7
Command
Graphs
graph box
Description
Example
Box plot
graph bar
Bar plot
graph pie
Pie chart
graph twoway line
Twoway line chart
graph twoway
scatter
Twoway scatter chart
8
Command
graph twoway area
Description
Twoway area chart
histogram
Histogram of a variable
gladder
Ladder of powers plot (one variable).
This presents histograms of
mathematical transformations of the
specified variable.
quantile
Quantile plot
kdensity
Kernel density plot
Example
9
Graphs can also be overlayed, and options control what is shown. The following command combines a
scatterplot of some data with a linear fit between the two variables (and the confidence intervals). Each
graph is enclosed in its own set of parentheses.
graph twoway (lfitci d_emp_man d_emp_all) (scatter d_emp_man d_emp_all)
Prefixes
The following list is not comprehensive and only includes some of the more common prefixes that can
be used in Stata. Note that not all Stata commands allow for the use of prefixes. For example, egen is a
command that allows for limited use of the by prefix. Refer to the help for your specific command to
see if prefixes are allowed.
Prefix
by
bysort
quietly /
noisily
svy
xi
Description
Run a command for each group identified in
the by variable
Same as by, but sorts by the by variable.
Sometimes has to be used if sorting must
occur.
Suppress / Force display of output
Survey prefix command. Use this when
using survey data. The svyseti command
must first be used to set the data as survey
data.
Expand interaction terms. Also used for
categorical data. This example uses i. to
note the single categorical variable.
Example
by gender: reg var1 var2 var3
bysort gender: tab var1
quietly: reg var1 var2 var3
estimate store model1
svy: mean var1
xi: reg depvar i.marriage age
References and Resources
Data and Visualization Services – http://library.duke.edu/data/
UCLA IDRE - http://www.ats.ucla.edu/stat/stata/
This workshop - http://library.duke.edu/data/sites/default/files/datagis/store/stata_0.zip
10