Separation of Concerns Part 1: Separating Analysis from Presentation¶
Review of Introductory Example¶
In the first part of this two-part lesson, we revisit the Introductory Example of the main paper to implement the principle of separation of concerns. That example proceeded in two steps. First, in dataprep.do
, we prepared our dataset auto-modified.dta
for analysis. Then, in analysis.do
, we ran our regressions and produced tables and figures. We reproduce the do-files, SConstruct, and workflow below.
// dataprep.do
version 16.1
use "inputs/auto-original.dta", clear
generate mpg_sqd = mpg^2
label variable mpg_sqd "Mileage (mpg) squared"
save "outputs/auto-modified.dta", replace
// analysis.do
version 16.1
use "outputs/auto-modified.dta", clear
twoway scatter price mpg, title("Price vs. MPG")
graph export "outputs/scatterplot.pdf", replace
regress price mpg
eststo linear
regress price mpg mpg_sqd
eststo quadratic
esttab linear quadratic using "outputs/regressionTable.tex", ///
se r2 replace
# SConstruct-introExample
# **** Setup from pystatacons package *****
import pystatacons
env = pystatacons.init_env()
# use sconsign specific to this exercise
SConsignFile(".sconsignSeparation")
# **** Substance begins *****
# analysis
cmd_analysis = env.StataBuild(
target = ['outputs/scatterplot.pdf',
'outputs/regressionTable.tex'],
source = 'code/analysis.do',
depends = ['outputs/auto-modified.dta'])
# dataprep
cmd_dataprep = env.StataBuild(
target = ['outputs/auto-modified.dta'],
source = 'code/dataprep.do',
depends = ['inputs/auto-original.dta'])
. statacons, file(SConstruct-introExample) debug(explain)
scons: Reading SConscript files ...
Using 'LabelsFormatsOnly' custom_datasignature.
Calculates timestamp-independent checksum of dataset,
including variable formats, variable labels and value labels.
Edit use_custom_datasignature in config_project.ini to change.
(other options are Strict, DataOnly, False)
scons: done reading SConscript files.
scons: Building targets ...
scons: building `outputs\auto-modified.dta' because it doesn't exist
stata_run(["outputs\auto-modified.dta"], ["code\dataprep.do"])
Running: "C:\Program Files\Stata16\StataMP-64.exe" /e do "code\dataprep.do".
Starting in hidden desktop (pid=26836).
scons: building `outputs\scatterplot.pdf' because it doesn't exist
stata_run(["outputs\scatterplot.pdf", "outputs\regressionTable.tex"], ["code\an
> alysis.do"])
Running: "C:\Program Files\Stata16\StataMP-64.exe" /e do "code\analysis.do".
Starting in hidden desktop (pid=28676).
scons: done building targets.
Separation of Concerns: The Problem¶
The issue with this workflow can be seen in the following example. Let's suppose that the regressions in analysis.do
take a very long time to run. Furthermore, let's suppose that we wanted to improve the formatting of our graph, scatterplot.pdf
, so that the y-axis labels are horizontal. We edit analysis.do
to add ylabels(, angle(horizontal))
to the command creating scatterplot.gph
:
// analysis.do
version 16.1
use "outputs/auto-modified.dta", clear
twoway scatter price mpg, title("Price vs. MPG") ///
ylabels(, angle(horizontal))
graph export "outputs/scatterplot.pdf", replace
regress price mpg
eststo linear
regress price mpg mpg_sqd
eststo quadratic
esttab linear quadratic using "outputs/regressionTable.tex", ///
se r2 replace
The problem will now become apparent if we ask statacons
for the status of the project:
. statacons, file(SConstruct-introExample) dry_run debug(explain)
scons: Reading SConscript files ...
Using 'LabelsFormatsOnly' custom_datasignature.
Calculates timestamp-independent checksum of dataset,
including variable formats, variable labels and value labels.
Edit use_custom_datasignature in config_project.ini to change.
(other options are Strict, DataOnly, False)
scons: done reading SConscript files.
scons: Building targets ...
scons: rebuilding `outputs\scatterplot.pdf' because `code\analysis.do' changed
stata_run(["outputs\scatterplot.pdf", "outputs\regressionTable.tex"], ["code\an
> alysis.do"])
scons: done building targets.
As expected, statacons
tells us that it will need to re-run analysis.do
. This is wasteful -- we should not re-run all these regressions just to change the formatting of tables or figures.
Separating Analysis from Presentation¶
The principle of separation of concerns suggests that we should split analysis.do
: regressions.do
should handle regressions; tabfig.do
should take the regression results and produce tables and figures. To implement this, we need regressions.do
to save regression results in regressions.sters
, which tabfig.do
can then use as an input.
Here are the new do-files, regressions.do
and tabfig.do
:
// regressions.do
version 16.1
use "outputs/auto-modified.dta", clear
// Linear regression
regress price mpg
eststo linear
// Quadratic regression
regress price mpg mpg_sqd
eststo quadratic
// save linear and quadratic regression results in .sters file
estwrite linear quadratic using "outputs/regressions.sters", ///
reproducible replace
exit
// tabfig.do
use "outputs/auto-modified.dta", clear
// read previous regression results from saved .sters file.
estimates clear
estread using "outputs/regressions.sters"
// create scatter plot
twoway scatter price mpg, title("Price vs. MPG") ///
ylabels(, angle(horizontal))
graph export "outputs/scatterplot.pdf", replace
// create *.tex file
esttab linear quadratic using "outputs/regressionTable.tex", ///
se r2 label replace
We create a new SConstruct, SConstruct-separation
, with this new workflow.
# SConstruct-separation
# **** Setup from pystatacons package *****
import pystatacons
env = pystatacons.init_env()
# use sconsign specific to this exercise
SConsignFile(".sconsignSeparation")
# separate analysis from tables and figures
# **** Substance begins *****
# tables and figures
cmd_tabfig = env.StataBuild(
target = ['outputs/scatterplot.pdf',
'outputs/regressionTable.tex'],
source = 'code/tabfig.do',
depends = ['outputs/auto-modified.dta',
'outputs/regressions.sters']
)
# regressions
cmd_regressions = env.StataBuild(
target = ['outputs/regressions.sters'],
source = 'code/regressions.do',
depends = ['outputs/auto-modified.dta']
)
# dataprep
cmd_dataprep = env.StataBuild(
target = ['outputs/auto-modified.dta'],
source = 'code/dataprep.do',
depends = ['inputs/auto-original.dta']
)
This workflow is illustrated in the following figure:
We build the project from scratch:
. statacons, file(SConstruct-separation) clean
scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Cleaning targets ...
Removed outputs\auto-modified.dta
Removed outputs\regressions.sters
Removed outputs\scatterplot.pdf
Removed outputs\regressionTable.tex
scons: done cleaning targets.
. statacons, file(SConstruct-separation) debug(explain)
scons: Reading SConscript files ...
Using 'LabelsFormatsOnly' custom_datasignature.
Calculates timestamp-independent checksum of dataset,
including variable formats, variable labels and value labels.
Edit use_custom_datasignature in config_project.ini to change.
(other options are Strict, DataOnly, False)
scons: done reading SConscript files.
scons: Building targets ...
scons: building `outputs\auto-modified.dta' because it doesn't exist
stata_run(["outputs\auto-modified.dta"], ["code\dataprep.do"])
Running: "C:\Program Files\Stata16\StataMP-64.exe" /e do "code\dataprep.do".
Starting in hidden desktop (pid=26172).
scons: building `outputs\regressions.sters' because it doesn't exist
stata_run(["outputs\regressions.sters"], ["code\regressions.do"])
Running: "C:\Program Files\Stata16\StataMP-64.exe" /e do "code\regressions.do".
Starting in hidden desktop (pid=34468).
scons: building `outputs\scatterplot.pdf' because it doesn't exist
stata_run(["outputs\scatterplot.pdf", "outputs\regressionTable.tex"], ["code\ta
> bfig.do"])
Running: "C:\Program Files\Stata16\StataMP-64.exe" /e do "code\tabfig.do".
Starting in hidden desktop (pid=11144).
scons: done building targets.
Testing separation¶
Now let's test whether our separation of concerns has been successful. Let's change the number of significant digits in our regression table from the default to 2.
// tabfig.do
use "outputs/auto-modified.dta", clear
// read previous regression results from saved .sters file.
estimates clear
estread using "outputs/regressions.sters"
// create scatter plot
twoway scatter price mpg, title("Price vs. MPG") ///
ylabels(, angle(horizontal))
graph export "outputs/scatterplot.pdf", replace
// create *.tex file
esttab linear quadratic using "outputs/regressionTable.tex", ///
se r2 label b(a2) replace
Let's see what statacons
does and does not rebuild.
. statacons, file(SConstruct-separation) debug(explain)
scons: Reading SConscript files ...
Using 'LabelsFormatsOnly' custom_datasignature.
Calculates timestamp-independent checksum of dataset,
including variable formats, variable labels and value labels.
Edit use_custom_datasignature in config_project.ini to change.
(other options are Strict, DataOnly, False)
scons: done reading SConscript files.
scons: Building targets ...
scons: rebuilding `outputs\scatterplot.pdf' because `code\tabfig.do' changed
stata_run(["outputs\scatterplot.pdf", "outputs\regressionTable.tex"], ["code\ta
> bfig.do"])
Running: "C:\Program Files\Stata16\StataMP-64.exe" /e do "code\tabfig.do".
Starting in hidden desktop (pid=33404).
scons: done building targets.
Notice that statacons
does rebuild the outputs of tabfig.do
because tabfig.do
has changed, but does not re-run any regressions.
Metadata (variable labels)¶
Now let's suppose that we want to edit the labels of our variables, for example to change the label of price from "Price" to "Price (USD)".
// dataprep.do
version 16.1
use "inputs/auto-original.dta", clear
generate mpg_sqd = mpg^2
label variable mpg_sqd "Mileage (mpg) squared"
label variable price "Price (USD)"
save "outputs/auto-modified.dta", replace
Through statacons
, we can see the issue this will create: because we are changing dataprep.do
and its target auto-modified.dta
, we need to rebuild regTable.tex
, which depends on auto-modified.dta
:
. statacons, file(SConstruct-separation) debug(explain)
scons: Reading SConscript files ...
Using 'LabelsFormatsOnly' custom_datasignature.
Calculates timestamp-independent checksum of dataset,
including variable formats, variable labels and value labels.
Edit use_custom_datasignature in config_project.ini to change.
(other options are Strict, DataOnly, False)
scons: done reading SConscript files.
scons: Building targets ...
scons: rebuilding `outputs\auto-modified.dta' because `code\dataprep.do' change
> d
stata_run(["outputs\auto-modified.dta"], ["code\dataprep.do"])
Running: "C:\Program Files\Stata16\StataMP-64.exe" /e do "code\dataprep.do".
Starting in hidden desktop (pid=33692).
scons: rebuilding `outputs\regressions.sters' because `outputs\auto-modified.dt
> a' changed
stata_run(["outputs\regressions.sters"], ["code\regressions.do"])
Running: "C:\Program Files\Stata16\StataMP-64.exe" /e do "code\regressions.do".
Starting in hidden desktop (pid=31340).
scons: rebuilding `outputs\scatterplot.pdf' because `outputs\auto-modified.dta'
> changed
stata_run(["outputs\scatterplot.pdf", "outputs\regressionTable.tex"], ["code\ta
> bfig.do"])
Running: "C:\Program Files\Stata16\StataMP-64.exe" /e do "code\tabfig.do".
Starting in hidden desktop (pid=14344).
scons: done building targets.
.
To avoid re-running regressions.do
, we could manually adjust the label of price in tabfig.do
. However, this could lead to inconsistency across analyses and contradicts the goal of an automatic workflow. We will take up this problem in the next lesson.