A Simple SConstruct file¶
We concluded the previous module with
What we really want is an executable description of our pipeline that allows software to do the tricky part for us: figuring out what steps need to be rerun.
This is where SCons comes in -- we write an SConstruct
file that codifies our workflow: what are the outputs, what are the inputs (including data and code), and what must be done to create the outputs from the inputs. Furthermore, SCons can tell which steps need to be re-run as the inputs change.
A first example¶
In your text editor, open the file called SConstruct
(no extension).
Skip the section at the top marked
# **** Setup from pystatacons package ***
and start reading at the section marked
# **** Substance begins *****
You will see the following:
cmd_isles_data = env.StataBuild(
target = 'outputs/data/dta/isles.dta',
source = 'code/count_isles.do'
)
Depends(cmd_isles_data,
['inputs/txt/isles.txt',
'code/ado/plus/w/wordcloud.ado',
'code/ado/personal/countWords.ado']
Let's walk through this one piece at a time.
cmd_isles_data
is the name we are giving to this task. It will allow us to refer to this task elsewhere in the SConstruct if we need to.env.StataBuild
tells SCons that this task will be performed according toStataBuild
. Essentially,StataBuild
tells Stata "Do the do-filesource
in batch mode."The
target
is the file to be created or built, so this task will be buildingoutputs/data/isles.dta
.source
is the do-file we will use to build our target. In general, source can include other dependencies. However, we have set upStataBuild
todo
the first thing it finds insource
, so we list only the do-file we want to run.in
Depends
, we list the remaining inputs (or "dependencies") for our taskcmd_isles_data
. In this case, we have three additional dependencies. It is clear why the input fileinputs/txt/isles.txt
is a dependency -- if the input data change, the output may change, so we will need to re-run our analysis. The next two are the two ado-files:countWords.ado
, which is called bycount_isles.do
, andwordCount.ado
, which is called bycountWords.ado
. Our targetisles.dta
also depends on these, since, just like the input data, if either of them change, then the output potentially could change,
So, in words, the code above reads
"The the task named
cmd_isles_data
is to createoutput/data/isles.dta
.StataBuild
tells us that we do that by running thesource
do-filecount_isles.do
in batch mode. The files this task depends on areisles.txt
,countWords.ado
, andwordcount.ado
. Check whether the source or any of the dependencies have changed since the last time we build this target. If any have changed, we must re-build the target. If none have changed, we do not need to re-build this target."
Putting our first example to work¶
Let's see how our script works.
First, let's start fresh by erasing our target isles.dta
(if it already exists) using the clean
option.
. statacons, clean
scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Cleaning targets ...
Removed outputs\data\dta\isles.dta
scons: done cleaning targets.
We can use the dry_run
option to statacons
to get a preview of what statacons
will do:
. statacons, dry_run
scons: Reading SConscript files ...
Using 'LabelsFormatsOnly' custom_datasignature.
Calculates timestamp-independent checksum of dataset,
including variable formats, variable labels and value labels.
Edit use_custom_datasignature in config_project.ini to change.
(other options are Strict, DataOnly, False)
scons: done reading SConscript files.
scons: Building targets ...
stata_run(["outputs\data\dta\isles.dta"], ["code\count_isles.do"])
scons: done building targets.
statacons
is telling us that it will do the action stata_run(["outputs\data\dta\isles.dta"], ["code\count_isles.do"])
, but we can get even more information about why it will do this by adding the option debug(explain)
(we still include dry_run
):
. statacons, dry_run debug(explain)
scons: Reading SConscript files ...
Using 'LabelsFormatsOnly' custom_datasignature.
Calculates timestamp-independent checksum of dataset,
including variable formats, variable labels and value labels.
Edit use_custom_datasignature in config_project.ini to change.
(other options are Strict, DataOnly, False)
scons: done reading SConscript files.
scons: Building targets ...
scons: building `outputs\data\dta\isles.dta' because it doesn't exist
stata_run(["outputs\data\dta\isles.dta"], ["code\count_isles.do"])
scons: done building targets.
statacons
tells us that it needs to rebuild outputs\data\dta\isles.dta
because it does not exist.
As one last check before rebuilding, we will get statacons
to tell us about the status of each file it is considering by using the option tree(status, prune)
:
. statacons, dry_run debug(explain) tree(status,prune)
scons: Reading SConscript files ...
Using 'LabelsFormatsOnly' custom_datasignature.
Calculates timestamp-independent checksum of dataset,
including variable formats, variable labels and value labels.
Edit use_custom_datasignature in config_project.ini to change.
(other options are Strict, DataOnly, False)
scons: done reading SConscript files.
scons: Building targets ...
scons: building `outputs\data\dta\isles.dta' because it doesn't exist
stata_run(["outputs\data\dta\isles.dta"], ["code\count_isles.do"])
E = exists
R = exists in repository only
b = implicit builder
B = explicit builder
S = side effect
P = precious
A = always build
C = current
N = no clean
H = no cache
[E b ]+-.
[E b C ] +-code
[E b C ] | +-code\ado
[E b C ] | | +-code\ado\personal
[E C ] | | | +-code\ado\personal\countWords.ado
[E b C ] | | +-code\ado\plus
[E b C ] | | +-code\ado\plus\w
[E C ] | | +-code\ado\plus\w\wordfreq.ado
[E C ] | +-code\count_isles.do
[E b C ] +-inputs
[E b C ] | +-inputs\txt
[E C ] | +-inputs\txt\isles.txt
[E b ] +-outputs
[E b ] | +-outputs\data
[E b ] | +-outputs\data\dta
[ B P ] | +-outputs\data\dta\isles.dta
[E C ] | +-code\count_isles.do
[E C ] | +-inputs\txt\isles.txt
[E C ] | +-code\ado\personal\countWords.ado
[E C ] | +-code\ado\plus\w\wordfreq.ado
[E C ] +-SConstruct
scons: done building targets.
Notice the entry in the tree for outputs\data\dta\isles.dta
: statacons
knows to look for it, since it is a target in our SConstruct
, but does not find it (no E
) in the leftmost column and that it has to build it (B
in the second column). Notice that statacons
has checked all the dependencies for this target, and has found them all to be current (capital C
in the third column). The prune
option makes the tree easier to read by not repeating dependencies -- if a file's dependencies have already been listed, that file will be enclosed in brackets.
Now that we have had a detailed look at what statacons
is planning to do, let's see what it actually does:
. statacons
scons: Reading SConscript files ...
Using 'LabelsFormatsOnly' custom_datasignature.
Calculates timestamp-independent checksum of dataset,
including variable formats, variable labels and value labels.
Edit use_custom_datasignature in config_project.ini to change.
(other options are Strict, DataOnly, False)
scons: done reading SConscript files.
scons: Building targets ...
stata_run(["outputs\data\dta\isles.dta"], ["code\count_isles.do"])
Running: "C:\Program Files\Stata16\StataMP-64.exe" /e do "code\count_isles.do".
Starting in hidden desktop (pid=23840).
scons: done building targets.
As expected, statacons
has done the action stata_run(["outputs\data\dta\isles.dta"], ["code\count_isles.do"])
by sending the source
(code\count_isles.do
) to stata by batch mode.
We see that our target has been created and is the same as before:
. use outputs/data/dta/isles.dta, clear
. desc, s
Contains data from outputs/data/dta/isles.dta
obs: 9,215
vars: 3 10 May 2022 12:59
Sorted by:
. li in 1/5
+------------------------+
| word freq share |
|------------------------|
1. | the 3315 .0625483 |
2. | of 2185 .0412272 |
3. | and 1530 .0288685 |
4. | to 1323 .0249627 |
5. | a 1132 .0213589 |
+------------------------+
. clear
If we ask statacons
to rebuild our project now, it will tell us that no rebuilding is necessary because all files are up to date:
. statacons, debug(explain) tree(status,prune)
scons: Reading SConscript files ...
Using 'LabelsFormatsOnly' custom_datasignature.
Calculates timestamp-independent checksum of dataset,
including variable formats, variable labels and value labels.
Edit use_custom_datasignature in config_project.ini to change.
(other options are Strict, DataOnly, False)
scons: done reading SConscript files.
scons: Building targets ...
scons: `.' is up to date.
E = exists
R = exists in repository only
b = implicit builder
B = explicit builder
S = side effect
P = precious
A = always build
C = current
N = no clean
H = no cache
[E b C ]+-.
[E b C ] +-code
[E b C ] | +-code\ado
[E b C ] | | +-code\ado\personal
[E C ] | | | +-code\ado\personal\countWords.ado
[E b C ] | | +-code\ado\plus
[E b C ] | | +-code\ado\plus\w
[E C ] | | +-code\ado\plus\w\wordfreq.ado
[E C ] | +-code\count_isles.do
[E b C ] +-inputs
[E b C ] | +-inputs\txt
[E C ] | +-inputs\txt\isles.txt
[E b C ] +-outputs
[E b C ] | +-outputs\data
[E b C ] | +-outputs\data\dta
[E B P C ] | +-outputs\data\dta\isles.dta
[E C ] | +-code\count_isles.do
[E C ] | +-inputs\txt\isles.txt
[E C ] | +-code\ado\personal\countWords.ado
[E C ] | +-code\ado\plus\w\wordfreq.ado
[E C ] +-SConstruct
scons: done building targets.
Adding a second target¶
We have successfully built isles.dta
using statacons
, but now we would like to add abyss.dta
to the same build process.
We copy SConstruct
to a new file SConstruct-addAbyss
and add a second task:
cmd_abyss_data = env.StataBuild(
target = 'outputs/data/dta/abyss.dta',
source = 'code/count_abyss.do'
)
Depends(cmd_abyss_data,
['inputs/txt/abyss.txt',
'code/ado/personal/countWords.ado',
'code/ado/plus/w/wordfreq.ado']
)
This is essentially the same as our previous task, and a useful exercise is to try to translate it into words as we did in the first subsection above ("A first example").
Let's see what statacons
thinks about this task. We use the option file(SConstruct-addAbyss)
to tell statacons
to examine our new SConstruct file instead of the default (which is just to use the file called SConstruct
):
. statacons, file(SConstruct-addAbyss) dry_run debug(explain) tree(status,prune
> )
scons: Reading SConscript files ...
Using 'LabelsFormatsOnly' custom_datasignature.
Calculates timestamp-independent checksum of dataset,
including variable formats, variable labels and value labels.
Edit use_custom_datasignature in config_project.ini to change.
(other options are Strict, DataOnly, False)
scons: done reading SConscript files.
scons: Building targets ...
scons: Cannot explain why `outputs\data\dta\abyss.dta' is being rebuilt: No pre
> vious build information found
stata_run(["outputs\data\dta\abyss.dta"], ["code\count_abyss.do"])
E = exists
R = exists in repository only
b = implicit builder
B = explicit builder
S = side effect
P = precious
A = always build
C = current
N = no clean
H = no cache
[E b ]+-.
[E b C ] +-code
[E b C ] | +-code\ado
[E b C ] | | +-code\ado\personal
[E C ] | | | +-code\ado\personal\countWords.ado
[E b C ] | | +-code\ado\plus
[E b C ] | | +-code\ado\plus\w
[E C ] | | +-code\ado\plus\w\wordfreq.ado
[E C ] | +-code\count_abyss.do
[E C ] | +-code\count_isles.do
[E b C ] +-inputs
[E b C ] | +-inputs\txt
[E C ] | +-inputs\txt\abyss.txt
[E C ] | +-inputs\txt\isles.txt
[E b ] +-outputs
[E b ] | +-outputs\data
[E b ] | +-outputs\data\dta
[E B P ] | +-outputs\data\dta\abyss.dta
[E C ] | | +-code\count_abyss.do
[E C ] | | +-inputs\txt\abyss.txt
[E C ] | | +-code\ado\personal\countWords.ado
[E C ] | | +-code\ado\plus\w\wordfreq.ado
[E B P C ] | +-outputs\data\dta\isles.dta
[E C ] | +-code\count_isles.do
[E C ] | +-inputs\txt\isles.txt
[E C ] | +-code\ado\personal\countWords.ado
[E C ] | +-code\ado\plus\w\wordfreq.ado
[E C ] +-SConstruct-addAbyss
scons: done building targets.
There is an interesting line from the debug: Cannot explain why outputs/data/dta/abyss.dta is being rebuilt: No previous build information found
. This is a reminder of how scons
decides whether to rebuild a target: it checks whether the target's dependencies have changed since the last time scons
built the target. We created abyss.dta
in the previous module, and have not erased it, so statacons
sees that it exists. However, since we have not built abyss.dta
using statacons
before, statacons
cannot answer the question "have the dependencies changed since the last time you built abyss.dta
?"", and will choose to rebuild it.
Notice that there is no C
in the status report for abyss.dta
from the tree, since its status is not known to be Current.
Now we rebuild:
. statacons, file(SConstruct-addAbyss)
scons: Reading SConscript files ...
Using 'LabelsFormatsOnly' custom_datasignature.
Calculates timestamp-independent checksum of dataset,
including variable formats, variable labels and value labels.
Edit use_custom_datasignature in config_project.ini to change.
(other options are Strict, DataOnly, False)
scons: done reading SConscript files.
scons: Building targets ...
stata_run(["outputs\data\dta\abyss.dta"], ["code\count_abyss.do"])
Running: "C:\Program Files\Stata16\StataMP-64.exe" /e do "code\count_abyss.do".
Starting in hidden desktop (pid=1816).
scons: done building targets.
Notice that statacons
has built abyss.dta
but not isles.dta
. (Question for the user: why has statacons
not rebuilt isles.dta
? Hint: check the tree
above.)
Exercise: Write Two New Rules¶
Create a copy of
SConstruct-addAbyss
and rename itSConstruct-2NewRules
Write a new do-file
count_last.do
to create a frequency count datasetlast.dta
from the input textlast.txt
.Add a rule to our new SConstruct to build
last.dta
.Update
testZipf.do
to incorporatelast.dta
, but do not re-createtestZipf.txt
.Add a rule in our new SConstruct to build
testZipf.txt
.Use
SConstruct-2NewRules
to see if any of the targets need to be rebuilt.