“Hey, I wonder how hard it would be to generate a stacked graph of the word count of my thesis over time.” And bang goes an evening.
How are we to find a good word count of a
There’s a whole post in this question itself.
For now, let’s just say that we are going to be using
Here is how we are going to be word-counting
-q flag suppresses errors which would
otherwise mess up our output,
tells texcount that we only want the sum
of words in text, headings, footnotes and so on.
-nosub flag tells
texcount not to bother doing subcounts
(how many words in each section).
It doesn’t seem to have any effect on performance,
but better safe than sorry.
How do we find out what the word count was at some point in the past? What, you mean you don’t have version control for your thesis? I use git to keep track of changes to my thesis. I can find out what a previous version of a file looked by doing:
the name of the commit and the file you are interested in.
So if I pipe this into
I can find out what the word count was at various past times.
Next, we want to loop over a bunch of different
commits, and get the word count for each previous version of the file.
Here’s how to do that for
1 2 3 4 5
git rev-list to get us the last
ten commits from the
and then pipe the output of
git show into
This we output to the terminal with
We also send
/dev/null to avoid
errors cluttering up out output.
There will be errors if you try to
show a file from a commit
before the file was created.
This isn’t a problem since then
texcount will output
as expected, so it’s best just to ignore the errors.
Once we get this working, we can get information on all the commits
by removing the
-n 10 flag.
As well as each word count, we want the time the commit was made. We can get that with the following:
Note our previous
printf (in the
git show line) ends with a comma and then a space.
This means that our output will be
We’ll want the next record to be on the next line,
\n explicit newline at the end of this command.
We want to word count for several files, so let’s work out how to loop over them. I don’t know much about how bash scripting works, so I’m pretty much stumbling around experimenting until something seems to work. The following seems to work.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
I define a list of files.
I don’t know what sort of list structures bash understands,
so I basically worked out by trial and error that putting each file on a line works.
There’s probably a nicer way, but whatever.
I have three files in this example.
Note also the commented out line that pipes the
wc -w instead of
For testing purposes, this is useful,
since it’s way quicker.
Note also the commented out stuff at the top
which would put a header row on your
Now obviously you’ll want to put this in a file
with an appropriate header, e.g.
#! /bin/bash or similar
and then run it, and send the output to a file.
I’ll assume the file is called
Here’s a little output from one version of
1 2 3 4 5 6 7
Wouldn’t it make more sense to put the date column first rather than last? Maybe, but I didn’t. This way you don’t have to worry about trailing commas after the last column, which is an advantage, I guess?
Now we want to use
to create some funky graphs.
Rather than explain how to build the graph,
I’ll point you to
which shows how to draw stacked charts of the kind we want.
That post recommends pre-processing the data to get it to work,
but it’s much easier to use
gnuplot’s built in capacities.
1 2 3 4 5 6 7 8 9 10 11
I won’t explain much about this
but note that you can use
$1+$2 to get a graph of
the sum of columns 1 and 2:
this is easier than the pre-processing that the above linked
At least, that’s my opinion.
Save something like the above as
gwc.gp or whatever,
and then call
gnuplot feeding in this file as input
and telling it where to put the output.
And that’s about all you need to know to create a stacked chart of change in word count over time.
Sadly, I didn’t start my git repo when I started writing, but only some time later. So I don’t have the gratifying experience of seeing the whole growth of the thing. But here’s the final product:
A word of warning:
the bash script is quite slow.
It takes about 30 seconds to do 3 files on 10 commits.
Or 30 seconds to count 30k ish words 10 times.
Now my actual thesis is approaching 70k words and
there are at least 70 commits.
It takes over six minutes to run.
So if you are going to try this,
wc -w line instead
and make sure to limit the number of commits that
rev-list spits out until you’re sure it’s working
How’s that for structured procrastination!