Git statistics
This post is going to go over generating statistics from a Git repository. All of the examples in the post were run against the Git repository for Git.
Total repository commits
Counting the total number of commits in a repository is just a case of using git log and wc:
$ git log --all --format=oneline|wc -l
46680
Alternatively you can also use git rev-list:
$ git rev-list --all | wc -l
46680
Total contributors
git log
can also be used to list contributors:
$ git log --all --format='%aN' | sort -u
A Large Angry SCM
Aaron Crane
Aaron Schrab
...
At the time of writing there are currently 1437:
$ git log --all --format='%aN' | sort -u | wc -l
1437
Top committers
Working out the top committers is also relatively straight forward. You can use
git log
with a format string:
$ git log --all --format='%aN' | sort | uniq -c | sort -nr | head -n 5
18794 Junio C Hamano
2341 Jeff King
1404 Shawn O. Pearce
1112 Linus Torvalds
1008 Nguy?n Thái Ng?c Duy
Alternatively git shortlog can also be used:
$ git shortlog --all -sn | head -n 5
18794 Junio C Hamano
2341 Jeff King
1404 Shawn O. Pearce
1112 Linus Torvalds
1008 Nguy?n Thái Ng?c Duy
Note: In both the examples given above the .mailmap
file is read to cope
with alternative names and/or email addresses. Using %an
instead of %aN
to
ignore .mailmap
will produce slightly different results:
$ git log --all --format='%an' | sort | uniq -c | sort -nr | head
18790 Junio C Hamano
2341 Jeff King
1334 Shawn O. Pearce
1112 Linus Torvalds
993 Nguy?n Thái Ng?c Duy
Top committers on a file
A very similar command can be used to calculate commit totals for a single file:
$ git log --all --format='%aN' README.md | sort | uniq -c | sort -nr | head
6 Matthieu Moy
1 Benjamin Dopplinger
Top committers this year
The --since
option can be used to limit commits to a time period:
$ git log --all --format='%aN' --since='2016-01-01' | sort \
| uniq -c | sort -nr | head -n5
1492 Junio C Hamano
339 Jeff King
183 Johannes Schindelin
166 Nguy?n Thái Ng?c Duy
107 Vasco Almeida
Commits over time
It's often interesting to know how active a codebase is. The following command shows total commits by year:
$ git log --all --format='%cd' --date='format:%Y' | sort | uniq -c \
| awk 'BEGIN{print "year","commits"}{print $ 2, " ", $1}'
year commits
2005 3215
2006 4601
2007 5496
2008 4120
2009 3835
2010 3883
2011 3521
2012 3782
2013 4319
2014 3103
2015 3289
2016 3516
A similar command can also be used to look at which hour of the day most commits are made:
$ git log --all --format='%cd' --date='format:%H' | sort |uniq -c \
| awk 'BEGIN{print "hour","commits"}{print $2, " ", $1}'
hour commits
00 1954
01 1292
02 780
03 415
04 177
05 67
06 108
07 340
08 878
09 1942
10 3284
11 4247
12 3983
13 3865
14 4319
15 3783
16 2573
17 1807
18 1389
19 1186
20 1146
21 2096
22 2577
23 2472
File level statistics
Looking at commits is fairly straightforward, however it's often more interesting to look at file based statistics. The git blame command is a great tool for doing this.
Lines per author
The following command will find the top five authors, based on the number of
lines attributed to them in the HEAD
revision of the repository:
$ git ls-tree -r --name-only HEAD \
| xargs -d "\n" -n 1 git blame --line-porcelain \
| sed -n 's/^author / /p' | sort | uniq -c | sort -rn \
| head -n 5
118793 Junio C Hamano
36583 Jeff King
27594 Jiang Xin
23174 Shawn O. Pearce
22671 Nguy?n Thái Ng?c Duy
Note: at the time of writing there are currently 840583 lines, split across 3000 files in the Git repository. As a result the command above took just over 15 minutes to run.
Lines per author for a file
Looking at attribution for a single file is slightly easier:
$ git blame --line-porcelain README.md \
| sed -n 's/^author //p' | sort | uniq -c | sort -r
32 Matthieu Moy
9 Nicolas Pitre
9 Junio C Hamano
5 Benjamin Dopplinger
4 Christian Couder
2 Stefano Lattarini
Changes by author
It's also possible to sort authors by the number of lines they've changed. The following command does this:
$ git log --all --format='%aN'|sort -u| xargs -d "\n" -n 1 -I {} \
bash -c 'echo "$(git log --format='' --author="{}" --numstat|awk "{total += (\$1 + \$2)}END{print total}") {}"' \
| sort -rn | head -n 5
366281 Junio C Hamano
154231 Linus Torvalds
145731 Jiang Xin
78303 Peter Krefting
78085 Shawn O. Pearce
Note: This is another slow command, it took about twelve minutes to run.
A word of warning
It's nice how easy it is to pull statistics from Git. However it's important to remember lines of code/number of commits is often a very poor metric to judge quality.
As always with statistics, this quote is relevant:
There are three kinds of lies: lies, damned lies, and statistics.