distribution file statistics

I’ve recently packaged and released a few small programs, and I spent a little time thinking about what files to include. If you’ve used any *nix OS before, the following shell session will look very familiar:

heaven:~> tar xzvf foo-1.2.3.tar.gz
heaven:~> ls foo-1.2.3/
bin     configure  doc      include  Makefile  src
build   COPYING    etc      INSTALL  MANIFEST  test
config  COPYRIGHTS HISTORY  lib      README    UPDATE
heaven:~>

If you haven’t used *nix much, this is a typical list of files and directories that a program comes with. Most programs have a README file. Other common files include CHANGELOG, NEWS, and AUTHORS. Also, some programs have different names for the same type of file, such as LICENSE, COPYING, and COPYRIGHT.

I was curious to see how common each file is, so I looked at many of the programs that ship with RedHat 9 and calculated some basic statistics. Out of 412 programs total, here’s the frequency of each file, grouped by type:

Filename Percent of projects with this file Percent of projects with this type of file
README 73% 75%
MANUAL 1%
USAGE 0%
COPYING 49% 59%
LICENSE 5%
LICENCE 1%
License 0%
COPYRIGHT 3%
Copyright 2%
ChangeLog 41% 56%
CHANGES 9%
Changelog 1%
CHANGELOG 1%
Changes 0%
changelog 0%
NOTES 1%
RELNOTES 1%
VERSION 1%
RELEASE 0%
NEWS 39% 42%
ANNOUNCE 2%
WHATSNEW 0%
WhatsNew 0%
announce 0%
AUTHORS 33% 42%
THANKS 5%
CREDITS 3%
MAINTAINERS 0%
TODO 24% 24%
ToDo 0%%
INSTALL 12%
Install 0%
BUGS 5% 7%
PROBLEMS 1%
Problems 0%
TROUBLESHOOTING 0%
FAQ 4% 4%
HACKING 2% 2%
HISTORY 1% 1%
PROJECTS 1% 1%

It’s not surprising to see that README is by far the most common file. However, I was surprised at the number of different names for the same types of files, especially for license and changelog types of files. However, it’s reassuring that the most common names, COPYING and ChangeLog respectively, are used 90% and 80% of the time. For the license files specifically, COPYING is the GNU standard. (Personally, I prefer the more straightforward LICENSE.)

Judging from this lineup, a de facto standard set of files would include README, COPYING, ChangeLog, NEWS, and for larger projects, AUTHORS.

Also, note that the total percentages for each type of file don’t all add up. This is due to rounding.

This was inspired by Eric Raymond‘s new book The Art of Unix Programming, which discusses best practices for releasing Open Source software. The good distribution-making practice section of his release practices HOWTO is also very relevant.

Leave a Reply

Your email address will not be published. Required fields are marked *