One million files on Git and Github

I want to create a Github repository with 1,700,000 files. Yes, more than one million files on Git. Even worse, I want to put them in the same directory.

Why? Well for creating a scientific dataset of software engineering data: a dataset of all commits of the Apache Repository. Each commit is represented by a json file and contains the author, date and commit message of the of all commits, extracting with svn log. And Apache’s SVN has almost 2 millions of commits.

However, one million files is a huge number of files, even if they are rather small (approx. 1kb in average). And I hit a large number of problems.

File systems are not good at having directories with huge number of files [1] [2]. However, a modern file system such as Ext4 has no hard limitation on the number of files, it’s more a practical limitation due to slowness.
Git is not good at handling large number of files, because it performs a lot of stat system calls on directory, for instance for each git add and git status [3] [4]
Some filesystems have a limited number of possible files (the number of inodes). Since Git creates one object per file, if you create 1M files, you need at least 2M free inodes. To check the number of inodes, df -i

(Stackoverflow is full on Q&As about those problems)

In the end, it takes hours and hours to create a repository with more than one million files in the same directory.

However, they are solutions:

Instead of doing 1 million git add data/x, do one single git add data (git update-index --add x is also an alternative)
Work in RAM based filesystem such as tmpfs in linux (mkdir /tmp/exp; mount -t tmpfs none /tmp/exp). This goes 100x if not 1000x faster depending on your initial hardware disk performance. To overcome the problem due to the number of inodes, set nr_inodes in tmpfs ( mount -t tmpfs -o 'size=90%,nr_inodes=4000000' none /tmp/exp

Let’s assume that you want to add 10000 files on Github. Such a batch consists of 10000 git update-index, one git commit and one git push. In RamFS, such a batch lasts 5 minutes (with files of 300 bytes in average).

Finally, I note that:

Pushing to Github is not an issue
Github very well handles such a repository from the view point of pushing and browsing (congrats Github engineers!)
However, Github search is disabled, their support tells me that there is a 500,000 files-per-repository limit for code searches (info from October 1st 2015).
Disabling delta compression is important for pushing echo "* -delta" >> .gitattributes

To conclude, I’ve set up a Github repository with 1,705,052 objects on Oct 2015.

git rev-list --objects -g --no-walk --all | wc -l
1705052

I hope to be the first one to hit this record on Github :-) ([4] has 1.3 million files, [5] mentions 800,000 objects.).

Feedback welcome,

–Martin Monperrus
Lille, October 2015

[1] Large Directory Causes “ls” to Hang http://unixetc.co.uk/2012/05/20/large-directory-causes-ls-to-hang/
[2] One billion files on Linux https://lwn.net/Articles/400629/
[3] How Fast is Git? https://gist.github.com/emanuelez/1758346
[4] Git performance results on a large repository http://thread.gmane.org/gmane.comp.version-control.git/189776
[5] What are the file limits in Git (number and size)? http://stackoverflow.com/questions/984707/what-are-the-file-limits-in-git-number-and-size