I want to create a Github repository with 1,700,000 files. Yes, more than one million files on Git. Even worse, I want to put them in the same directory.
Why? Well for creating a scientific dataset of software engineering data: a dataset of all commits of the Apache Repository. Each commit is represented by a json file and contains the author, date and commit message of the of all commits, extracting with svn log
. And Apache’s SVN has almost 2 millions of commits.
However, one million files is a huge number of files, even if they are rather small (approx. 1kb in average). And I hit a large number of problems.
- File systems are not good at having directories with huge number of files [1] [2]. However, a modern file system such as Ext4 has no hard limitation on the number of files, it’s more a practical limitation due to slowness.
- Git is not good at handling large number of files, because it performs a lot of
stat
system calls on directory, for instance for eachgit add
andgit status
[3] [4] - Some filesystems have a limited number of possible files (the number of inodes). Since Git creates one object per file, if you create 1M files, you need at least 2M free inodes. To check the number of inodes,
df -i
(Stackoverflow is full on Q&As about those problems)
In the end, it takes hours and hours to create a repository with more than one million files in the same directory.
However, they are solutions:
- Instead of doing 1 million
git add data/x
, do one singlegit add data
(git update-index --add x
is also an alternative) - Work in RAM based filesystem such as tmpfs in linux (
mkdir /tmp/exp; mount -t tmpfs none /tmp/exp
). This goes 100x if not 1000x faster depending on your initial hardware disk performance. To overcome the problem due to the number of inodes, setnr_inodes
in tmpfs (mount -t tmpfs -o 'size=90%,nr_inodes=4000000' none /tmp/exp
Let’s assume that you want to add 10000 files on Github. Such a batch consists of 10000 git update-index
, one git commit
and one git push
. In RamFS, such a batch lasts 5 minutes (with files of 300 bytes in average).
Finally, I note that:
- Pushing to Github is not an issue
- Github very well handles such a repository from the view point of pushing and browsing (congrats Github engineers!)
- However, Github search is disabled, their support tells me that there is a 500,000 files-per-repository limit for code searches (info from October 1st 2015).
- Disabling delta compression is important for pushing
echo "* -delta" >> .gitattributes
To conclude, I’ve set up a Github repository with 1,705,052 objects on Oct 2015.
git rev-list --objects -g --no-walk --all | wc -l
1705052
I hope to be the first one to hit this record on Github :-) ([4] has 1.3 million files, [5] mentions 800,000 objects.).
Feedback welcome,
–Martin Monperrus
Lille, October 2015
[1] Large Directory Causes “ls” to Hang http://unixetc.co.uk/2012/05/20/large-directory-causes-ls-to-hang/
[2] One billion files on Linux https://lwn.net/Articles/400629/
[3] How Fast is Git? https://gist.github.com/emanuelez/1758346
[4] Git performance results on a large repository http://thread.gmane.org/gmane.comp.version-control.git/189776
[5] What are the file limits in Git (number and size)? http://stackoverflow.com/questions/984707/what-are-the-file-limits-in-git-number-and-size