One million files on Git and Github

by Martin Monperrus

I want to create a Github repository with 1,700,000 files. Yes, more than one million files on Git. Even worse, I want to put them in the same directory.

Why? Well for creating a scientific dataset of software engineering data: a dataset of all commits of the Apache Repository. Each commit is represented by a json file and contains the author, date and commit message of the of all commits, extracting with svn log. And Apache’s SVN has almost 2 millions of commits.

However, one million files is a huge number of files, even if they are rather small (approx. 1kb in average). And I hit a large number of problems.

(Stackoverflow is full on Q&As about those problems)

In the end, it takes hours and hours to create a repository with more than one million files in the same directory.

However, they are solutions:

Let’s assume that you want to add 10000 files on Github. Such a batch consists of 10000 git update-index, one git commit and one git push. In RamFS, such a batch lasts 5 minutes (with files of 300 bytes in average).

Finally, I note that:

To conclude, I’ve set up a Github repository with 1,705,052 objects on Oct 2015. I hope to be the first one to hit this record on Github :-) ([4] has 1.3 million files, [5] mentions 800,000 objects.).

Feedback welcome,

–Martin Monperrus
Lille, October 2015

[1] Large Directory Causes “ls” to Hang
[2] One billion files on Linux
[3] How Fast is Git?
[4] Git performance results on a large repository
[5] What are the file limits in Git (number and size)?

Tagged as: