The hidden overhead when creating thousands of empty folders

Over the past weeks I've been looking for a way to store some hundred million files inside a desktop file system such as Ext3 or NTFS.

On previous experiments, I've noted that after >30 000 files on a single folder it would not be possible to use a GUI file browser and even the listing of these files using the command line becomes sluggish.

The next hypothesis was to avoid a huge number of files/folders on the same root. For example, for a file I'd remove the name portion and compute a SHA1 checksum signature of 40 bytes. With this signature I'd use the first 4 bytes to create a folder, then I'd use the next 4 bytes to create a sub-folder inside the first one and so forth until I have 10 folders that represent the unique signature of the file.

On the last folder I would add a text file containing the path and name information. If more files would be found in the future, the respective path information would be added on this text file. I thought that using the 4 bytes would reduce the number of possible combinations and permit re-use of the folders that would scale to millions.

In theory it worked OK. On my first tests it worked OK with the first thousand files. The problem was a bit more unexpected. Folders, despite containing nothing other than other folders will also require disk space to exist. While the initial project was a success, we could store/access thousands of files without noticeable latency, we brought aboard a problem of disk storage.

To index something like 40 000 files with this SHA1/folder combination was using 2.2Gb of disk space. This was a hidden overhead, very underestimated on my early estimation.

I didn't tested under the Windows NTFS, having this problem under Linux was already a no-go for this option. Fortunately, on the bench was already being planned an alternative.

Our file-system (at minimum) requires only to write files once and then read them when needed. So, a very simple solution would be writing all files together to create a very large file. And then have a second file indicating the position of each file within this large binary block.

There are some problems with this approach:
  • Easy to corrupt. If one byte within the large block is removed/added then it invalidates the index structure completely
  • Error checking overhead. Right now it copies files directly onto the big binary block (BBB). If an exception occurs while copying a file, this leads to corruption of the whole big file.
  • Limited to operating systems capable of handling files bigger than 4Gb
  • (..more exist, just pointing the ones most troublesome at the moment) 
Still, as a storing solution this is so far working as desired. Helped to remove the folder overhead and is extremely fast. I'm writing the index file as the coordinates of the file within the big file along with its original path/name combination. An example can be found at https://raw.githubusercontent.com/triplecheck/tdf/57a776d7ea065b686cb024f1dc924b5b06f8752d/run/test.tdf-index

As far as I'm concerned, creating millions of files on a desktop-grade file system is something I won't be pursuing further. It is tempting to have all the files ready for processing in their original format but I'm simply not finding a feasible manner of accessing millions of files in their original form without resorting to file systems such as XFS and HDFS, or resorting to some kind of data base.

So, opting with a big-binary file as simple storage for now.  






GitHub: Indexing all users non-stop

Last week was dedicated to index the users registered on GitHub.

Getting the users was a needed step in order to test how difficult it would be extracting information from the self-appointed largest (public) open source repository on earth.

To my surprise, it was relatively straight-forward and quick. Over the past months I've understood why GitHub is becoming the place of election for hosting. It was a neat user interface, provides incentives for developers to commit code and simply goes straight to what really matters.

In regards to acessing the information, GitHub provides a pretty amazing and reachable API access. When looking at sourceforge or googlecode, I don't fell minimally invited to work with them in regards to analyse the open source data. I'm sure they have their valid reasons, on my side I'm just looking for a straightforward (and legal) way to access this kind of information.

To access the API was available a set of Java libraries (or you can just use the raw JSON format). There is JCapi (freeware) which is the one that I found more suited. The reason is because every now and then we'd expire the API rate limit (5000 requests/hour) and these libraries permitted to keep the connection in pause until the limit was lifted. There were some defects noted while using the library, however, the developers behind the library were pretty amazing. They solved most of the reported issues in light-speed.


With the library in place, I've wrote a java project to iterate through all the users registered on GitHub using the authorized API calls. At first was noted that the Internet connection broke very often, usually around 200k indexed users. I was doing these tests from the office network and was not stable enough. Then moved the software to a dedicated server that is continuously online at a datacenter. From there the connection lasted longer but still failed at 800k indexed users.

Being a stubborn person by nature, got myself thinking on how to resume the operation from the point where it had stopped. The API did provided a resume option from a given user name but the libraries didn't yet covered this point, to my luck the jcabi team was very prompt in explaining how the API call could be used. With the support for resuming available, was then a simple matter to read back the text file with each user per line and get the last one.

Didn't said it before but the storage media that I'm using are plain text files (a.k.a flat files). I'm not using a relational database, nor have I looked much into "NoSQL" databases. From all the options that I've tested over the years, nothings beats the processing simplicity of a plain text file to store large arrays of data. Please note that I emphasize simplicity. If performance had been affected significantly, I'd use a more suited method. However, turns out that normal desktop computers in 2014 can handle text files with millions of lines under a second.

With the resume feature implemented was then possible to launch a single machine for indexing the user names non-stop. After two days of processing, the indexing was complete and showed a total of 7.709.288 registered users.

At the time of this writing (July 2014), GitHub mentions in their press page a total of 6 million users. The text file containing these user names is sized in 81Mb. It contains one user name per line.

Perhaps in the future would be interesting to see a study noting the demographics, gender variation and other details from these user accounts. On my context, this is part of a triplecheck goal. Open source is open, but is not transparent enough. And that is what we are working to improve.

Following the spirit of open source and free software, the software used for generating this index of users is too available on github. It is released as freeware under the modified EUPL (European Public Licence) terms. You find the code and compilation instructions at https://github.com/triplecheck/gitfinder


This code does more than just indexing users, I'll be writing more about what is being done in future posts. In the meanwhile you might be asking yourself: where can I get this text file?

For the moment I didn't made the text file available on a definitive location. Will likely talk with the good people at FLOSSmole to see if they would be interested in hosting and keeping it easily reachable to other folks.

Forward we move.
:-)