Flat files are faster than databases (for my purpose)

I've tried. I've honestly tried (again) to use databases for the purpose of storing tens of million entries but it is just too slow.

Over the past week I've been trying different approaches for storing file hashes inside a database, so that it would become possible to do neat queries over the data without needing to re-invent the wheel like we have done in past.

Our requirements are tough because we are talking about tens of million, if not billion entries and being capable of providing database answers on plain normal laptops without any kind of installation happening. The best candidate for this job seemed to be H2 database, after previously trying SQLite and HSQLDB.

The first tries were OK with a small sample but then became sluggish when adding data in larger scale. There was a peak of performance at the beginning and then it would creep to awful slowness as the database got bigger. Further investigation helped with the performance, disabling the indexation, cache and other details that would get on the way of a mass data insert. Still, I would wait 1~2 days and the slow speed wouldn't give confidence that the data sample (some 4Tb with 19 million computer files) could be iterated in useful time.

So, went back to the old school methods and used plain flat files using CSV format. It was simpler to add new files. It was easy to view with a text editor if things were being written as intended, it was simpler to count how much data had already been inserted during the data extraction. And it was fast, not simply fast, it was surprisingly fast and completed the whole operation under 6 hours.


It is frustrating to try using a fancy database and then end up using the traditional methods, simply because they are reliable and faster than other options. I've tried, but for now will continue using flat files to index large volumes of data.

No comments:

Post a Comment