Flat files are faster than databases (for my purpose)

I've tried. I've honestly tried (again) to use databases for the purpose of storing tens of million entries but it is just too slow.

Over the past week I've been trying different approaches for storing file hashes inside a database, so that it would become possible to do neat queries over the data without needing to re-invent the wheel like we have done in past.

Our requirements are tough because we are talking about tens of million, if not billion entries and being capable of providing database answers on plain normal laptops without any kind of installation happening. The best candidate for this job seemed to be H2 database, after previously trying SQLite and HSQLDB.

The first tries were OK with a small sample but then became sluggish when adding data in larger scale. There was a peak of performance at the beginning and then it would creep to awful slowness as the database got bigger. Further investigation helped with the performance, disabling the indexation, cache and other details that would get on the way of a mass data insert. Still, I would wait 1~2 days and the slow speed wouldn't give confidence that the data sample (some 4Tb with 19 million computer files) could be iterated in useful time.

So, went back to the old school methods and used plain flat files using CSV format. It was simpler to add new files. It was easy to view with a text editor if things were being written as intended, it was simpler to count how much data had already been inserted during the data extraction. And it was fast, not simply fast, it was surprisingly fast and completed the whole operation under 6 hours.


It is frustrating to try using a fancy database and then end up using the traditional methods, simply because they are reliable and faster than other options. I've tried, but for now will continue using flat files to index large volumes of data.

Getting started with the H2 database

H2 database is a small and compact way to store data directly from Java, especially because it can use simple binary files as storage. My goal with these things is performance and large scale indexing of data. When speaking about large scale, I'm talking about hundreds of million rows.

On this blog post I'm adding some of the things that are useful for those interested in debugging this database.

Starting an interactive shell

java -cp h2.jar org.h2.tools.Shell -url jdbc:h2:./test.mv.db -user xyz -password 123456

Where you should adjust:
  • h2.jar is the library jar file from H2 (just a single file)
  • ./test.mv.db is the filename for your database
  • "xyz" and "123456" are the user and password

 

Hello World (writing your first data and database to disk)

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.Statement;
public class Start {
    /**
     * @param args
     */
    public static void main(String[] args)
    {
        try
        {
            Class.forName("org.h2.Driver");
            Connection con = DriverManager.getConnection("jdbc:h2:~/test", "test", "" );
            Statement stmt = con.createStatement();
            //stmt.executeUpdate( "DROP TABLE table1" );
            stmt.executeUpdate( "CREATE TABLE table1 ( user varchar(50) )" );
            stmt.executeUpdate( "INSERT INTO table1 ( user ) VALUES ( 'Claudio' )" );
            stmt.executeUpdate( "INSERT INTO table1 ( user ) VALUES ( 'Bernasconi' )" );
            ResultSet rs = stmt.executeQuery("SELECT * FROM table1");
            while( rs.next() )
            {
                String name = rs.getString("user");
                System.out.println( name );
            }
            stmt.close();
            con.close();
        }
        catch( Exception e )
        {
            System.out.println( e.getMessage() );
        }
    }
}  

This example was retrieved from Claudio Bernasconi (thank you!):
http://www.claudiobernasconi.ch/2010/08/17/h2-embedded-java-db-getting-started/

Relevant commands

To list your tables inside the database:
SHOW TABLES;

To list everything inside a table:
SELECT * FROM MyTABLE;

Where "MyTable" is naturally the name for your table.

Showing which INDEXES are associated with a table:
SELECT * FROM information_schema.indexes WHERE table_schema = 'PUBLIC' AND table_name='MyTable';


Relevant links related to performance

Several switches to improve performance: 
http://iliachemodanov.ru/en/blog-en/21-databases/42-h2-performance-en

Some people say that an active INDEX will slow down large scale data entry:
https://groups.google.com/forum/#!msg/h2-database/7U99fWMaMw0/FbpkXgWIrDcJ

Learn to add multiple data fields per SQL call to speed things up further:
https://stackoverflow.com/a/19682518