Flat files are faster than databases (for my purpose)

I've tried. I've honestly tried (again) to use databases for the purpose of storing tens of million entries but it is just too slow.

Over the past week I've been trying different approaches for storing file hashes inside a database, so that it would become possible to do neat queries over the data without needing to re-invent the wheel like we have done in past.

Our requirements are tough because we are talking about tens of million, if not billion entries and being capable of providing database answers on plain normal laptops without any kind of installation happening. The best candidate for this job seemed to be H2 database, after previously trying SQLite and HSQLDB.

The first tries were OK with a small sample but then became sluggish when adding data in larger scale. There was a peak of performance at the beginning and then it would creep to awful slowness as the database got bigger. Further investigation helped with the performance, disabling the indexation, cache and other details that would get on the way of a mass data insert. Still, I would wait 1~2 days and the slow speed wouldn't give confidence that the data sample (some 4Tb with 19 million computer files) could be iterated in useful time.

So, went back to the old school methods and used plain flat files using CSV format. It was simpler to add new files. It was easy to view with a text editor if things were being written as intended, it was simpler to count how much data had already been inserted during the data extraction. And it was fast, not simply fast, it was surprisingly fast and completed the whole operation under 6 hours.


It is frustrating to try using a fancy database and then end up using the traditional methods, simply because they are reliable and faster than other options. I've tried, but for now will continue using flat files to index large volumes of data.

Getting started with the H2 database

H2 database is a small and compact way to store data directly from Java, especially because it can use simple binary files as storage. My goal with these things is performance and large scale indexing of data. When speaking about large scale, I'm talking about hundreds of million rows.

On this blog post I'm adding some of the things that are useful for those interested in debugging this database.

Starting an interactive shell

java -cp h2.jar org.h2.tools.Shell -url jdbc:h2:./test.mv.db -user xyz -password 123456

Where you should adjust:
  • h2.jar is the library jar file from H2 (just a single file)
  • ./test.mv.db is the filename for your database
  • "xyz" and "123456" are the user and password

 

Hello World (writing your first data and database to disk)

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.Statement;
public class Start {
    /**
     * @param args
     */
    public static void main(String[] args)
    {
        try
        {
            Class.forName("org.h2.Driver");
            Connection con = DriverManager.getConnection("jdbc:h2:~/test", "test", "" );
            Statement stmt = con.createStatement();
            //stmt.executeUpdate( "DROP TABLE table1" );
            stmt.executeUpdate( "CREATE TABLE table1 ( user varchar(50) )" );
            stmt.executeUpdate( "INSERT INTO table1 ( user ) VALUES ( 'Claudio' )" );
            stmt.executeUpdate( "INSERT INTO table1 ( user ) VALUES ( 'Bernasconi' )" );
            ResultSet rs = stmt.executeQuery("SELECT * FROM table1");
            while( rs.next() )
            {
                String name = rs.getString("user");
                System.out.println( name );
            }
            stmt.close();
            con.close();
        }
        catch( Exception e )
        {
            System.out.println( e.getMessage() );
        }
    }
}  

This example was retrieved from Claudio Bernasconi (thank you!):
http://www.claudiobernasconi.ch/2010/08/17/h2-embedded-java-db-getting-started/

Relevant commands

To list your tables inside the database:
SHOW TABLES;

To list everything inside a table:
SELECT * FROM MyTABLE;

Where "MyTable" is naturally the name for your table.

Showing which INDEXES are associated with a table:
SELECT * FROM information_schema.indexes WHERE table_schema = 'PUBLIC' AND table_name='MyTable';


Relevant links related to performance

Several switches to improve performance: 
http://iliachemodanov.ru/en/blog-en/21-databases/42-h2-performance-en

Some people say that an active INDEX will slow down large scale data entry:
https://groups.google.com/forum/#!msg/h2-database/7U99fWMaMw0/FbpkXgWIrDcJ

Learn to add multiple data fields per SQL call to speed things up further:
https://stackoverflow.com/a/19682518



Compiling Java apps as native x86 Windows executables

This was quite a holy grail some years ago.

The flexibility of Java turned into an easy to run application for Windows. If you are interested in compiling your Java-based software into native Windows applications, then you probably looked into options such as GCJ and wrappers. The problem is that GCJ is more like a myth nowadays where it is not clear if that thing works and wrappers, well. They are wrappers.


There is an alternative. A way to compile Java code directly into x86 binaries that run under Windows without needing .NET or any other intermediate platform. You are welcome to enter the world of Mono and IKVM.

You should read this page:
http://www.mono-project.com/docs/about-mono/languages/java/

The first step is compiling your jar file into a .NET executable using IKVMC, this is as simple as downloading the most recent tools from:
http://weblog.ikvm.net/default.aspx

Attention: these tools only run on Windows 7 and above (you can likely run in XP by installing a modern .NET version).

Then place your jar file inside the .\bin folder. Head out to the command prompt on that folder and type:
ikvmc myApp.jar

Where "myApp.jar" is obviously the jar you want to transform. If you have additional jars to be added as classpath, look on the syntax page for instructions. On my case I already ship a "fat jar" where every other jar files get merged into a single one in any case.

The compilation will likely show you that some classes are missing, that is mostly references on the code for things that were not found. On my case none of them were used so not really a problem, they were just warnings and you get a Windows .NET executable as output.

When running for the first time it failed. Was complaining that the a java.util.date method was returning a null value.  There is something to know about IKVM, it is an implementation of the JVM but it is not perfect. So you might need to adapt slightly the code so that it doesn't touch on the methods not supported by IKVM. On this case, I just removed the part of the code that was failing since it was not critical and then everything worked.

Now you have a .NET executable.

You will need to ship it together with some of the IKVM dll files. The minimum files to include are IKVM.OpenJDK.Core.dll and IKVM.Runtime.dll. However, you might need additional DLL files from the IKVM distribution depending on the functionalities that you are using. This is mostly a trial and error approach until you find which files are really needed. The end result is that you get an executable that you won't need to ship with a JVM or ask end-users to install one (under Windows).


The final step in case you are interested, is to remove the .NET dependency. This might be useful for the cases where you want to run on bare-metal Windows from older versions such as Windows 95, all the way up to newer versions where x86 code is supported. That is where Mono helps. Take a look on this section: http://www.mono-project.com/docs/about-mono/languages/java/#compiling-java-code-to-x86-code

Taking the executable created on the previous steps,  basically install the Mono setup (about 1Gb of disk space after installation) and type:
mono --aot myApp.exe

This will generate a set of files, refer to the previous link for additional details. Hopefully making your app run completely native under Windows. I'm yet to test the performance of native x86 Java binaries vs the bytecode versions running on the JVM. My initial expectation is that the Oracle/OpenJDK JVM is faster because but this might be a false impression.  If you manage to test the performance just let me know.

Have fun! :-)