Java: Parsing text files with minimum delay

In this blog post I'm gathering some of the things that I've learned from practice while making my own code process faster a significant amount of lines inside a text file.

In the most recent work I'm parsing 440 000 lines inside a text file under a second. My hardware is nothing special for today's standards, it is a Toshiba R630, a x64 CPU with 4Gb of RAM running Windows 7.

The result can currently be seen on https://github.com/triplecheck/reporter/blob/b5ca1781dc511b3a823724606a07d7fc61e1d9e3/src/spdxlib/SPDXfile2.java#L197 on the method processFileLine. The link might change in the future, so please do write back a message in case I don't update the above link.

During this parsing, quite a number of time consuming tasks take place:
- Discovering the type of file extension (is it a source code file? an image? an archive? ...)
- What kind of line are we reading (a checksum line? the file size? the license details? the next file? ...)
- Creating a node to place the current file on a treeview under a specific folder

All in all, the first edition worked fairly for smaller text files but failed at reports that contained thousands of lines. Either the memory wouldn't be enough (not enough heap size) or it would just be too slow (for example, generating the treeview was real slow). There was a serious effort to merge the number of loops onto a single loop and to drastically reduce the calculation and filtering time. I've started with first testing different ways of reading all files one by one and BufferedReader came as the fastest option.

Then, step by step I'd introduce a new value being parsed and test the performance to see how it would be degraded and try to optimize. The end result was satisfactory and below are some of my field notes that I kept around for future memory. It is not an exhaustive nor correct list of things you should/can do. It worked for my own goals and it is my hope it can somehow be useful for you too.

If you know any further optimizations that can take place, I'd be happy to include them here and assign proper credits.

Compiler not compiling

The Java compiler is not perfect. Just because it seems to always do a good work, there are some (rare) cases where you will need ask for a general re-compiling of all the classes. This has sometimes worked to address strange behaviors on the code that I was having trouble to figure until a full compilation was asked.

ASCII: party like it's 1998

One significant code speed-up happened after going back to compare characters instead of strings. There is a heavy performance cost when comparing strings but I guess nowadays machines have grown so fast that it gets easy not caring and just compare strings. When looking for performance, it is needed to go back into the days where an ASCII table was always present (if you didn't knew the ASCII codes by memory like Mr. Bill Gates once said he did). So, back to the 90's and looking online at places like http://www.asciitable.com/ to get refreshed with the chars codes.

There is a dramatic speed-up boost when looking for a char inside a string. Here is a working example:

        // find the "(" char which is represented by "40" in ASCII
        final int index = value.indexOf(40);
        if(index==-1){
        // do your code here
        }

Nested IF's

I'm not a friend of spaghetti code. However, from my experiments was possible to note that nested IF's can help to improve performance too. On my context, I was reading a large text file line by line and trying to recognize tags. This required quite a number of IF's.

In the past, perhaps due to dislike for nested IF's, I overlooked the simple fact that they should be nested with an "IF ELSE, IF ELSE, IF..." format. This is important. Once a condition is matched, the other IF statements shouldn't matter.

Breaking large code into smaller methods

I have no scientific evidence for this recommendation but besides the advantage of making a large method easier to read, I've noted that when moving a piece of code onto its own method that I'd get a slight performance boost. It is my impression that the compiler is better prepared to optimize smaller methods than large ones.

ENUM, ENUM

When possible, use and abuse of ENUMS. A comparison between Enums is faster than comparing strings. In the past I've neglected too often this kind of performance magic, simply because it was easy to compare strings, even when using statically defined strings. Looking into the code, I've replaced everything where possible with an Enum. Whenever the value for a given object would only fall into a couple of categories, enum you become.

How to convert from a piece of String to Enum?

Use the .valueOf("sometext") method that available by default on Enums. This turned out to be very efficient converting the set of possible values onto an enum that we can process.

Here is practical example where I get the value from a text line and then place it directly inside an object that expects an enum. If you note, I have no error checking present:
final String temp = tagGetValue(is.tagFileType, line);
tempInfo.setFileType(FileCategory.valueOf(temp));

Trust the data

Another thing that I always tried to do is preventing errors from happening. I placed quite an effort in detecting if a given piece of data was within a set of restrictions before being used. For performance, I can't afford to do these checks. You have to trust that the data being provided can be fed directly onto the object.

What happens when data has a problem? Things fail. There is an exception to be caught and that should be the only safety net that you have around for ensuring that you can process the data. Thinking about our own context, this made sense. For large text files, most of the effort is automated by tooling and not so much by humans. If something is wrong in the file, this should cause quite a fuss in order to be corrected.

The end result is a performance boost since we removed the need for so many IFs: "Is the text different from null? Is the value an integer? ...". Trust the data.

The Final countdown

I never realized how using "Final" for strings could have such an impact on performance. Only now understood how this tells the compiler: "This string doesn't chance EVER AGAIN", which in return blessed the program execution with a whooping speed boost. I'm unable of explaining the intricacies behind this mystery but is something that I'm now using with good results.

This is valid not just for strings but rather for any object out there.


GPU vs CPU

I've considered the usage of GPU (processing through graphic cards). This is wonderful for speeding up the computation but then noted that my own code needed to be optimized and throwing more hardware power at the problem wouldn't really be a solution. Our machines are so much more powerful than most servers back in the 90's. Somehow they managed to get things done efficiently, so must we.

One big loop to rule them all

When looking again into the working code I noted too many loops. There was one to find all files, another to place them on a treeview and sometimes a full loop for each time, resulting in exponential calculations that were not really efficient. So, decided to use a single loop. No excuses, just one big loop to do all the math for each file item.

Wasn't easy at first (a lot of the code had to be rewritten) but guess what? The code previously taking some 8 seconds to compute the treeview paths now takes less than a second. One big loop forces code to be simpler, helps to note immediately if loading performance is degrading somewhere.

Keep strings small

Another thing that helped was keeping the size of strings to be compared as minimally small as possible. Avoid a full string text search, do instead a search that starts from the beginning of the text, up to a specific number of characters.

Code will be deleted

The side effect of optimizing code is removing a lot of the fat existing before. The hardest part is actually deleting source code files because it kind of raises an alarm. From a psychological point of view, deleting files means losing code. I guess it is counter-intuitive that when an application get better by removing code, it just feels wrong to delete old files. Not easy, but a clean-up needs to be done.

If it is not recommendable to delete the source code file, consider moving into the "old" folder where it can be retrieved if/when necessary.


Leverage the knowledge of others

My thanks to Michael Schierl the extra tip added on this post. In his comment Michael brings up the advantage of using tools and frameworks already available to help profile performance. I quote and agree with his comment:
There are lots of code style and performance checking tools available (PMD, CheckStyle, FindBugs) - run them on your code and have a look at least at the "high priority" issues.

If you are too lazy to set up multiple tools and wade through multiple reports, get the free edition of SonarQube, add all free Java analysis plugins you can find, and let it analyze your code, then look at the resulting report. (In case you are using CI like Hudson or Jenkins - most likely for larger projects or commercial ones - add a SonarQube job to it - SonarQube has ways to filter "new violations" - even by committer - if run often enough, so that you don't have to look at the same issues every time.






No comments:

Post a Comment