brito: November 2014

Java: Reading the last line on a large text file

Following the recent batch of posts related to text files, sometimes is necessary to retrieve the last line on a given text file.

My traditional way of doing this operation is to use the buffered reader and iterate all lines until the last one is reached. This works relatively fast, at around 400k lines per second on a typical i7 CPU (4 cores) at 2.4Ghz in 2014.

However, for text files with hundreds of million lines this approach grows increasingly too slow.

One solution is using RandomAccessFile to access any point of the file without delay. Since we don't know the exact position of the last line, a possible solution is to iterate each of the last characters until the break line is found.

Seeking one position at a time and just reading a single char might not be the most efficient approach. So, reading a buffer with 1000 characters at a time is a possible improvement on a future implementation.

Nevertheless, the code snippet below solves my issue and gets the last line on a large text file under a millisecond, regardless of the number of lines on the file.

    /**
     * Returns the last line from a given text file. This method is particularly
     * well suited for very large text files that contain millions of text lines
     * since it will just seek the end of the text file and seek the last line
     * indicator. Please use only for large sized text files.
     * 
     * @param file A file on disk
     * @return The last line or an empty string if nothing was found
     * 
     * @author Nuno Brito
     * @author Michael Schierl
     * @license MIT
     * @date 2014-11-01
     */
    public static String getLastLineFast(final File file) {
        // file needs to exist
        if (file.exists() == false || file.isDirectory()) {
                return "";
        }

        // avoid empty files
        if (file.length() <= 2) {
                return "";
        }

        // open the file for read-only mode
        try {
            RandomAccessFile fileAccess = new RandomAccessFile(file, "r");
            char breakLine = '\n';
            // offset of the current filesystem block - start with the last one
            long blockStart = (file.length() - 1) / 4096 * 4096;
            // hold the current block
            byte[] currentBlock = new byte[(int) (file.length() - blockStart)];
            // later (previously read) blocks
            List<byte[]> laterBlocks = new ArrayList<byte[]>();
            while (blockStart >= 0) {
                fileAccess.seek(blockStart);
                fileAccess.readFully(currentBlock);
                // ignore the last 2 bytes of the block if it is the first one
                int lengthToScan = currentBlock.length - (laterBlocks.isEmpty() ? 2 : 0);
                for (int i = lengthToScan - 1; i >= 0; i--) {
                    if (currentBlock[i] == breakLine) {
                        // we found our end of line!
                        StringBuilder result = new StringBuilder();
                        // RandomAccessFile#readLine uses ISO-8859-1, therefore
                        // we do here too
                        result.append(new String(currentBlock, i + 1, currentBlock.length - (i + 1), "ISO-8859-1"));
                        for (byte[] laterBlock : laterBlocks) {
                                result.append(new String(laterBlock, "ISO-8859-1"));
                        }
                        // maybe we had a newline at end of file? Strip it.
                        if (result.charAt(result.length() - 1) == breakLine) {
                                // newline can be \r\n or \n, so check which one to strip
                                int newlineLength = result.charAt(result.length() - 2) == '\r' ? 2 : 1;
                                result.setLength(result.length() - newlineLength);
                        }
                        return result.toString();
                    }
                }
                // no end of line found - we need to read more
                laterBlocks.add(0, currentBlock);
                blockStart -= 4096;
                currentBlock = new byte[4096];
            }
        } catch (Exception ex) {
                ex.printStackTrace();
        }
        // oops, no line break found or some exception happened
        return "";
    }

If you're worried about re-using this method. Might help to assure that I've authored this code snippet and that you are welcome to reuse this code under the MIT license terms. You are welcome to improve the code, there is certainly room for optimization.

Hope this helps.

:-)

Java: Writing lines in text files with a buffer

Text files are simple. The problem is when you need to write them a few million times.

A strange error from the file (or operating) system causes the Java Virtual Machine to crash when you keep adding data to a given file.

Quick solution was reducing the number of writes using a buffer to cache a number of text lines before writing the data on disk.

This way was possible to reduce the need for such frequent/numerous write operations.

You find the most up-to-date source code on GitHub at this link.

Not difficult to use, initialize the class with the file you want to write. Then use the write() method for each event you want to store and in the end call the close() method to finish any pending operations.

That's it. :-)

Java: Reading millions of text lines at top speed

There is one thing to say about Java (as a platform), its performance while reading I/O from the disk using the default classes is impressive.

I'm sharing code to read as fast as possible the lines from a large-sized text file from disk and to process each line through a custom method.

The implementation is very simple, has no external dependencies. You find the code for download at this link on github.

btw. The above link will get you the most up-to-date version of the source code file.

Performance on my laptop (i7 CPU, 8Gb RAM, 500Gb HDD, Linux) is measured with a text file containing ~30 million lines of text (around 300 characters each) that is read under 2 minutes.

A practical example of the code in real-world is at this link. Basically, just copy the class to your project and then use "extends FileReadLines". Let the IDE create the needed methods and off you go to process each line or adapt the progress messages.

That's it.

While it is true that anyone can use bufferedReader on its own. The fact is that I found myself repeating these kind of things, therefore created a library to keep this code on a single location. Hope you find it useful if you're struggling to tackle large scaled flat-files.

To the best of my knowledge, this is the fastest possible way of reading a massive number of lines that uses nothing more than the Java platform.

If you have suggestions on how to reach faster results, please place them on the comments box and I'll update this post accordingly. My thanks in advance.

Also, feel free to change the code on GitHub as you see fit.

Looking into the .NET license compliance practice

This week Microsoft released to public the source code for the .NET platform. The code is published to the public at GitHub and released under the MIT license terms.

Better than just speaking about the advantages of open source, one should share their own code on the open sphere and this is when the licensing topic becomes a matter of scrutiny.

Now that this once proprietary code is now free and open to the public eye, I've took the liberty of looking inside the source to evaluate how well it would score in regards to licensing compliance.

To proceed with a licensing compliance check, I downloaded a copy of the source code and created a report using the triplecheck tool.

The result shows a project containing 630 files and 128 thousand lines of code. At the time of this evaluation, 534 files were identified with an MIT license reference, 540 files had reference to Microsoft as copyright holder. No other licenses were present.

The result is in the screenshot below.

I was happy to see how the current code base was almost reaching a 10/10 score. It was due to a few source code files that needed reference to author and applicable license that kept the 10 from reach.

Being hosted on GitHub means that it was straightforward to submit a bug report and the issue was picked up a few hours later by a developer.

On the second screenshot you see an overview of the code structure and what can be expected in regards to licensing.

The example highlights one of the files missing to correct.

This is a surface analysis, a deep analysis involves evaluating the code originality and the third-party dependencies. Please write a comment if you find a deeper analysis to be of interest and I'll proceed.

The report is available in the SPDX format by the Linux Foundation. You can view this document directly from a browser on this link at GitHub.

In conclusion, from a first week of releases to public it is clear that this code base demonstrates quality from a licensing perspective. Furthermore, the developer team is participative and transmits a notion of a serious commitment, which is much needed if the intention is to bridge the .NET platform with a healthy community spirit.

This is a sign of how the times are changing. And developing code open to public is certainly a good way to make these changes happen.

Simple way to order an ArrayList in Java

Often in Java one ends up using ArrayLists to store objects.

Every now and then becomes needed to sort the elements inside an array, either in descending or ascending order according to some attribute of the object.

This is a somewhat tricky operation that gets forgotten too often, therefore I'm documenting it for future reference to myself. Much of what is written here was based on the tutorial from Mkyong: http://www.mkyong.com/java/java-object-sorting-example-comparable-and-comparator/

First step:
Make the object that you want to order to extend the "Comparable" class.

On my case, I want to apply the ordering on the SourceCodeObject class, so I add the line:
        implements Comparable<SourceCodeFile>

And it looks like this:
https://github.com/triplecheck/f2f/blob/26db102366be6b1eaeae2fcbb8bb0e966d951d6a/src/structure/SourceCodeFile.java#L25

Second step:
Override the "compareTo" method. On this case the comparison can occur between any two attributes of the objects that you are comparing. I find it difficult to understand in detail how it works. For my case it was enough to subtract two values and return the value. You find the code example at:
https://github.com/triplecheck/f2f/blob/26db102366be6b1eaeae2fcbb8bb0e966d951d6a/src/structure/SourceCodeFile.java#L150-L153

Third step:
Convert the ArrayList to an array and then sort the array. This is the part that I find tricky because the syntax to convert the ArrayList to a plain array is not straightforward to understand, even from the help menu.

It should be something like:
        SourceCodeFile[] array = myArrayList.toArray(new SourceCodeFile[myArrayList.size()]);
        Arrays.sort(array);

You find the code example at:
https://github.com/triplecheck/f2f/blob/26db102366be6b1eaeae2fcbb8bb0e966d951d6a/src/structure/AnalysisResult.java#L223-L225

That's it. Enjoy the sorted array!

In resume:
- Add the "Comparable" class to the object you want to compare
- Add the "toCompare" method
- Apply the Arrays.sort for ordering the items on a new array.

Simpler ways?
This method worked for me and I found it ok. Surely other ways exist. What do you think? How would you have implemented ordering?

Java: Reading the last line on a large text file

Java: Writing lines in text files with a buffer

Java: Reading millions of text lines at top speed

Looking into the .NET license compliance practice

Simple way to order an ArrayList in Java

do you like this blog?