Java: Reading the last line on a large text file

Following the recent batch of posts related to text files, sometimes is necessary to retrieve the last line on a given text file.

My traditional way of doing this operation is to use the buffered reader and iterate all lines until the last one is reached. This works relatively fast, at around 400k lines per second on a typical i7 CPU (4 cores) at 2.4Ghz in 2014.

However, for text files with hundreds of million lines this approach grows increasingly too slow.

One solution is using RandomAccessFile to access any point of the file without delay. Since we don't know the exact position of the last line, a possible solution is to iterate each of the last characters until the break line is found.

Seeking one position at a time and just reading a single char might not be the most efficient approach. So, reading a buffer with 1000 characters at a time is a possible improvement on a future implementation.

Nevertheless, the code snippet below solves my issue and gets the last line on a large text file under a millisecond, regardless of the number of lines on the file.

    /**
     * Returns the last line from a given text file. This method is particularly
     * well suited for very large text files that contain millions of text lines
     * since it will just seek the end of the text file and seek the last line
     * indicator. Please use only for large sized text files.
     * 
     * @param file A file on disk
     * @return The last line or an empty string if nothing was found
     * 
     * @author Nuno Brito
     * @author Michael Schierl
     * @license MIT
     * @date 2014-11-01
     */
    public static String getLastLineFast(final File file) {
        // file needs to exist
        if (file.exists() == false || file.isDirectory()) {
                return "";
        }

        // avoid empty files
        if (file.length() <= 2) {
                return "";
        }

        // open the file for read-only mode
        try {
            RandomAccessFile fileAccess = new RandomAccessFile(file, "r");
            char breakLine = '\n';
            // offset of the current filesystem block - start with the last one
            long blockStart = (file.length() - 1) / 4096 * 4096;
            // hold the current block
            byte[] currentBlock = new byte[(int) (file.length() - blockStart)];
            // later (previously read) blocks
            List<byte[]> laterBlocks = new ArrayList<byte[]>();
            while (blockStart >= 0) {
                fileAccess.seek(blockStart);
                fileAccess.readFully(currentBlock);
                // ignore the last 2 bytes of the block if it is the first one
                int lengthToScan = currentBlock.length - (laterBlocks.isEmpty() ? 2 : 0);
                for (int i = lengthToScan - 1; i >= 0; i--) {
                    if (currentBlock[i] == breakLine) {
                        // we found our end of line!
                        StringBuilder result = new StringBuilder();
                        // RandomAccessFile#readLine uses ISO-8859-1, therefore
                        // we do here too
                        result.append(new String(currentBlock, i + 1, currentBlock.length - (i + 1), "ISO-8859-1"));
                        for (byte[] laterBlock : laterBlocks) {
                                result.append(new String(laterBlock, "ISO-8859-1"));
                        }
                        // maybe we had a newline at end of file? Strip it.
                        if (result.charAt(result.length() - 1) == breakLine) {
                                // newline can be \r\n or \n, so check which one to strip
                                int newlineLength = result.charAt(result.length() - 2) == '\r' ? 2 : 1;
                                result.setLength(result.length() - newlineLength);
                        }
                        return result.toString();
                    }
                }
                // no end of line found - we need to read more
                laterBlocks.add(0, currentBlock);
                blockStart -= 4096;
                currentBlock = new byte[4096];
            }
        } catch (Exception ex) {
                ex.printStackTrace();
        }
        // oops, no line break found or some exception happened
        return "";
    }

If you're worried about re-using this method. Might help to assure that I've authored this code snippet and that you are welcome to reuse this code under the MIT license terms. You are welcome to improve the code, there is certainly room for optimization.

Hope this helps.

:-)

No comments:

Post a Comment