Java hidden gem: CopyOnWriteArrayList()

CopyOnWriteArrayList() is a cousin of the well-known ArrayList() class.

ArrayList is often used for storing items. On my case, I had been working on a multi-threaded program that shared a common ArrayList.

In order to improve performance, every now and then I would like to remove some of the items on this list when matched some criteria. In the past I would use the Iterator() class to iterate through item using the iterator.next() function.

To remove an item I'd just call iterator.delete(). However, this approach was failing for some odd reason:
java.util.ConcurrentModificationException at java.util.AbstractList$Itr.checkForComodification (AbstractList.java:372)
I tried to place synchronized on the relevant methods but processing just got slower, not solving the error failure.

So, what else can one try? Looking around the web I've found the not-so-known CopyOnWriteArrayList() and to my surprise solved the problem with a nice performance boost.

Works in the same manner as a typical Arraylist but doesn't synchronize the items when they are removed. To remove items I use a second Arraylist that is decoupled and place the items to remove there. Then, an independent status thread is running in interval loops of three seconds to check if this second Arraylist has any items, removing them from the main list in asynchronous manner.

All in all, running the code in multi-threaded mode and adopting CopyOnArrayWriteArrayList() reduced the overall processing time for 17 million lines of data from 30 minutes to around 10 minutes, an average of 30k lines/second. The text database used as example is sized in 12,3 Gb and contains 2.5 billion snippets that are compared against 164 methods of my test sample.

This translates to roughly 41 billion comparisons taking place in 10 minutes.

As reference, when my computer is just reading the lines without any processing then it reaches an average speed of 140k lines/second, this value reveals the upper I/O limit expected as disk bandwidth. The speed of 30k lines/second occurs (probably) due to CPU limitations (an i7 core) when doing similarity comparisons between strings.


The performance is not bad, but at this point I'm running out of ideas on how to further bring down the processing time. The bottleneck is still the comparison algorithm, I've already wrote a cheaper/dirty version of Levensthein's algorithm for faster comparisons but still is not enough.


Any ideas?


EDIT

After some more time looking on performance I've noted that comparison of two strings was being made using String objects. There was redundant transformation back and forth between char[] and String objects. The code was modified to run using only char[] arrays. Speed was doubled, is now averaging 60k lines/second, taking 5 minutes to complete the same processing because less stress is placed on the CPU.




Java: RandomAccessFile + BufferedReader = FileRandomReadLines

In the Java world when reading large text files you are usually left with two options:
  1. RandomAccessFile
  2. BufferedReader

Option 1) allows to read text from any given part of the file but is not buffered, meaning that it will be slow to read lines.

Option 2) is buffered, therefore fast but you need to read each line from the beginning of the text file until you reach where you want to really read data.

There are strategies to cope with these mutually exclusive options, one is to read data sequentially, another option is to partition data into different files. However, sometimes you just have that case where you need to resume some time consuming operation (think on a scale of days) where billions of default sized lines are involved. Neither option 1) nor option 2) will suffice.

Up to this point I was trying to improve performance, remove any IF's and any code that could squeeze a few more ounces of speed but the problem remained the same: we need an option 3) that mixes the best of both options. There wasn't one readily available that I could find around the Internet.

In the meanwhile I have found a hint that might be possible to feed a BufferedReader directly from a RandomAccessFile. Tested this idea and was indeed possible, albeit still with some rough edges.

For example, if we are already reading data from the BufferedReader and decide to change the file position on the RandomAccessFile object, the BufferedReader will get erroneous data on the buffer. The solution that I've applied is to simply re-create a new BufferedReader, forcing the buffer to be reset.


Now, I'm making available the code that combines the these two approaches. You find the RandomAccessFile class at https://github.com/nunobrito/utils/blob/master/Utils/src/utils/ReadWrite/FileRandomReadLines.java

Has no third-party dependencies, you are likely fine by just downloading and including it on your code. Maybe there is already similar implementation elsewhere published before, I didn't found one and tried as much as possible to find some ready-made code.

If you see any improvements possible, do let me know and I'll include your name on the credits.

A trillion files

2014 has come to an end, so I'm writing a retrospective about what happened and what might be coming down the road in 2015.

For me, the year had the first milestone reached in February with a talk about SPDX and open source in FOSDEM. At that time I was applying to a position as co-chair for the SPDX working group but another candidate in Europe was chosen, apparently more suited.

Nevertheless, I kept throughout the year with my work related to the SPDX open format. In FOSDEM was debuted the first graphical visualizer for SPDX documents, in the process was written a license detection engine to find common software licenses and place this information on newly generated SPDX documents.

On the TripleCheck side, funding was a growing difficulty across the year. After FOSDEM there was urgency in raising funds to keep the company running. At that point we had no MVP (minimum viable prototype) to show and investors had no interest in joining the project. Despite our good intentions and attempts to explain the business concept, we didn't had the needed presentation and business skills to move forward. The alternative option for funding without depending on investors was the EUREKA funding from the EuroStars program.

For this purpose was formed a partnership with an aerospace organization and another company well matured in the open source field. We aimed to move a step forward in terms of open source licensing analysis. After months of preparation, iteration and project submission we got a reply: not accepted. The critique that pained me the most was reading that our project would be open source, therefore unable to maintain a sustainable business because competitors would copy our work. Maybe they have a point, but being open source ourselves is our leverage against competitors since this is a path they will not cross and that opened the doors of the enterprise industry to what we do. Open sourced companies are hard to succeed, despite the hard path I wasn't willing to see us become like the others.

In parallel, people had been hired in previous months to work on the business side of TripleCheck but it just wasn't working as we hoped. The focus then moved strictly to code development and reach an MVP but this wasn't working from a financial perspective either. At this point my own bank savings were depleted, the company reduced back to the two original founding members and seemed the end of the story for yet another startup that tried their luck. We did not had the finances, nor the team, nor the infrastructure to process open source software in large scale.


Failure was here, was time to quit and go home. So, as an engineer I just assumed failure as a consolidated fact. Now with everything failed, there was nothing to lose. The question was "what now?"

There was enough money in the bank to pay rent and stay at home for a couple of months. Finding a new job is relatively easy when you know your way around computers. It was a moment very much like a certain song where the only thing occupying the mind was not really failure, but the fact that I remained passionate about solving the licensing problem and wanted to get this project done.

So, let's clear the mind and start fresh. No outside financing from VC, no ambitious business plans, no business experts, no infrastructure nor any resources other than what is available right now. Let's make things work.


Kept working on the tooling, kept moving forward and eventually got approached by companies that needed consulting. TripleCheck was no longer a startup looking for explosive growth, it had now the modest ambition of making enough to pay the bills and keep working with open source.

Consulting on the field of open source compliance is not easy when you're a small company. While bigger consulting companies on this field can afford to just give back a report listing what is wrong with the code of a client, we had to do the same, plus putting our hands to change the code and make it compliant. Looking back in time, this was one heck of way to get expertise in complete and fast-forward manner. 

Each client became a beta-tester for the tooling developed at the same time. This meant that the manual process was incrementally replaced with an automated method. Our tooling got improved with each analysis that brought different code to analyze, different requirements and different licenses to interpret. At the some point the tooling got so accurate that could now detect licensing faults on the open source code from companies such as Microsoft.

At this time surfaced our first investor. A client was selling his company and he got amazed with the work done while inspecting his code. For me this was one of those turning points, now we had a business expert on our side. Our old powerpoint pitch-decks were crap, nobody really understood why someone needed a compliance check. But this investor had lived through the pain of not having his code ready for acquisition and how relevant this code repair had been. This had become an opportunity to bring aboard a person with first hand experience as a client that we didn't had to explain why it mattered to fix licensing with a tool, not an expert human.



With his support more business got done and our presentation improved. Was now possible to move forward. One of the goals in mind was the creation of an independent open source archive. In August we reached the mark of 100 million source code files archived. A new type of technology dubbed "BigZip" was developed for this purpose since normal file systems and archives were ill suited for this scale of archive processing. A good friend of mine described nicely this concept as a "reversed zipped tar". Meaning that we create millions of zip files inside a single file, the reverse action of what tar.gz does in Linux.

This way got solved the problem of processing files in large numbers. To get files from the Internet was developed a project called "gitFinder" that retrieved over 7 million open source projects. Our first significant data-set had been achieved.

In August was time for the first presence with a stand for TripleCheck on a conference, the FrOSCon. At this event we already had developed a new technology that was able to find snippets of code which were not original. It was dubbed "F2F", based on a humour inspired motto: "Hashes to Hashes, FOSS to FOSS" as a mock to the fact that file hashes (MD5, SHA1, ..) were used for exposing FOSS source code files inside proprietary code.


This code created a report indicating the snippets of code that were not original and where else on the Internet they could be found. For this purpose I wrote a token translator/comparator and a few other algorithms to detect code similarity. The best memory that I have from this development happened when writing part of the code on a boat directly in front of the Eiffel tower. When you're a developer, these are the memories that one remembers with a smile as years pass.

Shortly later in October, TripleCheck got attention at LinuxCon in Europe. For this event we brought aboard a partnership with http://searchcode.com to create or view online an SPDX document from a given repository on GitHub. In the same event we made available a DIY project that enabled anyone to generate 1 million SPDX documents. To provide context, the SPDX format is criticized by the lack of example documents available to public. The goal was making available as many documents as possible. Sadly, no public endorsement from the SPDX working group came to this kind of activities. To make matters worse, too often my emails went silently ignored on the mailing list whenever proposing improvements. That was sad, had real hopes to see this open standard rise.


Can only wonder if the Linux Foundation will ever react. I'm disenchanted with the SPDX direction but believe we (community) very much need this open standard for code licensing to exist, so I keep working to make it reachable and free of costs.


From November to December the focus was scaling our infrastructure. This meant a code rewrite to apply lessons learned. The code complexity was simplified to a level where we can keep using inexpensive hardware and software where only 1~2 developers are needed to improve the code.

The result was a platform that reached by the end of December the milestone of one trillion files archived. In this sense we achieved what others said to be impossible without the proper funds. These files belong to several million projects around the web that are now ready for use in future code analysis. For example, upcoming in 2015 is the introduction of two similarity matching algorithms converted to Java. One of them is TLSH from TrendMicro and the second is SDHash. This is code that we are directly donating back to the original authors after conversion and will be testing to see how it performs on code comparisons.


In retrospective I am happy. We passed through great and collapsing moments, lived through a journey that builds code that others can reuse. I'm happy that TripleCheck published more code in a single year than any other licensing compliance provider has ever done over the term of their existence, which in most cases is above a decade.

At the end of day after TripleCheck is long gone, it is this same code that will remain public and reachable for other folks to re-use. Isn't knowledge sharing one of the cornerstones of human revolution? In 2014 we have helped human knowledge about source code to move forward, let's now start 2015.