A trillion files

2014 has come to an end, so I'm writing a retrospective about what happened and what might be coming down the road in 2015.

For me, the year had the first milestone reached in February with a talk about SPDX and open source in FOSDEM. At that time I was applying to a position as co-chair for the SPDX working group but another candidate in Europe was chosen, apparently more suited.

Nevertheless, I kept throughout the year with my work related to the SPDX open format. In FOSDEM was debuted the first graphical visualizer for SPDX documents, in the process was written a license detection engine to find common software licenses and place this information on newly generated SPDX documents.

On the TripleCheck side, funding was a growing difficulty across the year. After FOSDEM there was urgency in raising funds to keep the company running. At that point we had no MVP (minimum viable prototype) to show and investors had no interest in joining the project. Despite our good intentions and attempts to explain the business concept, we didn't had the needed presentation and business skills to move forward. The alternative option for funding without depending on investors was the EUREKA funding from the EuroStars program.

For this purpose was formed a partnership with an aerospace organization and another company well matured in the open source field. We aimed to move a step forward in terms of open source licensing analysis. After months of preparation, iteration and project submission we got a reply: not accepted. The critique that pained me the most was reading that our project would be open source, therefore unable to maintain a sustainable business because competitors would copy our work. Maybe they have a point, but being open source ourselves is our leverage against competitors since this is a path they will not cross and that opened the doors of the enterprise industry to what we do. Open sourced companies are hard to succeed, despite the hard path I wasn't willing to see us become like the others.

In parallel, people had been hired in previous months to work on the business side of TripleCheck but it just wasn't working as we hoped. The focus then moved strictly to code development and reach an MVP but this wasn't working from a financial perspective either. At this point my own bank savings were depleted, the company reduced back to the two original founding members and seemed the end of the story for yet another startup that tried their luck. We did not had the finances, nor the team, nor the infrastructure to process open source software in large scale.


Failure was here, was time to quit and go home. So, as an engineer I just assumed failure as a consolidated fact. Now with everything failed, there was nothing to lose. The question was "what now?"

There was enough money in the bank to pay rent and stay at home for a couple of months. Finding a new job is relatively easy when you know your way around computers. It was a moment very much like a certain song where the only thing occupying the mind was not really failure, but the fact that I remained passionate about solving the licensing problem and wanted to get this project done.

So, let's clear the mind and start fresh. No outside financing from VC, no ambitious business plans, no business experts, no infrastructure nor any resources other than what is available right now. Let's make things work.


Kept working on the tooling, kept moving forward and eventually got approached by companies that needed consulting. TripleCheck was no longer a startup looking for explosive growth, it had now the modest ambition of making enough to pay the bills and keep working with open source.

Consulting on the field of open source compliance is not easy when you're a small company. While bigger consulting companies on this field can afford to just give back a report listing what is wrong with the code of a client, we had to do the same, plus putting our hands to change the code and make it compliant. Looking back in time, this was one heck of way to get expertise in complete and fast-forward manner. 

Each client became a beta-tester for the tooling developed at the same time. This meant that the manual process was incrementally replaced with an automated method. Our tooling got improved with each analysis that brought different code to analyze, different requirements and different licenses to interpret. At the some point the tooling got so accurate that could now detect licensing faults on the open source code from companies such as Microsoft.

At this time surfaced our first investor. A client was selling his company and he got amazed with the work done while inspecting his code. For me this was one of those turning points, now we had a business expert on our side. Our old powerpoint pitch-decks were crap, nobody really understood why someone needed a compliance check. But this investor had lived through the pain of not having his code ready for acquisition and how relevant this code repair had been. This had become an opportunity to bring aboard a person with first hand experience as a client that we didn't had to explain why it mattered to fix licensing with a tool, not an expert human.



With his support more business got done and our presentation improved. Was now possible to move forward. One of the goals in mind was the creation of an independent open source archive. In August we reached the mark of 100 million source code files archived. A new type of technology dubbed "BigZip" was developed for this purpose since normal file systems and archives were ill suited for this scale of archive processing. A good friend of mine described nicely this concept as a "reversed zipped tar". Meaning that we create millions of zip files inside a single file, the reverse action of what tar.gz does in Linux.

This way got solved the problem of processing files in large numbers. To get files from the Internet was developed a project called "gitFinder" that retrieved over 7 million open source projects. Our first significant data-set had been achieved.

In August was time for the first presence with a stand for TripleCheck on a conference, the FrOSCon. At this event we already had developed a new technology that was able to find snippets of code which were not original. It was dubbed "F2F", based on a humour inspired motto: "Hashes to Hashes, FOSS to FOSS" as a mock to the fact that file hashes (MD5, SHA1, ..) were used for exposing FOSS source code files inside proprietary code.


This code created a report indicating the snippets of code that were not original and where else on the Internet they could be found. For this purpose I wrote a token translator/comparator and a few other algorithms to detect code similarity. The best memory that I have from this development happened when writing part of the code on a boat directly in front of the Eiffel tower. When you're a developer, these are the memories that one remembers with a smile as years pass.

Shortly later in October, TripleCheck got attention at LinuxCon in Europe. For this event we brought aboard a partnership with http://searchcode.com to create or view online an SPDX document from a given repository on GitHub. In the same event we made available a DIY project that enabled anyone to generate 1 million SPDX documents. To provide context, the SPDX format is criticized by the lack of example documents available to public. The goal was making available as many documents as possible. Sadly, no public endorsement from the SPDX working group came to this kind of activities. To make matters worse, too often my emails went silently ignored on the mailing list whenever proposing improvements. That was sad, had real hopes to see this open standard rise.


Can only wonder if the Linux Foundation will ever react. I'm disenchanted with the SPDX direction but believe we (community) very much need this open standard for code licensing to exist, so I keep working to make it reachable and free of costs.


From November to December the focus was scaling our infrastructure. This meant a code rewrite to apply lessons learned. The code complexity was simplified to a level where we can keep using inexpensive hardware and software where only 1~2 developers are needed to improve the code.

The result was a platform that reached by the end of December the milestone of one trillion files archived. In this sense we achieved what others said to be impossible without the proper funds. These files belong to several million projects around the web that are now ready for use in future code analysis. For example, upcoming in 2015 is the introduction of two similarity matching algorithms converted to Java. One of them is TLSH from TrendMicro and the second is SDHash. This is code that we are directly donating back to the original authors after conversion and will be testing to see how it performs on code comparisons.


In retrospective I am happy. We passed through great and collapsing moments, lived through a journey that builds code that others can reuse. I'm happy that TripleCheck published more code in a single year than any other licensing compliance provider has ever done over the term of their existence, which in most cases is above a decade.

At the end of day after TripleCheck is long gone, it is this same code that will remain public and reachable for other folks to re-use. Isn't knowledge sharing one of the cornerstones of human revolution? In 2014 we have helped human knowledge about source code to move forward, let's now start 2015.



No comments:

Post a Comment