Linux Mint 17 with Windows 10 look

This weekend finally took the time to upgrade Windows 7 on my old laptop and try out that button on the system tray with the free Windows 10 install.

Was surprised, that was an old laptop from 2009 that came with the stock Windows 7 version and still worked fairly OK. Have to say that the new interface, which is indeed looking better and simpler. The desktop is enjoyable, but the fact that this Windows version beams up to Microsoft whatever I'm doing with on my own laptop is still a bother and a cold shill on the spine.

On my newer laptop I run Linux Mint. This is an old version installed back in 2013 and could really use an update. So, since it was upgrade-weekend I've decided to simply go ahead and bring up this Linux machine to a more recent version of Mint and see what had changed over the past years. While doing this upgrade, a question popped up: "how about adding the design of Windows 10 with Linux underneath, would it work?"

And this is the result:
https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiWwOTCowk11nFpH-wJboBfi3FCuZdqiwIYidSvPDcMtzyGxrCHwEDWty20uUeNKOxX4eLZggzqSj3kTItqRTOGv-vag6NpVh7a8-vJtvpHBlIJVG3ZQZwz4wKybraCqQ_gezPgVEEDKOc/s1600/Screenshot_2015-10-18_16-59-49.png


The intention wasn't creating a perfect look-a-like, but (in my opinion) to try mixing and getting a relatively fresh looking design based on Windows, at the same time without opening hand from our privacy.


Operating System

I've got Linux Mint 17.2 (codename Olivia, Cinnamon edition for x64) downloaded from http://www.linuxmint...tion.php?id=197

Instead of installing to disk, this time I've installed and now run the operating system from a MicroSD card connected to the laptop through the SD reader using an SD adapter. The MicroSD is a Samsung 64Gb with advertised speed of 40Mb/s for read operations. Cost was ~30 EUR.

Installing the operating system followed the same routine steps as one would expect. There is a GUI tool from within Linux mint to write the DVD ISO into a pendisk connected on your laptop. Then boot from the USB and install the operating system on the MicroSD, having the boot entry added automatically.


Window 10 theme and icons

Now that the new operating system is running, we can start the customization.

The windows style you find on the screenshot can be downloaded from: http://gnome-look.or...?content=171327

This theme comes with icons that look exactly like Windows 10, but that wasn't looking balanced nor was our intention to copy pixel per pixel the icons. Rather, the intention was re-using the design guidelines. While looking for options, found Sigma Metro which resembled what was needed: http://gnome-look.or...?content=167327

If you look around the web, you'll find instructions on how to change the window themes and icons. Otherwise if you get into difficulties, just write me a message and I'll help.


Firefox update and customization

Install Ubuntu Tweaks. From there, go to Apps tab and install the most recent edition of Firefox because the one included on the distro is a bit old.

Start changing Firefox by opening it up and going to "Addons" -> "Get Addons". Type on the search box "Simple White Compact", this was the theme that I found the simplest and will change the browser looks, from icons to tab position as you can see on the screenshot. Other extensions that you might enjoy adding while making these changes are "Adblock Plus" to remove ads, "Tab Scope" to show miniatures when browsing tabs and "Youtube ALL HTML5" to force youtube running without using the Adobe Flash Player.


Office alternative and customization

Then we arrive to Office. I only keep that oldish laptop because it has the Adobe Reader (which I use for signing PDF documents) and Microsoft Office for the cases when I need to modify documents and presentations without getting them to look broken. So, I was prepared this time to run both apps using Wine (it is possible) but decided to first do an update on the alternatives and try using only Linux native apps. Was not badly surprised.

LibreOffice 4.x is included by default on the distro. Whenever I'd use it, my slides formatted in MS Office would look broken and unusable. Decided to download and try out version 5.x and to my surprise notice that these issues are gone. Both the slides and word documents are now properly displayed with just about the same results that I'd expected from Microsoft office. I'm happy.

To install LibreOffice 5.x visit https://www.libreoff...reoffice-fresh/

For the Linux edition, read the text document with instructions. Quite straightforward, just one command line to launch the setup. So, I was happy with LibreOffice as a complete replacement to Microsoft (no need to acquire licenses nor run office through Wine). However, those icons inside LibreOffice still didn't look good, they looked old. On this aspect the most recent version of Microsoft Office simply "looks" better. I wanted LibreOffice to look that way too. So, got icons from here: http://gnome-look.or...?content=167958

It wasn't straightforward to find out where the icons could be placed because the instructions for version 4.x no longer apply. To help you, the zip file with icons need to be placed inside:
/opt/libreoffice5.0/share/config/

Then you can open up "writer" and from the "Tools" -> "Options" -> "View" choose "Office2013" and get the new icons being used. The startup logo of LibreOffice also seemed too flashy and could be changed. So I've changed with the one available at http://gnome-look.or...?content=166590

Just a matter of overwriting the intro.png image found at:
/opt/libreoffice5.0/program


Alternative to Adobe Reader for signing PDF

Every now and then comes a PDF that requires being printed, signed by pen and then scanned to send again to the other person. I stopped doing this kind of thing some time ago by adding a digital signature that includes an image of my handwritten signature on the document. This way there's no need to print nor scan any papers. Adobe Reader did a good work on this task but getting it to run on Wine with the signature function was not straightforward.

Started looking for a native Linux alternative and found "Master PDF Editor". The code for this software is not public but I couldn't find other options and these were the only ones that provided a native Linux install supporting digital handwritten signatures: https://code-industr...asterpdfeditor/

If you're using this tool for business, you need to acquire a license. Just for home-use is free of cost. Head out to the download page and install the app. I was surprised because it looked very modern, simple and customizable. I'll buy a license for this tool, does exactly what I needed. Having LibreOffice and MasterPDF as complete alternative to MS Office and Acrobat,  there is no more valid reason (on my case) to switch back the old laptop whenever editing documents. This can be done with same (or even better) quality from Linux now.


Command line

A relevant part of my day-to-day involves the use of command line. In Linux this is a relatively pleasant task because the terminal window can be adjusted, customized and never feels like a second class citizen inside the desktop. With these recent changes that were applied, was now possible to improve further the terminal window by showing the tool bar (see the screenshot).

Open a terminal, click on "View" -> "Show tool bar". Usually I'm against adding buttons, but that tool bar has a button for pasting clipboard text directly onto the console. I know that can be done by the keyboard using "Ctrl '+ Shift + V", but found it very practical to just click on a single button and paste the text.


Non-Windows tweaks

There are tweaks only possible on Linux. One of my favorite keeps being the "Woobly windows". Enable Compiz on the default desktop environment: http://askubuntu.com...-wobbly-windows

With Compiz there are many tweaks possible, I've kept them to a minimum but certainly is refreshing to use some animations rather than the plain window frames. If you never saw this in action, here is a video example: https://www.youtube.com/watch?v=jDDqsdrb4MU


Skype alternatives

Many of my friends and business contacts use Skype. It is not safe, it is not private, and I'd prefer to use a non-Microsoft service because the skype client gets installed on my desktop. Who knows what it can do on my machine when it is running on the background. One interesting alternative that I've found was launching the web-edition of skype that you find at https://web.skype.com/

From firefox, there is the option to "Pin" a given tab. So I've pinned skype as you can see on the screenshot, and now opens automatically whenever the browser gets open, in practice bringing it online when I want to be reachable. A safe desktop client and alternative would be better, this is nowhere a perfect solution but rather a compromise that avoids installing the skype client.


Finishing

There are more small tweaks happening to adjust the desktop for my case, but what is described above are the big blocks to help you reach this kind of design in case you'd like to do something similar. If you have any questions or get stuck at any part of customization, just let me know.

Have fun!
:-)

.ABOUT format to document third-party software

If you are a software developer, you know that every now and then someone asks you to create a list of the third-party things that you are using on some project.

This is a boring task. Ask any, single, motivated developer and try to find one that will not roll his eyes whenever asked to do this kind of thing. We (engineers) don't like it, yet are doomed to get this question every now and then. It is not productive to repeat the same thing over and over again, why can't someone make it simpler?

Waiting a couple of years didn't worked, so time to roll up the sleeves and find an easier way of getting this sorted. To date, one needs to list manually each and every portion of code that is not original (e.g. libraries, icons, translations, etc) and this will either end up on a text file or a spreadsheet (pick your poison).

There are ways to manage dependencies. Think of npm, maven and similar. However, you need to be using a dependency manager and this doesn't solve the case of non-code items. For example, when you want to list that package of icons from someone else, or just list dependencies that are part of the project, but not really part of the source code (e.g. servers, firewalls, etc).

For these cases, you still need to do things manually and it is painful. At TripleCheck, we don't like ourselves to do these lists so started looking into how to automate this step once for all. Our requirements: 1) simple, 2) tool-agnostic and 3) portable.

So we got inclined to the way how configuration files work because they are plain text files that are easy for humans to read or edit, and straightforward for machines to parse. We are big fans of SPDX because it permits describing third-party items in intrinsic detail, but a drawback of being so detailed is that sometimes we only have granular information. Example, we know that the files on a given a folder belong to some person and have a specific license (maybe we even know the version), but we don't want to compute the SHA1 binary signature for each and every file on that folder (either because the files might change often, or simply because it won't be done so easily and quickly by the engineer).

Turns out we we're not alone on this kind of quest. NexB had already pioneered in previous years a text format specifically for this kind of task, defining the ".ABOUT" file extension to describe third-party copyrights and applicable licenses: http://www.aboutcode.org/


The text format is fairly simple, here is an example we use ourselves:
 
name: jsTree
license_spdx: MIT
copyright: Ivan Bozhanov
version: 3.0.9

spec_version: 1.0
download_url: none
home_url: http://jstree.com/

# when was this ABOUT file created or last updated?
date: 2015-09-14

# files inside this folder and sub-folders
about_resource: ./

Basically, it follows the SPDX license abbreviations to ensure we use a common way of talking about the same license and you can add or omit information as much as it is available. Take attention on the "about_resource" field that describes what is covered by this ABOUT file. When using "./" means all files and files in respective sub-folders.

One interesting point is the possibility for nesting of multiple ABOUT files. For example, place one ABOUT on the root of your project to describe the license terms generally applicable to the project and then create specific ABOUT on specific third-party libraries/items to describe what is applicable for such cases.

When done with the text file, place it on the same folder of what you want to cover. The "about_resource" can also be used for a single file, or repeated in several lines for covering a very specific set of files.

NexB made available tooling to collect ABOUT files and generate documentation. Unfortunately, this text format is not as known as it should be. Still, it fits like a glove as easy solution to list third-party software so we started using it for automating the code detection.

Our own TripleCheck engine is now supporting the recognition of .ABOUT files and adding this information automatically to the report generation. There is even a simple web frontend for creating .ABOUT files at http://triplecheck.net/components/

From that page, you can either create your own .ABOUT files or simply browse through the collection of already created files. The backend of that web page is powered by GitHub, you find the repository at https://github.com/dot-about/components/tree/master/samples


So, no more excuses to keep listing third-party software manually on spreadsheets.

Have fun! :-)









Something is cooking in Portugal

I don't usually write about politics, for me that is more often a never-ending discussion about tastes, rather than facts.

However, one senses a disturbance in the forces at Portugal. For the first time over the last (35?) years we see a change in landscape. For those non-familiar with Portuguese politics, the country is historically ruled by either one of the two large parties. Basically, one "misbehaves" and then comes the other to "repair". Vice-versa on next elections as voters grow anemic and disconnected from whomever gets elected.

This year wasn't the case. The ruling party is seen as "misbehaving" and the other party didn't got a majority, in other words, didn't convinced a significant part of the population to vote for them. This isn't unusual, what happened as different was the large number of votes going to other two minor parties and the fact that most citizens got up from their sofas to vote who "rules" them for the next years.

For the first time, I'm watching how the second largest party is now forced to negotiate with these smaller parties to reach an agreement. How since a long time they have to review what was promised during election time and get audited by other parties to ensure they keep what was promised.

In other words, for the first time watching what I'd describe as a realistic democratic process happening in our corner of Europe. Might seem strong words, but fact is that ruling a government by majority (in our context) is a carte blanche to rule over public interests. Go to Portugal, ask if they feel the government works on their behalf or against. Ask them for specific examples from recent years that support their claim, they quickly remember epic fights to prevent expensive airports from being built (Ota) by government or the extensive (and expensive) network of highways that got built with EU money and are today empty, still serving only the private interest of companies charging tolls on them.

There was (and still exists) a too-high level of corruption on higher instances of government (just look at our former prime-minister, recently in jail) or the current prime-minister (ask him about "tecnoforma" or about his friend "Dr. Relvas") and so exists a positive impact when small parties get higher voting representation, forcing the majority administrations to be audited and checked in public.

You see, most of this situation derives from a control of mind-share. In previous centuries you'd get support from local cities by promoting your party followers to administrative positions. Later came newspapers (which got tightly controlled), then radio (eventually regulated to forbid rogue senders), then TV (which to date has only two private channels and two state-owned channels) and now comes the Internet.

With the Internet there is a problem. The local government parties with majority are not controlling the platforms where people exchange their thoughts. Portuguese use facebook (hate or like it, that's what common families and friends use between them) and facebook couldn't (currently) care less about elections in Portugal, nor could either of the large parties have resources to make facebook biased to their interests. So what we have is a large platform where public news can be debunked as false or plain biased, where you can see how other citizens really feel about the current state of affairs, where smaller parties get a balanced chance to be read, heard and now even voted by people who support what they stand up for.

For the first time I see the Internet making a real difference in enabling people to be connected between themselves and enabling the population to collectively learn and change the course of their history, together. As for the Portuguese, you see the big parties worried that this thing of re-elections in automatic pilot is no longer assured. They too need to work together now. Portuguese, please do keep voting. For me this is democracy in action. Today I'm happy.


TripleCheck as a Top 20 Frankfurt startup to watch in 2015

Quite an honor and surprise, we got appointed with this distinction despite the fact that we don't see ourselves so much as a startup, but rather as a plain normal company worried about getting to the next month and growing with its own resources.

Looking back, things are much better off today than a year ago. Our schedule is busy at 150% of client allocation and we managed to survive through plain normal consulting, finally moving to product sales this year with a good market reception so far. Team grew, we finally have a normal office location and I keep worrying each month that the funds in the bank are not enough to cover expenses. Somehow, on that brink of failure or success we work hard to pay the bills and invest in material or people that permits moving a bit further each month.

It is not easy, this is not your dream story and we don't know what will happen next year. What I know is that we are pushed to learn more and grow. That kind of experience has a value of its own.

Next step for triplecheck is building in 2015 our own petabyte-level datacenter in Frankfurt. Efficiency of costs aside, we are building a safe-house outside of the "clouds" where nobody really knows who has access to them.

I wish it was time for vacations or celebrate, but this is not yet the time. I'm happy that together with smart and competent people we are building a stable company.


List of >230 file extensions in plain JSON format

I've collected over the last year some 230 file extensions and manually curated their descriptions so that whenever I find a file extension, it becomes possible to give the end-user a slight idea about what the extension is about.


Most of my code nowadays is written in Java but there is interest in porting some of this information to web apps. So I have exported a JSON list that you are welcome to download and use in your projects.

The list is available on GitHub at this link.

One thing to keep in mind is that I'm looking at extensions from a software developer perspective. This means that when the same extension is used for different programs, I usually favor the programs related to programming.

The second thing is that I collect more information about file extensions than the info you find on this JSON list. For example, I populate for each extension the applicable programming languages. Here is an example for .h source code files. Other values include information if the data is plain binary or text readable, the category to which the extension belongs (archive, font, image, sourcecode, ..) and other meta data values that are useful for file filtering and processing.


If you need help or would like to suggest something to improve the list, just let me know.

Updating the header and footer on static web sites using Java

This year was the first time that I've moved away from websites based on Wordpress, PHP and MySQL to embrace the simplicity of static HTML sites.

Simplicity is indeed a good reason. It means virtually no exploits as there is no database nor script interpretation happening. It means speed since there are no PHP, Java nor Ruby scripts running on the server and only direct files are delivered. The last feature that I was curious to try is the site hosting provided by Github, which is only supporting static web sites.

The first site to convert was the TripleCheck company site. It had been developed over a year ago and lagged a serious update. Was based on Wordpress and wasn't easy to make changes on the theme or content. The site was quickly converted and placed online using Github.

However, not all are roses with static websites. As you can imagine, one of the troubles is updating the text and links that you want to see on each page of the site. There are tools such as Jekyll that help to maintain blogs, but all that was needed here was a simple tool that would pick the header and footer tags to updated with whatever content was intended.

Easy enough, I've wrote a simple app for this purpose. You can download the binaries from this link and the source code is available at https://github.com/triplecheck/site_update/


How to get started?

Place the site_update.jar file inside the folder where your web pages are located. Then copy also the html-header.txt and html-footer.txt files and write inside the content you'd want to use as header and footer.

Inside the HTML pages that you want to change, you need to include the following tags:
<header></header>
<footer></footer>

Once you have this ready, from the command line run the jar file using:
java -jar site_update.jar

Check your HTML pages to see if the changes were applied.


What happens when it is running?

It will look for all HTML files with .html extension that are found on the same folder where the .jar file is located. For each HTML file it will look for the HTML tags that were mentioned above and replace whatever is placed between them, effectively updating your pages as needed.

There is an added feature. If you have pages on a sub-folder, this software will automatically convert the links inside the tags so that they keep working. For example, a link pointing to index.html will be modified to ../index.html and this way preserve the link structure. This is done also for images.

An example where this program used can be found at the TripleCheck website, whose code you find available on Github at https://github.com/triplecheck/triplecheck.github.io


Feedback, new features?

I'd be happy to help. Just let me know on the comment box here or write a post on Github.





List of 310 software licenses in JSON format

I've recently needed a list of licenses to use inside a web page. The goal was presenting the end-user with a set of software licenses to choose from. However, couldn't find one readily available as a JSON or some kind of format to be embbeded as part of Javascript code.

So I've created such a list, based on the nice SPDX documentation. This list contains 310 license variations and types. I'm explicitly mentioning "types" because you will find licenses called "Proprietary" to define some sort of terms that are customized and a "Public domain" type, which is not a license per se but in practice denotes the lack of an applicable license since copyright (in theory) is not considered as applicable for them.

In case you are ok with these nuances, you can download this json list from https://github.com/triplecheck/engine/blob/master/run/licenseList.js

The list was not crafted manually, I've wrote a few lines of Java code to output the file. You find this file at https://github.com/triplecheck/engine/blob/master/src/provenance/javascript/OutputLicenseList.java

If you find the list useful and have feedback or need an updated version, just let me know.





SSDEEP in Java

If you are familiar with similarity hashing algorithms (a.k.a. fuzzy hash matching) and need an SSDEEP implementation in Java code, it is available directly from my Github account at this location: https://github.com/nunobrito/utils/tree/master/Utils/src/utils/hashing/ssdeep

The original page for SSDEEP can be found at http://ssdeep.sourceforge.net/

On that page you find also the binaries for Windows.

Have fun.

Preserving the soul of an old laptop

If you're like me and keep old laptops around the house that are wannabe time-capsules, I've recently started converting the physical operating systems onto virtual machines that I can run from a PC emulator.

The concept is called P2V (Physical To Virtual) and has been made simpler over recent years. My favorite tool for this purpose is provided by VMWare at http://www.vmware.com/products/converter

It is a freeware tool, albeit you have to provide an email address to access the download page. What I like about the tool is the fact that the most difficult steps are automated. All one needs to do is installing, convert and run the new virtual machine through a wizard-driven menu with a few clicks.

Being a VMWare tool you'd think that it restricts running the virtual image to their line of products. However, I was able to use VirtualBox to run and see my old Windows 7 booting and running from a virtual machine.

Very nice, to be able of preserving the old look & feel, the apps, documents and working environment in such a quick manner as hardware moves forward.

Windows: Driver for logging the timing of drivers and services at startup

Sometimes it is good to measure how long a laptop with Windows will take to boot and which drivers or services might be hogging down the boot process. There exist some ways of measuring the time using Microsoft-provided tooling but they aren't redistributable.

To overcome this limitation, I've wrote a simple driver that will write a text file with a time stamp when each other driver or service gets called. This way we can (more or less) expose which drivers or services are taking longer to be loaded.

This is a sample of what to expect:
18/02/2015 13:16:40.437, Driver, 4, \SystemRoot\System32\Drivers\crashdmp.sys
18/02/2015 13:16:40.453, Driver, 4, \SystemRoot\System32\Drivers\iaStor.sys
18/02/2015 13:16:40.453, Driver, 4, \SystemRoot\System32\Drivers\dumpfve.sys
18/02/2015 13:16:40.812, Driver, 4, \SystemRoot\system32\DRIVERS\cdrom.sys
18/02/2015 13:16:40.812, Driver, 4, \SystemRoot\System32\Drivers\Null.SYS
18/02/2015 13:16:40.828, Driver, 4, \SystemRoot\System32\Drivers\Beep.SYS
18/02/2015 13:16:40.843, Driver, 4, \SystemRoot\System32\drivers\watchdog.sys
18/02/2015 13:16:40.843, Driver, 4, \SystemRoot\System32\drivers\VIDEOPRT.SYS
18/02/2015 13:16:40.843, Driver, 4, \SystemRoot\System32\drivers\vga.sys
18/02/2015 13:16:40.843, Driver, 4, \SystemRoot\System32\DRIVERS\RDPCDD.sys
18/02/2015 13:16:40.859, Driver, 4, \SystemRoot\system32\drivers\rdpencdd.sys
18/02/2015 13:16:40.859, Driver, 4, \SystemRoot\system32\drivers\rdprefmp.sys
18/02/2015 13:16:40.859, Driver, 4, \SystemRoot\System32\Drivers\Msfs.SYS
18/02/2015 13:16:40.875, Driver, 4, \SystemRoot\System32\Drivers\Npfs.SYS
18/02/2015 13:16:40.875, Driver, 4, \SystemRoot\system32\DRIVERS\TDI.SYS

The code is available under the EUPL terms and hosted on GitHub at this location: https://github.com/nunobrito/BootLogger

On the download folder you find the compiled drivers (x86 and x64 versions) along with the instructions on how to use the driver on your machine.

Feedback from other users can be read at reboot on this topic:
http://reboot.pro/topic/20345-driver-for-logging-windows-boot-drivers-and-services/

Each boot log report will be placed under c:\BootLogger, this parameter is configurable in case you want to change it.

Have fun!
:-)






Olhando à frente

Olhando à frente
existe rumo diferente.
Rumo que dita o futuro,
de curto tempo e alento
para escapar o tormento
que traz o curto momento.
Assim temos um ano
pouco sano e profano
que de tal visto amanho
só pode trazer mais dano.
Serão dez meses a terminar
esta pequena obra d'encantar,
que deu tanto gosto de começar,
e tão pouco tempo para saborear
Imagino como seria o dia
em que o peso desaparecia.
Um dia correndo de alegria,
iria apreciar, seria magia
Tal dia chegará
um dia, oxalá.

Java hidden gem: CopyOnWriteArrayList()

CopyOnWriteArrayList() is a cousin of the well-known ArrayList() class.

ArrayList is often used for storing items. On my case, I had been working on a multi-threaded program that shared a common ArrayList.

In order to improve performance, every now and then I would like to remove some of the items on this list when matched some criteria. In the past I would use the Iterator() class to iterate through item using the iterator.next() function.

To remove an item I'd just call iterator.delete(). However, this approach was failing for some odd reason:
java.util.ConcurrentModificationException at java.util.AbstractList$Itr.checkForComodification (AbstractList.java:372)
I tried to place synchronized on the relevant methods but processing just got slower, not solving the error failure.

So, what else can one try? Looking around the web I've found the not-so-known CopyOnWriteArrayList() and to my surprise solved the problem with a nice performance boost.

Works in the same manner as a typical Arraylist but doesn't synchronize the items when they are removed. To remove items I use a second Arraylist that is decoupled and place the items to remove there. Then, an independent status thread is running in interval loops of three seconds to check if this second Arraylist has any items, removing them from the main list in asynchronous manner.

All in all, running the code in multi-threaded mode and adopting CopyOnArrayWriteArrayList() reduced the overall processing time for 17 million lines of data from 30 minutes to around 10 minutes, an average of 30k lines/second. The text database used as example is sized in 12,3 Gb and contains 2.5 billion snippets that are compared against 164 methods of my test sample.

This translates to roughly 41 billion comparisons taking place in 10 minutes.

As reference, when my computer is just reading the lines without any processing then it reaches an average speed of 140k lines/second, this value reveals the upper I/O limit expected as disk bandwidth. The speed of 30k lines/second occurs (probably) due to CPU limitations (an i7 core) when doing similarity comparisons between strings.


The performance is not bad, but at this point I'm running out of ideas on how to further bring down the processing time. The bottleneck is still the comparison algorithm, I've already wrote a cheaper/dirty version of Levensthein's algorithm for faster comparisons but still is not enough.


Any ideas?


EDIT

After some more time looking on performance I've noted that comparison of two strings was being made using String objects. There was redundant transformation back and forth between char[] and String objects. The code was modified to run using only char[] arrays. Speed was doubled, is now averaging 60k lines/second, taking 5 minutes to complete the same processing because less stress is placed on the CPU.




Java: RandomAccessFile + BufferedReader = FileRandomReadLines

In the Java world when reading large text files you are usually left with two options:
  1. RandomAccessFile
  2. BufferedReader

Option 1) allows to read text from any given part of the file but is not buffered, meaning that it will be slow to read lines.

Option 2) is buffered, therefore fast but you need to read each line from the beginning of the text file until you reach where you want to really read data.

There are strategies to cope with these mutually exclusive options, one is to read data sequentially, another option is to partition data into different files. However, sometimes you just have that case where you need to resume some time consuming operation (think on a scale of days) where billions of default sized lines are involved. Neither option 1) nor option 2) will suffice.

Up to this point I was trying to improve performance, remove any IF's and any code that could squeeze a few more ounces of speed but the problem remained the same: we need an option 3) that mixes the best of both options. There wasn't one readily available that I could find around the Internet.

In the meanwhile I have found a hint that might be possible to feed a BufferedReader directly from a RandomAccessFile. Tested this idea and was indeed possible, albeit still with some rough edges.

For example, if we are already reading data from the BufferedReader and decide to change the file position on the RandomAccessFile object, the BufferedReader will get erroneous data on the buffer. The solution that I've applied is to simply re-create a new BufferedReader, forcing the buffer to be reset.


Now, I'm making available the code that combines the these two approaches. You find the RandomAccessFile class at https://github.com/nunobrito/utils/blob/master/Utils/src/utils/ReadWrite/FileRandomReadLines.java

Has no third-party dependencies, you are likely fine by just downloading and including it on your code. Maybe there is already similar implementation elsewhere published before, I didn't found one and tried as much as possible to find some ready-made code.

If you see any improvements possible, do let me know and I'll include your name on the credits.

A trillion files

2014 has come to an end, so I'm writing a retrospective about what happened and what might be coming down the road in 2015.

For me, the year had the first milestone reached in February with a talk about SPDX and open source in FOSDEM. At that time I was applying to a position as co-chair for the SPDX working group but another candidate in Europe was chosen, apparently more suited.

Nevertheless, I kept throughout the year with my work related to the SPDX open format. In FOSDEM was debuted the first graphical visualizer for SPDX documents, in the process was written a license detection engine to find common software licenses and place this information on newly generated SPDX documents.

On the TripleCheck side, funding was a growing difficulty across the year. After FOSDEM there was urgency in raising funds to keep the company running. At that point we had no MVP (minimum viable prototype) to show and investors had no interest in joining the project. Despite our good intentions and attempts to explain the business concept, we didn't had the needed presentation and business skills to move forward. The alternative option for funding without depending on investors was the EUREKA funding from the EuroStars program.

For this purpose was formed a partnership with an aerospace organization and another company well matured in the open source field. We aimed to move a step forward in terms of open source licensing analysis. After months of preparation, iteration and project submission we got a reply: not accepted. The critique that pained me the most was reading that our project would be open source, therefore unable to maintain a sustainable business because competitors would copy our work. Maybe they have a point, but being open source ourselves is our leverage against competitors since this is a path they will not cross and that opened the doors of the enterprise industry to what we do. Open sourced companies are hard to succeed, despite the hard path I wasn't willing to see us become like the others.

In parallel, people had been hired in previous months to work on the business side of TripleCheck but it just wasn't working as we hoped. The focus then moved strictly to code development and reach an MVP but this wasn't working from a financial perspective either. At this point my own bank savings were depleted, the company reduced back to the two original founding members and seemed the end of the story for yet another startup that tried their luck. We did not had the finances, nor the team, nor the infrastructure to process open source software in large scale.


Failure was here, was time to quit and go home. So, as an engineer I just assumed failure as a consolidated fact. Now with everything failed, there was nothing to lose. The question was "what now?"

There was enough money in the bank to pay rent and stay at home for a couple of months. Finding a new job is relatively easy when you know your way around computers. It was a moment very much like a certain song where the only thing occupying the mind was not really failure, but the fact that I remained passionate about solving the licensing problem and wanted to get this project done.

So, let's clear the mind and start fresh. No outside financing from VC, no ambitious business plans, no business experts, no infrastructure nor any resources other than what is available right now. Let's make things work.


Kept working on the tooling, kept moving forward and eventually got approached by companies that needed consulting. TripleCheck was no longer a startup looking for explosive growth, it had now the modest ambition of making enough to pay the bills and keep working with open source.

Consulting on the field of open source compliance is not easy when you're a small company. While bigger consulting companies on this field can afford to just give back a report listing what is wrong with the code of a client, we had to do the same, plus putting our hands to change the code and make it compliant. Looking back in time, this was one heck of way to get expertise in complete and fast-forward manner. 

Each client became a beta-tester for the tooling developed at the same time. This meant that the manual process was incrementally replaced with an automated method. Our tooling got improved with each analysis that brought different code to analyze, different requirements and different licenses to interpret. At the some point the tooling got so accurate that could now detect licensing faults on the open source code from companies such as Microsoft.

At this time surfaced our first investor. A client was selling his company and he got amazed with the work done while inspecting his code. For me this was one of those turning points, now we had a business expert on our side. Our old powerpoint pitch-decks were crap, nobody really understood why someone needed a compliance check. But this investor had lived through the pain of not having his code ready for acquisition and how relevant this code repair had been. This had become an opportunity to bring aboard a person with first hand experience as a client that we didn't had to explain why it mattered to fix licensing with a tool, not an expert human.



With his support more business got done and our presentation improved. Was now possible to move forward. One of the goals in mind was the creation of an independent open source archive. In August we reached the mark of 100 million source code files archived. A new type of technology dubbed "BigZip" was developed for this purpose since normal file systems and archives were ill suited for this scale of archive processing. A good friend of mine described nicely this concept as a "reversed zipped tar". Meaning that we create millions of zip files inside a single file, the reverse action of what tar.gz does in Linux.

This way got solved the problem of processing files in large numbers. To get files from the Internet was developed a project called "gitFinder" that retrieved over 7 million open source projects. Our first significant data-set had been achieved.

In August was time for the first presence with a stand for TripleCheck on a conference, the FrOSCon. At this event we already had developed a new technology that was able to find snippets of code which were not original. It was dubbed "F2F", based on a humour inspired motto: "Hashes to Hashes, FOSS to FOSS" as a mock to the fact that file hashes (MD5, SHA1, ..) were used for exposing FOSS source code files inside proprietary code.


This code created a report indicating the snippets of code that were not original and where else on the Internet they could be found. For this purpose I wrote a token translator/comparator and a few other algorithms to detect code similarity. The best memory that I have from this development happened when writing part of the code on a boat directly in front of the Eiffel tower. When you're a developer, these are the memories that one remembers with a smile as years pass.

Shortly later in October, TripleCheck got attention at LinuxCon in Europe. For this event we brought aboard a partnership with http://searchcode.com to create or view online an SPDX document from a given repository on GitHub. In the same event we made available a DIY project that enabled anyone to generate 1 million SPDX documents. To provide context, the SPDX format is criticized by the lack of example documents available to public. The goal was making available as many documents as possible. Sadly, no public endorsement from the SPDX working group came to this kind of activities. To make matters worse, too often my emails went silently ignored on the mailing list whenever proposing improvements. That was sad, had real hopes to see this open standard rise.


Can only wonder if the Linux Foundation will ever react. I'm disenchanted with the SPDX direction but believe we (community) very much need this open standard for code licensing to exist, so I keep working to make it reachable and free of costs.


From November to December the focus was scaling our infrastructure. This meant a code rewrite to apply lessons learned. The code complexity was simplified to a level where we can keep using inexpensive hardware and software where only 1~2 developers are needed to improve the code.

The result was a platform that reached by the end of December the milestone of one trillion files archived. In this sense we achieved what others said to be impossible without the proper funds. These files belong to several million projects around the web that are now ready for use in future code analysis. For example, upcoming in 2015 is the introduction of two similarity matching algorithms converted to Java. One of them is TLSH from TrendMicro and the second is SDHash. This is code that we are directly donating back to the original authors after conversion and will be testing to see how it performs on code comparisons.


In retrospective I am happy. We passed through great and collapsing moments, lived through a journey that builds code that others can reuse. I'm happy that TripleCheck published more code in a single year than any other licensing compliance provider has ever done over the term of their existence, which in most cases is above a decade.

At the end of day after TripleCheck is long gone, it is this same code that will remain public and reachable for other folks to re-use. Isn't knowledge sharing one of the cornerstones of human revolution? In 2014 we have helped human knowledge about source code to move forward, let's now start 2015.