You probably wouldn't know this, but I keep an old computer book on my table.
The title is "Análise e programação de computadores", published in 1970 in Portuguese language by Raúl Verde. Should be noted that the first computer book written in my native language had been published only two years earlier, by the same author.
I keep this book on my table because it teaches the basic principles that an aspiring computer programmer should aim to understand. It focuses on the programming of high-level languages such as Cobol and Fortran. Discusses the tradeoffs between using cards to store data vs paper/magnetic tapes or provide formulas to expose the strong points and make informed decisions.
This book could seem awkward if published today, mostly because it mixes software development with engineering, architecture and business interests that have grown specialized. Yet, these principles remain the same and should row on the same direction. Albeit technology grew and matured, truth is that we keep working to push into new boundaries, squeezing performance out of what is now reachable and to somehow make things work.
In this sense, I find it curious how much of the software I'm developing today is holding a strong resemblance with the old techniques on the book. The usage of flat-files, the console output that helps to monitor results, the considerations about storage and its impact on speed/capacity/costs.
Paraphrasing a blog title, I'm reminded that humans do old new things.
This month I had the task of downloading over a billion files from the Internet on a single run. Compare the writing on this book with the current tasks today, they are not different. Even the term "scale" is no excuse when we place these challenges into context.
And yet, as I look to a thousand lines per minute scrolling through that silent terminal, this gives me a pause of happiness. Data is getting processed, things are working, the system is fast enough.
Each line on this screen represents a repository of code and files for a given person. What kind of work or history was that person developing? Then I think on the millions of other repositories still left to process, this is something pretty to just watch as it scrolls.
In some weird kind of sense for a computer geek, looking at these lines on a terminal feels for me as looking at the stars at night. Imagining how each bright dot is around its own sun, planets and history.
Would these two things be comparable? Perhaps yes, perhaps not.
The old new thing is something to think about.
Java: Reading the last line on a large text file
Following the recent batch of posts related to text files, sometimes is necessary to retrieve the last line on a given text file.
My traditional way of doing this operation is to use the buffered reader and iterate all lines until the last one is reached. This works relatively fast, at around 400k lines per second on a typical i7 CPU (4 cores) at 2.4Ghz in 2014.
However, for text files with hundreds of million lines this approach grows increasingly too slow.
One solution is using RandomAccessFile to access any point of the file without delay. Since we don't know the exact position of the last line, a possible solution is to iterate each of the last characters until the break line is found.
Seeking one position at a time and just reading a single char might not be the most efficient approach. So, reading a buffer with 1000 characters at a time is a possible improvement on a future implementation.
Nevertheless, the code snippet below solves my issue and gets the last line on a large text file under a millisecond, regardless of the number of lines on the file.
If you're worried about re-using this method. Might help to assure that I've authored this code snippet and that you are welcome to reuse this code under the MIT license terms. You are welcome to improve the code, there is certainly room for optimization.
Hope this helps.
:-)
My traditional way of doing this operation is to use the buffered reader and iterate all lines until the last one is reached. This works relatively fast, at around 400k lines per second on a typical i7 CPU (4 cores) at 2.4Ghz in 2014.
However, for text files with hundreds of million lines this approach grows increasingly too slow.
One solution is using RandomAccessFile to access any point of the file without delay. Since we don't know the exact position of the last line, a possible solution is to iterate each of the last characters until the break line is found.
Seeking one position at a time and just reading a single char might not be the most efficient approach. So, reading a buffer with 1000 characters at a time is a possible improvement on a future implementation.
Nevertheless, the code snippet below solves my issue and gets the last line on a large text file under a millisecond, regardless of the number of lines on the file.
/** * Returns the last line from a given text file. This method is particularly * well suited for very large text files that contain millions of text lines * since it will just seek the end of the text file and seek the last line * indicator. Please use only for large sized text files. * * @param file A file on disk * @return The last line or an empty string if nothing was found * * @author Nuno Brito * @author Michael Schierl * @license MIT * @date 2014-11-01 */ public static String getLastLineFast(final File file) { // file needs to exist if (file.exists() == false || file.isDirectory()) { return ""; } // avoid empty files if (file.length() <= 2) { return ""; } // open the file for read-only mode try { RandomAccessFile fileAccess = new RandomAccessFile(file, "r"); char breakLine = '\n'; // offset of the current filesystem block - start with the last one long blockStart = (file.length() - 1) / 4096 * 4096; // hold the current block byte[] currentBlock = new byte[(int) (file.length() - blockStart)]; // later (previously read) blocks List<byte[]> laterBlocks = new ArrayList<byte[]>(); while (blockStart >= 0) { fileAccess.seek(blockStart); fileAccess.readFully(currentBlock); // ignore the last 2 bytes of the block if it is the first one int lengthToScan = currentBlock.length - (laterBlocks.isEmpty() ? 2 : 0); for (int i = lengthToScan - 1; i >= 0; i--) { if (currentBlock[i] == breakLine) { // we found our end of line! StringBuilder result = new StringBuilder(); // RandomAccessFile#readLine uses ISO-8859-1, therefore // we do here too result.append(new String(currentBlock, i + 1, currentBlock.length - (i + 1), "ISO-8859-1")); for (byte[] laterBlock : laterBlocks) { result.append(new String(laterBlock, "ISO-8859-1")); } // maybe we had a newline at end of file? Strip it. if (result.charAt(result.length() - 1) == breakLine) { // newline can be \r\n or \n, so check which one to strip int newlineLength = result.charAt(result.length() - 2) == '\r' ? 2 : 1; result.setLength(result.length() - newlineLength); } return result.toString(); } } // no end of line found - we need to read more laterBlocks.add(0, currentBlock); blockStart -= 4096; currentBlock = new byte[4096]; } } catch (Exception ex) { ex.printStackTrace(); } // oops, no line break found or some exception happened return ""; }
If you're worried about re-using this method. Might help to assure that I've authored this code snippet and that you are welcome to reuse this code under the MIT license terms. You are welcome to improve the code, there is certainly room for optimization.
Hope this helps.
:-)
Java: Writing lines in text files with a buffer
Text files are simple. The problem is when you need to write them a few million times.
A strange error from the file (or operating) system causes the Java Virtual Machine to crash when you keep adding data to a given file.
Quick solution was reducing the number of writes using a buffer to cache a number of text lines before writing the data on disk.
This way was possible to reduce the need for such frequent/numerous write operations.
You find the most up-to-date source code on GitHub at this link.
Not difficult to use, initialize the class with the file you want to write. Then use the write() method for each event you want to store and in the end call the close() method to finish any pending operations.
That's it. :-)
A strange error from the file (or operating) system causes the Java Virtual Machine to crash when you keep adding data to a given file.
Quick solution was reducing the number of writes using a buffer to cache a number of text lines before writing the data on disk.
This way was possible to reduce the need for such frequent/numerous write operations.
You find the most up-to-date source code on GitHub at this link.
Not difficult to use, initialize the class with the file you want to write. Then use the write() method for each event you want to store and in the end call the close() method to finish any pending operations.
That's it. :-)
Java: Reading millions of text lines at top speed
There is one thing to say about Java (as a platform), its performance while reading I/O from the disk using the default classes is impressive.
I'm sharing code to read as fast as possible the lines from a large-sized text file from disk and to process each line through a custom method.
The implementation is very simple, has no external dependencies. You find the code for download at this link on github.
btw. The above link will get you the most up-to-date version of the source code file.
Performance on my laptop (i7 CPU, 8Gb RAM, 500Gb HDD, Linux) is measured with a text file containing ~30 million lines of text (around 300 characters each) that is read under 2 minutes.
A practical example of the code in real-world is at this link. Basically, just copy the class to your project and then use "extends FileReadLines". Let the IDE create the needed methods and off you go to process each line or adapt the progress messages.
That's it.
While it is true that anyone can use bufferedReader on its own. The fact is that I found myself repeating these kind of things, therefore created a library to keep this code on a single location. Hope you find it useful if you're struggling to tackle large scaled flat-files.
To the best of my knowledge, this is the fastest possible way of reading a massive number of lines that uses nothing more than the Java platform.
If you have suggestions on how to reach faster results, please place them on the comments box and I'll update this post accordingly. My thanks in advance.
Also, feel free to change the code on GitHub as you see fit.
I'm sharing code to read as fast as possible the lines from a large-sized text file from disk and to process each line through a custom method.
The implementation is very simple, has no external dependencies. You find the code for download at this link on github.
btw. The above link will get you the most up-to-date version of the source code file.
Performance on my laptop (i7 CPU, 8Gb RAM, 500Gb HDD, Linux) is measured with a text file containing ~30 million lines of text (around 300 characters each) that is read under 2 minutes.
A practical example of the code in real-world is at this link. Basically, just copy the class to your project and then use "extends FileReadLines". Let the IDE create the needed methods and off you go to process each line or adapt the progress messages.
That's it.
While it is true that anyone can use bufferedReader on its own. The fact is that I found myself repeating these kind of things, therefore created a library to keep this code on a single location. Hope you find it useful if you're struggling to tackle large scaled flat-files.
To the best of my knowledge, this is the fastest possible way of reading a massive number of lines that uses nothing more than the Java platform.
If you have suggestions on how to reach faster results, please place them on the comments box and I'll update this post accordingly. My thanks in advance.
Also, feel free to change the code on GitHub as you see fit.
Looking into the .NET license compliance practice
This week Microsoft released to public the source code for the .NET platform. The code is published to the public at GitHub and released under the MIT license terms.
Better than just speaking about the advantages of open source, one should share their own code on the open sphere and this is when the licensing topic becomes a matter of scrutiny.
Now that this once proprietary code is now free and open to the public eye, I've took the liberty of looking inside the source to evaluate how well it would score in regards to licensing compliance.
To proceed with a licensing compliance check, I downloaded a copy of the source code and created a report using the triplecheck tool.
The result shows a project containing 630 files and 128 thousand lines of code. At the time of this evaluation, 534 files were identified with an MIT license reference, 540 files had reference to Microsoft as copyright holder. No other licenses were present.
The result is in the screenshot below.
I was happy to see how the current code base was almost reaching a 10/10 score. It was due to a few source code files that needed reference to author and applicable license that kept the 10 from reach.
Being hosted on GitHub means that it was straightforward to submit a bug report and the issue was picked up a few hours later by a developer.
On the second screenshot you see an overview of the code structure and what can be expected in regards to licensing.
The example highlights one of the files missing to correct.
This is a surface analysis, a deep analysis involves evaluating the code originality and the third-party dependencies. Please write a comment if you find a deeper analysis to be of interest and I'll proceed.
The report is available in the SPDX format by the Linux Foundation. You can view this document directly from a browser on this link at GitHub.
In conclusion, from a first week of releases to public it is clear that this code base demonstrates quality from a licensing perspective. Furthermore, the developer team is participative and transmits a notion of a serious commitment, which is much needed if the intention is to bridge the .NET platform with a healthy community spirit.
This is a sign of how the times are changing. And developing code open to public is certainly a good way to make these changes happen.
Better than just speaking about the advantages of open source, one should share their own code on the open sphere and this is when the licensing topic becomes a matter of scrutiny.
Now that this once proprietary code is now free and open to the public eye, I've took the liberty of looking inside the source to evaluate how well it would score in regards to licensing compliance.
To proceed with a licensing compliance check, I downloaded a copy of the source code and created a report using the triplecheck tool.
The result shows a project containing 630 files and 128 thousand lines of code. At the time of this evaluation, 534 files were identified with an MIT license reference, 540 files had reference to Microsoft as copyright holder. No other licenses were present.
The result is in the screenshot below.
I was happy to see how the current code base was almost reaching a 10/10 score. It was due to a few source code files that needed reference to author and applicable license that kept the 10 from reach.
Being hosted on GitHub means that it was straightforward to submit a bug report and the issue was picked up a few hours later by a developer.
On the second screenshot you see an overview of the code structure and what can be expected in regards to licensing.
The example highlights one of the files missing to correct.
This is a surface analysis, a deep analysis involves evaluating the code originality and the third-party dependencies. Please write a comment if you find a deeper analysis to be of interest and I'll proceed.
The report is available in the SPDX format by the Linux Foundation. You can view this document directly from a browser on this link at GitHub.
In conclusion, from a first week of releases to public it is clear that this code base demonstrates quality from a licensing perspective. Furthermore, the developer team is participative and transmits a notion of a serious commitment, which is much needed if the intention is to bridge the .NET platform with a healthy community spirit.
This is a sign of how the times are changing. And developing code open to public is certainly a good way to make these changes happen.
Simple way to order an ArrayList in Java
Often in Java one ends up using ArrayLists to store objects.
Every now and then becomes needed to sort the elements inside an array, either in descending or ascending order according to some attribute of the object.
This is a somewhat tricky operation that gets forgotten too often, therefore I'm documenting it for future reference to myself. Much of what is written here was based on the tutorial from Mkyong: http://www.mkyong.com/java/java-object-sorting-example-comparable-and-comparator/
First step:
Make the object that you want to order to extend the "Comparable" class.
On my case, I want to apply the ordering on the SourceCodeObject class, so I add the line:
implements Comparable<SourceCodeFile>
And it looks like this:
https://github.com/triplecheck/f2f/blob/26db102366be6b1eaeae2fcbb8bb0e966d951d6a/src/structure/SourceCodeFile.java#L25
Second step:
Override the "compareTo" method. On this case the comparison can occur between any two attributes of the objects that you are comparing. I find it difficult to understand in detail how it works. For my case it was enough to subtract two values and return the value. You find the code example at:
https://github.com/triplecheck/f2f/blob/26db102366be6b1eaeae2fcbb8bb0e966d951d6a/src/structure/SourceCodeFile.java#L150-L153
Third step:
Convert the ArrayList to an array and then sort the array. This is the part that I find tricky because the syntax to convert the ArrayList to a plain array is not straightforward to understand, even from the help menu.
It should be something like:
SourceCodeFile[] array = myArrayList.toArray(new SourceCodeFile[myArrayList.size()]);
Arrays.sort(array);
You find the code example at:
https://github.com/triplecheck/f2f/blob/26db102366be6b1eaeae2fcbb8bb0e966d951d6a/src/structure/AnalysisResult.java#L223-L225
That's it. Enjoy the sorted array!
In resume:
- Add the "Comparable" class to the object you want to compare
- Add the "toCompare" method
- Apply the Arrays.sort for ordering the items on a new array.
Simpler ways?
This method worked for me and I found it ok. Surely other ways exist. What do you think? How would you have implemented ordering?
Every now and then becomes needed to sort the elements inside an array, either in descending or ascending order according to some attribute of the object.
This is a somewhat tricky operation that gets forgotten too often, therefore I'm documenting it for future reference to myself. Much of what is written here was based on the tutorial from Mkyong: http://www.mkyong.com/java/java-object-sorting-example-comparable-and-comparator/
First step:
Make the object that you want to order to extend the "Comparable" class.
On my case, I want to apply the ordering on the SourceCodeObject class, so I add the line:
implements Comparable<SourceCodeFile>
And it looks like this:
https://github.com/triplecheck/f2f/blob/26db102366be6b1eaeae2fcbb8bb0e966d951d6a/src/structure/SourceCodeFile.java#L25
Second step:
Override the "compareTo" method. On this case the comparison can occur between any two attributes of the objects that you are comparing. I find it difficult to understand in detail how it works. For my case it was enough to subtract two values and return the value. You find the code example at:
https://github.com/triplecheck/f2f/blob/26db102366be6b1eaeae2fcbb8bb0e966d951d6a/src/structure/SourceCodeFile.java#L150-L153
Third step:
Convert the ArrayList to an array and then sort the array. This is the part that I find tricky because the syntax to convert the ArrayList to a plain array is not straightforward to understand, even from the help menu.
It should be something like:
SourceCodeFile[] array = myArrayList.toArray(new SourceCodeFile[myArrayList.size()]);
Arrays.sort(array);
You find the code example at:
https://github.com/triplecheck/f2f/blob/26db102366be6b1eaeae2fcbb8bb0e966d951d6a/src/structure/AnalysisResult.java#L223-L225
That's it. Enjoy the sorted array!
In resume:
- Add the "Comparable" class to the object you want to compare
- Add the "toCompare" method
- Apply the Arrays.sort for ordering the items on a new array.
Simpler ways?
This method worked for me and I found it ok. Surely other ways exist. What do you think? How would you have implemented ordering?
Using a simple memory cache to speed-up disk operations
A new cycle of work has now debuted on the F2F project.
In the first version I was going through a bigzip file containing some 30 million Java source code files to convert the code into tokens.
However, the conversion tool had no support to stop/resume the processing. This meant that it would require running the same process 24/7 without any interruptions to complete the task. At the time I was working under a tight schedule and used only a sub-set of the data with some 4 million source code files.
Now, there is time and the tool needs to go through these millions of source code files, being ready to resume the processing in case something goes wrong is a needed feature.
On the first version I'd basically read the contents of a source code file, convert the content into a set of tokens and then write them back on a destination file.
To add a stop/resume feature, one needs to "persist" the memory on disk of the last data which was recorded, in order to know from where to begin on the next re-start. Writing a file to disk doesn't scale under this stress. After some hundred thousands writes it will output "JVM error". This also wears down the HDD.
Another option I tried was MapDB, which is great for persistency with just a few lines of code. However, when calling MapDB to create a checkpoint on file-by-file basis this ended up reducing the performance from 4.000 files/minute to 300 files/minute. Clearly, this would make it difficult to index the whole millions of source code files before 2015.
Writing a single line of text on a file is a bottleneck, even if the line is relatively small.
A very simple solution to this problem was adding a memory cache.
Basically, set a normal string to hold the contents of 1000 token files to be written. Once that threshold is reached, the whole text is written back into the destination file just a single time, instead of 1000.
This simple measure has not only permitted to introduce the checkpoints with MapDB to record the current processing status, as it ended up speeding the process rate to ~16.000 files/minute. Really fast, just wow. I'm still mesmerized by the output speed, some 3 millions already reached in a few hours. The whole set should be converted by tomorrow night.
The tradeoff is memory usage. Some barriers were already in place to avoid processing large sized source code files but let's see how this cache holds on.
You find the relevant code snippet here.
:-)
In the first version I was going through a bigzip file containing some 30 million Java source code files to convert the code into tokens.
However, the conversion tool had no support to stop/resume the processing. This meant that it would require running the same process 24/7 without any interruptions to complete the task. At the time I was working under a tight schedule and used only a sub-set of the data with some 4 million source code files.
On the first version I'd basically read the contents of a source code file, convert the content into a set of tokens and then write them back on a destination file.
To add a stop/resume feature, one needs to "persist" the memory on disk of the last data which was recorded, in order to know from where to begin on the next re-start. Writing a file to disk doesn't scale under this stress. After some hundred thousands writes it will output "JVM error". This also wears down the HDD.
Another option I tried was MapDB, which is great for persistency with just a few lines of code. However, when calling MapDB to create a checkpoint on file-by-file basis this ended up reducing the performance from 4.000 files/minute to 300 files/minute. Clearly, this would make it difficult to index the whole millions of source code files before 2015.
Writing a single line of text on a file is a bottleneck, even if the line is relatively small.
A very simple solution to this problem was adding a memory cache.
Basically, set a normal string to hold the contents of 1000 token files to be written. Once that threshold is reached, the whole text is written back into the destination file just a single time, instead of 1000.
This simple measure has not only permitted to introduce the checkpoints with MapDB to record the current processing status, as it ended up speeding the process rate to ~16.000 files/minute. Really fast, just wow. I'm still mesmerized by the output speed, some 3 millions already reached in a few hours. The whole set should be converted by tomorrow night.
The tradeoff is memory usage. Some barriers were already in place to avoid processing large sized source code files but let's see how this cache holds on.
You find the relevant code snippet here.
:-)
First pull request on GitHub
Since a year to this part that I've been using increasingly more GitHub for everything related to source code hosting.
My impression with them has been very positive since the start. From the web interface that is straightforward, to the social "gamification" in place that motivates developers to commit code more often and to see how they are doing.
Another thing that got me surprised was the availability of a direct API to extract all kind of data from the platform with very little limitations, as if it was your own data to download. This revealed something very deep, that the site was made by developers for developers and that sharing was a keyword taken to practice.
Yesterday I got surprised, once again. Was looking around some code that I found interesting and ended up suggesting some minor improvements. A few days later got a message from the author, a github staff asking if I could do "pull request": https://github.com/benbalter/peer_review/pull/2#event-169874904
It was the first I had to deal with this kind of request. Was feeling a bit lost but he took the time to help me get around. How often do you get someone from the GitHub staff helping you directly to get code shared?
Yep. This is the kind of attitude that makes a platform living and thriving. Great people.
Thumbs up to GitHub.
My impression with them has been very positive since the start. From the web interface that is straightforward, to the social "gamification" in place that motivates developers to commit code more often and to see how they are doing.
Another thing that got me surprised was the availability of a direct API to extract all kind of data from the platform with very little limitations, as if it was your own data to download. This revealed something very deep, that the site was made by developers for developers and that sharing was a keyword taken to practice.
Yesterday I got surprised, once again. Was looking around some code that I found interesting and ended up suggesting some minor improvements. A few days later got a message from the author, a github staff asking if I could do "pull request": https://github.com/benbalter/peer_review/pull/2#event-169874904
It was the first I had to deal with this kind of request. Was feeling a bit lost but he took the time to help me get around. How often do you get someone from the GitHub staff helping you directly to get code shared?
Yep. This is the kind of attitude that makes a platform living and thriving. Great people.
Thumbs up to GitHub.
Quem não tem cão, caça com gato.
I've noted that each month has been entertaining but this one (August) is passing over the "busyness" level.
Some months ago we set as goal to prepare the company for the processing of open source code in a larger scale. It was no longer enough to just develop the tooling for discovery and marking the licenses of open source files, we now required our own knowledge base to improve the detection techniques.
To make this possible, we would then require an infrastructure capable of handling files in larger scale. "You need hadoop" they said. But our resources are scarce. The company income is barely enough to cover a salary. Scaling our technology reach to tackle millions of files seemed a long shot when considering what we have at hand.
But it wasn't.
I remember with a smile the Portuguese proverb: "Quem não tem cão, caça com gato" (in English: Those without a dog, hunt with a cat). And so was the case.
No money for server farms. No Amazon cloud processing, no distributed storage. No teams of engineers, scientists and mathematicians using Hadoop and other big data buzz. No time for peer review nor academic research. No public funding.
And perhaps it was better this way.
It is a lonely and daunting task to write something that involves parallel operations, files with hundreds of gigabytes and an ever closer deadline to deliver a working solution.
However, it does force a person to find a way through the mess. To look at what really matters, to simply say "no" and keep focus. It is a lonely task but at the same time you're left to decide without delay. To experiment without fear and motivated for each day getting a bit closer to the intended goal.
So was the case. What seemed difficult for a single person to implement, was split into pieces that together built an infrastructure. I always keep close to mind the wise saying by a professor at CMU: "You don't know what you don't know". Too often the working solution differed from the planned solution. Very true indeed!
To some point I'm happy. We got things working very fast with a simplicity that anyone can maintain. Several hundred millions files processed in a month, over a hundred million stored in our archive and now moving to the final stage at fast pace.
Turned up that we didn't needed a large team, nor a computer farm, nor the full stack of big data buzz. Surely they are relevant. Surely in our case they were non-affordable, simpler solutions got things working.
TL;DR:
If a big data challenge can be broken into smaller pieces, go for it.
Some months ago we set as goal to prepare the company for the processing of open source code in a larger scale. It was no longer enough to just develop the tooling for discovery and marking the licenses of open source files, we now required our own knowledge base to improve the detection techniques.
To make this possible, we would then require an infrastructure capable of handling files in larger scale. "You need hadoop" they said. But our resources are scarce. The company income is barely enough to cover a salary. Scaling our technology reach to tackle millions of files seemed a long shot when considering what we have at hand.
But it wasn't.
I remember with a smile the Portuguese proverb: "Quem não tem cão, caça com gato" (in English: Those without a dog, hunt with a cat). And so was the case.
No money for server farms. No Amazon cloud processing, no distributed storage. No teams of engineers, scientists and mathematicians using Hadoop and other big data buzz. No time for peer review nor academic research. No public funding.
And perhaps it was better this way.
It is a lonely and daunting task to write something that involves parallel operations, files with hundreds of gigabytes and an ever closer deadline to deliver a working solution.
However, it does force a person to find a way through the mess. To look at what really matters, to simply say "no" and keep focus. It is a lonely task but at the same time you're left to decide without delay. To experiment without fear and motivated for each day getting a bit closer to the intended goal.
So was the case. What seemed difficult for a single person to implement, was split into pieces that together built an infrastructure. I always keep close to mind the wise saying by a professor at CMU: "You don't know what you don't know". Too often the working solution differed from the planned solution. Very true indeed!
To some point I'm happy. We got things working very fast with a simplicity that anyone can maintain. Several hundred millions files processed in a month, over a hundred million stored in our archive and now moving to the final stage at fast pace.
Turned up that we didn't needed a large team, nor a computer farm, nor the full stack of big data buzz. Surely they are relevant. Surely in our case they were non-affordable, simpler solutions got things working.
TL;DR:
If a big data challenge can be broken into smaller pieces, go for it.
The hidden overhead when creating thousands of empty folders
Over the past weeks I've been looking for a way to store some hundred million files inside a desktop file system such as Ext3 or NTFS.
On previous experiments, I've noted that after >30 000 files on a single folder it would not be possible to use a GUI file browser and even the listing of these files using the command line becomes sluggish.
The next hypothesis was to avoid a huge number of files/folders on the same root. For example, for a file I'd remove the name portion and compute a SHA1 checksum signature of 40 bytes. With this signature I'd use the first 4 bytes to create a folder, then I'd use the next 4 bytes to create a sub-folder inside the first one and so forth until I have 10 folders that represent the unique signature of the file.
On the last folder I would add a text file containing the path and name information. If more files would be found in the future, the respective path information would be added on this text file. I thought that using the 4 bytes would reduce the number of possible combinations and permit re-use of the folders that would scale to millions.
In theory it worked OK. On my first tests it worked OK with the first thousand files. The problem was a bit more unexpected. Folders, despite containing nothing other than other folders will also require disk space to exist. While the initial project was a success, we could store/access thousands of files without noticeable latency, we brought aboard a problem of disk storage.
To index something like 40 000 files with this SHA1/folder combination was using 2.2Gb of disk space. This was a hidden overhead, very underestimated on my early estimation.
I didn't tested under the Windows NTFS, having this problem under Linux was already a no-go for this option. Fortunately, on the bench was already being planned an alternative.
Our file-system (at minimum) requires only to write files once and then read them when needed. So, a very simple solution would be writing all files together to create a very large file. And then have a second file indicating the position of each file within this large binary block.
There are some problems with this approach:
As far as I'm concerned, creating millions of files on a desktop-grade file system is something I won't be pursuing further. It is tempting to have all the files ready for processing in their original format but I'm simply not finding a feasible manner of accessing millions of files in their original form without resorting to file systems such as XFS and HDFS, or resorting to some kind of data base.
So, opting with a big-binary file as simple storage for now.
On previous experiments, I've noted that after >30 000 files on a single folder it would not be possible to use a GUI file browser and even the listing of these files using the command line becomes sluggish.
The next hypothesis was to avoid a huge number of files/folders on the same root. For example, for a file I'd remove the name portion and compute a SHA1 checksum signature of 40 bytes. With this signature I'd use the first 4 bytes to create a folder, then I'd use the next 4 bytes to create a sub-folder inside the first one and so forth until I have 10 folders that represent the unique signature of the file.
On the last folder I would add a text file containing the path and name information. If more files would be found in the future, the respective path information would be added on this text file. I thought that using the 4 bytes would reduce the number of possible combinations and permit re-use of the folders that would scale to millions.
In theory it worked OK. On my first tests it worked OK with the first thousand files. The problem was a bit more unexpected. Folders, despite containing nothing other than other folders will also require disk space to exist. While the initial project was a success, we could store/access thousands of files without noticeable latency, we brought aboard a problem of disk storage.
To index something like 40 000 files with this SHA1/folder combination was using 2.2Gb of disk space. This was a hidden overhead, very underestimated on my early estimation.
I didn't tested under the Windows NTFS, having this problem under Linux was already a no-go for this option. Fortunately, on the bench was already being planned an alternative.
Our file-system (at minimum) requires only to write files once and then read them when needed. So, a very simple solution would be writing all files together to create a very large file. And then have a second file indicating the position of each file within this large binary block.
There are some problems with this approach:
- Easy to corrupt. If one byte within the large block is removed/added then it invalidates the index structure completely
- Error checking overhead. Right now it copies files directly onto the big binary block (BBB). If an exception occurs while copying a file, this leads to corruption of the whole big file.
- Limited to operating systems capable of handling files bigger than 4Gb
- (..more exist, just pointing the ones most troublesome at the moment)
As far as I'm concerned, creating millions of files on a desktop-grade file system is something I won't be pursuing further. It is tempting to have all the files ready for processing in their original format but I'm simply not finding a feasible manner of accessing millions of files in their original form without resorting to file systems such as XFS and HDFS, or resorting to some kind of data base.
So, opting with a big-binary file as simple storage for now.
GitHub: Indexing all users non-stop
Last week was dedicated to index the users registered on GitHub.
Getting the users was a needed step in order to test how difficult it would be extracting information from the self-appointed largest (public) open source repository on earth.
To my surprise, it was relatively straight-forward and quick. Over the past months I've understood why GitHub is becoming the place of election for hosting. It was a neat user interface, provides incentives for developers to commit code and simply goes straight to what really matters.
In regards to acessing the information, GitHub provides a pretty amazing and reachable API access. When looking at sourceforge or googlecode, I don't fell minimally invited to work with them in regards to analyse the open source data. I'm sure they have their valid reasons, on my side I'm just looking for a straightforward (and legal) way to access this kind of information.
To access the API was available a set of Java libraries (or you can just use the raw JSON format). There is JCapi (freeware) which is the one that I found more suited. The reason is because every now and then we'd expire the API rate limit (5000 requests/hour) and these libraries permitted to keep the connection in pause until the limit was lifted. There were some defects noted while using the library, however, the developers behind the library were pretty amazing. They solved most of the reported issues in light-speed.
With the library in place, I've wrote a java project to iterate through all the users registered on GitHub using the authorized API calls. At first was noted that the Internet connection broke very often, usually around 200k indexed users. I was doing these tests from the office network and was not stable enough. Then moved the software to a dedicated server that is continuously online at a datacenter. From there the connection lasted longer but still failed at 800k indexed users.
Being a stubborn person by nature, got myself thinking on how to resume the operation from the point where it had stopped. The API did provided a resume option from a given user name but the libraries didn't yet covered this point, to my luck the jcabi team was very prompt in explaining how the API call could be used. With the support for resuming available, was then a simple matter to read back the text file with each user per line and get the last one.
Didn't said it before but the storage media that I'm using are plain text files (a.k.a flat files). I'm not using a relational database, nor have I looked much into "NoSQL" databases. From all the options that I've tested over the years, nothings beats the processing simplicity of a plain text file to store large arrays of data. Please note that I emphasize simplicity. If performance had been affected significantly, I'd use a more suited method. However, turns out that normal desktop computers in 2014 can handle text files with millions of lines under a second.
With the resume feature implemented was then possible to launch a single machine for indexing the user names non-stop. After two days of processing, the indexing was complete and showed a total of 7.709.288 registered users.
At the time of this writing (July 2014), GitHub mentions in their press page a total of 6 million users. The text file containing these user names is sized in 81Mb. It contains one user name per line.
Perhaps in the future would be interesting to see a study noting the demographics, gender variation and other details from these user accounts. On my context, this is part of a triplecheck goal. Open source is open, but is not transparent enough. And that is what we are working to improve.
Following the spirit of open source and free software, the software used for generating this index of users is too available on github. It is released as freeware under the modified EUPL (European Public Licence) terms. You find the code and compilation instructions at https://github.com/triplecheck/gitfinder
This code does more than just indexing users, I'll be writing more about what is being done in future posts. In the meanwhile you might be asking yourself: where can I get this text file?
For the moment I didn't made the text file available on a definitive location. Will likely talk with the good people at FLOSSmole to see if they would be interested in hosting and keeping it easily reachable to other folks.
Forward we move.
:-)
Getting the users was a needed step in order to test how difficult it would be extracting information from the self-appointed largest (public) open source repository on earth.
To my surprise, it was relatively straight-forward and quick. Over the past months I've understood why GitHub is becoming the place of election for hosting. It was a neat user interface, provides incentives for developers to commit code and simply goes straight to what really matters.
In regards to acessing the information, GitHub provides a pretty amazing and reachable API access. When looking at sourceforge or googlecode, I don't fell minimally invited to work with them in regards to analyse the open source data. I'm sure they have their valid reasons, on my side I'm just looking for a straightforward (and legal) way to access this kind of information.
To access the API was available a set of Java libraries (or you can just use the raw JSON format). There is JCapi (freeware) which is the one that I found more suited. The reason is because every now and then we'd expire the API rate limit (5000 requests/hour) and these libraries permitted to keep the connection in pause until the limit was lifted. There were some defects noted while using the library, however, the developers behind the library were pretty amazing. They solved most of the reported issues in light-speed.
With the library in place, I've wrote a java project to iterate through all the users registered on GitHub using the authorized API calls. At first was noted that the Internet connection broke very often, usually around 200k indexed users. I was doing these tests from the office network and was not stable enough. Then moved the software to a dedicated server that is continuously online at a datacenter. From there the connection lasted longer but still failed at 800k indexed users.
Being a stubborn person by nature, got myself thinking on how to resume the operation from the point where it had stopped. The API did provided a resume option from a given user name but the libraries didn't yet covered this point, to my luck the jcabi team was very prompt in explaining how the API call could be used. With the support for resuming available, was then a simple matter to read back the text file with each user per line and get the last one.
Didn't said it before but the storage media that I'm using are plain text files (a.k.a flat files). I'm not using a relational database, nor have I looked much into "NoSQL" databases. From all the options that I've tested over the years, nothings beats the processing simplicity of a plain text file to store large arrays of data. Please note that I emphasize simplicity. If performance had been affected significantly, I'd use a more suited method. However, turns out that normal desktop computers in 2014 can handle text files with millions of lines under a second.
With the resume feature implemented was then possible to launch a single machine for indexing the user names non-stop. After two days of processing, the indexing was complete and showed a total of 7.709.288 registered users.
At the time of this writing (July 2014), GitHub mentions in their press page a total of 6 million users. The text file containing these user names is sized in 81Mb. It contains one user name per line.
Perhaps in the future would be interesting to see a study noting the demographics, gender variation and other details from these user accounts. On my context, this is part of a triplecheck goal. Open source is open, but is not transparent enough. And that is what we are working to improve.
Following the spirit of open source and free software, the software used for generating this index of users is too available on github. It is released as freeware under the modified EUPL (European Public Licence) terms. You find the code and compilation instructions at https://github.com/triplecheck/gitfinder
This code does more than just indexing users, I'll be writing more about what is being done in future posts. In the meanwhile you might be asking yourself: where can I get this text file?
For the moment I didn't made the text file available on a definitive location. Will likely talk with the good people at FLOSSmole to see if they would be interested in hosting and keeping it easily reachable to other folks.
Forward we move.
:-)
Java: Parsing text files with minimum delay
In this blog post I'm gathering some of the things that I've learned from practice while making my own code process faster a significant amount of lines inside a text file.
In the most recent work I'm parsing 440 000 lines inside a text file under a second. My hardware is nothing special for today's standards, it is a Toshiba R630, a x64 CPU with 4Gb of RAM running Windows 7.
The result can currently be seen on https://github.com/triplecheck/reporter/blob/b5ca1781dc511b3a823724606a07d7fc61e1d9e3/src/spdxlib/SPDXfile2.java#L197 on the method processFileLine. The link might change in the future, so please do write back a message in case I don't update the above link.
During this parsing, quite a number of time consuming tasks take place:
- Discovering the type of file extension (is it a source code file? an image? an archive? ...)
- What kind of line are we reading (a checksum line? the file size? the license details? the next file? ...)
- Creating a node to place the current file on a treeview under a specific folder
All in all, the first edition worked fairly for smaller text files but failed at reports that contained thousands of lines. Either the memory wouldn't be enough (not enough heap size) or it would just be too slow (for example, generating the treeview was real slow). There was a serious effort to merge the number of loops onto a single loop and to drastically reduce the calculation and filtering time. I've started with first testing different ways of reading all files one by one and BufferedReader came as the fastest option.
Then, step by step I'd introduce a new value being parsed and test the performance to see how it would be degraded and try to optimize. The end result was satisfactory and below are some of my field notes that I kept around for future memory. It is not an exhaustive nor correct list of things you should/can do. It worked for my own goals and it is my hope it can somehow be useful for you too.
If you know any further optimizations that can take place, I'd be happy to include them here and assign proper credits.
There is a dramatic speed-up boost when looking for a char inside a string. Here is a working example:
In the past, perhaps due to dislike for nested IF's, I overlooked the simple fact that they should be nested with an "IF ELSE, IF ELSE, IF..." format. This is important. Once a condition is matched, the other IF statements shouldn't matter.
How to convert from a piece of String to Enum?
Use the .valueOf("sometext") method that available by default on Enums. This turned out to be very efficient converting the set of possible values onto an enum that we can process.
Here is practical example where I get the value from a text line and then place it directly inside an object that expects an enum. If you note, I have no error checking present:
What happens when data has a problem? Things fail. There is an exception to be caught and that should be the only safety net that you have around for ensuring that you can process the data. Thinking about our own context, this made sense. For large text files, most of the effort is automated by tooling and not so much by humans. If something is wrong in the file, this should cause quite a fuss in order to be corrected.
The end result is a performance boost since we removed the need for so many IFs: "Is the text different from null? Is the value an integer? ...". Trust the data.
This is valid not just for strings but rather for any object out there.
Wasn't easy at first (a lot of the code had to be rewritten) but guess what? The code previously taking some 8 seconds to compute the treeview paths now takes less than a second. One big loop forces code to be simpler, helps to note immediately if loading performance is degrading somewhere.
If it is not recommendable to delete the source code file, consider moving into the "old" folder where it can be retrieved if/when necessary.
In the most recent work I'm parsing 440 000 lines inside a text file under a second. My hardware is nothing special for today's standards, it is a Toshiba R630, a x64 CPU with 4Gb of RAM running Windows 7.
The result can currently be seen on https://github.com/triplecheck/reporter/blob/b5ca1781dc511b3a823724606a07d7fc61e1d9e3/src/spdxlib/SPDXfile2.java#L197 on the method processFileLine. The link might change in the future, so please do write back a message in case I don't update the above link.
During this parsing, quite a number of time consuming tasks take place:
- Discovering the type of file extension (is it a source code file? an image? an archive? ...)
- What kind of line are we reading (a checksum line? the file size? the license details? the next file? ...)
- Creating a node to place the current file on a treeview under a specific folder
All in all, the first edition worked fairly for smaller text files but failed at reports that contained thousands of lines. Either the memory wouldn't be enough (not enough heap size) or it would just be too slow (for example, generating the treeview was real slow). There was a serious effort to merge the number of loops onto a single loop and to drastically reduce the calculation and filtering time. I've started with first testing different ways of reading all files one by one and BufferedReader came as the fastest option.
Then, step by step I'd introduce a new value being parsed and test the performance to see how it would be degraded and try to optimize. The end result was satisfactory and below are some of my field notes that I kept around for future memory. It is not an exhaustive nor correct list of things you should/can do. It worked for my own goals and it is my hope it can somehow be useful for you too.
If you know any further optimizations that can take place, I'd be happy to include them here and assign proper credits.
Compiler not compiling
The Java compiler is not perfect. Just because it seems to always do a good work, there are some (rare) cases where you will need ask for a general re-compiling of all the classes. This has sometimes worked to address strange behaviors on the code that I was having trouble to figure until a full compilation was asked.ASCII: party like it's 1998
One significant code speed-up happened after going back to compare characters instead of strings. There is a heavy performance cost when comparing strings but I guess nowadays machines have grown so fast that it gets easy not caring and just compare strings. When looking for performance, it is needed to go back into the days where an ASCII table was always present (if you didn't knew the ASCII codes by memory like Mr. Bill Gates once said he did). So, back to the 90's and looking online at places like http://www.asciitable.com/ to get refreshed with the chars codes.There is a dramatic speed-up boost when looking for a char inside a string. Here is a working example:
// find the "(" char which is represented by "40" in ASCII
final int index = value.indexOf(40);
if(index==-1){
// do your code here
}
Nested IF's
I'm not a friend of spaghetti code. However, from my experiments was possible to note that nested IF's can help to improve performance too. On my context, I was reading a large text file line by line and trying to recognize tags. This required quite a number of IF's.In the past, perhaps due to dislike for nested IF's, I overlooked the simple fact that they should be nested with an "IF ELSE, IF ELSE, IF..." format. This is important. Once a condition is matched, the other IF statements shouldn't matter.
Breaking large code into smaller methods
I have no scientific evidence for this recommendation but besides the advantage of making a large method easier to read, I've noted that when moving a piece of code onto its own method that I'd get a slight performance boost. It is my impression that the compiler is better prepared to optimize smaller methods than large ones.ENUM, ENUM
When possible, use and abuse of ENUMS. A comparison between Enums is faster than comparing strings. In the past I've neglected too often this kind of performance magic, simply because it was easy to compare strings, even when using statically defined strings. Looking into the code, I've replaced everything where possible with an Enum. Whenever the value for a given object would only fall into a couple of categories, enum you become.How to convert from a piece of String to Enum?
Use the .valueOf("sometext") method that available by default on Enums. This turned out to be very efficient converting the set of possible values onto an enum that we can process.
Here is practical example where I get the value from a text line and then place it directly inside an object that expects an enum. If you note, I have no error checking present:
final String temp = tagGetValue(is.tagFileType, line);
tempInfo.setFileType(FileCategory.valueOf(temp));
Trust the data
Another thing that I always tried to do is preventing errors from happening. I placed quite an effort in detecting if a given piece of data was within a set of restrictions before being used. For performance, I can't afford to do these checks. You have to trust that the data being provided can be fed directly onto the object.What happens when data has a problem? Things fail. There is an exception to be caught and that should be the only safety net that you have around for ensuring that you can process the data. Thinking about our own context, this made sense. For large text files, most of the effort is automated by tooling and not so much by humans. If something is wrong in the file, this should cause quite a fuss in order to be corrected.
The end result is a performance boost since we removed the need for so many IFs: "Is the text different from null? Is the value an integer? ...". Trust the data.
The Final countdown
I never realized how using "Final" for strings could have such an impact on performance. Only now understood how this tells the compiler: "This string doesn't chance EVER AGAIN", which in return blessed the program execution with a whooping speed boost. I'm unable of explaining the intricacies behind this mystery but is something that I'm now using with good results.This is valid not just for strings but rather for any object out there.
GPU vs CPU
I've considered the usage of GPU (processing through graphic cards). This is wonderful for speeding up the computation but then noted that my own code needed to be optimized and throwing more hardware power at the problem wouldn't really be a solution. Our machines are so much more powerful than most servers back in the 90's. Somehow they managed to get things done efficiently, so must we.One big loop to rule them all
When looking again into the working code I noted too many loops. There was one to find all files, another to place them on a treeview and sometimes a full loop for each time, resulting in exponential calculations that were not really efficient. So, decided to use a single loop. No excuses, just one big loop to do all the math for each file item.Wasn't easy at first (a lot of the code had to be rewritten) but guess what? The code previously taking some 8 seconds to compute the treeview paths now takes less than a second. One big loop forces code to be simpler, helps to note immediately if loading performance is degrading somewhere.
Keep strings small
Another thing that helped was keeping the size of strings to be compared as minimally small as possible. Avoid a full string text search, do instead a search that starts from the beginning of the text, up to a specific number of characters.Code will be deleted
The side effect of optimizing code is removing a lot of the fat existing before. The hardest part is actually deleting source code files because it kind of raises an alarm. From a psychological point of view, deleting files means losing code. I guess it is counter-intuitive that when an application get better by removing code, it just feels wrong to delete old files. Not easy, but a clean-up needs to be done.If it is not recommendable to delete the source code file, consider moving into the "old" folder where it can be retrieved if/when necessary.
Leverage the knowledge of others
My thanks to Michael Schierl the extra tip added on this post. In his comment Michael brings up the advantage of using tools and frameworks already available to help profile performance. I quote and agree with his comment:There are lots of code style and performance checking tools available (PMD, CheckStyle, FindBugs) - run them on your code and have a look at least at the "high priority" issues.
If you are too lazy to set up multiple tools and wade through multiple reports, get the free edition of SonarQube, add all free Java analysis plugins you can find, and let it analyze your code, then look at the resulting report. (In case you are using CI like Hudson or Jenkins - most likely for larger projects or commercial ones - add a SonarQube job to it - SonarQube has ways to filter "new violations" - even by committer - if run often enough, so that you don't have to look at the same issues every time.
Java: catch (Exception e) does not catch all exceptions..
Today I was testing a piece of code that resulted in exception. The result from the exception was output to screen.
I've placed the normal "try..catch" block to control the exception but it was stubborn. Completely ignored the generic exception and kept on moving along its way.
After some research on this strange phenomenon, it turns that the "catch-all" (Exception e) is not catching "everything" when an exception occurs. The explanation came from this other blog post at http://10kloc.wordpress.com/2013/03/09/runtimeexceptions-try-catch-or-not-to-catch/
There is another class of exceptions called "RuntimeException".
When replacing "Exception" with "RuntimeException" then I was able to regain control of the exception handling and get more details about why it was happening.
Well, mystery solved.
I've placed the normal "try..catch" block to control the exception but it was stubborn. Completely ignored the generic exception and kept on moving along its way.
After some research on this strange phenomenon, it turns that the "catch-all" (Exception e) is not catching "everything" when an exception occurs. The explanation came from this other blog post at http://10kloc.wordpress.com/2013/03/09/runtimeexceptions-try-catch-or-not-to-catch/
There is another class of exceptions called "RuntimeException".
When replacing "Exception" with "RuntimeException" then I was able to regain control of the exception handling and get more details about why it was happening.
Well, mystery solved.
Java: pluralizer
Every now and then one needs to output quantities in plural and singular forms. In English language it is pretty much straightforward, just add an "s" to the end and you get a plural.
However, doing it programatically adds up a few lines of code that tend to make things less elegant (and simple) than they ought to be. For example, when listing the number of files inside a folder it is annoying to see a text saying "1 files", knowing that this is not grammatically correct.
To solve these cases, I've wrote a simple method.
As you can see, very simple code. From there I can rest assured that the correct form will be used according to the value that is used.
However, doing it programatically adds up a few lines of code that tend to make things less elegant (and simple) than they ought to be. For example, when listing the number of files inside a folder it is annoying to see a text saying "1 files", knowing that this is not grammatically correct.
To solve these cases, I've wrote a simple method.
/** * This method simplifies showing values with associated terms when they * occur either in plural or singular manner. For example, solves the issue * of output "1 files" onto the correct "1 file" * @param value The value to output * @param text The text that will be "pluralized" * @return The pluralized text */ public static String pluralize(int value, String text){ if(value == 1){ return value + " " + text; }else{ return value + " " + text + "s"; } }
As you can see, very simple code. From there I can rest assured that the correct form will be used according to the value that is used.
Java RegEx: detecting copyright string inside source code files
Recently, one of my goals was to detect and index the copyright notices that can be found inside source code files.
This copyright notice is helpful to automatically get a first idea about the people that were involved in developing a given portion of code and can be considered as copyright holders. It is part of the the work with the SPDX report generation tool that you find at http://triplecheck.de/download
Detecting copyright notices is not an easy task. There exist a myriad of different combinations and variations to consider. Nevertheless, it was needed to start from some point and was decided to attempt detecting common cases, such as "Copyright (c) 1981-2014 Nuno Brito".
After some testing, this is the regular expression that was used:
Copyright (C) 2006-2014 Josefina Jota
Copyright (c) 2012 Manel Magalhães
Copyright (C) 2003 by Tiago Tavares <tiago@tavares.pt>
Copyright (C) 1993, 1994 Ricardo Romão <ricardo@romão.pt>
(C) Copyright 2000-2013, by Oscar Alho and contributors.
It is not perfect. There is no support for cases where the copyright credits extend for more than a single line nor for the cases where "copyright" is not even used as identifiable keyword. Last but not least, there are false positives that I already noted, such as:
copyright ownership.
copyright notice
Currently I don't have a better solution other than specifically filtering out these false positives.
You find the working code in Java at https://github.com/triplecheck/reporter/blob/master/tool.iml/run/triggers/CopyrightDetector.java
And you find a simple test case for the regular expression at https://github.com/triplecheck/reporter/blob/master/tool.iml/test/trigger/TestTriggerCopyright.java
This detection could certainly be improved and the code is open source. Suggestions are welcome. :-)
This copyright notice is helpful to automatically get a first idea about the people that were involved in developing a given portion of code and can be considered as copyright holders. It is part of the the work with the SPDX report generation tool that you find at http://triplecheck.de/download
Detecting copyright notices is not an easy task. There exist a myriad of different combinations and variations to consider. Nevertheless, it was needed to start from some point and was decided to attempt detecting common cases, such as "Copyright (c) 1981-2014 Nuno Brito".
After some testing, this is the regular expression that was used:
String patternString = "" + "(\\((C|c)\\) |)" // detect a (c) before the copyright text + "(C|c)opyright" // detect the copyright text + "( \\((C|c)\\)|) " // sometimes with a (c) + "([0-9]|)" // optionally with the year + "+" + "[^\\n\\t\\*]+\\.?";It can detect the following cases:
Copyright (C) 2006-2014 Josefina Jota
Copyright (c) 2012 Manel Magalhães
Copyright (C) 2003 by Tiago Tavares <tiago@tavares.pt>
Copyright (C) 1993, 1994 Ricardo Romão <ricardo@romão.pt>
(C) Copyright 2000-2013, by Oscar Alho and contributors.
It is not perfect. There is no support for cases where the copyright credits extend for more than a single line nor for the cases where "copyright" is not even used as identifiable keyword. Last but not least, there are false positives that I already noted, such as:
copyright ownership.
copyright notice
Currently I don't have a better solution other than specifically filtering out these false positives.
You find the working code in Java at https://github.com/triplecheck/reporter/blob/master/tool.iml/run/triggers/CopyrightDetector.java
And you find a simple test case for the regular expression at https://github.com/triplecheck/reporter/blob/master/tool.iml/test/trigger/TestTriggerCopyright.java
This detection could certainly be improved and the code is open source. Suggestions are welcome. :-)
Windows: single-line command to download and install software
I noted that users of Linux and OSX are sometimes greeted with a very nice feature. Sites like Bowery present on the front page a nice command line code that downloads their software and gets it running immediately.
This is great, however, this was the code provided for Windows:
If you're a Windows developer, you likely notice the above code gets stuck right on the first part of the code simply because "curl" is a command that is not available by default on Windows.
How can this work under Windows?
I was curious and decided to find a way of doing the same thing using only internal Windows commands. Took some digging but discovered bitsadmin to be a somewhat equivalent tool for this task.
And the one-line command that can be run from a Windows command prompt is:
What does this code do?
bitsadmin is great because it comes inside any Windows machine since 2000 and above. There is a drawback, it is considerably slow. The first line of command will download a batch script and run this batch from the temporary folder.
The first action by the batch script is to create the needed folders (at c:\triplecheck) and then download wget.exe as the default downloader. This is a single and small sized executable that will speed-up the download process.
Then, we get the software. I didn't had much time to implement a way of extracting zip files under Windows from the command line and so decided to use the default cabinet archive format (.cab) for all packaging. My software runs on Java so I've added some checks to verify if there was Java available on the machine or simply download the Java runtimes from my own server.
At this point must say that the independence of Java tastes really great. Just download, unpack and Java is available. After all these steps are done, the script will download a shortcut that I created earlier and places this shortcut on the user desktop for his convenience when launching the tool.
Everything is finished by opening an Explorer window on the newly created folder and starting up the tool.
The full script code can be found at http://triplecheck.de/launch
What are the advantages?
On my case this provides a one-line command to automatically download and deploy my software on Windows. It is lightning fast, if you already have Java installed then the whole process gets concluded in some 10 seconds on my machine and this is something impressive.
Disadvantages?
I'm using wget.exe and this will be a problem for certain Anti-virus which might not enjoy the fact that a downloader executable gets inside the system. A possible improvement is checking if wget.exe was in fact permitted to stay on the end-user's folder. If it was removed, then revert to bitsadmin as default downloader.
This bitsadmin tool is marked as deprecated, this is actually something that I don't find so often in the Windows world. Very surprised (and disappointed) to read the message. It seems that from Windows 2000 to Windows 8 machines will be possible to run this tool.
Does not run on Windows RT. The installation script does not take into consideration the newish tablets with Windows RT. Therefore the Java runtimes will not work on devices with an ARM processor. The script could be improved but not so many folks use WinRT for this kind of work.
Cabinet files are used, instead of standard zip files. It is possible to later improve this script for enabling the built-in Windows zip extraction but this wasn't something readily available. The drawback is having two maintain two different sets of archives when distributing my software.
Hope you find this useful.
This is great, however, this was the code provided for Windows:
curl -O download.bowery.io/downloads/bowery_2.1.0_windows_amd64.zip && sudo unzip bowery_2.1.0_windows_amd64.zip -d /usr/local/bin
If you're a Windows developer, you likely notice the above code gets stuck right on the first part of the code simply because "curl" is a command that is not available by default on Windows.
How can this work under Windows?
I was curious and decided to find a way of doing the same thing using only internal Windows commands. Took some digging but discovered bitsadmin to be a somewhat equivalent tool for this task.
And the one-line command that can be run from a Windows command prompt is:
bitsadmin /transfer t http://triplecheck.de/launch %temp%\x.bat&%temp%\x.bat
What does this code do?
bitsadmin is great because it comes inside any Windows machine since 2000 and above. There is a drawback, it is considerably slow. The first line of command will download a batch script and run this batch from the temporary folder.
The first action by the batch script is to create the needed folders (at c:\triplecheck) and then download wget.exe as the default downloader. This is a single and small sized executable that will speed-up the download process.
Then, we get the software. I didn't had much time to implement a way of extracting zip files under Windows from the command line and so decided to use the default cabinet archive format (.cab) for all packaging. My software runs on Java so I've added some checks to verify if there was Java available on the machine or simply download the Java runtimes from my own server.
At this point must say that the independence of Java tastes really great. Just download, unpack and Java is available. After all these steps are done, the script will download a shortcut that I created earlier and places this shortcut on the user desktop for his convenience when launching the tool.
Everything is finished by opening an Explorer window on the newly created folder and starting up the tool.
The full script code can be found at http://triplecheck.de/launch
What are the advantages?
On my case this provides a one-line command to automatically download and deploy my software on Windows. It is lightning fast, if you already have Java installed then the whole process gets concluded in some 10 seconds on my machine and this is something impressive.
Disadvantages?
I'm using wget.exe and this will be a problem for certain Anti-virus which might not enjoy the fact that a downloader executable gets inside the system. A possible improvement is checking if wget.exe was in fact permitted to stay on the end-user's folder. If it was removed, then revert to bitsadmin as default downloader.
This bitsadmin tool is marked as deprecated, this is actually something that I don't find so often in the Windows world. Very surprised (and disappointed) to read the message. It seems that from Windows 2000 to Windows 8 machines will be possible to run this tool.
Does not run on Windows RT. The installation script does not take into consideration the newish tablets with Windows RT. Therefore the Java runtimes will not work on devices with an ARM processor. The script could be improved but not so many folks use WinRT for this kind of work.
Cabinet files are used, instead of standard zip files. It is possible to later improve this script for enabling the built-in Windows zip extraction but this wasn't something readily available. The drawback is having two maintain two different sets of archives when distributing my software.
Hope you find this useful.
Java: Sorting an hashmap according to its value
I had an HashMap composed with an object and an Integer value
associated. By design, hashmaps are not ordered. Eventually, found
around the web a nice method and modified the code to ensure it
could be generic and ready to use with any kind of object.
Below you find the method ready to use. Attention to the copyright assignment (thanks WikiJava) where I'm referring the source from where the code derives.
Inside your code, you can use the snippet below. Attention that "FileLanguage" is the name of my object, you can replace this with a String or any other object you wish.
The end result is the following:
File: busybox-1.21.1.spdx
C -> 796
UNSORTED -> 674
SCRIPT_LINUX -> 19
HTML -> 9
PERL -> 3
File: flyingsaucer-R8.spdx
JAVA -> 4
UNSORTED -> 1
File: jfreechart-1.0.16.spdx
JAVA -> 1035
UNSORTED -> 57
HTML -> 48
Hope you find it useful. :-)
Below you find the method ready to use. Attention to the copyright assignment (thanks WikiJava) where I'm referring the source from where the code derives.
/** * Sort an hashmap according to its value. * @origin http://wikijava.org/wiki/Sort_a_HashMap * @date 2011-05-28 * @modified http://nunobrito.eu * @date 2014-04-04 */ private Map sortHashMap(HashMap input){ Map<Object,Integer> map = new LinkedHashMap<Object,Integer>(); List<Object> yourMapKeys = new ArrayList<Object>(input.keySet()); List<Integer> yourMapValues = new ArrayList<Integer>(input.values()); TreeSet<Integer> sortedSet = new TreeSet<Integer>(yourMapValues); Object[] sortedArray = sortedSet.toArray(); int size = sortedArray.length; for (int i=size-1; i>-1; i--) { map.put (yourMapKeys.get(yourMapValues.indexOf(sortedArray[i])), (Integer) sortedArray[i]); } return map; }
Inside your code, you can use the snippet below. Attention that "FileLanguage" is the name of my object, you can replace this with a String or any other object you wish.
// sort the result Map<Object,Integer> map = sortHashMap(statsLanguagesFound); // show the ordered results for(Object langObj :map.keySet()){ FileLanguage lang = (FileLanguage) langObj; int count = map.get(lang); System.out.println(lang.toString() + " -> " + count); }
The end result is the following:
File: busybox-1.21.1.spdx
C -> 796
UNSORTED -> 674
SCRIPT_LINUX -> 19
HTML -> 9
PERL -> 3
File: flyingsaucer-R8.spdx
JAVA -> 4
UNSORTED -> 1
File: jfreechart-1.0.16.spdx
JAVA -> 1035
UNSORTED -> 57
HTML -> 48
Hope you find it useful. :-)
Java: counting how many times a string is repeated
Recently I needed to find a simple way to count how many times a specific keyword is repeated inside a large text. A regular expression would be possible but (besides the complication), it is very slow on large text files (>60 000 lines).
The solution, a very simple code that is crude but works with good enough performance:
Not intensively tested but functional.
Hope it helps you.
The solution, a very simple code that is crude but works with good enough performance:
int counter = (text.length() - text.replace(keyword, "").length()) / keyword.length();
Not intensively tested but functional.
Hope it helps you.
A escolha
Mais um dia fechado em casa
e como depressa o tempo passa
Até parece que o tempo escapa
e depressa transforma em nada
Às vezes procuro um plano d'salvação
mas lá no fundo já fiz a minha opção
entre fugir ou enfrentar o perigo
venha o diabo como meu amigo (ou não)
Subscribe to:
Posts (Atom)