brito: July 2012

reWIM ready for the public. Create WIM archives from Java

It was the year 2007 when Windows (r) Vista was released and along with it came the dreaded WIM archive format.

To this date I can only wonder the reasons why Microsoft in its infinite wisdom decided to launch an (yet) other file format intended to archive files. Of course that I can imagine the reason, this new archive format was only intended to be manipulated with MS tools and 5 years passed without an efficient way of replicating these archives without them.

In the meanwhile, all the fans of Windows customization suffer in angst for a better way of doing things. Using the tools provided by Microsoft is nice and dandy but let's face it, there is a lot more that can be tweaked and there is no plausible reason to keep this format so closed when it would even be more interesting to Microsoft that more people play with Windows nowadays.

So, I decided some years ago to work on a way of changing this situation. I don't like using/installing Windows drivers for something that is nothing more than a glorified zip file. More friends joined the fun since some years ago and we are now finally capable today of presenting an independent tool: reWIM.

Nowadays we see more tools from other developers that perform similar functions without resource to Microsoft drivers so this project was aimed to bring not only the original intended functionality but also some new nifty features along the road.

For example:

It is available in pure Java. This means that it will run across any mainstream operative system without worries
It has a nice graphical interface that supports the command line interface as well
It was specially tailored for Windows PE boot disks, meaning that allows custom XML data and optimization to make these archives load faster
Last but not least, it is multi-threaded. Executes in parallel several processes that compress the data and create archives faster, especially fast on multi-core machines.

I am happy. It was a long road to reach this far and the tool is finally available for everyone to enjoy completely free of costs. Some will always find a reason to argue and point flaws, as far as I'm concerned: the mission was accomplished, we create archives without admin rights, drivers or silly workarounds.

Have fun!

MySQL as option for large scale file repository

Further improvements were made. There was interest in getting files onto the server so that later it would be possible to perform checksum analysis on them.

Initially I was pondering to store all files inside the file system. Some time ago I had already tried to store a few hundred thousand files under the same folder, however the result was bad. After reading more about the matter, I could use a more specific file system such as XFS or go all the way with Hadoop over the existing file systems to compute hashes over the existent repository.

Both these options were good, however came disappointing upon implementation stage. Setting up a XFS from a remote server would cause offline time and bring risks of disk space shortage as I would have to allocate a large portion of the EXT3 file system.

Running Hadoop seemed nice from all the papers and articles that I was reading, however, it would force all files to be placed inside a clustered file system that would still be out of reach or require specific interaction. At this point I was neither happy with XFS nor Hadoop.

So, since MySQL is performing so well, why not give it a go at database storage?

At first I would be fast to claim that file systems are faster than databases at any given day of the week. However, what is the use of speed if then you lose time (and hairs) to transverse quickly through all the files?

I decided to give a change to MySQL and started uploading files directly to the database. As result: I am happy!

This way I am adding all the files under a normalized table and can perform queries to select files added on a specific date, with a specific size or mime type. Above all that, I don't have to deal with a special file system and can keep all data together.

Not everything is perfect for sure. One might complain about the need to extract files from MySQL onto somewhere in order to apply an algorithm. However, think it this way: now we can also connect directly to the database and it will provide results faster than a typical file system. On top of that, we don't have to manage two system, just one that can one day be clustered away on some cloud environment if desired.

Tradeoffs for sure and so far I am happy with this decision. If later we need a file system, we can still have one. If we want specific collections of files, it will be a breeze to put them together using MySQL.

Take care!

filename.pro

I'm proud to announce the opening of http://filename.pro/

This platform is my effort to make available a security platform. With this site becomes possible to query file names and hashes, looking for more information about them at other sources.

It is not a grand opening, it is a work in progress. Meaning that I will be adding more features and improving the site usability as time allows. One of my goals is integration with ninjapendisk and open the possibility for users to report malicious files and names.

For the moment, I am really happy with the current result. Today it was possible to improve the MySQL performance to a point where one can execute a query over 25 million records and output the result under a single second.

For the future, additional hashing algorithms are planned for inclusion. At the moment ssdeep is already confirmed to be possible both at the server (PHP/MySQL) along with client side (Java) support and I am still learning more about approaches that can be included next.

reWIM ready for the public. Create WIM archives from Java

MySQL as option for large scale file repository

filename.pro

do you like this blog?