Java RegEx: detecting copyright string inside source code files

Recently, one of my goals was to detect and index the copyright notices that can be found inside source code files.

This copyright notice is helpful to automatically get a first idea about the people that were involved in developing a given portion of code and can be considered as copyright holders. It is part of the the work with the SPDX report generation tool that you find at http://triplecheck.de/download

Detecting copyright notices is not an easy task. There exist a myriad of different combinations and variations to consider. Nevertheless, it was needed to start from some point and was decided to attempt detecting common cases, such as "Copyright (c) 1981-2014 Nuno Brito".

After some testing, this is the regular expression that was used:

String patternString = ""
             + "(\\((C|c)\\) |)"    // detect a (c) before the copyright text
             + "(C|c)opyright"      // detect the copyright text
             + "( \\((C|c)\\)|) "   // sometimes with a (c)
             + "([0-9]|)"           // optionally with the year
             + "+"                 
             + "[^\\n\\t\\*]+\\.?";
It can detect the following cases:
Copyright (C) 2006-2014 Josefina Jota
Copyright (c) 2012 Manel Magalhães
Copyright (C) 2003 by Tiago Tavares <tiago@tavares.pt>
Copyright (C) 1993, 1994 Ricardo Romão <ricardo@romão.pt>
(C) Copyright 2000-2013, by Oscar Alho and contributors.

It is not perfect. There is no support for cases where the copyright credits extend for more than a single line nor for the cases where "copyright" is not even used as identifiable keyword. Last but not least, there are false positives that I already noted, such as:
copyright ownership.
copyright notice


Currently I don't have a better solution other than specifically filtering out these false positives.

You find the working code in Java at https://github.com/triplecheck/reporter/blob/master/tool.iml/run/triggers/CopyrightDetector.java

And you find a simple test case for the regular expression at https://github.com/triplecheck/reporter/blob/master/tool.iml/test/trigger/TestTriggerCopyright.java

This detection could certainly be improved and the code is open source. Suggestions are welcome. :-)






No comments:

Post a Comment