I have coded a method in Java that will recursively iterate through all folders and respective sub-folders on disk.
The code seemed to work as expected, but whenever reaching a certain folder it would simply thrown an exception error and complain about stackOverflow.
Now, what is this stack overflow all about? Googling around it seems to occur whenever you enter into an endless loop situation.
This occurred while crawling sub-folders, so even thought my hard drive is quite filled up, it's not exactly filed with folders up to infinity.
So, what is wrong on this picture?
It turns out that this method is vulnerable to badly formed dynamic links. Meaning that whenever a link is found that points to a folder on lower level - it would just re-bounce back to that lower sub-folder and then loop back again ad eternum.
I've lost plenty of time trying to avoid dynamic links from being crawled but to no avail. Also reached the point of calling the absolute path of dynamic links and indexing each path on a database to check one by one if they had already been called to avoid these annoying loops (losing a lot of performance in the process).
Fortunately, found the solution to this riddle on this website - http://leepoint.net/notes-java/io/10file/20recursivelist.html
Albeit being a simplistic code, it provides a very efficient way to deal with the "symbolic-link-limbo". All you need to is define a depth level. They mention 20 as the default value and I applied the same concept on my code.
The second piece of the puzzle is verifying that each new directory that you want to crawl matches in terms of absolute path and canonical path. I've used the code from the following page as inspiration: http://www.idiom.com/~zilla/Xfiles/javasymlinks.html
After applying these changes, the code WORKED LIKE A CHARM!
Now the same method is indexing all folders up to 20 levels of depth and working as expected without seeing any more stack overflow messages. Under a MacBookPro it can index over 600 000 files under 16 minutes using less than 10Mb of heap space in RAM.
:)
No comments:
Post a Comment