Back in December 2006, I ran a series of tests on Java-based file type detection. At the time, I was researching digital asset management systems and in particular, the possibility of open source full-text search and semantic analytics.
Fast-forward to June 2011 and ReadWriteWeb’s Head to Head Comparison of Text Extraction Algorithms. It is amazing to look back at my old research and see what tools are available now that would have been part of the evaluation process. I also like Tomaž Kovačič’s thorough explanation of his testing methods and results.