Ferret is a copy-detection program, locating duplicate text or code in multiple text documents or source files: it is designed to detect copying (collusion) within a given set of files. Copying is detected by looking for common trigrams of words or tokens.

There have been many people involved with Ferret and its development, including Caroline Lyon, James Malcolm, Robert Dickerson, and Jun-Peng Bao. The program was extended to work with the Chinese language, and Pam Green has shown how to extract more precise information from Ferret’s detailed output.

My contribution was to implement a distributable version, see Ferret.

Main publications:

  1. P.D. Green, P.C.R. Lane, A.W. Rainer and S. Scholz, ‘Analysing Ferret XML reports to estimate the density of copied code’, Technical Report 501, Science and Technology Research Institute, University of Hertfordshire, 2010. Download.
  2. P.D. Green, P.C.R. Lane, A.W. Rainer and S. Scholz, ‘Unscrambling code clones for one-to-one matching of duplicated code’, Technical Report 502, Science and Technology Research Institute, University of Hertfordshire, 2010. Download.
  3. A.W. Rainer, P.C.R. Lane, J.A. Malcolm and S. Scholz, ‘Using n-grams to rapidly characterise the evolution of software code’, in Proceedings of the Fourth International ERCIM Workshop on Software Evolution and Evolvability (IEEE Computer Society), pp.43-52, 2008.
  4. J.P. Bao, C.M. Lyon and P.C.R. Lane, ‘Copy detection in Chinese documents using Ferret,’ Language Resources and Evaluation, 40:357-365, 2006. Web page.