Skip to end of metadata
Go to start of metadata

Performance issue: https://jira.codehaus.org/browse/SONARPLUGINS-1696

API and implementation

Currently sonar-duplications library (version 2.13) is mature enough to be reused for example to cover new languages or embed into IDE, but generally speaking does not provide stable API, because it's relatively new library and almost all public classes or methods are subject to incompatible changes - you can treat this as if all public members were marked with @Beta annotation. Thus API must be reviewed and separated on stable and unstable.

Required enhancements:

  • BlockChunker (algorithm to build blocks) should not depend on StatementChunker and should work for any unit of source code for which we can determine hash and position, e.g. Token. This will allow to simply reuse existing lexers/parsers, thus quickly cover languages other than Java - see SONAR-3139 and SONAR-3181.
  • Moreover, even more simpler pre-processing might be efficient enough, e.g. split by lines with removal of white-spaces and comments.
  • Stabilize API to create statements from tokens - SONAR-2843.
  • Class Statement (and related classes/methods/fields) must be renamed, because
    • it generates false positive issues of rule pmd:CloseResource,
    • doesn't really represent statement in terms of language - this is just some part of source code, thus possible new name can be something like a "Unit".
  • We prefer term "duplicate" ("duplication") over "clone", because at least it doesn't clash with a name of clone method in Java. Thus some classes must be renamed.

Possible enhancements:

  • Suffix tree relies on ByteArray#equals , which might be optimized by using hash-code (which is cached) as a primary comparison attribute.
  • No need to store nor list of tokens, nor text representation of Statement, because any next processing relies only on value of hash. Most probably this will decrease memory consumption.
  • No need to store text representation of Token, because in any case sonar-channel keeps whole file in memory, thus Token can simply point on position in this file. Maybe this will decrease memory consumption.
  • Investigate possibilities to build and store suffix-tree more efficiently in terms of memory, e.g. without suffix-links or by constructing on-disk, but not in-memory...
  • BlockChunker uses Rabin-Karp rolling hash, so in fact instead of processing of list, can process stream incrementally. It might decrease memory consumption in case if lexer/parser also works in stream mode.
  • API: Allow to use detection algorithm without index?
  • API: Hide concrete implementation from consumer? (e.g. original vs suffix-tree)
  • Currently for each Block we store index in file, number of first line and last line. During detection only first one is used as an absolute position of block in file, and two last are used only to represent results for the user. Whereas in fact we can try to use only two 32-bit integers (or even one) to represent both index and line number, if line+column would be packed into single 32-bit integer ( as mentioned by Dinesh on page About PEG ).
    • Just a quick calculation without real measurements :
      • current memory consumption on source code from Oracle JDK 1.6.0_29 when position in file not packed = 100 Mb to store index
      • with two integers = 80 Mb , so -20%
      • with one integer (aggressive packaging) = 60 Mb , so -40%
  • Generalize concept of "code fragment" :
    • Examples :
      • Token / Statement ( constructed from sequence of characters ) : file - list of tokens / statements, which are fragments without intersection, but possibly with gaps
      • Block ( constructed from sequence of tokens / statements ) : file - list of blocks, which are fragments with intersections and without gaps (each next fragment contains part of previous)
      • Duplication ( constructed from sequence of blocks ) : duplication group - set of fragments from different files
    • Should be possible to get start / end point, measured in different units, e.g. :
      • Block : index in file - to sort
      • Duplication :
        • start / end line - to represent in UI
        • start / end index of token / statement in file - to filter
    • Considerations about consumption of memory for implementation - seems that no need to store endpoint , because can be taken from next :
      • For sequential fragments without gaps and intersections - special fragment "end-of-file" will be required.
      • For fragments with gaps - special fragment "gap".
      • For fragments with intersections - from next fragment without overlap.

Removal of deprecated code:

  • Remove forked part of PMD CPD - SONAR-3182.
  • Remove original algorithm, which was implemented in 2.11 and used in 2.12, but not used since 2.13, and which is kept for historical reasons and to be able to quickly switch back on it.

Possible new features:

  • Ability to take advantage of multiple processors out of the box - files can be processed using several threads
  • Off-heap memory for index ( see Apache Direct Memory )
  • Implementation of CloneIndex, which is efficient in terms of memory and supports incremental updating. This will allow usage of library for online detection of duplications, e.g. inside of IDE. Note about current implementations :
    • MemoryCloneIndex can be updated incrementally, but not efficient in terms of memory,
    • PackedMemoryCloneIndex can't be updated incrementally, but efficient in terms of memory.
  • Implementation of CloneIndex , which is based on NoSQL database (key-value). Possible candidates:
  • Detection of duplications of higher types than 1 ( see A Survey on Software Clone Detection Research (2007 year)) :
    • Type 1 - identical code fragments except for variations in whitespace (may be also variations in layout) and comments. Type 1 is widely know as Exact clones.
    • Type 2 - structurally/syntactically identical fragments except for variations in identifiers, literals, types, layout and comments. Note that currently literals difference partially supported.
    • Type 3 - copied fragments with further modifications (statements can be changed, added and/or deleted in addition to variations in identifiers, literals, types, layout and comments). Note that currently this is partially supported - detection of different separated fragments.
  • Detection based on Java bytecode instead of source code. Thus for libraries with closed-sources and even better for libraries, which were obfuscated.

Research:

  • Difference from functional point of view (results) between detection based on tokens and based on statements. Note that PMD CPD is based on tokens and SonarQubeCPD for Java is based on statements.
  • Copyright infringement http://www.dwheeler.com/essays/floss-license-slide.html

Statistic about src.zip from Oracle JDK 1.6.0_30 :

 SonarQube2.13PMD CPD 4.2.5
files7212 
tokens per file (max)48 78445 243
tokens per file (avg)856.3731.0
tokens (total)6 175 7905 272 220
statements per file (max)4 864 
statements per file (avg)83.3 
statements (total)601 031 
blocks per file (max)4 855 
blocks per file (avg)75.8 
blocks (total)546 850 

 

Intersection of parts

Currently we allow intersection of parts of duplication, i.e. for file "1 2 3 4 1 2 3 4 1 2" we will find two parts of "1 2 3 4 1 2".The same result without intersection : three parts of "1 2", two parts of "3 4". Also this leads to detection of two parts of "1 1" for file "1 1 1" :

Whereas in fact user may expect that in such case : part of duplication represents code fragment, which can be refactored into separate method, i.e.:

All this makes me think that crossing should be prohibited. Also should be noted that this is exactly the same corner case, which is a cause of long work of original algorithm ( SONAR-3060 ).

More complex example may look like :

Resource 1Resource 2

fragment 1

fragment 2

fragment 1

fragment 2

fragment 1

fragment 2

fragment 1

fragment 2

For resource 1 simple refactoring can be applied - extract duplication (fragment 1, fragment 2) into new method without leaving boundaries of resource. For duplication in both resources (fragment 1, fragment 2, fragment 1, fragment 2) more complex refactoring might be required depending on their relative position :

  • Same module : extract into helper class without leaving boundaries of module.

  • Same project : introduction of a shared module.
  • Same organization : introduction of shared project.

 

Literature

YearTitle / LinkPagesCommentsAvailability
2011DebCheck: Efficient Checking for Open Source Code Clones in Software Systems detection 
 Towards Collection of Refactoring Patterns Based on Code Clone Classification   
 Finding Code Clones for Refactoring with Clone Metrics: A Case Study of Open Source Software   
 On the Effectiveness of Simhashing in Clone Detection on Large Scale Software System One of authors : Sharif Uddinyes, 05/01/12
 File Cloning in Open Source Java Projects: The Good, The Bad, and The Ugly https://github.com/sourcerer/Sourcereryes, 05/01/12
2010Instant Code Clone Search detection 
 From Whence It Came: Detecting Source Code Clones by Analyzing Assembler5detectionyes, 03/01/12
 Index-Based Code Clone Detection: Incremental, Distributed, Scalable9(detection) During GSoC'11 we took it as a basis for our algorithms.yes, 03/01/12
 A Technique for Just-In-Time Clone Detection in Large Scale Systems detection 
 Model Clone Detection in Practice   
 Code Clones in Feature-Oriented Software Product Lines   
 Studying the Impact of Clones on Software Defects   
 Measuring the Efficacy of Code Clone Information: An Empirical Study   
 Evaluating Code Clone Genealogies at Release Level: An Empirical Study   
 Can Clone Detection Support Quality Assessments of Requirements Specifications?   
 Tracking Clones' Imprint   
 Code Similarities Beyond Copy & Paste   
 Representation, Analysis, and Refactoring Techniques to Support Code Clone Maintenance Ph.D. Thesis 
2009A Mutation / Injection-based Automatic Framework for Evaluating Code Clone Detection Tools10evaluationyes, 03/01/12
 Comparison and Evaluation of Clone Detection Techniques and Tools: A Qualitative Approach43(evaluation) Good comparison of tools.yes, 03/01/12
2008Scenario-based Comparison of Clone Detection Techniques10evaluationyes, 03/01/12
 Towards a Mutation-Based Automatic Framework for Evaluating Code Clone Detection Tools4evaluationyes, 03/01/12
2007A Survey on Software Clone Detection Research115Good overview of general questions.yes, 03/01/12
2006Survey of Research on Software Clones24Good overview of general questions.yes, 03/01/12
2005    
2004    
2003Evaluating Clone Detection Techniques evaluationno, 03/01/12
 Detection of Software Clones Tool Comparison Experiment   
 Problems Creating Task-relevant Clone Detection Reference Data10 yes, 03/01/12

 

Detection between different projects

Cases to cover

  1. Index must be in a consistent state at any moment of time - any unsuccessful analysis should not influence on any other analysis.
  2. During analysis of project with "sonar.branch" we must prevent detection of duplicates between it and "master project" (without "sonar.branch"). Thus:
    • current project must not be saved in index
    • "master project" must not be taken into account
  3. Future changes of mechanism to build hashes. This includes any changes in a mechanism to build blocks :
    • block size
    • hash function
    • changes in a mechanism to build statements - statement chunker, which is not necessarily based on token chunker, but can be
  4. Optimization - do not build blocks for a source file that hasn't been changed.

Strategy

Current strategy covers only first case and can cover third, if we will add dedicated column to store meta-info.

  1. Insert new blocks :

  2. Query blocks :

    This will return all blocks with a hash value, which exists in blocks for requested resource (resource_snapshot_id) from current project snapshot, excluding last snapshot for current project (last_project_snapshot_id). Thus to analyse project with N resources, we will perform N such queries.
    Possible optimizations :

    • currently hash is a hex value stored in string, but maybe we can use byte array instead
  3. Purge (DBCleaner) :

Column

Description

snapshot_id

Id of snapshot of resource, which was used to build this block

project_snapshot_id

Id of snapshot of project

hash

Hash code for this block

index_in_file

Index of this block in file

start_line

First line of this block in file

end_line

Last line of this block in file

Limitation

Because we analyse projects sequentially (and also modules inside of one project), we have a following limitation : during search of duplicates for some project we will take into account only current snapshot of this project and snapshots of projects, which was fully analysed before it. This can be illustrated as following : for sequence of analysises - A 0, B 0, A 1, B 1, where A 0 - is first analysis of module A, during search of duplicates for A 1 we will take into account A 1 and B 0, and during search for B 1 - B 1 and A 0.

Labels
  • None