Performance issue: https://jira.codehaus.org/browse/SONARPLUGINS-1696
API and implementation
Currently sonar-duplications library (version 2.13) is mature enough to be reused for example to cover new languages or embed into IDE, but generally speaking does not provide stable API, because it's relatively new library and almost all public classes or methods are subject to incompatible changes - you can treat this as if all public members were marked with @Beta annotation. Thus API must be reviewed and separated on stable and unstable.
Required enhancements:
- BlockChunker (algorithm to build blocks) should not depend on StatementChunker and should work for any unit of source code for which we can determine hash and position, e.g. Token.
This will allow to simply reuse existing lexers/parsers, thus quickly cover languages other than Java - see SONAR-3139 and SONAR-3181. - Moreover, even more simpler pre-processing might be efficient enough, e.g. split by lines with removal of white-spaces and comments.
- Stabilize API to create statements from tokens - SONAR-2843.
- Class Statement (and related classes/methods/fields) must be renamed, because
- it generates false positive issues of rule pmd:CloseResource,
- doesn't really represent statement in terms of language - this is just some part of source code, thus possible new name can be something like a "Unit".
- We prefer term "duplicate" ("duplication") over "clone", because at least it doesn't clash with a name of clone method in Java. Thus some classes must be renamed.
Possible enhancements:
- Suffix tree relies on ByteArray#equals , which might be optimized by using hash-code (which is cached) as a primary comparison attribute.
- No need to store nor list of tokens, nor text representation of Statement, because any next processing relies only on value of hash. Most probably this will decrease memory consumption.
- No need to store text representation of Token, because in any case sonar-channel keeps whole file in memory, thus Token can simply point on position in this file. Maybe this will decrease memory consumption.
- Investigate possibilities to build and store suffix-tree more efficiently in terms of memory, e.g. without suffix-links or by constructing on-disk, but not in-memory...
- BlockChunker uses Rabin-Karp rolling hash, so in fact instead of processing of list, can process stream incrementally. It might decrease memory consumption in case if lexer/parser also works in stream mode.
- API: Allow to use detection algorithm without index?
- API: Hide concrete implementation from consumer? (e.g. original vs suffix-tree)
- Currently for each Block we store index in file, number of first line and last line. During detection only first one is used as an absolute position of block in file, and two last are used only to represent results for the user. Whereas in fact we can try to use only two 32-bit integers (or even one) to represent both index and line number, if line+column would be packed into single 32-bit integer ( as mentioned by Dinesh on page About PEG ).
- Just a quick calculation without real measurements :
- current memory consumption on source code from Oracle JDK 1.6.0_29 when position in file not packed = 100 Mb to store index
- with two integers = 80 Mb , so -20%
- with one integer (aggressive packaging) = 60 Mb , so -40%
- Just a quick calculation without real measurements :
- Generalize concept of "code fragment" :
- Examples :
- Token / Statement ( constructed from sequence of characters ) : file - list of tokens / statements, which are fragments without intersection, but possibly with gaps
- Block ( constructed from sequence of tokens / statements ) : file - list of blocks, which are fragments with intersections and without gaps (each next fragment contains part of previous)
- Duplication ( constructed from sequence of blocks ) : duplication group - set of fragments from different files
- Should be possible to get start / end point, measured in different units, e.g. :
- Block : index in file - to sort
- Duplication :
- start / end line - to represent in UI
- start / end index of token / statement in file - to filter
- Considerations about consumption of memory for implementation - seems that no need to store endpoint , because can be taken from next :
- For sequential fragments without gaps and intersections - special fragment "end-of-file" will be required.
- For fragments with gaps - special fragment "gap".
- For fragments with intersections - from next fragment without overlap.
- Examples :
Removal of deprecated code:
Remove forked part of PMD CPD - SONAR-3182.- Remove original algorithm, which was implemented in 2.11 and used in 2.12, but not used since 2.13, and which is kept for historical reasons and to be able to quickly switch back on it.
Possible new features:
- Ability to take advantage of multiple processors out of the box - files can be processed using several threads
- Off-heap memory for index ( see Apache Direct Memory )
- Implementation of CloneIndex, which is efficient in terms of memory and supports incremental updating. This will allow usage of library for online detection of duplications, e.g. inside of IDE. Note about current implementations :
- MemoryCloneIndex can be updated incrementally, but not efficient in terms of memory,
- PackedMemoryCloneIndex can't be updated incrementally, but efficient in terms of memory.
- Implementation of CloneIndex , which is based on NoSQL database (key-value). Possible candidates:
- Detection of duplications of higher types than 1 ( see A Survey on Software Clone Detection Research (2007 year)) :
- Type 1 - identical code fragments except for variations in whitespace (may be also variations in layout) and comments. Type 1 is widely know as Exact clones.
- Type 2 - structurally/syntactically identical fragments except for variations in identifiers, literals, types, layout and comments. Note that currently literals difference partially supported.
- Type 3 - copied fragments with further modifications (statements can be changed, added and/or deleted in addition to variations in identifiers, literals, types, layout and comments). Note that currently this is partially supported - detection of different separated fragments.
- Detection based on Java bytecode instead of source code. Thus for libraries with closed-sources and even better for libraries, which were obfuscated.
Research:
- Difference from functional point of view (results) between detection based on tokens and based on statements. Note that PMD CPD is based on tokens and SonarQubeCPD for Java is based on statements.
- Copyright infringement http://www.dwheeler.com/essays/floss-license-slide.html
Statistic about src.zip from Oracle JDK 1.6.0_30 :
| SonarQube2.13 | PMD CPD 4.2.5 | |
|---|---|---|
| files | 7212 | |
| tokens per file (max) | 48 784 | 45 243 |
| tokens per file (avg) | 856.3 | 731.0 |
| tokens (total) | 6 175 790 | 5 272 220 |
| statements per file (max) | 4 864 | |
| statements per file (avg) | 83.3 | |
| statements (total) | 601 031 | |
| blocks per file (max) | 4 855 | |
| blocks per file (avg) | 75.8 | |
| blocks (total) | 546 850 |
Intersection of parts
Currently we allow intersection of parts of duplication, i.e. for file "1 2 3 4 1 2 3 4 1 2" we will find two parts of "1 2 3 4 1 2".The same result without intersection : three parts of "1 2", two parts of "3 4". Also this leads to detection of two parts of "1 1" for file "1 1 1" :

Whereas in fact user may expect that in such case : part of duplication represents code fragment, which can be refactored into separate method, i.e.:
All this makes me think that crossing should be prohibited. Also should be noted that this is exactly the same corner case, which is a cause of long work of original algorithm ( SONAR-3060 ).
More complex example may look like :
| Resource 1 | Resource 2 |
|---|---|
fragment 1 fragment 2 fragment 1 fragment 2 | fragment 1 fragment 2 fragment 1 fragment 2 |
For resource 1 simple refactoring can be applied - extract duplication (fragment 1, fragment 2) into new method without leaving boundaries of resource. For duplication in both resources (fragment 1, fragment 2, fragment 1, fragment 2) more complex refactoring might be required depending on their relative position :
Same module : extract into helper class without leaving boundaries of module.
- Same project : introduction of a shared module.
- Same organization : introduction of shared project.
Literature
Detection between different projects
Cases to cover
- Index must be in a consistent state at any moment of time - any unsuccessful analysis should not influence on any other analysis.
- During analysis of project with "sonar.branch" we must prevent detection of duplicates between it and "master project" (without "sonar.branch"). Thus:
- current project must not be saved in index
- "master project" must not be taken into account
- Future changes of mechanism to build hashes. This includes any changes in a mechanism to build blocks :
- block size
- hash function
- changes in a mechanism to build statements - statement chunker, which is not necessarily based on token chunker, but can be
- Optimization - do not build blocks for a source file that hasn't been changed.
Strategy
Current strategy covers only first case and can cover third, if we will add dedicated column to store meta-info.
Insert new blocks :
Query blocks :
This will return all blocks with a hash value, which exists in blocks for requested resource (
resource_snapshot_id) from current project snapshot, excluding last snapshot for current project (last_project_snapshot_id). Thus to analyse project with N resources, we will perform N such queries.
Possible optimizations :- currently
hashis a hex value stored in string, but maybe we can use byte array instead
- currently
Purge (DBCleaner) :
Column | Description |
|---|---|
snapshot_id | Id of snapshot of resource, which was used to build this block |
project_snapshot_id | Id of snapshot of project |
hash | Hash code for this block |
index_in_file | Index of this block in file |
start_line | First line of this block in file |
end_line | Last line of this block in file |
Limitation
Because we analyse projects sequentially (and also modules inside of one project), we have a following limitation : during search of duplicates for some project we will take into account only current snapshot of this project and snapshots of projects, which was fully analysed before it. This can be illustrated as following : for sequence of analysises - A 0, B 0, A 1, B 1, where A 0 - is first analysis of module A, during search of duplicates for A 1 we will take into account A 1 and B 0, and during search for B 1 - B 1 and A 0.

