Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Performance issue: https://jira.codehaus.org/browse/SONARPLUGINS-1696

API and implementation

Currently sonar-duplications library (version 2.13) is mature enough to be reused for example to cover new languages or embed into IDE, but generally speaking does not provide stable API, because it's relatively new library and almost all public classes or methods are subject to incompatible changes - you can treat this as if all public members were marked with @Beta annotation. Thus API must be reviewed and separated on stable and unstable.

...

  • BlockChunker (algorithm to build blocks) should not depend on StatementChunker and should work for any unit of source code for which we can determine hash and position, e.g. Token. This will allow to simply reuse existing lexers/parsers, thus quickly cover languages other than Java - see SONAR-3139 and SONAR-3181.
  • Moreover, even more simpler pre-processing might be efficient enough, e.g. split by lines with removal of white-spaces and comments.
  • Stabilize API to create statements from tokens - SONAR-2843.
  • Class Statement (and related classes/methods/fields) must be renamed, because
    • it generates false - positive violations issues of rule pmd:CloseResource,
    • doesn't really represent statement in terms of language - this is just some part of source code, thus possible new name can be something like a "Unit".
  • We prefer term "duplicate" ("duplication") over "clone", because at least it doesn't clash with a name of clone method in Java. Thus some classes must be renamed.

...

  • Suffix tree relies on ByteArray#equals , which might be optimized by using hash-code (which is cached) as a primary comparison attribute.
  • No need to store nor list of tokens, nor text representation of Statement, because any next processing relies only on value of hash. Most probably this will decrease memory consumption.
  • No need to store text representation of Token, because in any case sonar-channel keeps whole file in memory, thus Token can simply point on position in this file. Maybe this will decrease memory consumption.
  • Investigate possibilities to build and store suffix-tree more efficiently in terms of memory, e.g. without suffix-links or by constructing on-disk, but not in-memory...
  • BlockChunker uses Rabin-Karp rolling hash, so in fact instead of processing of list, can process stream incrementally. It might decrease memory consumption in case if lexer/parser also works in stream mode.
  • API: Allow to use detection algorithm without index?
  • API: Hide concrete implementation from consumer? (e.g. original vs suffix-tree)
  • Currently for each Block we store index in file, number of first line and last line. During detection only first one is used as an absolute position of block in file, and two last are used only to represent results for the user. Whereas in fact we can try to use only two 32-bit integers (or even one) to represent both index and line number, if line+column would be packed into single 32-bit integer ( as mentioned by Dinesh on page About PEG ).
    • Just a quick calculation without real measurements :
      • current memory consumption on source code from Oracle JDK 1.6.0_29 when position in file not packed = 100 Mb to store index
      • with two integers = 80 Mb , so -20%
      • with one integer (aggressive packaging) = 60 Mb , so -40%
  • Generalize concept of "code fragment" :
    • Examples :
      • Token / Statement ( constructed from sequence of characters ) : file - list of tokens / statements, which are fragments without intersection, but possibly with gaps
      • Block ( constructed from sequence of tokens / statements ) : file - list of blocks, which are fragments with intersections and without gaps (each next fragment contains part of previous)
      • Duplication ( constructed from sequence of blocks ) : duplication group - set of fragments from different files
    • Should be possible to get start / end point, measured in different units, e.g. :
      • Block : index in file - to sort
      • Duplication :
        • start / end line - to represent in UI
        • start / end index of token / statement in file - to filter
    • Considerations about consumption of memory for implementation - seems that no need to store endpoint , because can be taken from next :
      • For sequential fragments without gaps and intersections - special fragment "end-of-file" will be required.
      • For fragments with gaps - special fragment "gap".
      • For fragments with intersections - from next fragment without overlap.

...

  • Difference from functional point of view (results) between detection based on tokens and based on statements. Note that PMD CPD is based on tokens and Sonar CPD SonarQubeCPD for Java is based on statements.
  • Copyright infringement http://www.dwheeler.com/essays/floss-license-slide.html

Statistic about src.zip from Oracle JDK 1.6.0_30 :

 Sonar 2SonarQube2.13PMD CPD 4.2.5
files7212 
tokens per file (max)48 78445 243
tokens per file (avg)856.3731.0
tokens (total)6 175 7905 272 220
statements per file (max)4 864 
statements per file (avg)83.3 
statements (total)601 031 
blocks per file (max)4 855 
blocks per file (avg)75.8 
blocks (total)546 850 

 

...

  1. Index must be in a consistent state at any moment of time - any unsuccessful analysis should not influence on any other analysis.
  2. During analysis of project with "sonar.branch" we must prevent detection of duplicates between it and "master project" (without "sonar.branch"). Thus:
    • current project must not be saved in index
    • "master project" must not be taken into account
  3. Future changes of mechanism to build hashes. This includes any changes in a mechanism to build blocks :
    • block size
    • hash function
    • changes in a mechanism to build statements - statement chunker, which is not necessarily based on token chunker, but can be
  4. Optimization - do not build blocks for a source file that hasn't been changed.

...