public class SimilarityIndex
extends java.lang.Object
This structure can be used to compute an approximation of the similarity
between two files. The index is used by
SimilarityRenameDetector
to compute scores
between files.
To save space in memory, this index uses a space efficient encoding which will not exceed 1 MiB per instance. The index starts out at a smaller size (closer to 2 KiB), but may grow as more distinct blocks within the scanned file are discovered.
Modifier and Type | Class and Description |
---|---|
static class |
SimilarityIndex.TableFullException
Thrown by
create() when file is too large. |
Modifier and Type | Field and Description |
---|---|
private long |
hashedCnt
Total amount of bytes hashed into the structure, including \n.
|
private int |
idGrowAt
|
private long[] |
idHash
Pairings of content keys and counters.
|
private int |
idHashBits
idHash.length == 1 << idHashBits . |
private int |
idSize
Number of non-zero entries in
idHash . |
private static int |
KEY_SHIFT
Shift to apply before storing a key.
|
private static long |
MAX_COUNT
Maximum value of the count field, also mask to extract the count.
|
static SimilarityIndex.TableFullException |
TABLE_FULL_OUT_OF_MEMORY
A special
SimilarityIndex.TableFullException used in place of OutOfMemoryError. |
Constructor and Description |
---|
SimilarityIndex() |
Modifier and Type | Method and Description |
---|---|
(package private) void |
add(int key,
int cnt) |
private static long |
common(long[] srcHash,
int srcIdx,
long[] dstHash,
int dstIdx) |
(package private) long |
common(SimilarityIndex dst) |
private static long |
common(SimilarityIndex src,
SimilarityIndex dst) |
(package private) long |
count(int idx) |
private static long |
countOf(long v) |
static SimilarityIndex |
create(ObjectLoader obj)
Create a new similarity index for the given object
|
(package private) int |
findIndex(int key) |
private void |
grow() |
private static int |
growAt(int idHashBits) |
(package private) void |
hash(byte[] raw,
int ptr,
int end) |
(package private) void |
hash(java.io.InputStream in,
long remaining,
boolean text) |
(package private) void |
hash(ObjectLoader obj) |
private void |
hashLargeObject(ObjectLoader obj) |
(package private) int |
key(int idx) |
private static int |
keyOf(long v) |
private int |
packedIndex(int idx) |
private static long |
pair(int key,
long cnt) |
int |
score(SimilarityIndex dst,
int maxScore)
Compute the similarity score between this index and another.
|
(package private) int |
size() |
private int |
slot(int key) |
(package private) void |
sort()
Sort the internal table so it can be used for efficient scoring.
|
public static final SimilarityIndex.TableFullException TABLE_FULL_OUT_OF_MEMORY
SimilarityIndex.TableFullException
used in place of OutOfMemoryError.private static final int KEY_SHIFT
Within the 64 bit table record space, we leave the highest bit unset so all values are positive. The lower 32 bits to count bytes.
private static final long MAX_COUNT
private long hashedCnt
private int idSize
idHash
.private int idGrowAt
private long[] idHash
Slots in the table are actually two ints wedged into a single long. The upper 32 bits stores the content key, and the remaining lower bits stores the number of bytes associated with that key. Empty slots are denoted by 0, which cannot occur because the count cannot be 0. Values can only be positive, which we enforce during key addition.
private int idHashBits
idHash.length == 1 << idHashBits
.public static SimilarityIndex create(ObjectLoader obj) throws java.io.IOException, SimilarityIndex.TableFullException
obj
- the object to hashjava.io.IOException
- file contents cannot be read from the repository.SimilarityIndex.TableFullException
- object hashing overflowed the storage capacity of the
SimilarityIndex.void hash(ObjectLoader obj) throws MissingObjectException, java.io.IOException, SimilarityIndex.TableFullException
MissingObjectException
java.io.IOException
SimilarityIndex.TableFullException
private void hashLargeObject(ObjectLoader obj) throws java.io.IOException, SimilarityIndex.TableFullException
java.io.IOException
SimilarityIndex.TableFullException
void hash(byte[] raw, int ptr, int end) throws SimilarityIndex.TableFullException
void hash(java.io.InputStream in, long remaining, boolean text) throws java.io.IOException, SimilarityIndex.TableFullException
java.io.IOException
SimilarityIndex.TableFullException
void sort()
Once sorted, additional lines/blocks cannot be added to the index.
public int score(SimilarityIndex dst, int maxScore)
A region of a file is defined as a line in a text file or a fixed-size block in a binary file. To prepare an index, each region in the file is hashed; the values and counts of hashes are retained in a sorted table. Define the similarity fraction F as the count of matching regions between the two files divided between the maximum count of regions in either file. The similarity score is F multiplied by the maxScore constant, yielding a range [0, maxScore]. It is defined as maxScore for the degenerate case of two empty files.
The similarity score is symmetrical; i.e. a.score(b) == b.score(a).
dst
- the other indexmaxScore
- the score representing a 100% matchlong common(SimilarityIndex dst)
private static long common(SimilarityIndex src, SimilarityIndex dst)
private static long common(long[] srcHash, int srcIdx, long[] dstHash, int dstIdx)
int size()
int key(int idx)
long count(int idx)
int findIndex(int key)
private int packedIndex(int idx)
void add(int key, int cnt) throws SimilarityIndex.TableFullException
private static long pair(int key, long cnt) throws SimilarityIndex.TableFullException
private int slot(int key)
private static int growAt(int idHashBits)
private void grow() throws SimilarityIndex.TableFullException
private static int keyOf(long v)
private static long countOf(long v)