crandas.unsafe

Usage of the operations in this package are considered unsafe because they step around the security model of Crandas in terms of information leakage. Usage of these operations can lead to performance improvements if usage is deemed appropriate in the given context.

crandas.unsafe.merge_m2m(left, right, how='inner', on=None, left_on=None, right_on=None, validate='m:m-unsafe', suffixes=('_x', '_y'), session=None, **query_args)

Unsafe many-to-many merge.

For meaning of arguments, see crandas.crandas.merge().

This function implements a many-to-many merge that leaks some information about the underlying data to the servers that perform the merge. This function should only be used if this leakage is acceptable. This needs to be carefully analysed on a case-by-case basis.

Leakage

This merge works by mapping the join columns of the left and right tables to hash values that are computed randomly but deterministically from the combination of values from the join columns. The servers learn these hash values (in a permuted order) and use these to determine which columns from the left and right table (in the permuted order) match.

As a consequence of this, the servers learn the number of different combinations of key values of the left and right tables; how often each combination occurs; and which of these combinations match between the left and right tables.

For example, consider the two following tables:

Column A

Column B

Column C

1

a

x

2

b

y

2

b

z

3

b

z

Column A

Column B

Column D

1

a

u

2

b

v

2

a

w

In the case of a join of these two tables on the columns A and B, conceptually, the servers might for example learn the following two tables of permuted hash values:

Hashes for left table

590890364

490892411

239890890

590890364

(Where the first row corresponds to the second record, the second row to the first record, the third row to the fourth record, and the fourth row to the third record.)

Hashes for right table

908902902

590890364

239890890

(Where the first row corresponds to the third record, the second row to the second record, and the third row to the third record.)

Accordingly, the servers can deduce that, in the left table, there are two rows that have the same combination of values for rows A and B. In the right table, there is one row that also has that same combination of values. Moreover, the right table has one row where the values for rows A and B are the same as for another row that occurs only once in the left table.

Whether or not the leakage of this information to the servers is acceptable needs to be analysed carefully on a case-to-case basis and will depend on the extent to which such statistical information is considered sensitive.

Note: when the merge is performed multiple times, the values and their order are each time different. The examples shown abore are purely illustrative.