Guide for approvers#

Note

This guide is a work in progress, and does not cover all possible scenarios. Other attack scenarios might exist that are not mentioned here.

This page guides you in the process of writing and approving a crandas script. It explains what you should look for when inspecting such a script. In most applications, an analysis should protect individual records by only disclosing aggregate statistics and deriving properties of individuals should not be possible. Still, the more freedom an analyst receives in designing an analysis, the higher the chance becomes that the analysis will reveal information that is beyond its purpose. Besides, aggregations can in some instances leak information on individuals.

The aim of this guide is to help you design and approve suitable analyses that protect individual records in the source data. The first section of this guide assumes that the analyst is trusted not to actively or maliciously try to derive additional information about individual records. If this is not the case, additional mechanisms can be used as described in chapter 3.

Contents:

  1. Purpose alignment : do the steps in the script support the desired results?

  2. Disclosure checklist

  3. Advanced controls for disclosure risks

  4. Appendix


In case you need help in assessing disclosure risks, or when you want to know more about how to recognize manipulated scripts or data, please reach out to us. We are happy to help!

Purpose alignment#

First, you should verify if the script aligns with the purpose of the analysis.

Go through the steps in the script, and inspect the outputs that are generated based on the dummy data. Which information can you derive? Does this support the intended outcome of the analysis, within the scope that was agreed upon? For example, when the purpose is to compare individuals in a dataset based on a specific attribute, be cautious on statements involving different unrelated attributes (for an example, see Appendix 1).

Disclosure checklist#

To prevent accidental disclosure, please make sure to at least take the following steps:

In your script:#

  1. Review, or discuss with the analyst, if the script aligns with the purpose of the analysis as in the first step.

  2. Prevent open() statements in code, to make sure that you don’t reveal information on individual records.

  3. Always refer to tables by their unique table handle.

  4. Always use a threshold (see also Appendix 2) on a CDataFrame.filter() or aggregation (e.g. CSeriesColRef.count(), CSeriesColRef.sum()). To prevent disclosure risks the threshold value should be at least 3. A threshold value of 10 is recommended, but depends on the context [1].

  5. Be very careful when using unique ID attributes in a script for anything other than merging tables (see also Disclosure by numerical operations on ID’s ). In general, make sure not to use CSeriesColRef.min() or CSeriesColRef.max() on sensitive attributes that might be unique, since it is sometimes possible to link the values to their corresponding records.

In your input data:#

  1. When a dataset needs to be uploaded again (for example because a mistake was made or records need to be changed or added), make sure that it is different enough from the previous dataset to prevent disclosure by differencing.

  2. Assess how much of the data being processed might already be publicly available, to prevent group disclosure. Also, prior knowledge on the input might reduce the effect of a threshold, since some records can then be recognized in the output.

  3. Prevent that duplicate records in your dataset undermine the chosen threshold. If the input data contains duplicates, choose a higher threshold accordingly.

  4. Make sure that your input dataset(s) contain a sufficient number of (overlapping) records, such that enough records will end up in the merge. This will prevent disclosure by intersection.

Advanced controls for disclosure risks#

In some circumstances disclosure risks are more likely or have more impact. For example, this is the case when you need to prevent active attempts of disclosure by manipulation of the script or the input data, or when the output is shared publicly. In that case, we advise the following:


For additional measures on filtering, please inspect Appendix 5.

Note

This document is meant to assist in writing and approving scripts that minimize the risks of disclosure of individuals. This means that analysts or people to whom the output is disclosed can not find out which individuals occur in the source data, or derive information about them. These “individuals” can be anything the data records represent, e.g. people or organizations. We hope that this document will help you in this.

Warning

As part of a collaboration, parties may agree at their own risk to deviate from the default risk scenario and associated measures. For example, you may choose to use a smaller threshold than what is recommended, upload tables by means of the table naming feature or re-upload a dataset to be used in the same analysis. This will be at the cost of a higher disclosure risk.

Appendix#

Appendix 1: Example non-compliant script#

Suppose two parties agree to analyse the relation between alcohol consumption and the occurence of certain diseases, such as heart disease and diabetes. To do this, one of them provides a dataset containing medical information of individuals and the other a survey on lifestyle information. As we see in the script, at some point “sleep time” is inspected. Since this is irrelevant for the purpose of the analysis that was agreed on, the approver should be cautious to approve this analysis.

survey_data = cd.get_table("980C4EF3DC374B262C8175F6963EAA602C8C36566314AC49251FB054981B8EAE")
medical_data = cd.get_table("3F16BD4F1212284D3CD61E46441FA50C2E38CA629DA47BD2615CEB045CB28BB1")

merged = cd.merge(survey_data, medical_data, how='inner', left_on='patient_id', right_on='patientnr')

disease_list = ['Stroke',
                'KidneyDisease',
                'Asthma',
                'SkinCancer',
                'HeartDisease',
                'Diabetic'
                ]

drinker = {0: "No",
           1: "Yes",
           }

## Sleeptime is irrelevant for the purpose of the analysis, so the approver should be careful approving this analysis
merged["SleepTime"].mean()

num_drinkers = [sum(merged["AlcoholDrinking"]==0), sum(merged["AlcoholDrinking"]==1)]

table = []
for key,val in drinker.items():
    for disease in disease_list:
        sub = []
        sub.append(key)
        sub.append(val)
        sub.append(disease)
        sub.append(sum(merged[(merged["AlcoholDrinking"]==key)][disease])/num_drinkers[key])
        table.append(sub)

df = pd.DataFrame(table, columns=["key","Drinker","Disease", "Total"])

Appendix 2: Thresholds#

Disclosure risks at runtime can be limited using the threshold functionality. The threshold can be added as a parameter to a CDataFrame.filter() or to an aggregation operation (e.g. CSeriesColRef.mean(), CSeriesColRef.max(), CSeriesColRef.sum()).

Only when the number of rows on which the function is applied is higher than the specified threshold, the function will be executed. Organizations can choose the threshold themselves. A value of 10 is recommended, but it depends on the context [1]. To prevent disclosure of individuals, the threshold should not be lower than 3 .

Suppose we have two tables occuring in the Virtual Data Lake: one containing medical data about patients and another containing survey data about patients. These datasets have a few records in common, which can be accessed by performing a join on patient_id.

survey_data_handle = "253A02627B55BBAC4C5B1E3370D82E3F9123E23E4021442A385AB5B83025D7F5"
medical_data_handle = "CAEFF085F0323E9137776E16898806DFE684B7E4C1BDC4963D3A4824A5A70FB9"

We can now perform a join on both datasets:

joined_table = cd.merge(cd.get_table(survey_data_handle), cd.get_table(medical_data_handle), how="inner",
        left_on="patient_id", right_on="patientnr")
Handle: "76183E482231AC32B90F592D1069E4AF9D31AE31B130B543F4A80585736F0068"
Size: 507 rows x 17 columns
CIndex([Col("patient_id", "i", 1), Col("Sex", "s", 7), Col("Age", "i", 1),
Col("SleepTime", "i", 1), Col("Smoking", "s", 4), Col("AlcoholDrinking", "i", 1),
Col("Country", "s", 15), Col("HeartDisease", "s", 4), Col("BMI", "i", 1),
Col("Stroke", "i", 1), Col("PhysicalActivity", "i", 1), Col("GenHealth", "s", 10),
Col("DiffWalking", "s", 4), Col("Diabetic", "s", 23), Col("Asthma", "i", 1),
Col("KidneyDisease", "i", 1), Col("SkinCancer", "i", 1)])

Suppose we want to analyse the average BMI of individuals having diabetes. However, it might be the case that there are only a few individuals having diabetes, and then the results of the analysis will reveal sensitive properties of those individuals. To prevent this, we allow the table of diabetic people to be generated only when enough people occur in it.

diabetic_filter = joined_table["Diabetic"] == "Yes"
diabetic_filtered = joined_table[diabetic_filter.with_threshold(10)]

Or, alternatively:

diabetic_filtered = joined_table.filter(joined_table["Diabetic"] == "Yes", threshold = 10)

No error occurred, so apparently the resulting table contains more than 10 records. We are now allowed to proceed with our analysis.

>>> diabetic_filtered["BMI"].mean()
31.13953488372093

Now say that we want to know the average BMI of people having diabetes and also kidney disease. This time we use a threshold of 3, meaning that the operation is not allowed in case it applies to only one or two records. This time, we apply the threshold function not on the filter, but on the aggregation (CSeriesColRef.mean()) directly.

>>> diabetic_filtered[diabetic_filtered["KidneyDisease"]==1]["BMI"].mean(threshold=3)
ServerError:  Table does not meet rows threshold for function application
    (error code: ServerErrorCode.CLIENT_ERROR_THRESHOLD_NOT_MET)

This throws a threshold not met error, so indeed the result of this computation has a chance of revealing the BMI of a single or two individuals having both kidney disease and diabetes. Note that the threshold function can also be applied on other aggregation functions, e.g.:

>>> diabetic_filtered["Age"].sum(threshold=3)
168
>>> diabetic_filtered["Age"].min(threshold=3)
34
>>> diabetic_filtered["Age"].count(threshold=3)
3

Or, alternatively:

ages = diabetic_filtered["Age"].with_threshold(3)
>>> ages.sum()
168
>>> ages.min()
34
>>> ages.count()
3

Note

In case a set of concatenated datasets is filtered on, the threshold should be evaluated on the individual datasets it consists of before the filter is applied to the concatenated dataset. This prevents that one data provider can use its own input to derive the input of the other.

concat_regions = cd.concat([regionA, regionB], join="inner")

## Verify if the number of records conforms to the threshold for regions A and B individually
try:
    lower_salary_A = regionA.filter(regionA["salary"] < 1500, threshold=3)
    lower_salary_B = regionB.filter(regionB["salary"] < 1500, threshold=3)
except:
    lower_salary_A = False
    lower_salary_B = False
else:
    lower_salary_concat = concat_regions.filter(concat_regions["salary"] < 1500, threshold=3)

Appendix 3: Disclosure scenarios#

When insufficient mechanisms for disclosure control are in place, the following scenarios could lead to disclosure of information of individual records:

Disclosure by intersection#

When two datasets have only a single individual in common, any outcome resulting from a computation on their intersection will reveal information about that individual:

# join tables of different data providers
merged = cd.merge(tableA, tableB, how="inner", left_on="patientnr", right_on="patientnr")

>>> print("patientnr: " + str(merged["patientnr"].sum()) + ", age: " + str(merged["age"].sum()))
patientnr: 10112, age: 33

Warning

The Virtual Data Lake can not hide the shape (number of rows and columns) in a CDataFrame, even when this information is not explicitly revealed in the script.

Group disclosure#

Properties of an individual record may be disclosed, by combining information that is revealed during execution of the script with information the analyst already has. This is particularly relevant when the script discloses information about a group of records that all have a certain property:

smokers = healthdata.filter(merged["Smoking"] == True)
>>> smokers
Handle: "51386E338D2A10E62B71A32AD87AC2F3A5A507F324052A402BAF7C3C1ED8E9E8"
Size: 24 rows x 13 columns

smokers_asthma = smokers.filter(smokers["Asthma"] == True)
>>> smokers_asthma
Handle: "193744E1B3812C1E1939353712081ADFF77075BD1F9FE40FD8F54827127672E4"
Size: 24 rows x 13 columns

The second filter did not influence the number of records. From this we can conclude that any smoking individual we might know to occur in the dataset indeed has asthma. Similarly, if the second filter would result in an empty dataframe, this would also disclose information about all smokers.

Disclosure by duplication#

It might occur that records are duplicated in either of the source datasets.

partyA_data = [{'person_ID':101,'name': 'Jackson'},
               {'person_ID':102,'name': 'Wesley'},
               {'person_ID':103,'name': 'Jameson'},
               {'person_ID':999,'name': 'Greyson'},
               {'person_ID':101,'name': 'Jackson'},
               {'person_ID':102,'name': 'Wesley'},
               {'person_ID':103,'name': 'Jameson'},
               {'person_ID':106,'name': 'Milo'},
               {'person_ID':101,'name': 'Jackson'},
               {'person_ID':102,'name': 'Wesley'},
               {'person_ID':103,'name': 'Jameson'},
               {'person_ID':107,'name': 'Maarten'}]
c_partyA_data = cd.upload_pandas_dataframe(pd.DataFrame.from_records(partyA_data), auto_bounds=True)
partyA_batch1 = c_partyA_data.slice(slice(0,4))
partyA_batch2 = c_partyA_data.slice(slice(4,8))
partyA_batch3 = c_partyA_data.slice(slice(8,12))

partyB_data = [{'person_ID':101,'salary': 85225},
               {'person_ID':102,'salary': 37160},
               {'person_ID':103,'salary': 75706},
               {'person_ID':104,'salary': 25893},
               {'person_ID':105,'salary': 96129},
               {'person_ID':106,'salary': 66381},
               {'person_ID':107,'salary': 11253},
               {'person_ID':108,'salary': 6417}]
c_partyB_data = cd.upload_pandas_dataframe(pd.DataFrame.from_records(partyB_data), auto_bounds=True)


merge_batch1 = cd.merge(partyA_batch1, c_partyB_data, how="inner", left_on="person_ID", right_on="person_ID")
merge_batch2 = cd.merge(partyA_batch2, c_partyB_data, how="inner", left_on="person_ID", right_on="person_ID")
merge_batch3 = cd.merge(partyA_batch3, c_partyB_data, how="inner", left_on="person_ID", right_on="person_ID")

>>> print("Salary of Milo: " + str(merge_batch2["salary"].sum() - merge_batch1["salary"].sum()))
Salary of Milo: 66381

Therefore, a merge should not be done on specific slices of a CDataFrame. Also, once a merge has been computed, no additional records should be added to the resulting CDataFrame.

Disclosure by numerical operations on ID’s#

Generally, the ID attribute can be used to uniquely identify an individual. It should therefore be avoided that numerical operations (e.g. CSeriesColRef.min(), CSeriesColRef.max()) are applied on the ID attribute.

asthma = healthdata.filter(healthdata["Asthma"] == True)

>>> print("Patients: ")
>>> print(str(asthma["patientnr"].min()))
>>> print("and")
>>> print(str(asthma["patientnr"].max()))
>>> print("have asthma")

Patients:
10128249
and
13758308
have asthma

Disclosure by differencing#

Information about individuals can be revealed by performing the same computation on different datasets and inspecting the result:

## Suppose we have dataset X and dataset Y, whose only difference
## is that record r occurs in X and not in Y. Then:

sum(X["value"]) - sum(Y["value"]) = r["value"]

This can be done either by performing the same analysis on datasets that are only slightly different, or within the script itself:

## When there is a single person from Ireland, this will reveal the age of that person:
>>> patient_info["age"].sum() -  patient_info.filter(patient_info["Country"] != "Ireland")["age"].sum()
33.0

Similarly, when the same analysis is approved multiple times on different datasets, these datasets might also differ in a single record:

## First analysis:
patient_info = cd.get_table("9AABE73F98B33C3A54AF2B147A0219A809D7EC2FC1A5ED261FA8F4929FFDD15E")
>>> patient_info
Handle: "9AABE73F98B33C3A54AF2B147A0219A809D7EC2FC1A5ED261FA8F4929FFDD15E"
Size: 1067 rows x 7 columns

>>> patient_info["age"].sum()
10394.0
## Second analysis:
patient_info = cd.get_table("3532105850E7E0C9C5314818571A8BC658F329EBDFF28A820CB0FC34468583B2")
>>> patient_info
Handle: "3532105850E7E0C9C5314818571A8BC658F329EBDFF28A820CB0FC34468583B2"
Size: 1066 rows x 7 columns

>>> patient_info["age"].sum()
10361.0
## single person was left out having an age of (10394 - 10361 = ) 33

Disclosure by imbalance of contributions#

One individual might have an attribute value that is significantly larger those of others. The sum for that attribute might then be an accurate estimation of the largest individual’s contribution:

data = [{'company': 'S1', 'revenue': 1},
        {'company': 'S2', 'revenue': 1},
        {'company': 'S3', 'revenue': 2},
        {'company': 'S4', 'revenue': 120},
        {'company': 'S5', 'revenue': 30},
        {'company': 'S6', 'revenue': 2},]
revenues = pd.DataFrame.from_records(data)
revenues = cd.upload_pandas_dataframe(revenues,auto_bounds=True)

>>> print("S5 estimation of S4: " + str(revenues['revenue'].sum() - 30))
S5 estimation of S4: 126 ## (which is indeed 5% accurate )

Appendix 4: p%-rule#

A (sensitive) attribute conforms to p% when the second-largest contributor can not derive the contribution of the highest contributor by p%. Formally, let X be the sum of all values for the attribute, and x1 and x2 the largest and second-largest value for that attribute respectively. According to the p%-rule, the attribute is then unsafe to be used in an analysis when:

\[X - x1 - x2 < \frac{p*x1}{100}\]

The p%-rule is generally used as the standard for national statistical institutes for publishing data. The choice for p depends on the preferences of the data provider and the sensitivity of the attribute, but we recommend the value for p to be at least 5.

In crandas, the p%-rule can be evaluated as follows:

def ppercent(cdf, column_name, p):
    """Verify if column `column_name` of CDataFrame `cdf` conforms to the p%-rule for value `p`.

    Parameters
    ----------
    cdf: CDataFrame
        input table being assessed
    column_name: str
        name of the column of `cdf` to be assessed
    p: int
        percentage value (in range <0,100>) for p%-rule

    Returns
    -------
    Nothing if `column_name` conforms to p%-rule, raise error otherwise
    """
    max_value = cdf[column_name].max(mode="regular")
    sum_value = cdf[column_name].sum(mode="regular")
    table_filtered = cdf[cdf[column_name] != max_value]
    size_cdf = cdf[column_name].count(mode="regular")
    size_filtered = cdf[column_name].count(mode="regular")
    diff = (size_cdf - size_filtered) == 1
    second_max_value = ((1-diff1) * max_value) + (diff1 * filtered["Income"].max(mode="regular"))

    breached = 100*(sum_value - max_value - second_max_value) < p * max_value
    is_breached = breached.open().iloc[0]

    if is_breached:
        raise ValueError('Result unsafe to disclose: p%-rule  breached for p=' + str(p))

table = incomes.filter(incomes["Profession"] == "Dentist", threshold=10)
>>> ppercent(table, "Income", 15)
ValueError: Result unsafe to disclose: p%-rule  breached for p=15

Note

The way that is described here to evaluate p% is a temporary workaround for the current version of crandas. A more suitable mechanism will soon become available.

Appendix 5: Filtering#

When a dataset is filtered, it is important to assess the difference in the number of records in the original and the resulting dataset. This difference can either reveal information about an individual or about a group of records in the dataset being filtered on. Examples of this can be seen in disclosure scenarios group disclosure and disclosure by differencing.

As such, you should make sure that the filtered dataset differs at least 3 records from the original dataset. In crandas, this can be done as follows:

def filter_diff(cdf, filtering, threshold=None):
    """Perform filtering on cdf, unless the difference in the number of records between the original
        and the result is smaller than 3.

    Parameters
    ----------
    cdf: CDataFrame
        input table being filtered on
    filtering: key or function
        filter being applied to cdf (see docstring for `filter` for additional info)
    threshold: int, optional, default: None
        If given, sets a minimum amount of rows that the resulting table needs to have; otherwise,
        the server returns an error. Equivalent to calling filter with key.with_threshold(threshold)

    Returns
    -------
    CDataFrame
        Result of applying filter `filtering` on df, or throw ValueError in case the result differs less than 3
        records from `cdf`
    """
    try:
        cdf.filter(1-filtering, threshold=3)
    except Exception as e: # work on python 3.x
        raise ValueError('Result unsafe to disclose: difference between original and filtered table too small')

    return table.filter(filtering, threshold)

filtering = table["Income"] < 100
>>> filter_diff(table, filtering, 10)
ValueError: Result unsafe to disclose: difference between original and filtered table too small

Note

The way that is described here to evaluate the filter is a temporary workaround for the current version of crandas. A more suitable mechanism will soon become available.