Authorization¶

Most of the material in this guide is applicable for when using the engine servers run in design mode. In this mode, any query [1] is allowed and can be immediately executed by the engine, hence this mode is very useful for interactive exploration. However, queries that reveal all source data can also be executed, and therefore this mode is mostly intended for environments that only contain dummy data.

The engine servers can also be used in authorized mode. In this mode, all queries require prior approval by a fixed set of approvers before it can be executed.

Signatures¶

In authorized mode, the engine servers only accept the query if it is accompanied by list of digital signatures; one signature for each pre-configured approver. Approvers hold a secret signing key, that they use to generate the digital signatures. For each signing key there is a unique associated verification key, that may be distributed publicly. The engine servers are configured with the verification keys of each approver, and they use the verification keys to check whether the signatures on the queries that they receive are valid.

Besides the signatures of the approvers, there is a second level of signatures present, because the analyst that uses crandas also signs the query. The reason behind this is to ensure that only the intended analyst can execute the approved query and obtain its results. To accomplish this, the user of crandas will generate an analyst key, which like the above also consists of two parts: a secret signing key, and a public verification key. The approvers include the verification key of the analyst in their approval. When crandas wants to execute a query, it first signs the query itself using the secret analyst key. It then sends the query, together with the signatures from the approvers, to the engine server.

Generating keys¶

All keys can be generated using the platform. See this page for an elaborate explanation.

In case the engine nodes are hosted on-premise, the keys can also be generated locally. To execute queries in authorized mode, queries are signed twice: by the approvers who approve the query, and by the analyst who executes them. Both approvers and analysts sign the query using their secret signing key. To each signing key there is a corresponding verification key, that may be distributred publicly, and that can be used to verify the digital signature.

To generate a signing key and corresponding verification key, the user [2] may use the generate_key.py script that comes with crandas.

To use it from the command-line, use:

generate_key.py analyst.sk analyst.pk

This will generate two files, analyst.sk (signing key) and analyst.pk (public verification key) in the current working directory.

Uploading data from crandas within authorized mode¶

To upload data in authorized mode, we also need approval. Typically, uploading data in authorized mode can be done via the platform, and this step can be skipped.

If the user wants to be able to upload data from crandas, they need to have authorization for it, just like any other operation.

Roughly, there are two possibilities:

Option 1: Recording an upload script¶

The user records a script for a single upload. They can use the methods described above to record a script that uploads a table. Of course, they may upload several tables in one script as well. They submit the script for approval, and then load the approved script for use in the authorized mode, where they upload the real data.

>>> cd.script.record()
>>> table = cd.read_csv("dummy_data.csv")
>>> print(table.handle)
>>> cd.script.save("upload-dummy-data.recording")

Note, we do not use dummy_for here, because there is no handle available yet in the authorized mode (for the production data).

And then:

>>> cd.script.load("upload-dummy-data.approved")
>>> table = cd.read_csv("real_data.csv")
>>> print(table.handle)
>>> cd.script.reset()

We print the handle and should save it somewhere because we probably need it to refer to the table later in the analysis.

The downside of this approach is that the query contains structural information about the uploaded table. This means once approval is obtained, the user may only upload a table of identical structure as the one that was authorized. In practice, this means that once a user wants to upload a different table, they have to obtain approval again.

One advantage of this approach is that named tables may be used, e.g.

>>> cd.read_csv("dummy_data.csv", name="survey")

This is useful when designing scripts for which the production data are not yet available in the authorized mode. Also, the name is easier to memorize than a randomly generated handle.

Warning

Approving a named upload may lead to a potential attack that inadvertently leaks data, where the data provider uploads several different data sets with the same name and compares the result of an approved analysis that uses the table. It is safer to use uploads without a name and reference the resulting handles directly.

Option 2: Authorizing queries outside of scripts¶

Approvers can also authorize queries that are not exact scripts that need to be executed in order. Instead, they may approve certain queries on their own, possibly with certain constraints. An analyst may freely use these authorizations within their analysis.

Note

To create such scripts requires manually editing the .recording file and replacing concrete values with ##variable## tokens, and this is outside of the scope of this guide.

As an example of this, the approvers may authorize arbitrary uploads. The analyst obtains a signed query file with the approval. They can use it as follows.:

cd.base.session.load_authorization("allow-any-upload.approved")
cd.read_csv("real_data.csv")

Several authorizations may be loaded into a single session, and they are all searched to match the query that the user wants to execute.