Importing/exporting data
This section of the user guide covers importing/exporting data to the engine using crandas. Information on how to upload data via the platform can be found in the knowledge base.
Users can create new crandas DataFrames, upload existing pandas DataFrames or upload CSV files.
They can access tables via handle or name and use
.open()
to reveal the information inside.
Creating new crandas DataFrame
You can create a new crandas DataFrame from scratch using its constructor. Internally, the
constructor calls the pandas DataFrame constructor and uploads the
resulting table using upload_pandas_dataframe(). If you specify a name for the table, it will be passed on
to upload_pandas_dataframe().
Note
When uploading data with missing values, it is important to specify certain additional data. For more information look here.
For example, let's say you want to create a brand new
DataFrame called my_table with
columns A, B, and C.
This will create a new DataFrame
called my_table with the columns that we specified and upload it to
the engine. You can now use the my_table object to interact with the
uploaded data.
Uploading a pandas DataFrame
If you have an existing data in a pandas DataFrame that you want to
upload to the engine, you can use the
upload_pandas_dataframe() function. This
function takes a pandas DataFrame as its parameter and uploads it to
the engine.
For example, let's say you have a pandas DataFrame called my_data
that you want to upload to the engine:
import crandas as cd
import pandas as pd
my_data = pd.DataFrame({"fruit": ["orange", "apple", "raspberry"]})
uploaded_data = cd.upload_pandas_dataframe(my_data)
This will upload my_data to the engine and return a
DataFrame object that you can
use to interact with the uploaded data. A
DataFrame behaves similarly
to a pandas DataFrame, however it is stored in secret-shared form in
the engine.
The advantage of this approach is that it enables users to read any file type that is accepted by pandas by first utilizing pandas, followed by crandas.
Handles and names
State objects, such as tables, now have two distinct identifiers: handles and names.
- Handles are randomly generated 32-byte strings, usually encoded as
64 hexadecimal digits. Each table has a unique handle that remains
fixed (e.g.
6E14C5275C5E90E31D84FCE0CE5F6D3D1BFE587C21D2278C52D4A092C4AB19F7). They can be accessed bytable.handle. - Names are a user-friendly way to refer to handles and therefore tables. Handles are generated by the engine and will always be unique but names are assigned by a user. If a table is given a name already assigned to a different table, the old table will lose its name but will still be accessible by its handle.
When uploading a new table with an existing name, the server reassigns the name to the new table, and the old name-handle mapping is lost. This is in contrast to handles, which always remain unique for each table.
To upload a table with a name, use the following syntax:
Warning
Extra care should be taken when working with table names to not overwrite an already existing table. Due to this reason, we always recommend the use of handles instead.
Uploading CSV files to the engine
To upload a CSV file to the engine, you can utilize the
read_csv() function. This
function accepts the name of the CSV file as its parameter and
facilitates its upload to the server. In addition, users may opt to
specify a name for the resulting table.
To upload a file called my_data.csv to the engine, we simply need to
do this:
This will upload my_data.csv to the engine and return a
DataFrame object that you can use to
interact with the uploaded data. Note that for this to work, the file
must be in the current directory, otherwise we must specify the path to
it.
Uploading Parquet files to the Engine
It is also possible to upload
Parquet files to the Engine using
the read_parquet() function.
Uploading Parquet files is recommended over CSV for efficiency reasons
as well as null value handling. Parquet uploads take considerably less
memory than CSV uploads, so it is the preferred option for large tables.
The API is equivalent to read_csv().
Accessing an uploaded table
Any table that has been uploaded to the engine can be referenced by its
handle: a hexadecimal string of characters. To access a table, we use
the get_table() function. This
function takes the table's handle/name as its parameter and returns a
DataFrame.
# Using the handle
my_table = cd.get_table("63FE905BB6DF9AD2E7D32DD092C75B1FC2CEB52BDBC4AEAB7AAEF14DBFCB6224")
This will return the DataFrame object
for my_table, which you can then perform operations on.
Instead of referencing a table by its handle, it can also be accessed by
its name. If you (or someone you are collaborating with) have previously
uploaded a table to the engine with the name my_table, you can access
it by that name:
However, this is not without risk: If a script is allowed to process any dataset having the given name, the analyst could bypass the approved script by renaming and inserting their own input tables. Therefore, we recommend to use table handles instead.
How to access data in a DataFrame
After performing a a computation, you will want to access the resulting
data. The .open() method allows you to retrieve a
DataFrame and open it. This
downloads the open data, exposing it and not making it private. In
general, opening DataFrames is not allowed. In non-demo environments, there will be
strict controls over which DataFrames can be opened.
Attention
An attempt to use .open() in authorized mode
without proper authorization will be met with an error.
Given a DataFrame we can retrieve it
using .open() which will output a pandas DataFrame.
# create the DataFrame
df = cd.DataFrame({'A': [1, 2], 'B': [3, 4]})
# open the DataFrame
opened_df = df.open()
The opened table opened_df is now a normal pandas DataFrame with
data in the clear.
These are the ways we can import and export data in the engine.
By using these functions, you can easily upload data to engine for further privacy-preserving analysis and processing.
Listing uploaded data
It is possible to obtain a list of all tables that have been uploaded to
the engine using the function
list_uploads(). The
result of this function is a pandas DataFrame with handles and metadata:
>>> cd.stateobject.list_uploads()
handle created type
0 91E4033337F1ED4D13ED23CA4DCBFB279FBA4C58C7E249... 2023-12-04 12:49:00+00:00 DataFrame (3 rows x 2 columns)
1 DF20C85FB51E7F114822B6FCF865A260D42872214795F7... 2023-12-04 12:48:16+00:00 DataFrame (4 rows x 2 columns)
Note that this function only list uploads: this does not include
computation results (e.g., the result of joining two tables) or demo
tables created using demo_table(). See the documentation of
list_uploads() for
more details.
Removing objects from the engine
After working with data, you might want to delete it from the engine.
This is as simple as calling the
.remove() method.
This will not only get rid of the python DataFrame
used to interact with the table, but also the table in the
server.
Now that we know how to add data to the engine, we can learn how that data is structured so we can start working with it.
Note
If an object is in an error state (for example, an upload that was
aborted), retrieving it using cd.get_table() will give an error. To still
remove the object, you can use the mode="defer" flag. For example, to
remove all uploads, use: