Importing/exporting data ######################### This section of the user guide covers importing/exporting data to the Virtual Data Lake (VDL) using crandas. Users can upload existing pandas ``DataFrames``, create new crandas :class:`CDataFrames<.CDataFrame>`, or upload CSV files. Access tables via handle or name and use :meth:`.CDataFrame.open()` for aggregated information. Upload pandas ``Dataframe`` ---------------------------- If you have an existing data in a pandas ``DataFrame`` that you want to upload to the VDL, you can use the :func:`.crandas.upload_pandas_dataframe` function. This function takes a pandas ``DataFrame`` as its parameter and uploads it to the VDL. You can also optionally specify a name for the table. For example, let's say you have a pandas ``DataFrame`` called ``my_data`` that you want to upload to the VDL: .. code:: python import crandas as cd import pandas as pd my_data = pd.DataFrame({"fruit": ["orange", "apple", "raspberry"]}) uploaded_data = cd.upload_pandas_dataframe(my_data) This will upload ``my_data`` to the VDL and return a :class:`.CDataFrame` object that you can use to interact with the uploaded data. A :class:`.CDataFrame` behaves similarly to a pandas ``DataFrame``, however it is stored in *secret-shared* form in the VDL. The advantage of this approach is that it enables users to read any file type that is accepted by pandas by first utilizing pandas, followed by crandas. Create new crandas :class:`.CDataFrame` --------------------------------------- Alternatively, if you want to create a new crandas :class:`.CDataFrame` from scratch, you can use the :func:`.crandas.DataFrame` function. This function calls the pandas DataFrame constructor and uploads the resulting table using `upload_pandas_dataframe()`. If you specify a name for the table, it will be passed on to `upload_pandas_dataframe()`. .. note:: When uploading data with missing values, it is important to specify certain additional data. For more information look :ref:`here`. For example, let's say you want to create a brand new :class:`.CDataFrame` called ``my_table`` with columns ``A``, ``B``, and ``C``. .. code:: python my_table = cd.DataFrame({ "A": [1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9] }, name="my_table") This will create a new :class:`.CDataFrame` called ``my_table`` with the columns that we specified and upload it to the Virtual Data Lake. You can now use the ``my_table`` object to interact with the uploaded data. Handles and names ------------------ State objects, such as tables, now have two distinct identifiers: handles and names. - **Handles** are randomly generated 32-byte strings, usually encoded as 64 hexadecimal digits. Each table has a unique handle that remains fixed (e.g. 6E14C5275C5E90E31D84FCE0CE5F6D3D1BFE587C21D2278C52D4A092C4AB19F7). - **Names** are a user-friendly way to refer to handles and therefore tables. Handles are generated by the VDL and will always be unique but names are assigned by a user. If a table is given a name already assigned to a different table, the old table will lose its name but will still be accessible by its handle. When uploading a new table with an existing name, the server reassigns the name to the new table, and the old name-handle mapping is lost. This is in contrast to handles, which always remain unique for each table. To upload a table with a name, use the following syntax: .. code:: python cdf = cd.upload_pandas_dataframe(df, name="input") The name will then be assigned. Upload CSV file to the VDL -------------------------- To upload a CSV file to the VDL, you can utilize the :func:`.crandas.read_csv` function. This function accepts the name of the CSV file as its parameter and facilitates its upload to the server. In addition, users may opt to specify a name for the resulting table. To upload a file called ``my_data.csv`` to the VDL, we simply need to do this: .. code:: python uploaded_data = cd.read_csv("my_data.csv") This will upload ``my_data.csv`` to the VDL and return a :class:`.CDataFrame` object that you can use to interact with the uploaded data. Note that for this to work, the file must be in the current directory, otherwise we must specify the path to it. .. _Access: Access an uploaded table ----------------------------- Any table that has been uploaded to the VDL can be referenced by its handle, a hexadecimal string of characters. Optionally, tables can also have a name attached to them. It is often better to assign a name instead of using the handle for practical reasons. To access a table, we use the :func:`.crandas.get_table` function. This function takes the table's handle/name as its parameter and returns a :class:`.CDataFrame`. If you (or someone you are collaborating with) have previously uploaded a table with the name ``my_table``, you can access it by name or handle: .. code:: python # Using the name my_table = cd.get_table("my_table") # Using the handle my_table = cd.get_table("63FE905BB6DF9AD2E7D32DD092C75B1FC2CEB52BDBC4AEAB7AAEF14DBFCB6224") This will return the :class:`.CDataFrame` object for ``my_table``, which you can then perform operations on. How to open :class:`CDataFrames<.CDataFrame>` --------------------------------------------- After performing a a computation, you will want to access the resulting data. The :meth:`.CDataFrame.open` method allows you to retrieve a :class:`.CDataFrame` and **open** it. This downloads the open data, exposing it and not making it private. In general, opening :class:`CDataFrames<.CDataFrame>` is not allowed. In non-demo environments, there will be strict controls over which :class:`CDataFrames<.CDataFrame>` can be opened. .. warning:: An attempt to use ``.open()`` in a production environment will be met with an error. Given a :class:`.CDataFrame` we can retrieve it using :meth:`.CDataFrame.open` which will output a pandas `DataFrame`. .. code:: python # create the CDataFrame df = cd.DataFrame({'A': [1, 2], 'B': [3, 4]}) # open the CDataFrame opened_df = df.open() The opened table ``opened_df`` is now a normal pandas ``DataFrame`` with data in the clear. >>> print(opened_df) A B 0 1 3 1 2 4 These are the ways we can import and export data in the VDL. By using these functions, you can easily upload data to VDL for further **privacy-preserving** analysis and processing. Listing uploaded data --------------------- It is possible to obtain a list of all tables that have been uploaded to the VDL using the function :meth:`crandas.stateobject.list_uploads`. The result of this function is a pandas dataframe with handles and metadata: >>> cd.stateobject.list_uploads() handle created type 0 91E4033337F1ED4D13ED23CA4DCBFB279FBA4C58C7E249... 2023-12-04 12:49:00+00:00 CDataFrame (3 rows x 2 columns) 1 DF20C85FB51E7F114822B6FCF865A260D42872214795F7... 2023-12-04 12:48:16+00:00 CDataFrame (4 rows x 2 columns) Note that this function only list uploads: this does not include computation results (e.g., the result of joining two tables) or demo tables created using :meth:`crandas.crandas.demo_table`. See the documentation of :meth:`crandas.stateobject.list_uploads` for more details. Removing objects from the VDL -------------------------------- After working with data, you might want to delete it from the VDL. This is as simple as calling the :meth:`.StateObject.remove()` method. .. code:: python # You simply remove a CDataFrame using the following command df.remove() This will not only get rid of the python :class:`.CDataFrame` used to interact with the table, but also the table in the server. Now that we know how to add data to the VDL, we can learn how that data is structured so we can start working with it.