.. _basics: First steps ########### Once you have installed the crandas environment as explained in :ref:`installing` you can start using it. To start, we import the ``pandas`` and ``crandas`` modules: .. code:: python import crandas as cd import pandas as pd If you're experiencing issues with crandas or just want to confirm that you're using the latest version, it's useful to know how to check which version of crandas you're running. This is done with the following: .. code:: python from crandas.base import session session.version Uploading data ======================== Uploading data can be done either through the web portal or directly through crandas. With crandas, one easy way is to import a CSV file directly: .. code:: python demo_table = cd.read_csv("/gs_data/demo_table.csv") We use the :func:`.read_csv` function in order to read data that is stored as a CSV file. Alternatively, you can import a pandas ``DataFrame`` with the :func:`.upload_pandas_dataframe` method. .. code:: python df = pd.DataFrame({ "Name": ["Alice", "Bob", "Eve"], "Trusted": ["Yes", "Yes", "No"], "Age": [52, 24, 39], }) table = cd.upload_pandas_dataframe(df, auto_bounds=True) Using this method, we can upload a wide variety of files, as pandas supports many different file formats (csv, excel, sql, json...) and each can be read in using `read_*` (where * is the file type). Therefore, as long as the data can be read into a pandas ``DataFrame``, it can be uploaded to the Virtual Data Lake using :func:`.upload_pandas_dataframe`. Every table uploaded to the Virtual Data Lake has a unique handle. For example, to see the handle of the table uploaded just now, we can do: >>> table Name: 9AABE73F98B33C3A54AF2B147A0219A809D7EC2FC1A5ED261FA8F4929FFDD15E Size: 3 rows x 3 columns CIndex([Col("Name", "s", 6), Col("Trusted", "s", 4), Col("Age", "i", 1)]) The ``Name`` field is the handle of the table. To access a table that was uploaded through the web portal, copy its handle from the portal, and then use the :func:`.get_table` function: >>> table = cd.get_table("9AABE73F98B33C3A54AF2B147A0219A809D7EC2FC1A5ED261FA8F4929FFDD15E") >>> table Name: 9AABE73F98B33C3A54AF2B147A0219A809D7EC2FC1A5ED261FA8F4929FFDD15E Size: 3 rows x 3 columns CIndex([Col("Name", "s", 6), Col("Trusted", "s", 4), Col("Age", "i", 1)]) When uploading data through crandas, we can attach a *name* instead of a handle as well: .. code:: python named_table = cd.upload_pandas_dataframe(df, name="demo") The name "demo" now refers to this new table, even when a table with the name "demo" previously existed. To access a named table, we also use the :func:`.get_table` function: >>> cd.get_table("demo") Name: 9AABE73F98B33C3A54AF2B147A0219A809D7EC2FC1A5ED261FA8F4929FFDD15E Size: 3 rows x 3 columns CIndex([Col("Name", "s", 6), Col("Trusted", "s", 4), Col("Age", "i", 1)]) After uploading data, you can start your analysis. Trying to access the data --------------------------- crandas is a system where data can be shared confidentially, so what happens when someone has access to a table that you uploaded? Can they simply print it to see all the values? The answer is no. Trying to simply print the table only outputs a prefix of the unique ID of the table (a hexadecimal number) and the dimensions (number of rows and columns) of the table: >>> print(table) 4BD87DAB1330328783D4AB9C59D8A4F6...[3x3] This is as much information as we are willing to *leak*. .. _selecting-data-label: Selecting and manipulating data ==================================== Once you have your table ``table``, you can select and manipulate its rows and columns. You can access a column by its name: .. code:: python ages = table["Age"] As mentioned, you cannot access the values in ``ages`` directly, but you can perform computation on them. If you want to know the average age or the sum of the ages of the patients in the health datasets, you may calculate the sum and mean as follows: >>> ages.sum() 115 >>> ages.mean() 38.333333333333336 These calculations were produced using the table we defined in the previous section. There are various statistics available that can be applied to any columns containing numerical data. While you cannot simply add a new column to a :class:`.CDataFrame`, you may use the ``assign`` method to calculate new columns from existing table values. Say you wanted to have a column where you took all the names and made them lowercase: .. code:: python table = table.assign(lower_name=table["Name"].lower()) In order to confirm that the function has worked as you expected, we can use the ``open()`` function belonging to :class:`.CDataFrame`. As the name suggests, this function allows the user to open the values of the table, so this function should be used with caution as it will reveal secret data. Here we use it simply as an example to illustrate what you have done. .. note:: In production environments, this can only be done upon approval by all data owners. >>> table.open() Name Trusted Age lower_name 0 Alice Yes 52 alice 1 Bob Yes 24 bob 2 Eve No 39 eve You can see that your table now has a column with the names in lower case. More interestingly, you can make computations with the values in your tables. To show this, we'll create a new table with more numerical values. .. code:: python shapes = cd.DataFrame({ "id": [1, 2, 3], "length": [32, 86, 21], "height": [41, 66, 11], }, auto_bounds=True) Now, we can find the area of each shape by multiplying the ``length`` and ``height`` columns into a new column ``area``. .. code:: python shapes = shapes.assign(area=shapes["length"]*shapes["height"]) >>> shapes.open() id length height area 0 1 32 41 1312 1 2 86 66 5676 2 3 21 11 231 We can also perform comparisons. Suppose that there is a minimum amount of space that we can assign a shape. Let's say that is *1000*. If a shape is smaller than that, we still need to assign *1000*, if it is bigger we assign its area. We can do this by using the :meth:`.CSeries.if_else` function: >>> shapes.assign(space=(shapes["area"] > 1000).if_else(shapes["area"], 1000)).open() id length height area space 0 1 32 41 1312 1312 1 2 86 66 5676 5676 2 3 21 11 231 1000 To select numbered rows of the table, we can use python's slice_ notation. For example, to select the first two rows: >>> shapes[:2].open() id length height area 0 1 32 41 1312 1 2 86 66 5676 The syntax is ``[start:stop]``, where ``start`` and ``stop`` are indices of the rows. This selects the rows with indices from ``start`` to ``stop``, excluding ``stop``, so ``start, start + 1, start + 2, ..., stop - 1``. If ``start`` is omitted, it is 0, and if ``stop`` is omitted, it is the length of the table. We can also supply a step size, with the ``[start:stop:step]`` syntax. For example, to get every other element: >>> shapes[::2].open() id length height area 0 1 32 41 1312 1 3 21 11 231 Alternatively, you might want to select rows based on the value of certain columns. You can do this by creating a filter. Let's say we want to select only "tall" shapes (shapes that are taller than they are long). We can do the following: .. code:: python tall_shape_filter = shapes["height"] > shapes["length"] tall_shapes = shapes[tall_shape_filter] >>> tall_shapes.open() id length height area 0 1 32 41 1312 You can now take a table, perform computations over it and then select certain records based on the data on it. This is already enough to do many computations on the data. However, crandas is a tool made for data sharing. In the next section you will learn how to combine different data sources to do exactly that. Working with thresholds ----------------------- The tools that you have learned to use provide us with many ways to interact with the data. As we have seen, operations like ``sum`` and ``mean`` reveal data about a column. To avoid revealing data on individual records, many functions like :meth:`.CDataFrame.filter` have a :code:`threshold` parameter that prevents certain information being output if it does not have enough rows of data. Organizations may choose what threshold is acceptable, and queries of their data will have to respect it, as seen here: >>> tall_shapes = shapes[tall_shape_filter.with_threshold(2)] ServerError: Table cannot be filtered: remaining number of items would be below specified threshold (error code: ServerErrorCode.CLIENT_ERROR_THRESHOLD_NOT_MET) .. _slice: https://docs.python.org/3/library/functions.html?highlight=slice#slice