.. _basics:

First steps
###########

Once you have installed the crandas environment as explained in :ref:`installing` you can start using it. To start, we import the ``pandas`` and ``crandas`` modules:

.. code:: python

    import crandas as cd
    import pandas as pd


If you're experiencing issues with crandas or just want to confirm that you're using the latest version, it's useful to know how to check which version of crandas you're running. This is done with the following:

.. code:: python
    
    from crandas.base import session

    session.version

Uploading data
========================

Uploading data can be done either through the web portal or directly through crandas.

With crandas, one easy way is to import a CSV file directly:

.. code:: python

    demo_table = cd.read_csv("/gs_data/demo_table.csv")  

We use the :func:`.read_csv` function in order to read data that is stored as a CSV file.

Alternatively, you can import a pandas ``DataFrame`` with the :func:`.upload_pandas_dataframe` method.

.. code:: python

    df = pd.DataFrame({ 
        "Name": ["Alice", "Bob", "Eve"], 
        "Trusted": ["Yes", "Yes", "No"], 
        "Age": [52, 24, 39], 
    })

    table = cd.upload_pandas_dataframe(df, auto_bounds=True) 

Using this method, we can upload a wide variety of files, as pandas supports many different file formats (csv, excel, sql, json...) and each can be read in using `read_*` (where * is the file type). Therefore, as long as the data can be read into a pandas ``DataFrame``, it can be uploaded to the Virtual Data Lake using :func:`.upload_pandas_dataframe`.  

Every table uploaded to the Virtual Data Lake has a unique handle.
For example, to see the handle of the table uploaded just now, we can do:

>>> table
Name: 9AABE73F98B33C3A54AF2B147A0219A809D7EC2FC1A5ED261FA8F4929FFDD15E
Size: 3 rows x 3 columns
CIndex([Col("Name", "s", 6), Col("Trusted", "s", 4), Col("Age", "i", 1)])

The ``Name`` field is the handle of the table.

To access a table that was uploaded through the web portal, copy its handle from the portal, and then use the :func:`.get_table` function:

>>> table = cd.get_table("9AABE73F98B33C3A54AF2B147A0219A809D7EC2FC1A5ED261FA8F4929FFDD15E")
>>> table
Name: 9AABE73F98B33C3A54AF2B147A0219A809D7EC2FC1A5ED261FA8F4929FFDD15E
Size: 3 rows x 3 columns
CIndex([Col("Name", "s", 6), Col("Trusted", "s", 4), Col("Age", "i", 1)])

When uploading data through crandas, we can attach a *name* instead of a handle as well:

.. code:: python

    named_table = cd.upload_pandas_dataframe(df, name="demo")

The name "demo" now refers to this new table, even when a table with the name "demo" previously existed.

To access a named table, we also use the :func:`.get_table` function:

>>> cd.get_table("demo")
Name: 9AABE73F98B33C3A54AF2B147A0219A809D7EC2FC1A5ED261FA8F4929FFDD15E
Size: 3 rows x 3 columns
CIndex([Col("Name", "s", 6), Col("Trusted", "s", 4), Col("Age", "i", 1)])

After uploading data, you can start your analysis.

Trying to access the data
---------------------------

crandas is a system where data can be shared confidentially, so what happens when someone has access to a table that you uploaded? Can they simply print it to see all the values? The answer is no. Trying to simply print the table only outputs a prefix of the unique ID of the table (a hexadecimal number) and the dimensions (number of rows and columns) of the table:

>>> print(table)
4BD87DAB1330328783D4AB9C59D8A4F6...[3x3]

This is as much information as we are willing to *leak*.

.. _selecting-data-label:

Selecting and manipulating data
====================================
Once you have your table ``table``, you can select and manipulate its rows and columns. You can access a column by its name:

.. code:: python

    ages = table["Age"]

As mentioned, you cannot access the values in ``ages`` directly, but you can perform computation on them.

If you want to know the average age or the sum of the ages of the patients in the health datasets, you may calculate the sum and mean as follows: 

>>> ages.sum()
115

>>> ages.mean()
38.333333333333336

These calculations were produced using the table we defined in the previous section. There are various statistics available that can be applied to any columns containing numerical data.  

While you cannot simply add a new column to a :class:`.CDataFrame`, you may use the ``assign`` method to calculate new columns from existing table values.
Say you wanted to have a column where you took all the names and made them lowercase:

.. code:: python

    table = table.assign(lower_name=table["Name"].lower()) 

In order to confirm that the function has worked as you expected, we can use the ``open()`` function belonging to :class:`.CDataFrame`.
As the name suggests, this function allows the user to open the values of the table, so this function should be used with caution as it will reveal secret data.
Here we use it simply as an example to illustrate what you have done.

.. note::

In production environments, this can only be done upon approval by all data owners.

>>> table.open()
    Name Trusted  Age lower_name
0  Alice     Yes   52      alice
1    Bob     Yes   24        bob
2    Eve      No   39        eve

You can see that your table now has a column with the names in lower case.

More interestingly, you can make computations with the values in your tables.
To show this, we'll create a new table with more numerical values.

.. code:: python

    shapes = cd.DataFrame({
        "id": [1, 2, 3],
        "length": [32, 86, 21],
        "height": [41, 66, 11],
    }, auto_bounds=True)

Now, we can find the area of each shape by multiplying the ``length`` and ``height`` columns into a new column ``area``.

.. code:: python

    shapes = shapes.assign(area=shapes["length"]*shapes["height"]) 

>>> shapes.open()
   id  length  height  area
0   1      32     41  1312
1   2      86     66  5676
2   3      21     11   231

We can also perform comparisons.
Suppose that there is a minimum amount of space that we can assign a shape. Let's say that is *1000*.
If a shape is smaller than that, we still need to assign *1000*, if it is bigger we assign its area.
We can do this by using the :meth:`.CSeries.if_else` function: 

>>> shapes.assign(space=(shapes["area"] > 1000).if_else(shapes["area"], 1000)).open()
   id  length  height  area  space
0   1      32     41  1312   1312
1   2      86     66  5676   5676
2   3      21     11   231   1000

To select numbered rows of the table, we can use python's slice_ notation. For example, to select the first two rows:

>>> shapes[:2].open()
   id  length  height  area
0   1      32     41  1312
1   2      86     66  5676

The syntax is ``[start:stop]``, where ``start`` and ``stop`` are indices of the rows. This selects the rows with indices from ``start`` to ``stop``, excluding ``stop``, so ``start, start + 1, start + 2, ..., stop - 1``. If ``start`` is omitted, it is 0, and if ``stop`` is omitted, it is the length of the table.

We can also supply a step size, with the ``[start:stop:step]`` syntax. For example, to get every other element:

>>> shapes[::2].open()
   id  length  height  area
0   1      32     41  1312
1   3      21     11   231

Alternatively, you might want to select rows based on the value of certain columns. You can do this by creating a filter.
Let's say we want to select only "tall" shapes (shapes that are taller than they are long).
We can do the following:

.. code:: python

    tall_shape_filter = shapes["height"] > shapes["length"]
    tall_shapes = shapes[tall_shape_filter]

>>> tall_shapes.open()
   id  length  height  area
0   1      32     41  1312


You can now take a table, perform computations over it and then select certain records based on the data on it. This is already enough to do many computations on the data. However, crandas is a tool made for data sharing. In the next section you will learn how to combine different data sources to do exactly that.

Working with thresholds
-----------------------

The tools that you have learned to use provide us with many ways to interact with the data.
As we have seen, operations like ``sum`` and ``mean`` reveal data about a column.
To avoid revealing data on individual records, many functions like :meth:`.CDataFrame.filter` have a :code:`threshold` parameter that prevents certain information being output if it does not have enough rows of data.
Organizations may choose what threshold is acceptable, and queries of their data will have to respect it, as seen here:

>>> tall_shapes = shapes[tall_shape_filter.with_threshold(2)]
ServerError: Table cannot be filtered: remaining number of items would be below specified threshold (error code: ServerErrorCode.CLIENT_ERROR_THRESHOLD_NOT_MET)


.. _slice: https://docs.python.org/3/library/functions.html?highlight=slice#slice