.. _recording: Recording a Script ==================== Processing data in an authorized environment requires recording the script that needs to be executed. The recording is done in a design environment that contains *dummy data*, tables that have the same [1]_ structure as the real data but with fake entries. The script is executed in the design environment, creating a ``recording`` file that must be uploaded to the platform for `approval `_. After a script is recorded and an approval is created, there are a couple of simple changes that need to be done in the authorized environment. .. [1] Except for number of rows. The first step when recording a script is to make sure that we are connected to a design environment. In such environment there are no restrictions on the computations that can be executed. In general, this requires connecting to the engine with the connection file that ends in ``-design``. .. code:: python import crandas as cd cd.connect("environment-design") The purpose of this environment is to design the script that will be run over the real data. This script will be recorded into a ``.recording`` file that must then be cryptographically signed by the approvers, typically through the platform. .. tip:: You should only attempt to record the script **after** you have fully run and designed the analysis to be run. To start recording a script, you must do this: .. code:: python # Adding a name for the analysis is not obligatory, but it is recommended script = cd.script.record("Analysis name") After running the script that needs to be recorded a file is created by calling the :meth:`save <.script.Script.save>` method while specifying the filename for the recorded script. .. code:: python # After the script is finished, create the recording file script.record("filename.recording") Note that if there is an issue in the recording of the script, you should restart the script recording from scratch instead of trying halfway. In that case certain steps might be recorded multiple times or not at all, meaning that the authorized script will not work. Call :func:`.script.record` to start the recording again. .. admonition:: Using Jupyter :class: jupyter When you are using a Jupyter notebook, can use ``Cells -> Run all`` (or ``Kernel -> Restart & Run all``). This will start recording a script, run the commands in order against in the design mode, and save the script to the specified filename. Part of the goal of recording a script is to see that you effectively get the data that you want to get out. For this to happen, there must be tables that look like the real data over which the script is run. Of course, these tables cannot contain real data, yet they must be connected to the tables in the authorized environment that do. .. _dummy_data: Uploading dummy data ---------------------- In order to record a script, we need to have a table with the same structure as the table of real data over which the analysis will ultimately be performed. By *the same structure*, we mean that it has the same columns in the same order and each of these columns is of the appropriate type, including whether it contains missing values. In order to ensure this structure is correct, you can use :ref:`table schemas `. When uploading a table, use the ``schema`` argument to ensure that the table has the right properties. For example, if a column could have missing values, the schema can make this distinction even if the dummy data does not contain those values. .. code:: python SCHEMA = {"string": "varchar", "nullable_string": "varchar?", "ranking": "int8"} tab = cd.DataFrame( { "string": ["these", "are", "strings"], "nullable_string": ["so", "are", "these"], "ranking": [1, 2, 3], }, schema=SCHEMA, ) When checking the structure of ``tab`` we see that the ``nullable_string`` column is of type ``varchar?`` even if it has no missing values. >>> tab Handle: 52773671E3EE4B863460E6E991084D743B0CBDEC30B45C6D9489102EF1959931 (design mode) string nullable_string ctype varchar varchar? Size: 3 rows x 2 columns (contents secret-shared; data size: 360 bytes) Having uploaded the table with dummy data, we need to associate it with the table with the real data that is found in the authorized environment. For that, we use the :meth:`.crandas.DataFrame.save` method with the ``dummy_for`` keyword. Assuming that we saved the handle for the table with the real data in the ``PROD_HANDLE`` variable, we do the following: >>> tab.save(dummy_for=PROD_HANDLE) This will associate the table that we uploaded with the table with the real data found in the authorized environment. Note that the process of uploading dummy data must be done **outside of the recording**. Once we are inside the recording, we must access the table with :func:`.crandas.get_table`. .. code:: python # Given a pandas DataFrame `df` with the data, a schema `SCHEMA` # and the real handle `PROD_HANDLE` # The process would look like this dummy_tab = cd.upload_pandas_dataframe(df, schema=SCHEMA) dummy_tab.save(dummy_for=PROD_HANDLE) cd.script.record("Analysis name") # Note that we use the authorized handle and not the dummy one tab = cd.get_table(PROD_HANDLE, schema=SCHEMA) # We get with `schema` to ensure that the table fulfills the schema we want After uploading all the tables that will be part of the analysis, we can continue to design the script. .. note:: Typically, table uploads are done *outside* of script recordings with the sensitive data uploaded through `the platform `_. .. _recording_placeholders: Values and placeholders ------------------------- Beyond having dummy data uploaded to the platform, a script might also use certain data that is either computed or comes from external sources. For example, given a column with a ranking from 1 to 10 we might want to filter out the responses with a value that is bigger than 5. In that case we would write the following: .. code:: python new_tab = tab[tab["ranking"]>5] This will work without an issue, but what if we want to compare it not to a number that is already known but to something that is computed from the data, like a mean? .. code:: python # Here we compute the mean, which *opens* it mean = tab["ranking"].mean() # Now we filter based on the mean new_tab = tab[tab["ranking"]>mean] # This will not work, because the mean of the dummy data # will be different than the real data This example, where we open the mean of a column and then try to filter the table will not work when executing a recorded script. This is because when ``mean`` is opened it becomes a value that becomes hardcoded in the recording. For this particular case, the solution is to simply *not open it*. Every value that is unopened (as well as any table, grouping or model) that is created during a recording will correspond to the analogous object in the authorized environment. .. code:: python # mode="regular" ensures the mean is not opened mean = tab["ranking"].mean(mode="regular") # This will now work new_tab = tab[tab["ranking"]>mean] # If we want to know the mean, we must open it mean.open() While this is the preferred way to deal with values, there could be a situation where the value is not known in advanced and not computed from the data. In this situation you can use placeholders. The :class:`Any<.placeholders.Any>` placeholder does what its names suggests, it allows us to use any value in that place. .. code:: python val = cd.placeholders.Any(5) # In the authorized environment, `val` can be substituted by any value values_above = tab[tab["ranking"]>val] values_below = tab[tab["ranking"]4.2] new_tab = tab[tab["ranking"]<4.2] But we could not execute this, as the placeholder will have two different values: .. code:: python new_tab = tab[tab["ranking"]>4.2] new_tab = tab[tab["ranking"]<5] .. warning:: Given an approved script with a placeholder, every distinct run of the script will allow a different value for a placeholder. You must try to minimize the use of placeholders in order to prevent this possibility. Conditionals and loops ----------------------- ``if`` statements and ``for`` loops are the usual tools of a programmer, however, you must be careful when using them in a recorded script. A script recording only records the calls that are made to the engine. Therefore, in cases where a conditional determines which operation or values will be used, the recording and the actual execution will not match. Here is a simple example: .. code:: python if tab["ranking"].max(mode="open") > 6: tab["succeeded"] = 0 else: tab["succeeded"] = 1 Only one of the two branches will be executed when recording this script and this is not necessarily what will be executed when working with real data. In the given example table at the beginning of this article, this conditional will go into the ``else`` branch, but the real data could realistically go into the first branch. The execution would then fail because the recording and the execution will never coincide. When recording, this will output a :exc:`ConditionalCallDetected <.check_recording.ConditionalCallDetected>` warning. Depending of what is needed, a conditional call can be avoided by using :meth:`.CSeries.if_else`, :func:`when().then()<.crandas.when>`, :ref:`thresholds ` or :ref:`placeholders`. For example, the previous example could be converted to: .. code:: python tab["succeeded"] = (tab["ranking"].max(mode="regular")>6).if_else(1,0) .. note:: Unlike the previous section, we cannot simply not open the value to use it in a conditional. This is because something like ``tab["ranking"].max(mode="regular") > 6`` cannot be evaluated in an ``if`` block. A similar issue can happen when we use a ``for`` loop. For example, see the following example: .. code:: python # We group the table to get all distinct values of `ranking` grouping = tab.groupby("ranking") values = grouping.as_table()["ranking"] for i in range(0,len(values)): # This iterates through rows and prints them print(values[i:i+1].open()) This code will make as many calls to the slicing and opening functions as there are values in the dummy data table. If this differs from the production table, the scripts will no longer coincide. This behavior also leads to a :exc:`ConditionalCallDetected <.check_recording.ConditionalCallDetected>` warning. In general, iteration over rows is not recommended because of this issue. Most operations are rowwise and do not require iterating. There are sometimes where for loops are not problematic, like when we are iterating over *columns* instead of *rows*. While a :exc:`ConditionalCallDetected <.check_recording.ConditionalCallDetected>` warning will still be issued, because we know the column structure of the table and we know the dummy data table has the same schema, we can do the following: .. code:: python # We iterate over columns for col in tab.columns.cols: # Check if the column is an integer if (col.type=="i"): # We print the mean print(tab[col.name].mean(mode="open")) This code prints the means of integer columns. While this is a ``for`` statement, it will always work because the structure of the dummy and real tables are the same. Note that we even use a conditional here, which will work for the same reason: it refers to the schema of the table (which is the same in design and in authorized mode).