Data and Trace
bmmltools is organized around two key objects:
Data: which is used to load external data;
Trace: which is used to store all the partials results, and more generally execute various task during the application of a series of operations to some input data.
Data
Data
is the python module used to load data in bmmltools. Input data are loaded and
stored in an hdf5 file, saved in a folder selected by the user.
Data
is able to load may formats in different ways, which are listed below. Each
of these input type has its own specific input method in Data
which is indicated
between parenthesis.
stacks save in a multitiff (see
load_stack
);stacks saved as tiff slice-by-slice in a folder (see
load_stack_from_folder
);numpy arrays which are homogenous in data-type and contains only numeric or boolean data (see
from_array
);file .npy containing numpy array homogenous in data-type and contains only numeric or boolean data (see
load_npy
);pandas dataframe (see
from_pandas_df
);json files containing pandas dataframe (see
load_pandas_df_from_json
);csv files containing pandas dataframe (see
load_pandas_df_from_csv
).
In addition to these, there are two main methods:
new
, to create a new hdf5 file to store that data loaded. The hdf5 file will be create in a folder, specified in theworking_folder
field of this function. The code below show how to create a newData
object from numpy arrayimport numpy as np from bmmltools.core.data import Data # initialize a Data object d = Data() d.create(working_folder = r'[PATH TO SOME FOLDER]') # load data from a numpy array arr = np.random.uniform(0,1,size=10) d.from_array(arr,'x')
With the code above, the content of the numpy array
arr
is stored in the hdf5 file in a dataset calledx
.There is also the possibility to specify the working folder directly in the data initialization. Keeping this in mind the initialization lines in the code above can be reduced to
d = Data(working_folder = r'[PATH TO SOME FOLDER]')
Note
The name of the hdf5 file created by
Data
has a standard structure, which isdata_XXXX.hdf5
.XXXX
is a 4 digits code which is randomly generated once the file is created in order to uniquely identify this file: these 4 digits are called trace code.link
, to link the data object initialized to an already existing hdf5 file (typically already containing some dataset previously loaded). To link an initializedData
object to an existing hdf5 file, one need to specify the folder where the file is in the fieldworking_folder
, and the data code (as string) in the fielddata_code
.The code lines below show how this can be done.
from bmmltools.core.data import Data # initialize a Data object d = Data() # link an existing hdf5 file. d.link(working_folder=r'[PATH TO FOLDER WITH data_XXXX.hdf5 FILE]',data_code='XXXX')
Once Data
object are created and filled with some input method or linked to some
hdf5, the dataset can be used by specifying the its name within square parenthesis, as showed in the example before.
After these square parenthesis one can use the slicing notation of
h5py, which mimic the numpy slicing notation.
print(d['x'][0])
The line above print the 0-th element of the dataset called x
present in the hdf5 file linked to the
Data
object. Alternatively one can use use_dataset
, which can be particularly useful if the dataset have to be used many times.
Consider the example below
# ...
# [creation and filling of a Data object or linking to an hdf5 file]
# ...
# select a dataset
d.use_dataset('x')
print(d[0])
print(d[1])
# unselect a dataset
d.use_dataset(None)
print(d['x'][0]) # <- This should work.
print(d[1]) # <- This should give rise to error.
In the code above, the dataset x
is first specified, and then every time the data object is called the use of this
particular dataset is assumed: the first two prints will print the elements 0 and 1 of the dataset without the need of
specifying the dataset name two times. To “unselect” a dataset None
should be given as argument of
use_dataset
. As showed in the example above, in this case one have to
proceed in the standard way, as the two last line of code above should show.
Trace
Trace
is the core class of bmmltoools. It is used to track all the intermediate
results during the application of a series of operation, in automatic manner and without keeping these results in the
computer RAM. Trace
produces a series of file in a folder called
trace folder, which is a folder created at a path specified by the user (see below). The trace folder has a standard
name: trace_XXXX
, where XXXX
is a random 4 digits number (called trace code) uniquely identifying the trace.
The files generated by Trace
are listed and explained below.
trace hdf5: here the intermediate results are stored. This file is produced once the trace is created (see below) and is the file which can be linked to a
Trace
object.trace json: here the trace graph, i.e. all the information to reconstruct the sequence of operations applied on a given trace, and the parameters of the various operations applied on the trace are stored in a dictionary-like format.
trace dill: here the initialized operation applied on a trace are saved as dill object once the application of them terminate (i.e. they are saved in the state they have at the end of the application on a dataset contained in the trace). This file is produced only if
enable_trace_graph = True
when theTrace
object is initialized (this is the default setting).
When operations act on a trace they can produce a series of folders where various files are saved during the application
on the trace. The Trace
object is also responsible for the creation and
organization in a standard way of these folder. These folders are organized as follows.
trace file folder, to save the intermediate quantities produced during the application of an operation. The path to this folder is standard (it is a folder called
trace_files
inside the trace folder) and can be obtained calling the methodtrace_file_path
.trace readings folder, to save the final result one has at the end of the application of an operation (i.e. possibly an intermediate result of the application of a series of operations). The path to this folder is standard (it is a folder called
trace_readings
inside the trace folder)and can be obtained calling the methodtrace_readings_path
.trace outputs folder, the output operations store the files produced in this folder when they are applied. The path to this folder is standard (it is a folder called
trace_outputs
inside the trace folder) and can be obtained calling the methodtrace_outputs_path
.
Form a practical point of view, Trace
works similarly to Data.
More precisely, a Trace
object once initialized needs to create or to be linked
to an hdf5 file. Two methods are used for that:
create
is used to create an hdf5 file (and a trace json too). To create an hdf5 file one needs to specify a folder where the trace folder is created. This is done by specifying the path in theworking_folder
field. It is also possible to specify the group where all the intermediate results are stored in the fieldgroup_name
. By default the group is the root of the hdf5 file, i.e. the intermediate results are stored in the dataset/[variable_name]
. When the group is specified, the intermediate results are saved at/[group_name]/[variable_name]
. The code below show how to initialize a new trace.from bmmltools.core.trace import Trace # initialization with creation of necessary file of a trace t = trace() t.create(working_folder=r'[SOME FOLDER PATH]', group_name='[GROUP NAME]')
It is not mandatory to used groups inside a trace but they can be useful: groups can used to give some internal organization to the hdf5 trace file, keeping separated intermediate results coming from different pipelines of operations, for example.
link
is used to link an initializedTrace
object to an already existing hdf5 file (and json file) containing the trace. To do that one needs to specify the path to the trace folder in thetrace_folder
field, and the name of the group (if any) in thegroup_name
fieldfrom bmmltools.core.trace import Trace # initialization of a trace object with link to an existing trace folder t = trace() t.link(trace_folder=r'[TRACE FOLDER PATH]', group_name='[GROUP NAME]')
Since a trace can be organized in groups, one can create a new group or change the group used to store the data. This
can be done using the methods change_group
and
create_group
whose meaning is self-explaining.
Attention
It is possible to specify in the trace the seed used for all the random steps of the various operations applied on the trace. This can be done right ater the creation/linking of an hdf5 file simply as showed below
#...
t.seed = 5
Given a Trace
object linked to some hdf5 file, one can initialize a new variable
in the trace, recover the content of a variable tracked on the trace, or delete a variable using the python’s standard
ways. The example below shows the basic usage of a Trace
object.
from bmmltools.core.trace import Trace
# initialize a trace creating all necessary trace files
t = trace()
t.create(working_folder=r'[SOME FOLDER PATH]')
# add an initialized variable to the trace
t.x = 4
# recover a variable from the trace
print(t.x)
# change value to a variable on the trace
t.x = 5
print(t.x)
# remove a variable from the trace
del t.x
print(t.x) # <- this should give rise to error.
It is important to keep in mind that the variable x
is in RAM only the time necessary to print it: for the rest of
the time the variable is stored in an hdf5 file. This is particularly useful when one has to use many different
variables containing data occupying a lot of RAM. Note that in the example above the whole content of x
is loaded
in RAM.
Finally, also Trace
has a method to get information over the trace content,
which is infotrace
. This method can be used to get the names of the
variables that are currently under tracking on the hard disk, the variable type, the groups available on the trace,
and group currently used to store the variables.
t.infotrace()
Supported variable types
Trace is able to automatically store-read-delete variables on the Hard Disk (i.e. inside the trace hdf5 file) only if they are of specific formats. These formats are listed below.
Homogenous numpy array: namely numpy arrays of any shape and dimension hose elements are numbers and all of the same type, i.e. only boolean,integer,float or complex.
import numpy as np ... # ... # [initialization and linking to an hdf5 file of a trace object] # ... # create an nd array arr = np.random.uniform(0,1,size=(10,10,10)) # store value of arr in x then erase from the RAM trace.x = arr del arr # read the whole x and print the content print(trace.x)
Homogenous numeric dataframe: namely pandas dataframe whose elements are all of the same numeric type. The numeric types supported are the same of the previous data format.
import pandas as pd ... # ... # [initialization and linking to an hdf5 file of a trace object] # ... # create an pandas dataframe df = pd.DataFrame({'X':[1,2,3,4],'Y':[5,6,7,8],'Z':[9,10,11,12]}) # store value of arr in y then erase from the RAM trace.y = df del df # read the whole y and print the content trace.y
Dictionary of the two variable types listed above: namely a dictionary whose keys are homogenous numpy arrays and/or homogenous numeric dataframe. One can read and write individual keys of the dictionary by using the methods
read_dictionary_key
andwrite_dictionary_key
.import numpy as np import pandas as pd ... # ... # [initialization and linking to an hdf5 file of a trace object] # ... # create a dictionary to save dictionary_to_trace = {'x': np.random.uniform(0,1,size=(10,10,10)), 'y': pd.DataFrame({'X':[1,2,3,4],'Y':[5,6,7,8],'Z':[9,10,11,12]})} # store value of arr in x then erase from the RAM trace.dictionary = dictionary_to_trace del dictionary_to_trace # read the whole 'dictionary' and print the content print(trace.dictionary) # read just one key of 'dictionary' trace.read_dictionary_key('dictionary','x') # write just one key of 'dictionary' trace.read_dictionary_key('dictionary','x',np.array([1,2,3]))
External link to dataset in other hdf5 files: it is used to avoid to copy the content of the input dataset which is present in
Data
object, saving space on the Hard Disk.Note
This external link depends on the path to the
Data
object. Therefore if the content of the folder created byData
, where its hdf5 file is created, is changed, the external link would not work (see external links in h5py).
What does not fall in these categories can be added to a trace but its content remain in RAM.
Note
The decision on where the variables are stored (in the hdf5 file or in RAM) is done automatically by
Trace
and cannot be selected by the user.