from_table
docs/api/data_layer/data.py::DateTimeAccessorBase.from_table()
Creates a multi-dimensional xarray DataArray from a Pandas DataFrame, organizing data into fixed-size time dimensions.
Parameters:
data (
pd.DataFrame
): The input data table containing time-series data.time_column (
str
, optional): The name of the column in data that contains datetime information. Default is 'time'.asset_column (
str
, optional): The name of the column representing different assets or entities. Default is 'asset'.feature_columns (
list of str
, optional): A list of columns in data that contain the features or measurements to include in the DataArray. If None, all columns except time and asset columns are used.frequency (
FrequencyType
, optional): The frequency of the data, e.g., 'D' for daily. Default is 'D'.
Returns:
xr.Dataset A multi-dimensional Dataset with the following dimensions:
year
: Unique years present in the data.month
: Fixed-size months (1 to 12).day
: Fixed-size days (1 to 31).asset
: Unique assets present in the data.feature
: Data features if multiple value columns are provided.
Description:
This method transforms a Pandas DataFrame into an xarray DataArray with multi-dimensional coordinates based on time components (year, month, day) and assets. It handles missing dates by creating fixed-size dimensions for months (1-12) and days (1-31), ensuring consistent shapes across different years and months.
How the Data Structures are Imposed:
Fixed-Size Time Dimensions: The method imposes fixed-size dimensions for months (1-12) and days (1-31), regardless of whether all these days or months are present in the data. This means that even if, for example, February doesn't have 31 days, the DataArray will still have 31 days in the day dimension. This ensures that the DataArray has a consistent shape across all time periods.
Asset Dimension: The asset dimension is derived from the unique values in the asset_column of the DataFrame.
Feature Dimension: If multiple feature columns are provided, an additional 'feature' dimension is added to the DataArray.
Role of TimeSeriesIndex:
Purpose: The TimeSeriesIndex class creates a mapping from datetime labels to multi-dimensional indices in the DataArray. It handles the conversion from datetime objects to indices in the (year, month, day) dimensions.
Link to DataArray: The TimeSeriesIndex is attached to the 'time' coordinate of the DataArray as an attribute. This allows users to select data based on datetime labels using the
sel
method provided by the TimeSeriesIndex.Abstracting Away Indexing: Users can use datetime labels to select data without needing to know the underlying multi-dimensional indices. The TimeSeriesIndex handles the mapping internally, making time-based selection intuitive.
Implementation Details:
Data Preprocessing:
The DataFrame is copied to avoid modifying the original data.
The time_column is converted to datetime using
pd.to_datetime
.Time components (year, month, day) are extracted and added as new columns in the DataFrame.
Creating Fixed-Size Time Dimensions:
Unique years are obtained from the data.
Fixed-size months (1-12) and days (1-31) are created using
np.arange
.A full index is created using
pd.MultiIndex.from_product
with all combinations of years, months, days, and assets.
Reindexing Data:
The DataFrame is reindexed to include all possible combinations from the full index, filling missing entries with NaNs.
Creating Time Coordinate:
A time coordinate is created by combining the 'year', 'month', and 'day' levels of the MultiIndex using
pd.to_datetime
. Invalid dates (e.g., February 30th) will result in NaT (Not a Time).
Reshaping Data:
The data values are reshaped to match the shape of the time dimensions.
Creating DataArray:
A DataArray is created with dimensions (year, month, day, asset), and an optional 'feature' dimension if multiple feature columns are provided.
The 'time' coordinate is included, which contains the datetime values corresponding to each combination of (year, month, day, asset).
Attaching TimeSeriesIndex:
A TimeSeriesIndex instance is created using the 'time' coordinate DataArray.
The TimeSeriesIndex is attached to the 'time' coordinate's attributes under the key 'indexes', allowing for time-based selection.
Examples:
Example 1: Basic Usage with Single Feature Column
Suppose you have a DataFrame df
containing daily closing prices for multiple stocks:
import pandas as pd
data = {
'time': ['2020-01-01', '2020-01-02', '2020-01-02', '2020-01-03'],
'asset': ['AAPL', 'AAPL', 'GOOG', 'AAPL'],
'price': [300, 305, 1350, 310]
}
df = pd.DataFrame(data)
You can create a Dataset
using:
da = DateTimeAccessorBase.from_table(
data=df,
time_column='time',
asset_column='asset',
feature_columns=['price']
)
This will produce a Dataset
with dimensions (year, month, day, asset)
containing the price data.
Example 2: Multiple Feature Columns
If your DataFrame
has multiple features, like 'open'
, 'close'
, 'volume'
, you can specify multiple feature columns:
da = DateTimeAccessorBase.from_table(
data=df,
time_column='time',
asset_column='asset',
feature_columns=['open', 'close', 'volume']
)
Example 4: Accessing Data with TimeSeriesIndex
After creating the Dataset
, you can use the attached TimeSeriesIndex
to select data based on datetime labels:
# Select data for January 2, 2020
selected_data = da.dt.sel('2020-01-02')
This will return data corresponding to the specified date across all assets and features.
Note
Fixed-Size Dimensions: The fixed-size months and days may include dates that are not valid (e.g., February 30th). Such dates will have
NaN
values in theDataArray
. TheTimeSeriesIndex
handles these cases by mapping valid datetime labels to the correct indices and ignoring invalid dates.
See Also
TimeSeriesIndex: For more information on how time indexing works.
xarray.DataArray: The underlying data structure used.
Supporting Details
TimeSeriesIndex
Class
The TimeSeriesIndex
class is designed to facilitate time-based indexing and selection on the multi-dimensional DataArray
created by from_table
.
Mapping Timestamps to Indices: It creates a mapping from datetime labels to the flattened indices of the time coordinate in the
DataArray
. This is necessary because the time coordinate is multi-dimensional (year
,month
,day
), but users typically think in terms of flatdatetime
labels.Handling Missing Dates: The
TimeSeriesIndex
only includes valid dates in its mapping. Invalid dates (e.g., February 30th) are represented asNaT
and are excluded from the mapping.Selection with
sel
Method: Thesel
method allows users to select data using datetime labels. It converts the datetime labels into flat indices, then into multi-dimensional indices corresponding to theDataArray
's dimensions, and returns a dictionary that can be used withisel
to select data.
Abstracting Indexing with Datetime
By using TimeSeriesIndex
, users can work with datetime labels directly without needing to know the internal structure of the DataArray
's time dimensions. This abstraction makes it easier to perform time-based operations and selections.
Imposed Data Structures
The
DataArray
created has dimensions (year
,month
,day
,asset
,feature
), where feature is optional.The time dimensions (
year
,month
,day
) are fixed-size, meaning they have the same size regardless of the data.
This structure ensures that the DataArray
s has a consistent shape, which can be important for certain types of analysis or machine learning models that expect fixed input shapes.
Last updated