from_table

docs/api/data_layer/data.py::DateTimeAccessorBase.from_table()

Creates a multi-dimensional xarray DataArray from a Pandas DataFrame, organizing data into fixed-size time dimensions.

Parameters:

  • data (pd.DataFrame): The input data table containing time-series data.

  • time_column (str, optional): The name of the column in data that contains datetime information. Default is 'time'.

  • asset_column (str, optional): The name of the column representing different assets or entities. Default is 'asset'.

  • feature_columns (list of str, optional): A list of columns in data that contain the features or measurements to include in the DataArray. If None, all columns except time and asset columns are used.

  • frequency (FrequencyType, optional): The frequency of the data, e.g., 'D' for daily. Default is 'D'.

Returns:

  • xr.Dataset A multi-dimensional Dataset with the following dimensions:

    • year: Unique years present in the data.

      • month: Fixed-size months (1 to 12).

      • day: Fixed-size days (1 to 31).

      • asset: Unique assets present in the data.

      • feature: Data features if multiple value columns are provided.

Description:

This method transforms a Pandas DataFrame into an xarray DataArray with multi-dimensional coordinates based on time components (year, month, day) and assets. It handles missing dates by creating fixed-size dimensions for months (1-12) and days (1-31), ensuring consistent shapes across different years and months.

How the Data Structures are Imposed:

  • Fixed-Size Time Dimensions: The method imposes fixed-size dimensions for months (1-12) and days (1-31), regardless of whether all these days or months are present in the data. This means that even if, for example, February doesn't have 31 days, the DataArray will still have 31 days in the day dimension. This ensures that the DataArray has a consistent shape across all time periods.

  • Asset Dimension: The asset dimension is derived from the unique values in the asset_column of the DataFrame.

  • Feature Dimension: If multiple feature columns are provided, an additional 'feature' dimension is added to the DataArray.

Role of TimeSeriesIndex:

  • Purpose: The TimeSeriesIndex class creates a mapping from datetime labels to multi-dimensional indices in the DataArray. It handles the conversion from datetime objects to indices in the (year, month, day) dimensions.

  • Link to DataArray: The TimeSeriesIndex is attached to the 'time' coordinate of the DataArray as an attribute. This allows users to select data based on datetime labels using the sel method provided by the TimeSeriesIndex.

  • Abstracting Away Indexing: Users can use datetime labels to select data without needing to know the underlying multi-dimensional indices. The TimeSeriesIndex handles the mapping internally, making time-based selection intuitive.

Implementation Details:

  1. Data Preprocessing:

  • The DataFrame is copied to avoid modifying the original data.

  • The time_column is converted to datetime using pd.to_datetime.

  • Time components (year, month, day) are extracted and added as new columns in the DataFrame.

  1. Creating Fixed-Size Time Dimensions:

  • Unique years are obtained from the data.

  • Fixed-size months (1-12) and days (1-31) are created using np.arange.

  • A full index is created using pd.MultiIndex.from_product with all combinations of years, months, days, and assets.

  1. Reindexing Data:

  • The DataFrame is reindexed to include all possible combinations from the full index, filling missing entries with NaNs.

  1. Creating Time Coordinate:

  • A time coordinate is created by combining the 'year', 'month', and 'day' levels of the MultiIndex using pd.to_datetime. Invalid dates (e.g., February 30th) will result in NaT (Not a Time).

  1. Reshaping Data:

  • The data values are reshaped to match the shape of the time dimensions.

  1. Creating DataArray:

  • A DataArray is created with dimensions (year, month, day, asset), and an optional 'feature' dimension if multiple feature columns are provided.

  • The 'time' coordinate is included, which contains the datetime values corresponding to each combination of (year, month, day, asset).

  1. Attaching TimeSeriesIndex:

  • A TimeSeriesIndex instance is created using the 'time' coordinate DataArray.

  • The TimeSeriesIndex is attached to the 'time' coordinate's attributes under the key 'indexes', allowing for time-based selection.

Examples:

Example 1: Basic Usage with Single Feature Column

Suppose you have a DataFrame df containing daily closing prices for multiple stocks:

import pandas as pd

data = {
    'time': ['2020-01-01', '2020-01-02', '2020-01-02', '2020-01-03'],
    'asset': ['AAPL', 'AAPL', 'GOOG', 'AAPL'],
    'price': [300, 305, 1350, 310]
}

df = pd.DataFrame(data)

You can create a Dataset using:

da = DateTimeAccessorBase.from_table(
    data=df,
    time_column='time',
    asset_column='asset',
    feature_columns=['price']
)

This will produce a Dataset with dimensions (year, month, day, asset) containing the price data.

Example 2: Multiple Feature Columns

If your DataFrame has multiple features, like 'open', 'close', 'volume', you can specify multiple feature columns:

da = DateTimeAccessorBase.from_table(
    data=df,
    time_column='time',
    asset_column='asset',
    feature_columns=['open', 'close', 'volume']
)

Example 4: Accessing Data with TimeSeriesIndex

After creating the Dataset, you can use the attached TimeSeriesIndex to select data based on datetime labels:

# Select data for January 2, 2020
selected_data = da.dt.sel('2020-01-02')

This will return data corresponding to the specified date across all assets and features.

Note

  • Fixed-Size Dimensions: The fixed-size months and days may include dates that are not valid (e.g., February 30th). Such dates will have NaN values in the DataArray. The TimeSeriesIndex handles these cases by mapping valid datetime labels to the correct indices and ignoring invalid dates.

See Also

Supporting Details

TimeSeriesIndex Class

The TimeSeriesIndex class is designed to facilitate time-based indexing and selection on the multi-dimensional DataArray created by from_table.

  • Mapping Timestamps to Indices: It creates a mapping from datetime labels to the flattened indices of the time coordinate in the DataArray. This is necessary because the time coordinate is multi-dimensional (year, month, day), but users typically think in terms of flat datetime labels.

  • Handling Missing Dates: The TimeSeriesIndex only includes valid dates in its mapping. Invalid dates (e.g., February 30th) are represented as NaT and are excluded from the mapping.

  • Selection with sel Method: The sel method allows users to select data using datetime labels. It converts the datetime labels into flat indices, then into multi-dimensional indices corresponding to the DataArray's dimensions, and returns a dictionary that can be used with isel to select data.

Abstracting Indexing with Datetime

By using TimeSeriesIndex, users can work with datetime labels directly without needing to know the internal structure of the DataArray's time dimensions. This abstraction makes it easier to perform time-based operations and selections.

Imposed Data Structures

  • The DataArray created has dimensions (year, month, day, asset, feature), where feature is optional.

  • The time dimensions (year, month, day) are fixed-size, meaning they have the same size regardless of the data.

This structure ensures that the DataArrays has a consistent shape, which can be important for certain types of analysis or machine learning models that expect fixed input shapes.

Last updated