Data Distributor#
The DataDistributor
class is a utility for managing and distributing datasets among clients in a federated learning simulation. It supports various data distribution strategies to simulate realistic federated learning scenarios, including IID (independent and identically distributed) and non-IID distributions. This flexibility allows researchers to evaluate the performance and robustness of models under diverse data heterogeneity conditions.
Key Features#
- Data Distribution Strategies:
iid: Splits data equally and randomly among clients.
gamma_similarity_niid: Creates non-IID distributions with a specified degree of similarity using a gamma parameter.
dirichlet_niid: Implements non-IID distributions based on a Dirichlet distribution with a configurable concentration parameter (alpha).
extreme_niid: Generates highly non-IID distributions by assigning sorted data partitions to clients.
Flexible Input Support: Accepts datasets as PyTorch DataLoader objects and processes them seamlessly.
Flexible Dataset Management: Returns data loaders for each client after applying the specified data distribution.
- class byzfl.DataDistributor(params)[source]#
Bases:
object
- Initialization Parameters:
params (dict) – A dictionary containing the configuration for the data distributor. Must include:
- “data_distribution_name”str
Name of the data distribution strategy (“iid”, “gamma_similarity_niid”, etc.).
- “distribution_parameter”float
Parameter for the data distribution strategy (e.g., gamma or alpha).
- “nb_honest”int
Number of honest clients to split the dataset among.
- “data_loader”DataLoader
The data loader of the dataset to be distributed.
- “batch_size”int
Batch size for the generated dataloaders.
- - **`split_data()`**:
Splits the dataset into dataloaders based on the specified distribution strategy.
Example
>>> from torchvision import datasets, transforms >>> from torch.utils.data import DataLoader >>> from byzfl import DataDistributor >>> transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))]) >>> dataset = datasets.MNIST(root="./data", train=True, download=True, transform=transform) >>> data_loader = DataLoader(dataset, batch_size=64, shuffle=True) >>> params = { >>> "data_distribution_name": "dirichlet_niid", >>> "distribution_parameter": 0.5, >>> "nb_honest": 5, >>> "data_loader": data_loader, >>> "batch_size": 64, >>> } >>> distributor = DataDistributor(params) >>> dataloaders = distributor.split_data()
- dirichlet_niid_idx(targets, idx)[source]#
Creates a Dirichlet non-IID partition of the dataset.
- Parameters:
targets (numpy.ndarray) – Array of dataset targets (labels).
idx (numpy.ndarray) – Array of dataset indices corresponding to the targets.
- Returns:
list[numpy.ndarray] – A list of arrays where each array contains indices for one client.
- extreme_niid_idx(targets, idx)[source]#
Creates an extremely non-IID partition of the dataset.
- Parameters:
targets (numpy.ndarray) – Array of dataset targets (labels).
idx (numpy.ndarray) – Array of dataset indices corresponding to the targets.
- Returns:
list[numpy.ndarray] – A list of arrays where each array contains indices for one client.
- gamma_niid_idx(targets, idx)[source]#
Creates a gamma-similarity non-IID partition of the dataset.
- Parameters:
targets (numpy.ndarray) – Array of dataset targets (labels).
idx (numpy.ndarray) – Array of dataset indices corresponding to the targets.
- Returns:
list[numpy.ndarray] – A list of arrays where each array contains indices for one client.
- idx_to_dataloaders(split_idx)[source]#
Converts index splits into DataLoader objects.
- Parameters:
split_idx (list[numpy.ndarray]) – A list of arrays where each array contains indices for one client.
- Returns:
list[DataLoader] – A list of DataLoader objects for each client.