Data Distributor#

The DataDistributor class is a utility for managing and distributing datasets among clients in a federated learning simulation. It supports various data distribution strategies to simulate realistic federated learning scenarios, including IID (independent and identically distributed) and non-IID distributions. This flexibility allows researchers to evaluate the performance and robustness of models under diverse data heterogeneity conditions.

Key Features#

  • Data Distribution Strategies:
    • iid: Splits data equally and randomly among clients.

    • gamma_similarity_niid: Creates non-IID distributions with a specified degree of similarity using a gamma parameter.

    • dirichlet_niid: Implements non-IID distributions based on a Dirichlet distribution with a configurable concentration parameter (alpha).

    • extreme_niid: Generates highly non-IID distributions by assigning sorted data partitions to clients.

  • Flexible Input Support: Accepts datasets as PyTorch DataLoader objects and processes them seamlessly.

  • Flexible Dataset Management: Returns data loaders for each client after applying the specified data distribution.

class byzfl.DataDistributor(params)[source]#

Bases: object

Initialization Parameters:

params (dict) – A dictionary containing the configuration for the data distributor. Must include:

  • “data_distribution_name”str

    Name of the data distribution strategy (“iid”, “gamma_similarity_niid”, etc.).

  • “distribution_parameter”float

    Parameter for the data distribution strategy (e.g., gamma or alpha).

  • “nb_honest”int

    Number of honest clients to split the dataset among.

  • “data_loader”DataLoader

    The data loader of the dataset to be distributed.

  • “batch_size”int

    Batch size for the generated dataloaders.

- **`split_data()`**:

Splits the dataset into dataloaders based on the specified distribution strategy.

Example

>>> from torchvision import datasets, transforms
>>> from torch.utils.data import DataLoader
>>> from byzfl import DataDistributor
>>> transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
>>> dataset = datasets.MNIST(root="./data", train=True, download=True, transform=transform)
>>> data_loader = DataLoader(dataset, batch_size=64, shuffle=True)
>>> params = {
>>>     "data_distribution_name": "dirichlet_niid",
>>>     "distribution_parameter": 0.5,
>>>     "nb_honest": 5,
>>>     "data_loader": data_loader,
>>>     "batch_size": 64,
>>> }
>>> distributor = DataDistributor(params)
>>> dataloaders = distributor.split_data()
dirichlet_niid_idx(targets, idx)[source]#

Creates a Dirichlet non-IID partition of the dataset.

Parameters:
  • targets (numpy.ndarray) – Array of dataset targets (labels).

  • idx (numpy.ndarray) – Array of dataset indices corresponding to the targets.

Returns:

list[numpy.ndarray] – A list of arrays where each array contains indices for one client.

extreme_niid_idx(targets, idx)[source]#

Creates an extremely non-IID partition of the dataset.

Parameters:
  • targets (numpy.ndarray) – Array of dataset targets (labels).

  • idx (numpy.ndarray) – Array of dataset indices corresponding to the targets.

Returns:

list[numpy.ndarray] – A list of arrays where each array contains indices for one client.

gamma_niid_idx(targets, idx)[source]#

Creates a gamma-similarity non-IID partition of the dataset.

Parameters:
  • targets (numpy.ndarray) – Array of dataset targets (labels).

  • idx (numpy.ndarray) – Array of dataset indices corresponding to the targets.

Returns:

list[numpy.ndarray] – A list of arrays where each array contains indices for one client.

idx_to_dataloaders(split_idx)[source]#

Converts index splits into DataLoader objects.

Parameters:

split_idx (list[numpy.ndarray]) – A list of arrays where each array contains indices for one client.

Returns:

list[DataLoader] – A list of DataLoader objects for each client.

iid_idx(idx)[source]#

Splits indices into IID (independent and identically distributed) partitions.

Parameters:

idx (numpy.ndarray) – Array of dataset indices.

Returns:

list[numpy.ndarray] – A list of arrays where each array contains indices for one client.

split_data()[source]#

Splits the dataset according to the specified distribution strategy.

Returns:

list[DataLoader] – A list of DataLoader objects for each client.

Raises:

ValueError – If the specified data distribution name is invalid.