trailofbits.python.numpy-in-pytorch-datasets.numpy-in-pytorch-datasets

profile photo of trailofbitstrailofbits
Author
unknown
Download Count*

Using the NumPy RNG inside of a PyTorch dataset can lead to a number of issues with loading data, including identical augmentations. Instead, use the random number generators built into Python and PyTorch

Run Locally

Run in CI

Defintion

rules:
  - id: numpy-in-pytorch-datasets
    message: Using the NumPy RNG inside of a PyTorch dataset can lead to a number of
      issues with loading data, including identical augmentations. Instead, use
      the random number generators built into Python and PyTorch
    languages:
      - python
    severity: WARNING
    metadata:
      category: security
      cwe: "CWE-330: Use of Insufficiently Random Values"
      subcategory:
        - audit
      confidence: HIGH
      likelihood: MEDIUM
      impact: LOW
      technology:
        - pytorch
        - numpy
      description: Calls to the `NumPy` RNG inside of a `Torch` dataset
      references:
        - https://tanelp.github.io/posts/a-bug-that-plagues-thousands-of-open-source-ml-projects
      license: AGPL-3.0 license
      vulnerability_class:
        - Cryptographic Issues
    patterns:
      - pattern: |
          class $X(torch.utils.data.Dataset):
            ...
            def __getitem__(...):
              ...
              numpy.random.randint(...)
              ...

Examples

numpy-in-pytorch-datasets.py

import numpy as np
from torch.utils.data import Dataset
from tob.strangelib import Dataset as DatasetStrange

# ruleid: numpy-in-pytorch-datasets
class RandomDataset(Dataset):
    def __getitem__(self, index):
        return np.random.randint(0, 1000, 3)

    def __len__(self):
        return 1000
      
      
# ruleid: numpy-in-pytorch-datasets
class AnotherRandomDataset(Dataset):
    def __len__(self):
        return 1000
     
    def __getitem__(self, index):
        print("Hello World")
        x = np.random.randint(0, 1000, 3)
        return x 

# ruleid: numpy-in-pytorch-datasets
class AnotherRandomDatasetOther(Dataset):
    def __len__(self):
        return 1000
     
    def __getitem__(self, index):
        print("Hello World")
        x = numpy.random.randint(0, 1000, 3)
        return x

# ok: numpy-in-pytorch-datasets
class NotTorchDataset(DatasetStrange):
    def __len__(self):
        return 1000
     
    def __getitem__(self, index):
        print("Hello World")
        x = numpy.random.randint(0, 1000, 3)
        return x 

# ok: numpy-in-pytorch-datasets
class YetAnotherRandomDataset(Dataset):
    def __len__(self):
        return 1000
     
    def __getitem__(self, index):
        return index