mlspm.data_generation#

class mlspm.data_generation.TarDataGenerator(samples: list[TarSampleList], base_path: PathLike = './', n_proc: int = 1, scale_pot: float = -1, scale_rho: float = 1)[source]#

Bases: object

Iterable that loads data from tar archives with data saved in npz format for generating samples with GeneratorAFMTrainer in ppafm.

The npz files should contain the following entries:

  • 'data': An array containing the potential/density on a 3D grid.

  • 'origin': Lattice origin in 3D space as an array of shape (3,).

  • 'lattice': Lattice vectors as an array of shape (3, 3), where the rows are the vectors.

  • 'xyz': Atom xyz coordinates as an array of shape (n_atoms, 3).

  • 'Z': Atom atomic numbers as an array of shape (n_atoms,).

Yields dicts that contain the following:

  • 'xyzs': Atom xyz coordinates.

  • 'Zs': Atomic numbers.

  • 'qs': Sample Hartree potential.

  • 'rho_sample': Sample electron density if the sample dict contained rho, or None otherwise.

  • 'rot': Rotation matrix.

Note

It is recommended to use multiprocessing.set_start_method('spawn') when using the TarDataGenerator. Otherwise a lot of warnings about leaked memory objects may be thrown on exit.

Parameters:
  • samples – List of sample dicts as TarSampleList. File paths should be relative to base_path.

  • base_path – Path to the directory with the tar files.

  • n_proc – Number of parallel processes for loading data. The sample lists get divided evenly over the processes. For memory usage, note that a maximum number of samples double the number of processes can be loaded into memory at the same time.

  • scale_pot – The loaded Hartree potentials are scaled by this factor in order to correct the units. The yielded potential should be in units of V. The default value of -1 works for potentials in units of eV.

  • scale_rho – The loaded electron densities are scaled by this factor in order to correct the units. The yielded density should be in units of e/Å^3 with positive sign for the electron density.

class mlspm.data_generation.TarSampleList[source]#

Bases: TypedDict

  • 'hartree': Paths to the Hartree potentials. First item in the tuple is the path to the tar file, and second entry is a list of tar file member names.

  • 'rho': (Optional) Paths to the electron densities. First item in the tuple is the path to the tar file, and second entry is a list tar file member names.

  • 'rots': List of rotations for each sample.

hartree: tuple[PathLike, list[str]]#
rho: tuple[PathLike, list[str]]#
rots: list[ndarray]#
class mlspm.data_generation.TarWriter(base_path: PathLike = './', base_name: str = '', max_count: int = 100, async_write=True)[source]#

Bases: object

Write samples of AFM images, molecules and descriptors to tar files. Use as a context manager and add samples with add_sample().

Each tar file has a maximum number of samples, and whenever that maximum is reached, a new tar file is created. The generated tar files are named as {base_name}_{n}.tar and saved into the specified folder.

Parameters:
  • base_path – Path to directory where tar files are saved.

  • base_name – Base name for output tar files. The number of the tar file is appended to the name.

  • max_count – Maximum number of samples per tar file.

  • async_write – Write tar files asynchronously in a parallel process.

add_sample(X: list[ndarray], xyzs: ndarray, Y: ndarray | None = None, comment_str: str = '')[source]#

Add a sample to the current tar file.

Parameters:
  • X – AFM images. Each list item corresponds to an AFM tip and is an array of shape (nx, ny, nz).

  • xyzs – Atom coordinates and elements. Each row is one atom and is of the form [x, y, z, element].

  • Y – Image descriptors. Each list item is one descriptor and is an array of shape (nx, ny).

  • comment_str – Comment line (second line) to add to the xyz file.

mlspm.data_generation.get_tarinfo(fname: str, file_bytes: BytesIO)[source]#