mlspm.data_generation#

class mlspm.data_generation.TarDataGenerator(samples: list[TarSampleList], base_path: PathLike = './', n_proc: int = 1, scale_pot: float = -1, scale_rho: float = 1)[source]#

Bases: object

Iterable that loads data from tar archives with data saved in npz format for generating samples with GeneratorAFMTrainer in ppafm.

The npz files should contain the following entries:

'data': An array containing the potential/density on a 3D grid.

'origin': Lattice origin in 3D space as an array of shape (3,).

'lattice': Lattice vectors as an array of shape (3, 3), where the rows are the vectors.

'xyz': Atom xyz coordinates as an array of shape (n_atoms, 3).

'Z': Atom atomic numbers as an array of shape (n_atoms,).

Yields dicts that contain the following:

'xyzs': Atom xyz coordinates.

'Zs': Atomic numbers.

'qs': Sample Hartree potential.

'rho_sample': Sample electron density if the sample dict contained rho, or None otherwise.

'rot': Rotation matrix.

Note

It is recommended to use multiprocessing.set_start_method('spawn') when using the TarDataGenerator. Otherwise a lot of warnings about leaked memory objects may be thrown on exit.

Parameters:

samples – List of sample dicts as TarSampleList. File paths should be relative to base_path.
base_path – Path to the directory with the tar files.
n_proc – Number of parallel processes for loading data. The sample lists get divided evenly over the processes. For memory usage, note that a maximum number of samples double the number of processes can be loaded into memory at the same time.
scale_pot – The loaded Hartree potentials are scaled by this factor in order to correct the units. The yielded potential should be in units of V. The default value of -1 works for potentials in units of eV.
scale_rho – The loaded electron densities are scaled by this factor in order to correct the units. The yielded density should be in units of e/Å^3 with positive sign for the electron density.

class mlspm.data_generation.TarSampleList[source]#

Bases: TypedDict

'hartree': Paths to the Hartree potentials. First item in the tuple is the path to the tar file, and second entry is a list of tar file member names.
'rho': (Optional) Paths to the electron densities. First item in the tuple is the path to the tar file, and second entry is a list tar file member names.
'rots': List of rotations for each sample.

hartree: tuple[PathLike, list[str]]#

rho: tuple[PathLike, list[str]]#

rots: list[ndarray]#

class mlspm.data_generation.TarWriter(base_path: PathLike = './', base_name: str = '', max_count: int = 100, async_write=True)[source]#

Bases: object

Write samples of AFM images, molecules and descriptors to tar files. Use as a context manager and add samples with add_sample().

Each tar file has a maximum number of samples, and whenever that maximum is reached, a new tar file is created. The generated tar files are named as {base_name}_{n}.tar and saved into the specified folder.

Parameters:

base_path – Path to directory where tar files are saved.
base_name – Base name for output tar files. The number of the tar file is appended to the name.
max_count – Maximum number of samples per tar file.
async_write – Write tar files asynchronously in a parallel process.

add_sample(X: list[ndarray], xyzs: ndarray, Y: ndarray | None = None, comment_str: str = '')[source]#

Add a sample to the current tar file.

Parameters:

X – AFM images. Each list item corresponds to an AFM tip and is an array of shape (nx, ny, nz).
xyzs – Atom coordinates and elements. Each row is one atom and is of the form [x, y, z, element].
Y – Image descriptors. Each list item is one descriptor and is an array of shape (nx, ny).
comment_str – Comment line (second line) to add to the xyz file.

mlspm.data_generation.get_tarinfo(fname: str, file_bytes: BytesIO)[source]#