mlspm.data_generation#
- class mlspm.data_generation.TarDataGenerator(samples: list[TarSampleList], base_path: PathLike = './', n_proc: int = 1, scale_pot: float = -1, scale_rho: float = 1)[source]#
Bases:
objectIterable that loads data from tar archives with data saved in npz format for generating samples with
GeneratorAFMTrainerin ppafm.The npz files should contain the following entries:
'data': An array containing the potential/density on a 3D grid.'origin': Lattice origin in 3D space as an array of shape(3,).'lattice': Lattice vectors as an array of shape(3, 3), where the rows are the vectors.'xyz': Atom xyz coordinates as an array of shape(n_atoms, 3).'Z': Atom atomic numbers as an array of shape(n_atoms,).
Yields dicts that contain the following:
'xyzs': Atom xyz coordinates.'Zs': Atomic numbers.'qs': Sample Hartree potential.'rho_sample': Sample electron density if the sample dict containedrho, orNoneotherwise.'rot': Rotation matrix.
Note
It is recommended to use
multiprocessing.set_start_method('spawn')when using theTarDataGenerator. Otherwise a lot of warnings about leaked memory objects may be thrown on exit.- Parameters:
samples – List of sample dicts as
TarSampleList. File paths should be relative tobase_path.base_path – Path to the directory with the tar files.
n_proc – Number of parallel processes for loading data. The sample lists get divided evenly over the processes. For memory usage, note that a maximum number of samples double the number of processes can be loaded into memory at the same time.
scale_pot – The loaded Hartree potentials are scaled by this factor in order to correct the units. The yielded potential should be in units of V. The default value of -1 works for potentials in units of eV.
scale_rho – The loaded electron densities are scaled by this factor in order to correct the units. The yielded density should be in units of e/Å^3 with positive sign for the electron density.
- class mlspm.data_generation.TarSampleList[source]#
Bases:
TypedDict'hartree': Paths to the Hartree potentials. First item in the tuple is the path to the tar file, and second entry is a list of tar file member names.'rho': (Optional) Paths to the electron densities. First item in the tuple is the path to the tar file, and second entry is a list tar file member names.'rots': List of rotations for each sample.
- hartree: tuple[PathLike, list[str]]#
- rho: tuple[PathLike, list[str]]#
- rots: list[ndarray]#
- class mlspm.data_generation.TarWriter(base_path: PathLike = './', base_name: str = '', max_count: int = 100, async_write=True)[source]#
Bases:
objectWrite samples of AFM images, molecules and descriptors to tar files. Use as a context manager and add samples with
add_sample().Each tar file has a maximum number of samples, and whenever that maximum is reached, a new tar file is created. The generated tar files are named as
{base_name}_{n}.tarand saved into the specified folder.- Parameters:
base_path – Path to directory where tar files are saved.
base_name – Base name for output tar files. The number of the tar file is appended to the name.
max_count – Maximum number of samples per tar file.
async_write – Write tar files asynchronously in a parallel process.
- add_sample(X: list[ndarray], xyzs: ndarray, Y: ndarray | None = None, comment_str: str = '')[source]#
Add a sample to the current tar file.
- Parameters:
X – AFM images. Each list item corresponds to an AFM tip and is an array of shape (nx, ny, nz).
xyzs – Atom coordinates and elements. Each row is one atom and is of the form [x, y, z, element].
Y – Image descriptors. Each list item is one descriptor and is an array of shape (nx, ny).
comment_str – Comment line (second line) to add to the xyz file.