versioned_uploadable#
VersionedUploadable helps store and use a “dataset” in an exclusive directory in consistent and convenient ways. Specifically,
The dataset is identified by a version string that is generated and sortable (datetime-based), so that the “newest” is always the “latest” version, and code can infer the latest version. The full path of the storage location is managed for the user, who only needs the version.
The storage can be either local (on disk) or remote (in a cloud blob store). There are methods to download/upload between local and remote storages.
However, “local” and “remote” are just labels for the two storage locations. They can be both on local disk, or both in the same cloud storage (in different “locations”), or in two different cloud storages, or one on local disk and the other in a cloud blob store. This is controlled by the classmethods
VersionedUploadable.local_cls_upath()andVersionedUploadable.remote_cls_upath(), which should be implemented by subclasses.Within the dataset, one can conveniently specify sub-directories and files relative to the “root”, and read/write. This navigation is the same regardless of whether the storage is local or remote.
- class cloudly.upathlib.versioned_uploadable.VersionedUploadable[source]#
Bases:
ABCA subclass will customize
remote_cls_upath()andlocal_cls_upath().- classmethod resolve_version(version: str, remote: bool | None = None) tuple[str, bool][source]#
Given
versionas one of the special values—‘latest-local’, ‘latest-remote’, and ‘latest’—or an actual version string, andremoteasNoneor explicitTrue/False, figure out the actual version and its remote-ness.This is called by
__init__().- Parameters:
- version
Either one of the special values ‘latest’, ‘latest-local’, and ‘latest-remote’, or an actual version string like ‘20210322-120529’.
If
versionis an actual version string, thenversionandremoteare returned as is, even ifremoteisNone. It is checked thatversionis a valid version string, but existence of the version is not checked.- remote
If
True, look in remote (cloud) storage only. IfFalse, look in local storage only. IfNone, look in both remote and local.If
versionis ‘latest-local’, thenremotemust beFalseorNone.If
versionis ‘latest-remote’, thenremotemust beTrueorNone.If
versionis ‘latest’, then find the latest between local and remote storages ifremoteisNone, otherwiseversionbecomes ‘latest-remote’ or ‘latest-local’ according to the value ofremote.
- Returns:
- tuple
A tuple of two elements: the actual version string, and remote-ness.
Raises
ValueErrorif the parameters are incompatible.Raises
VersionNotFoundErrorif no version is found that satisfies the request.
- abstract classmethod local_cls_upath() LocalUpath[source]#
A subclass implements this method to determine the full path on the local disk for the entity represented by the particular subclass, i.e. a particular type of “dataset”.
The file-system structure directly under this path is determined by this class. Currently it contains a subdirectory called ‘versions’, in which goes on subdirectory per version, named after the version string.
In the directory of one particular version, the content is determined by the user. User can create whatever subdirectories and files they want. This class uses one meta file in the root of the version’s directory and the file is named “info.json”.
See also
- abstract classmethod remote_cls_upath() BlobUpath[source]#
Analogous to
local_cls_upath()but on the remote side.See also
- classmethod local_version_upath(version: str) LocalUpath[source]#
Root directory of the specified version in the local storage.
- classmethod remote_version_upath(version: str) BlobUpath[source]#
Root directory of the specified version in the remote storage.
- classmethod local_versions() list[str][source]#
Get a (potentially empty) list of the versions that exist on the local disk.
The elements in the list are sorted from small (old) to large (new).
Because
remote_versionsandlocal_versionsget “directories” v/o checking their content, they might get invalid (corrupt or empty) versions. User should delete such bad versions as they are discovered.
- classmethod remote_versions() list[str][source]#
Analogous to
local_versions()but on the remote side.
- classmethod has_local_version(version: str) bool[source]#
A version is considered existent if and only if the file “info.json” exists in its root directory.
- classmethod has_remote_version(version: str) bool[source]#
Analogous to
has_local_version()but on the remote side.
- classmethod remove_local_version(version: str, **kwargs) None[source]#
Delete the entire directory of the specified version on the local disk.
By default, there is neither warning before the deletion nor progress printouts.
- Parameters:
- version
The exact version string. If the version does not exist, it’s a no-op.
- **kwargs
Passed on to
remove_dir().
- classmethod remove_remote_version(version: str, **kwargs) None[source]#
Analogous to
remove_local_version()but on the remote side.
- classmethod new(*, tag: str | None = None, remote: bool = False, **kwargs) VersionedUploadable[source]#
If a subclass needs additional setup on a newly created object, they may choose to override this classmethod
new.The optional
tagappends (human readable) info to the auto created version string, which is based on current date and time.The returned object has attribute
info, which is an empty dict. Nothing has been written to storage.
- __init__(version: str, *, remote: bool | None = None, require_exists: bool = True)[source]#
This loads up an existing version for reading and writing. The create a new version, use the classmethod
new().- Parameters:
- version
Either the actual version string, or one of ‘latest’, ‘latest-local’, and ‘latest-remote’.
- remote
Look for the version in local or remote storage?
If an explicit bool, it must be compatible with
version. For example,version='latest-remote'andremote=Falseare not compatible.If
None, andversion='latest', then the latest version between local and remote is found and used. If local and remote have the same latest version, then the local one is used.If
None, andversionis an exact version, then find it either locally or remotely wherever it exists. If the version exists in both storages, then the local one is used.- require_exists
Default is
True. Ifversionis an exact version string but the version does not exist,VersionNotFoundErroris raised. Usually you should leave this at the default. This is mainly for the call of__init__innew(), where it needs to userequire_exists=False.
- version: str#
version of the object
- remote: bool#
remote-ness of the object
- property upath: Upath#
Return the root directory of
self.This is consistent with the remote-ness of
self.
- path(*args: str) Upath[source]#
Return a path relative to the root directory of
self.Examples:
self.path() # the root path self.path(‘info.json’) # file in root directory self.path(‘abc’, ‘de’, ‘data.parquet’) # file ‘abc/de/data.parquet’ under root directory self.path(‘abc/de/data.parquet’) # same as above
Typically, you proceed to read or write with the returned path (object), e.g.,
self.path('info.json').write_json(self.info, overwrite=True) info = self.path('info.json').read_json()
This is consistent with the remote-ness of
self. In other words, ifselfis local, then the returned path is local (under the local root directory); otherwise, the returned path is remote.self.path('abc.txt')is equivalent to(self.upath / 'abc.txt').Note
Don’t start
argswith'/'.
- save() None[source]#
A subclass should re-implement this method to save its own stuff like data, summary, and whatever, and in the end call
super().save().
- download(path: str | None = None, *, overwrite: bool = False, **kwargs) int[source]#
Download the entire dataset or specified parts of it.
If the current object already points to a local version, then
UnsupportedOperationis raised.If you know certain files have changed, you can bring remote/local into sync by downloading/uploading those particular files.
- Parameters:
- path
Specific subdirectory or file to download. If
None, the entire version is downloaded.If
None, and the local version already exists, andoverwriteisFalse, then download will not happen. However, if the local version is incomplete or corrupt compared to the remote counterpart (the same version), the code wouldn’t know.If not
None, then the specified sub-directory or file will be downloaded (into the expected locations for the version). This is meant for “repair work” if you know certain parts of the local version are corrupt or missing. If the version does not exist locally, there is hardly a scenario for downloading only parts of it (and that may cause issues later, as you are creating an incomplete local version).- overwrite
If
True, overwrite any file that exists locally.- **kwargs
If
pathisNone, this is passed on toupathlib.Upath.download_dir. Ifpathis notNone, this is ignored.
- Returns:
- int
The number of files downloaded.
Warning
You should not use
overwrite=Truelightly just to ensure it proceeds. The defaultoverwrite=Falseprevents re-downloading when the local version already exists. Try to benefit from such savings as far as you can.
- upload(path: str | None = None, *, overwrite: bool = False, **kwargs) int[source]#
Analogous to
download.Return the number of files uploaded.
- ensure_local(*, init_kwargs: dict[str, Any] | None = None, **kwargs) VersionedUploadable[source]#
Return a local object of this version that exists.
If
selfis local, thenselfis returned.Otherwise, if the local version does not exist, it will be downloaded. If the local version exists, downloading will not happen. (This code has to assume the local version is sound. though.) To force downloading regardless, pass in
overwrite=True, but don’t do that lightly!- Parameters:
- init_kwargs
For special needs of a subclass that defines additional arguments for its
__init__.- **kwargs
Passed on to
download.- .. note:: Calling ``ensure_local`` does not make the current object local;
you need to receive and use the returned object, which is local.
- ensure_remote(*, init_kwargs: dict[str, Any] | None = None, **kwargs) VersionedUploadable[source]#