versioned_uploadable#

VersionedUploadable helps store and use a “dataset” in an exclusive directory in consistent and convenient ways. Specifically,

  1. The dataset is identified by a version string that is generated and sortable (datetime-based), so that the “newest” is always the “latest” version, and code can infer the latest version. The full path of the storage location is managed for the user, who only needs the version.

  2. The storage can be either local (on disk) or remote (in a cloud blob store). There are methods to download/upload between local and remote storages.

    However, “local” and “remote” are just labels for the two storage locations. They can be both on local disk, or both in the same cloud storage (in different “locations”), or in two different cloud storages, or one on local disk and the other in a cloud blob store. This is controlled by the classmethods VersionedUploadable.local_cls_upath() and VersionedUploadable.remote_cls_upath(), which should be implemented by subclasses.

  3. Within the dataset, one can conveniently specify sub-directories and files relative to the “root”, and read/write. This navigation is the same regardless of whether the storage is local or remote.

exception cloudly.upathlib.versioned_uploadable.VersionExistsError[source]#

Bases: Exception

exception cloudly.upathlib.versioned_uploadable.VersionNotFoundError[source]#

Bases: Exception

class cloudly.upathlib.versioned_uploadable.VersionedUploadable[source]#

Bases: ABC

A subclass will customize remote_cls_upath() and local_cls_upath().

classmethod resolve_version(version: str, remote: bool | None = None) tuple[str, bool][source]#

Given version as one of the special values—‘latest-local’, ‘latest-remote’, and ‘latest’—or an actual version string, and remote as None or explicit True/False, figure out the actual version and its remote-ness.

This is called by __init__().

Parameters:
version

Either one of the special values ‘latest’, ‘latest-local’, and ‘latest-remote’, or an actual version string like ‘20210322-120529’.

If version is an actual version string, then version and remote are returned as is, even if remote is None. It is checked that version is a valid version string, but existence of the version is not checked.

remote

If True, look in remote (cloud) storage only. If False, look in local storage only. If None, look in both remote and local.

If version is ‘latest-local’, then remote must be False or None.

If version is ‘latest-remote’, then remote must be True or None.

If version is ‘latest’, then find the latest between local and remote storages if remote is None, otherwise version becomes ‘latest-remote’ or ‘latest-local’ according to the value of remote.

Returns:
tuple

A tuple of two elements: the actual version string, and remote-ness.

Raises ValueError if the parameters are incompatible.

Raises VersionNotFoundError if no version is found that satisfies the request.

classmethod parse_version(version: str) dict[str, str][source]#
abstract classmethod local_cls_upath() LocalUpath[source]#

A subclass implements this method to determine the full path on the local disk for the entity represented by the particular subclass, i.e. a particular type of “dataset”.

The file-system structure directly under this path is determined by this class. Currently it contains a subdirectory called ‘versions’, in which goes on subdirectory per version, named after the version string.

In the directory of one particular version, the content is determined by the user. User can create whatever subdirectories and files they want. This class uses one meta file in the root of the version’s directory and the file is named “info.json”.

abstract classmethod remote_cls_upath() BlobUpath[source]#

Analogous to local_cls_upath() but on the remote side.

classmethod local_version_upath(version: str) LocalUpath[source]#

Root directory of the specified version in the local storage.

classmethod remote_version_upath(version: str) BlobUpath[source]#

Root directory of the specified version in the remote storage.

classmethod local_versions() list[str][source]#

Get a (potentially empty) list of the versions that exist on the local disk.

The elements in the list are sorted from small (old) to large (new).

Because remote_versions and local_versions get “directories” v/o checking their content, they might get invalid (corrupt or empty) versions. User should delete such bad versions as they are discovered.

classmethod remote_versions() list[str][source]#

Analogous to local_versions() but on the remote side.

classmethod has_local_version(version: str) bool[source]#

A version is considered existent if and only if the file “info.json” exists in its root directory.

classmethod has_remote_version(version: str) bool[source]#

Analogous to has_local_version() but on the remote side.

classmethod remove_local_version(version: str, **kwargs) None[source]#

Delete the entire directory of the specified version on the local disk.

By default, there is neither warning before the deletion nor progress printouts.

Parameters:
version

The exact version string. If the version does not exist, it’s a no-op.

**kwargs

Passed on to remove_dir().

classmethod remove_remote_version(version: str, **kwargs) None[source]#

Analogous to remove_local_version() but on the remote side.

classmethod new(*, tag: str | None = None, remote: bool = False, **kwargs) VersionedUploadable[source]#

If a subclass needs additional setup on a newly created object, they may choose to override this classmethod new.

The optional tag appends (human readable) info to the auto created version string, which is based on current date and time.

The returned object has attribute info, which is an empty dict. Nothing has been written to storage.

__init__(version: str, *, remote: bool | None = None, require_exists: bool = True)[source]#

This loads up an existing version for reading and writing. The create a new version, use the classmethod new().

Parameters:
version

Either the actual version string, or one of ‘latest’, ‘latest-local’, and ‘latest-remote’.

remote

Look for the version in local or remote storage?

If an explicit bool, it must be compatible with version. For example, version='latest-remote' and remote=False are not compatible.

If None, and version='latest', then the latest version between local and remote is found and used. If local and remote have the same latest version, then the local one is used.

If None, and version is an exact version, then find it either locally or remotely wherever it exists. If the version exists in both storages, then the local one is used.

require_exists

Default is True. If version is an exact version string but the version does not exist, VersionNotFoundError is raised. Usually you should leave this at the default. This is mainly for the call of __init__ in new(), where it needs to use require_exists=False.

version: str#

version of the object

remote: bool#

remote-ness of the object

property upath: Upath#

Return the root directory of self.

This is consistent with the remote-ness of self.

path(*args: str) Upath[source]#

Return a path relative to the root directory of self.

Examples:

self.path() # the root path self.path(‘info.json’) # file in root directory self.path(‘abc’, ‘de’, ‘data.parquet’) # file ‘abc/de/data.parquet’ under root directory self.path(‘abc/de/data.parquet’) # same as above

Typically, you proceed to read or write with the returned path (object), e.g.,

self.path('info.json').write_json(self.info, overwrite=True)
info = self.path('info.json').read_json()

This is consistent with the remote-ness of self. In other words, if self is local, then the returned path is local (under the local root directory); otherwise, the returned path is remote.

self.path('abc.txt') is equivalent to (self.upath / 'abc.txt').

Note

Don’t start args with '/'.

save() None[source]#

A subclass should re-implement this method to save its own stuff like data, summary, and whatever, and in the end call super().save().

download(path: str | None = None, *, overwrite: bool = False, **kwargs) int[source]#

Download the entire dataset or specified parts of it.

If the current object already points to a local version, then UnsupportedOperation is raised.

If you know certain files have changed, you can bring remote/local into sync by downloading/uploading those particular files.

Parameters:
path

Specific subdirectory or file to download. If None, the entire version is downloaded.

If None, and the local version already exists, and overwrite is False, then download will not happen. However, if the local version is incomplete or corrupt compared to the remote counterpart (the same version), the code wouldn’t know.

If not None, then the specified sub-directory or file will be downloaded (into the expected locations for the version). This is meant for “repair work” if you know certain parts of the local version are corrupt or missing. If the version does not exist locally, there is hardly a scenario for downloading only parts of it (and that may cause issues later, as you are creating an incomplete local version).

overwrite

If True, overwrite any file that exists locally.

**kwargs

If path is None, this is passed on to upathlib.Upath.download_dir. If path is not None, this is ignored.

Returns:
int

The number of files downloaded.

Warning

You should not use overwrite=True lightly just to ensure it proceeds. The default overwrite=False prevents re-downloading when the local version already exists. Try to benefit from such savings as far as you can.

upload(path: str | None = None, *, overwrite: bool = False, **kwargs) int[source]#

Analogous to download.

Return the number of files uploaded.

ensure_local(*, init_kwargs: dict[str, Any] | None = None, **kwargs) VersionedUploadable[source]#

Return a local object of this version that exists.

If self is local, then self is returned.

Otherwise, if the local version does not exist, it will be downloaded. If the local version exists, downloading will not happen. (This code has to assume the local version is sound. though.) To force downloading regardless, pass in overwrite=True, but don’t do that lightly!

Parameters:
init_kwargs

For special needs of a subclass that defines additional arguments for its __init__.

**kwargs

Passed on to download.

.. note:: Calling ``ensure_local`` does not make the current object local;

you need to receive and use the returned object, which is local.

ensure_remote(*, init_kwargs: dict[str, Any] | None = None, **kwargs) VersionedUploadable[source]#