upathlib#

upathlib defines a unified API for cloud blob store (aka “object store”) as well as local file systems.

Attention is focused on identifying the most essential functionalities while working with a blob store for data processing. Functionalities in a traditional local file system that are secondary in these tasks—such as symbolic links, fine-grained permissions, and various access modes—are ignored.

End user should look to the class Upath for documentation of the API. Local file system is implemented by LocalUpath, which subclasses Upath. Client for Google Cloud Storage (i.e. blob store on GCP) is implemented by another Upath subclass, namely GcsBlobUpath.

One use case is the module cloudly.biglist, where the class Biglist takes a Upath object to indicate its location of storage. It does not care whether the storage is local or in a cloud blob store—it simply uses the common API to operate the storage.

Quickstart#

Let’s carve out a space in the local file system and poke around.

>>> from cloudly.upathlib import LocalUpath
>>> p = LocalUpath('/tmp/abc')

This creates a LocalUpath object p that points to the location '/tmp/abc'. This may be an existing file, or directory, or may be nonexistent. We know this is a temporary location; to be sure we have a clear playground, let’s wipe out anything and everything:

>>> p.rmrf()
0

Think rm -rf /tmp/abc. It does just that. The returned 0 means zero files were deleted.

Now let’s create a file and write something to it:

>>> (p / 'x.txt').write_text('first')

This creates file /tmp/abc/x.txt with the content 'first'. Note the directory '/tmp/abc' did not exist before the call. We did not need to “create the parent directory”. In fact, upathlib does not provide a way to do that. In upathlib, “directory” is a “virtual” thing that is embodied by a group of files. For example, if there exist

/tmp/abc/x.txt
/tmp/abc/d/y.data

we say there is directories '/tmp/abc' and '/tmp/abc/d', but we don’t create these “directories” by themselves. These directories come into being if there exist such files.

Let’s actually create these files:

>>> (p / 'x.txt').write_text('second', overwrite=True)
>>> (p / 'd' / 'y.data').write_bytes(b'0101')

Now let’s look into this directory:

>>> p.is_dir()
True
>>> (p / 'd').is_dir()
True
>>> (p / 'x.txt').is_dir()
False
>>> (p / 'x.txt').is_file()
True

We can navigate in the directory. For example,

>>> for v in sorted(p.iterdir()):  # the sort merely makes the result stable
...     print(v)
/tmp/abc/d
/tmp/abc/x.txt

This is only the first level, or “direct children”. We can also use “recursive iterdir” to get all files under the directory, descending into subdirectories recursively:

>>> for v in sorted(p.riterdir()):  # the sort merely makes the result stable
...     print(v)
/tmp/abc/d/y.data
/tmp/abc/x.txt

This time only files are listed. Subdirectories do not show up because, after all, they are not real in upathlib concept.

We can as easily read a file, like

>>> (p / 'x.txt').read_text()
'second'

Several common file formats are provided out of the box, including text, bytes, json, and pickle, as well as compressed versions by zlib and Zstandard.

Let’s do some JSON:

>>> pp = p / 'e/f/g/data.json'
>>> pp.write_json({'name': 'John', 'age': 38})

We know the JSON file is also a text file, so we can treat it as such:

>>> pp.read_text()
'{"name": "John", "age": 38}'

But usually we prefer to get back the Python object directly:

>>> v = pp.read_json()
>>> v
{'name': 'John', 'age': 38}
>>> type(v)
<class 'dict'>

We can go “down” the directory tree using /. Conversely, we can go “up” using parent():

>>> pp.path
PosixPath('/tmp/abc/e/f/g/data.json')
>>> pp.parent
LocalUpath('/tmp/abc/e/f/g')
>>> pp.parent.parent
LocalUpath('/tmp/abc/e/f')
>>> pp.parent.parent.is_dir()
True
>>> pp.parent.parent.is_file()
False

or the terminal-lovers’ ..:

>>> pp
LocalUpath('/tmp/abc/e/f/g/data.json')
>>> pp / '..'
LocalUpath('/tmp/abc/e/f/g')
>>> pp / '..' / '..'
LocalUpath('/tmp/abc/e/f')

Under the hood, / delegates to a call to joinpath():

>>> pp.joinpath('../../o/p/q')
LocalUpath('/tmp/abc/e/f/o/p/q')

Let’s see again what we have:

>>> sorted(p.riterdir())
[LocalUpath('/tmp/abc/d/y.data'), LocalUpath('/tmp/abc/e/f/g/data.json'), LocalUpath('/tmp/abc/x.txt')]

and to get rid of them all:

>>> p.rmrf()
3

A nice thing about upathlib is the “unified” nature across local and cloud storages. Suppose we have set up the environment to use Google Cloud Storage, then we could have started this excercise with

>>> from cloudly.gcp.storage import GcsBlobUpath
>>> p = GcsBlobUpath('gs://my-bucket/tmp/abc')

Everything after this would work unchanged. (The printouts would be different at some places, e.g. LocalUpath would be replaced by GcsBlobUpath.)

Upath#

Upath is an abstract base class that defines the APIs and some of the implementation. Subclasses tailor to particular storage systems. Currently there are two production-ready subclasses; they implement Upath for local file systems and Google Cloud Storage, respectively.

The APIs follow the style of the standard library pathlib where appropriate.

class cloudly.upathlib.Upath[source]#

Bases: ABC

__init__(*pathsegments: str)[source]#

Create a Upath instance. Because Upath is an abstract class, this is always called on a subclass to instantiate a path on the specific storage system.

Subclasses for cloud blob stores may need to add additional parameters representing, e.g., container/bucket name, etc.

Parameters:

*pathsegments

Analogous to the input to pathlib.Path. The first segment may or may not start with '/'. The path constructed with *pathsegments is always “absolute” under a known “root”.

For a local POSIX file system, the root is the usual '/'.

For a local Windows file system, the root is resolved to a particular “drive”.

For Azure blob store, the root is that in a “container”.

For AWS and GCP blob stores, the root is that in a “bucket”.

If missing, the path constructed is the “root”. However, the subclass LocalUpath plugs in the current working directory for a missing *pathsegments.

Note

If one segment starts with '/', it will reset to the “root” and discard all the segments that have come before it. For example, Upath('work', 'projects', '/', 'projects') is the same as Upath('/', 'projects).

Note

The first element of *pathsegments may start with some platform-specific strings. For example, '/' on Linux, 'c://' on Windows, 'gs://' on Google Cloud Storage. Please see subclasses for specifics.

property path: PurePath#

The pathlib.PurePath version of the internal path string.

In the subclass LocalUpath, this property is overridden to return a pathlib.Path, which is a subclass of pathlib.PurePath.

In subclasses for cloud blob stores, this implementation stays in effect.

abstract as_uri() → str[source]#: Represent the path as a file URI. See subclasses for platform-dependent specifics.

property name: str#

A string representing the final path component, excluding the drive and root, if any.

This is the name component of self.path. If self.path is '/' (the root), then an empty string is returned. (The name of the root path is empty.)

Examples

>>> from cloudly.upathlib import LocalUpath
>>> p = LocalUpath('/tmp/test/upathlib/data/sales.txt.gz')
>>> p.path
PosixPath('/tmp/test/upathlib/data/sales.txt.gz')
>>> p.name
'sales.txt.gz'
>>> p.parent.parent.parent.parent
LocalUpath('/tmp')
>>> p.parent.parent.parent.parent.name
'tmp'
>>> p.parent.parent.parent.parent.parent
LocalUpath('/')
>>> p.parent.parent.parent.parent.parent.name
''
>>> # the parent of root is still root:
>>> p.parent.parent.parent.parent.parent.parent
LocalUpath('/')

property stem: str#

The final path component, without its suffix.

This is the “stem” part of self.name.

Examples

>>> from cloudly.upathlib import LocalUpath
>>> p = LocalUpath('/tmp/test/upathlib/data/sales.txt')
>>> p
LocalUpath('/tmp/test/upathlib/data/sales.txt')
>>> p.path
PosixPath('/tmp/test/upathlib/data/sales.txt')
>>> p.name
'sales.txt'
>>> p.stem
'sales'
>>> p = LocalUpath('/tmp/test/upathlib/data/sales.txt.gz')
>>> p.stem
'sales.txt'

property suffix: str#: The file extension of the final component, if any

property suffixes: list[str]#

A list of the path’s file extensions.

Examples

>>> p = LocalUpath('/tmp/test/upathlib/data/sales.txt')
>>> p.suffix
'.txt'
>>> p.suffixes
['.txt']
>>> p = LocalUpath('/tmp/test/upathlib/data/sales.txt.gz')
>>> p.suffix
'.gz'
>>> p.suffixes
['.txt', '.gz']

exists() → bool[source]#

Return True if the path is an existing file or dir; False otherwise.

Examples

In a blob store with blobs

/a/b/cd
/a/b/cd/e.txt

'/a/b/cd' exists, and is both a file and a dir; '/a/b/cd/e.txt' exists, and is a file; '/a/b' exists, and is a dir; '/a/b/c' does not exist.

abstract is_dir() → bool[source]#

Return True if the path is an existing directory; False otherwise.

If there exists a file named like

/a/b/c/d.txt

we say '/a', '/a/b', '/a/b/c' are existing directories.

In a cloud blob store, there’s no such thing as an “empty directory”, because there is no concept of “directory”. A blob store just consists of files (aka blobs) with names, which could contain the letter ‘/’, with no special meaning attached to it. We interpret the name '/a/b' as a directory to emulate the familiar concept in a local file system when there exist files named '/a/b/*'.

In a local file system, there can be empty directories. However, it is recommended to not have empty directories.

There is no method for “creating an empty dir” (like the Linux command mkdir). Simply create a file under the dir, and the dir will come into being. This is analogous to we create files all the time—we don’t “create” an empty file in advance; we simply write to the would-be path of the file to be created.

abstract is_file() → bool[source]#

Return True if the path is an existing file; False otherwise.

In a cloud blob store, a path can be both a file and a dir. For example, if these blobs exist:

/a/b/c/d.txt
/a/b/c

we say /a/b/c is a “file”, and also a “dir”. User is recommended to avoid such namings.

This situation does not happen in a local file system.

abstract file_info() → FileInfo | None[source]#: If is_file() is False, return None; otherwise, return file info.

property parent: Self#

The parent of the path.

If the path is the root, then the parent is still the root.

abstract property root: Self#: Return a new path representing the root.

joinpath(*other: str) → Self[source]#

Join this path with more segments, return the new path object.

Calling this method is equivalent to combining the path with each of the other arguments in turn.

If self was created by Upath(*segs), then this method essentially returns Upath(*segs, *other).

If *other is a single string, there is a shortcut by the operator /, implemented by __truediv__().

with_name(name: str) → Self[source]#

Return a new path the the “name” part substituted by the new value. If the original path doesn’t have a name (i.e. the original path is the root), ValueError is raised.

Examples

>>> p = LocalUpath('/tmp/test/upathlib/data/sales.txt.gz')
>>> p.with_name('sales.data')
LocalUpath('/tmp/test/upathlib/data/sales.data')

with_stem(stem: str) → Self[source]#

with_suffix(suffix: str) → Self[source]#

Return a new path with the suffix replaced by the specified value. If the original path doesn’t have a suffix, the new suffix is appended instead. If suffix is an empty string, the original suffix is removed.

suffix should include a dot, like '.txt'.

Examples

>>> p = LocalUpath('/tmp/test/upathlib/data/sales.txt.gz')
>>>
>>> # replace the last suffix:
>>> p.with_suffix('.data')
LocalUpath('/tmp/test/upathlib/data/sales.txt.data')
>>>
>>> # remove the last suffix:
>>> p.with_suffix('')
LocalUpath('/tmp/test/upathlib/data/sales.txt')
>>>
>>> p.with_suffix('').with_suffix('.bin')
LocalUpath('/tmp/test/upathlib/data/sales.bin')
>>>
>>> pp = p.with_suffix('').with_suffix('')
>>> pp
LocalUpath('/tmp/test/upathlib/data/sales')
>>>
>>> # no suffix to remove:
>>> pp.with_suffix('')
LocalUpath('/tmp/test/upathlib/data/sales')
>>>
>>> # add a suffix:
>>> pp.with_suffix('.pickle')
LocalUpath('/tmp/test/upathlib/data/sales.pickle')

abstract write_bytes(data: bytes | BufferedReader, *, overwrite: bool = False) → None[source]#

Write bytes data to the current file.

Parent “directories” are created as needed, if applicable.

If overwrite is False and the current file exists, FileExistsError is raised.

data is either “byte-like” (such as bytes, bytearray, memoryview) or “file-like” open in “binary” mode. In the second case, the file should be positioned at the beginning (such as by calling .seek(0).)

abstract read_bytes() → bytes[source]#

Return the binary contents of the pointed-to file as a bytes object.

If self is not a file or does not exist, FileNotFoundError is raised.

write_text(data: str, *, overwrite: bool = False, encoding: str | None = None, errors: str | None = None) → None[source]#

Write text data to the current file.

Parent “directories” are created as needed, if applicable.

If overwrite is False and the current file exists, FileExistsError is raised.

encoding and errors are passed to encode(). Usually you should leave them at the default values.

read_text(*, encoding: str | None = None, errors: str | None = None) → str[source]#

Return the decoded contents of the pointed-to file as a string.

If self is not a file or does not exist, FileNotFoundError is raised.

encoding and errors are passed to decode(). Usually you should leave them at the default values.

write_json(data: Any, *, overwrite=False, **kwargs) → None[source]#

read_json(**kwargs) → Any[source]#

write_pickle(data: Any, *, overwrite=False, **kwargs) → None[source]#

read_pickle(**kwargs) → Any[source]#

write_pickle_zstd(data: Any, *, overwrite=False, **kwargs) → None[source]#

read_pickle_zstd(**kwargs) → Any[source]#

write_parquet(data: list[dict], *, overwrite=False, **kwargs) → None[source]#

read_parquet(**kwargs) → list[dict][source]#

write_csv(data, *, overwrite=False, use_pandas: bool = False, **kwargs) → None[source]#: If use_pandas is True, then data is a pandas DataFrame. Otherwise, data is iterable[tuple] | iterable[dict[str, Any]].

read_csv(*, use_pandas: bool = False, **kwargs)[source]#: If use_pands is True, then return a pandas DataFrame. Otherwise, return a list.

copy_dir(source: str | Upath, *, overwrite: bool = False, quiet: bool = False, concurrent: bool = True) → int[source]#

Copy the content of the source directory recursively to self.

If source is an string, then it is in the same store as the current path, and it is either absolute, or relative to self.parent.

If source is not a string, then it must be an instance of a Upath() subclass, and it may be in any store system.

Immediate children of source will be copied as immediate children of self.

There is no such error as “target directory exists” as the copy-operation only concerns individual files. If the target directory (i.e. self) contains files that do not have counterparts in the source directory, they will stay untouched.

overwrite is file-wise. If False, any existing target file will raise FileExistsError and halt the operation. If True, any existing target file will be overwritten by the source file.

quiet controls whether to print out progress info.

Returns:

int: The number of files copied.

copy_file(source: str | Upath, *, overwrite: bool = False) → None[source]#

Copy the source file to the current file (i.e. self).

If source is str, then it is in the same store as the current path, and it is either absolute, or relative to self.parent. For example, if self is '/a/b/c/d.txt', then source='e.txt' means '/a/b/c/e.txt'.

If source is not a string, then it must be an instance of a Upath() subclass, and it may be in any storage system.

If source is not an existing file, FileNotFoundError is raised.

If self is an existing file and overwrite is False, FileExistsError is raised. If overwrite is True, then the self will be overwritten.

If type(source) is LocalUpath and self is an existing directory, then IsADirectoryError is raised. In a cloud blob store, there is no concrete “directory”. For example, suppose self is ‘gs://mybucket/backup/record’ on Google Cloud Storage, and source is ‘/experiment/data’, then source is ‘gs://mybucket/experiment/data’. If there exists blob ‘gs://mybucket/backup/record/y’, then we say ‘gs://mybucket/backup/record’ is a “directory”. However, this is merely a “virtual” concept, or an emulation of the “directory” concept on local disk. As long as the self path is not an existing blob, the copy will proceed with no problem. Nevertheless, such naming is confusing and better avoided.

Return the number of files copied (either 0 or 1).

remove_dir(*, quiet: bool = True, concurrent: bool = True) → int[source]#

Remove the current directory (i.e. self) and all its contents recursively.

Essentially, this removes each file that is yielded by riterdir(). Subclasses should take care to remove “empty directories”, if applicable, that are left behind.

quiet controls whether to print progress info.

Returns:

int: The number of files removed.

abstract remove_file(missing_ok: bool = False) → None[source]#

Remove the current file (i.e. self).

If self is not an existing file, FileNotFoundError is raised, unless missing_ok is True. If the file exists but can’t be removed, the platform-dependent exception is propagated.

abstract iterdir() → Iterator[Self][source]#

Yield the immediate (i.e. non-recursive) children of the current dir (i.e. self).

The yielded elements are instances of the same class. Each yielded element is either a file or a dir. There is no guarantee on the order of the returned elements.

If self is not a dir (e.g. maybe it’s a file), or does not exist at all, nothing is yielded (resulting in an empty iterable); no exception is raised.

LocalUpath#

class cloudly.upathlib.LocalUpath[source]#

Bases: Upath, PathLike

__init__(*pathsegments: str)[source]#

Create a path on the local file system. Both POSIX and Windows platforms are supported.

*pathsegments specify the path, either absolute or relative to the current working directory. If missing, the constructed path is the current working directory. This is passed to pathlib.Path.

property path: Path#: Return the pathlib.Path object of the path.

as_uri() → str[source]#: Represent the path as a file URI. On Linux, this is like ‘file:///home/username/path/to/file’. On Windows, this is like ‘file:///C:/Users/username/path/to/file’.

is_dir() → bool[source]#: Return whether the current path is a dir.

is_file() → bool[source]#: Return whether the current path is a file.

file_info() → FileInfo | None[source]#: Return file info if the current path is a file; otherwise return None.

property root: LocalUpath#

Return a new path representing the root.

On Windows, this is the root on the same drive, like LocalUpath('C:'). On Linux and Mac, this is LocalUpath('/').

read_bytes() → bytes[source]#: Read the content of the current file as bytes.

write_bytes(data: bytes | BufferedReader, *, overwrite: bool = False)[source]#: Write the bytes data to the current file.

copy_file(source: str | Upath, *, overwrite: bool = False) → None[source]#

Copy the source file to the current file (i.e. self).

If source is not a string, then it must be an instance of a Upath() subclass, and it may be in any storage system.

If source is not an existing file, FileNotFoundError is raised.

If self is an existing file and overwrite is False, FileExistsError is raised. If overwrite is True, then the self will be overwritten.

Return the number of files copied (either 0 or 1).

copy_dir(source: str | Upath, *, overwrite: bool = False, quiet: bool = False, concurrent: bool = True) → int[source]#

Copy the content of the source directory recursively to self.

If source is an string, then it is in the same store as the current path, and it is either absolute, or relative to self.parent.

If source is not a string, then it must be an instance of a Upath() subclass, and it may be in any store system.

Immediate children of source will be copied as immediate children of self.

overwrite is file-wise. If False, any existing target file will raise FileExistsError and halt the operation. If True, any existing target file will be overwritten by the source file.

quiet controls whether to print out progress info.

Returns:

int: The number of files copied.

remove_dir(**kwargs) → int[source]#: Remove the current dir along with all its contents recursively.

remove_file(missing_ok: bool = False) → None[source]#: Remove the current file.

rename_dir(target: str | LocalUpath, *, overwrite: bool = False, quiet: bool = False, concurrent: bool = True) → LocalUpath[source]#

Rename the current dir (i.e. self) to target.

overwrite is applied file-wise. If there are files under target that do not have counterparts under self, they are left untouched.

quiet controls whether to print progress info.

Return the new path.

rename_file(target: str | LocalUpath, *, overwrite: bool = False) → LocalUpath[source]#

Rename the current file (i.e. self) to target in the same store.

target is either absolute or relative to self.parent. For example, if self is ‘/a/b/c/d.txt’, then target='e.txt' means ‘/a/b/c/e.txt’.

If overwrite is False (the default) and the target file exists, FileExistsError is raised.

Return the new path.

iterdir() → Iterator[LocalUpath][source]#: Yield the immediate children under the current dir.

riterdir() → Iterator[LocalUpath][source]#: Yield all files under the current dir recursively.

lock(*, timeout=None)[source]#: This uses the package filelock to implement a file lock for inter-process communication.

Note

At the end, this file is not deleted. If it is purely a dummy file to implement locking for other things, user may want to delete this file after use.

BlobUpath#

class cloudly.upathlib._blob.BlobUpath[source]#

Bases: Upath

BlobUpath is a base class for paths in a cloud storage, aka “blob store”. This is in contrast to a local disk storage, which is implemented by LocalUpath.

property blob_name: str#: Return the “name” of the blob. This is the “path” without a leading '/'. In cloud blob stores, this is exactly the name of the blob. The name often contains '/', which has no special role in the name per se but is interpreted by users to be a directory separator.

is_dir() → bool[source]#

In a typical blob store, there is no such concept as a “directory”. Here we emulate the concept in a local file system. If there is a blob named like

/ab/cd/ef/g.txt

we say there exists directory “/ab/cd/ef”. We should never have a trailing / in a blob’s name, like

/ab/cd/ef/

(I don’t know whether the blob stores allow such blob names.)

Consequently, is_dir is equivalent to “having stuff in the dir”. There is no such thing as an “empty directory” in blob stores.

iterdir() → Iterator[Self][source]#

Yield immediate children under the current dir.

This is a naive, inefficient implementation. Expected to be refined by subclasses.

download_dir(target: str | Path | LocalUpath, **kwargs) → int[source]#

download_file(target: str | Path | LocalUpath, **kwargs) → None[source]#

upload_dir(source: str | Path | LocalUpath, **kwargs) → int[source]#

upload_file(source: str | Path | LocalUpath, **kwargs) → None[source]#

Two particular applications of upathlib are multiplexer and VersionedUplodable.