Serializer#

class cloudly.util.serializer.Serializer[source]#

Bases: Protocol

classmethod serialize(x: T, **kwargs) → bytes[source]#

classmethod deserialize(y: bytes, **kwargs) → T[source]#

__init__(*args, **kwargs)#

class cloudly.util.serializer.JsonSerializer[source]#

Bases: Serializer

classmethod serialize(x, *, encoding=None, errors=None, **kwargs) → bytes[source]#

classmethod deserialize(y, *, encoding=None, errors=None, **kwargs)[source]#

class cloudly.util.serializer.PickleSerializer[source]#

Bases: Serializer

classmethod serialize(x, *, protocol=None, **kwargs) → bytes[source]#

classmethod deserialize(y, **kwargs)[source]#

class cloudly.util.serializer.ZPickleSerializer[source]#

Bases: PickleSerializer

classmethod serialize(x, *, level=3, **kwargs) → bytes[source]#

classmethod deserialize(y, **kwargs)[source]#

class cloudly.util.serializer.ZstdCompressor[source]#

Bases: _local

__init__()[source]#

compress(x, *, level=3, threads=0)[source]#

Parameters:

threads: Number of threads to use to compress data concurrently. When set, compression operations are performed on multiple threads. The default value (0) disables multi-threaded compression. A value of -1 means to set the number of threads to the number of detected logical CPUs.

decompress(y)[source]#

class cloudly.util.serializer.ZstdPickleSerializer[source]#

Bases: PickleSerializer

classmethod serialize(x, *, level=3, threads=0, **kwargs) → bytes[source]#

classmethod deserialize(y, **kwargs)[source]#

class OrjsonSerializer[source]#

Bases: Serializer

classmethod serialize(x, **kwargs) → bytes[source]#

classmethod deserialize(y: bytes, **kwargs)[source]#

class ZOrjsonSerializer[source]#

Bases: OrjsonSerializer

classmethod serialize(x, *, level=3, **kwargs) → bytes[source]#

classmethod deserialize(y, **kwargs)[source]#

class ZstdOrjsonSerializer[source]#

Bases: OrjsonSerializer

classmethod serialize(x, *, level=3, threads=0, **kwargs) → bytes[source]#

classmethod deserialize(y, **kwargs)[source]#

class cloudly.util.serializer.OrjsonSerializer[source]#

Bases: Serializer

classmethod serialize(x, **kwargs) → bytes[source]#

classmethod deserialize(y: bytes, **kwargs)[source]#

class cloudly.util.serializer.NewlineDelimitedOrjsonSeriealizer[source]#

Bases: Serializer

classmethod serialize(x: Iterable[T], **kwargs)[source]#

classmethod deserialize(y, **kwargs) → list[T][source]#

class cloudly.util.serializer.CsvSerializer[source]#

Bases: Serializer

classmethod serialize(x: Iterable[Sequence] | Iterable[dict[str, Any]], **kwargs) → bytes[source]#

classmethod deserialize(y: bytes, *, as_dict: bool = False, **kwargs) → list[tuple] | list[dict[str, Any]][source]#

cloudly.util.serializer.make_avro_schema(value: dict, name: str, namespace: str) → dict[source]#

value is a dict whose members are either ‘simple types’ or ‘compound types’.

‘simple types’ include:: int, float, str
‘compound types’ include:: dict: whose elements are simple or compound types list: whose elements are all the same simple or compound type

class cloudly.util.serializer.AvroSerializer[source]#

Bases: Serializer

classmethod serialize(x: Iterable[dict], *, schema: dict) → BytesIO[source]#

classmethod deserialize(y) → list[dict][source]#

cloudly.util.serializer.make_parquet_type(type_spec: str | Sequence)[source]#

type_spec is a spec of arguments to one of pyarrow’s data type factory functions.

For simple types, this may be just the type name (or function name), e.g. 'bool_', 'string', 'float64'.

For type functions expecting arguments, this is a list or tuple with the type name followed by other arguments, for example,

('time32', 's')
('decimal128', 5, -3)

For compound types (types constructed by other types), this is a “recursive” structure, such as

('list_', 'int64')
('list_', ('time32', 's'), 5)

where the second element is the spec for the member type, or

('map_', 'string', ('list_', 'int64'), True)

where the second and third elements are specs for the key type and value type, respectively, and the fourth element is the optional argument keys_sorted to pyarrow.map_(). Below is an example of a struct type:

('struct', [('name', 'string', False), ('age', 'uint8', True), ('income', ('struct', (('currency', 'string'), ('amount', 'uint64'))), False)])

Here, the second element is the list of fields in the struct. Each field is expressed by a spec that is taken by make_parquet_field().

cloudly.util.serializer.make_parquet_field(field_spec: Sequence)[source]#: filed_spec is a list or tuple with 2, 3, or 4 elements. The first element is the name of the field. The second element is the spec of the type, to be passed to function make_parquet_type(). Additional elements are the optional nullable and metadata to the function pyarrow.field().

cloudly.util.serializer.make_parquet_schema(fields_spec: Iterable[Sequence])[source]#

This function constructs a pyarrow schema that is expressed by simple Python types that can be json-serialized.

fields_spec is a list or tuple, each of its elements accepted by make_parquet_field().

This function is motivated by the need of ParquetSerializer. When biglist.Biglist uses a “storage-format” that takes options (such as ‘parquet’), these options can be passed into biglist.Biglist.new() (via serialize_kwargs and deserialize_kwargs) and saved in “info.json”. However, this requires the options to be json-serializable. Therefore, the argument schema to ParquetSerializer.serialize() can not be used by this mechanism. As an alternative, user can use the argument schema_spec; this argument can be saved in “info.json”, and it is handled by this function.

class cloudly.util.serializer.ParquetSerializer[source]#

Bases: Serializer

classmethod serialize(x: list[dict], schema: Schema | None = None, schema_spec: Sequence | None = None, metadata=None, **kwargs)[source]#

x is a list of data items. Each item is a dict. In the output Parquet file, each item is a “row”.

The content of the item dict should follow a regular pattern. Not every structure is supported. The data x must be acceptable to pyarrow.Table.from_pylist. If unsure, use a list with a couple data elements and experiment with pyarrow.Table.from_pylist directly.

When using storage_format='parquet' for Biglist, each data element is a dict with a consistent structure that is acceptable to pyarrow.Table.from_pylist. When reading the Biglist, the original Python data elements are returned. (A record read out may not be exactly equal to the original that was written, in that elements that were missing in a record when written may have been filled in with None when read back out.) In other words, the reading is not like that of ExternalBiglist. You can always create a separate ExternalBiglist for the data files of the Biglist in order to use Parquet-style data reading. The data files are valid Parquet files.

If neither schema nor schema_spec is specified, then the data schema is auto-inferred based on the first element of x. If this does not work, you can specify either schema or schema_spec. The advantage of schema_spec is that it is json-serializable Python types, hence can be passed into Biglist.new() via serialize_kwargs and saved in “info.json” of the biglist.

If schema_spec is not flexible or powerful enough for your use case, then you may have to use schema.

classmethod deserialize(y: bytes, **kwargs)[source]#