pyarrow#

PyArrow contrib functionality.

Class ParquetFormatter#

class ParquetFormatter[source]#

Bases: Formatter

Class ParquetTableFormatter#

class ParquetTableFormatter[source]#

Bases: Formatter

Functions#

merge_parquet_files(src_paths: Sequence[str | Path | FileSystemFileTarget], dst_path: str | Path | FileSystemFileTarget, force: bool = True, callback: Callable[[int], Any] | None = None, writer_opts: dict[str, Any] | None = None, copy_single: bool = False, skip_empty: bool = True, target_row_group_size: int = 0) str[source]#

Merges parquet files in src_paths into a new file at dst_path. Intermediate directories are created automatically. When dst_path exists and force is True, the file is removed first. Otherwise, an exception is thrown.

callback can refer to a callable accepting a single integer argument representing the index of the file after it was merged. writer_opts can be a dictionary of keyword arguments that are passed to the ParquetWriter instance. When src_paths contains only a single file and copy_single is True, the file is copied to dst_path and no merging takes place. Files containing empty tables are skipped unless skip_empty is False.

When target_row_group_size is a positive number, the merging is done on the level of particular row groups. These groups are merged in-memory such that each resulting group stored on disk, potentially except for the last one, will target_row_group_size rows.

The absolute, expanded dst_path is returned.

merge_parquet_task(task: Task, inputs: Sequence[str | Path | FileSystemFileTarget], output: str | Path | FileSystemFileTarget, local: bool = False, cwd: str | Path | LocalDirectoryTarget | None = None, force: bool = True, **kwargs: Any) None[source]#

This method is intended to be used by tasks that are supposed to merge parquet files, e.g. when inheriting from law.contrib.tasks.MergeCascade. inputs should be a sequence of targets that represent the files to merge into output.

When local is False and files need to be copied from remote first, cwd can be a set as the dowload directory. When empty, a temporary directory is used. The task itself is used to print and publish messages via its law.Task.publish_message() and law.Task.publish_step() methods. When force is True, any existing output file is overwritten.

All additional kwargs are forwarded to merge_parquet_files() which is used internally for the actual merging.