pyarrow#
PyArrow contrib functionality.
Class ParquetFormatter
#
Class ParquetTableFormatter
#
Functions#
- merge_parquet_files(src_paths, dst_path, force=True, callback=None, writer_opts=None, copy_single=False)#
Merges parquet files in src_paths into a new file at dst_path. Intermediate directories are created automatically. When dst_path exists and force is True, the file is removed first. Otherwise, an exception is thrown.
callback can refer to a callable accepting a single integer argument representing the index of the file after it was merged. writer_opts can be a dictionary of keyword arguments that are passed to the ParquetWriter instance. When src_paths contains only a single file and copy_single is True, the file is copied to dst_path and no merging takes place.
The absolute, expanded dst_path is returned.
- merge_parquet_task(task, inputs, output, local=False, cwd=None, force=True, writer_opts=None, copy_single=False)#
This method is intended to be used by tasks that are supposed to merge parquet files, e.g. when inheriting from
law.contrib.tasks.MergeCascade
. inputs should be a sequence of targets that represent the files to merge into output.When local is False and files need to be copied from remote first, cwd can be a set as the dowload directory. When empty, a temporary directory is used. The task itself is used to print and publish messages via its
law.Task.publish_message()
andlaw.Task.publish_step()
methods. When force is True, any existing output file is overwritten.writer_opts and copy_single are forwarded to
merge_parquet_files()
which is used internally for the actual merging.