Thursday, April 15, 2021

Avoid small file issue in Hive

One way to control the size of files when inserting into a table using Hive, is to set the below parameters:

set hive.merge.tezfiles=true;
set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.size.per.task=128000000;
set hive.merge.smallfiles.avgsize=128000000;

This will work for both M/R and Tez engine and will ensure that all files created are at or below 128 MB in size (you can alter that size number according to your use case. Additional reading here: https://community.cloudera.com/t5/Community-Articles/ORC-Creation-Best-Practices/ta-p/248963).

The easiest way to merge the files of the table is to remake it, while having ran the above hive commands at runtime:

CREATE TABLE new_table LIKE old_table;
INSERT INTO new_table select * from old_table;

No comments:

Post a Comment