How to Improve HiveQL Efficiency

The efficiency of data queries influences the user experience extremely. When you’re creating a report by loading the data from the database, you must feel frustrated if it is very slow. I know that feeling and it is the daily life of analytics.

In this post, I’ll summarize four simple methods to improve HiveQL efficiency.

  • Change execution engine
    • set hive.execution.engine=tez;
  • Set right storage formats
    • Create TABLE A STORED AS ORC tblproperties("orc.compress"="SNAPPY") AS SELET * FROM A_Table
    • Create TABLE A STORED AS Parquet AS SELET * FROM A_Table
  • Use vectorized query exectuion
    • set hive.vectorized.execution.enabled = true;
  • Have a query execution plan
    • set hive.cbo.enable=true;

Reference:

Shaw, Scott et al. Practical Hive : a Guide to Hadoop’s Data Warehouse System / by Scott Shaw, Andreas Francois Vermeulen, Ankur Gupta, David Kjerrumgaard. Berkeley, CA: Apress, 2016. Web.