Wednesday, March 17, 2021

Sqoop performance improvement

In General, performance tuning in Sqoop can be achieved by:

  • Controlling Parallelism
  • Controlling Data Transfer Process

Controlling Parallelism:

Sqoop works on the MapReduce programming model implemented in Hadoop. Sqoop imports exports data from most relational databases in parallel. The number of map tasks per job determines it’s parallelism. By controlling the parallelism, we can handle the load on our databases and hence its performance. Here are a couple of ways in Sqoop jobs to exploit parallelism:

Changing the number of mappers

Typical Sqoop jobs launch four mappers by default. To optimise performance, increasing the map tasks (Parallel processes) to an integer value of 8 or 16 can show an increase in performance in some databases.

By using the -m or --num-mappers parameter we can set the degree of parallelism in Sqoop. Changing the number of mappers to 10 for example:

sqoop import  
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--table cities \
--num-mappers 10

A few things to keep in mind is that the number of map tasks should be less than the maximum number of parallel database connections possible. The increase in the degree of parallelism should be lesser than that which is available within your MapReduce cluster.

Split By Query

When performing parallel imports, Sqoop needs a criterion by which it can split the workload. Sqoop uses a splitting column to split the workload. By default, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column. The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range.

The --split-by parameter splits the column data uniformly on the basis of the number of mappers specified. The syntax for --split-by is given by:

sqoop import  
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--table cities \

--split-by city_id 

Custom Boundary Queries

As seen before split-by uniformly distributes the data for import. If the column has non-uniform values, boundary-query can be used if we do not get the desired results while using the split-by argument alone.

Ideally, we configure the boundary-query parameter with the min(id) and max(id) along with the table name.

sqoop import \
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--query 'SELECT normcities.id, \
countries.country, \
normcities.city \
FROM normcities \
JOIN countries USING(country_id) \
WHERE $CONDITIONS' \
--split-by id \
--target-dir cities \
--boundary-query "select min(id), max(id) from normcities"

Here $CONDITIONS is used internally by Sqoop and will be expanded to pick the min and max id of the cities table to split the data. By using custom values, efficient partitions are derived splitting the data resulting in performance improvement in general.