Bigdata and data science by Kartheek Dachepalli: Spark cluster mode parameters

Friday, December 2, 2022

Spark cluster mode parameters

We can divide these options into two categories.

The first category is data file, data files means spark only add the specified files into containers, no further commands will be executed. There are two options in this category:

--archives: with this option, you can submit archives, and spark will extract files in it for you, spark support zip, tar ... formats.
--files: with this option, you can submit files, spark will put it in container, won't do any other things. sc.addFile is the programming api for this one.

The second category is code dependencies. In spark application, code dependency could be JVM dependency or python dependency for pyspark application.

--jars ：this option is used to submit JVM dependency with Jar file, spark will add these Jars into CLASSPATH automatically, so your JVM can load them.
--py-files: this option is used to submit Python dependency, it can be .py, .egg or .zip. spark will add these file into PYTHONPATH, so your python interpreter can find them.
sc.addPyFile is the programming api for this one.
PS: for single .py file, spark will add it into a __pyfiles__ folder, others will add into CWD.

All these four options can specified multiple files, splitted with "," and for each file, you can specified an alias through {URL}#{ALIAS} format. Don't specify alias in --py-files option, cause spark won't add alias into PYTHONPATH.

Example:

-- archives abc.zip#new_abc,cde.zip#new_cde

spark will extract abc.zip, cde.zip and creates new_abc, new_cde folders in container

Bigdata and data science by Kartheek Dachepalli

Friday, December 2, 2022

Spark cluster mode parameters

No comments:

Post a Comment