Problem: You have some fancy packages that the cluster didn’t have in their default settings. You could ask the sys team to install them however you need to consider if those packages are commonly used throughout the cluster’s users? If not, this could be a burden to the cluster and hence there will be a reduction in cluster’s performance. Cumulatively, this could mountain to a big resource hog. We could avoid that by sending our environment on-demand which would only be kept for that single application. The environment will be discarded once the Spark application get terminated, successfully and unsuccessfully. In the case of jars, all the dependencies have already been packaged so there’ll never be such kind of problem. This only occurs in PySpark.
- Create venv that contains the packages that both the driver and workers needs.
- Activate the environment and try running your task locally. If successful,
- zip -r /path/to/your_venv.zip /path/to/your_venv
- spark-submit your_script.py –master yarn –archives /path/to/your_venv.zip