See the following guide to setup dataproc on GCP an to create a dataproc cluster. We will be using Google Cloud SDK with the gcloud CLI commands.
For this example, use the preview image version so that the cluster runs Spark 3. For instance, the following gcloud command creates a small dataproc cluster called geni-cluster:
gcloud dataproc clusters create geni-cluster \
--region=asia-southeast1 \
--master-machine-type n1-standard-1 \
--master-boot-disk-size 30 \
--num-workers 2 \
--worker-machine-type n1-standard-1 \
--worker-boot-disk-size 30 \
--image-version=previewThis could take a few minutes to run. Then access the primary node using:
gcloud compute ssh ubuntu@geni-cluster-mJava should already be installed on the primary node. Install Leiningen using:
wget https://raw.githubusercontent.com/technomancy/leiningen/stable/bin/lein && \
sudo mv lein /usr/bin/ && \
chmod a+x /usr/bin/lein && \
leinThen, create a templated Geni app and step into the app directory::
lein new geni app +dataproc && cd appTo spawn the Spark REPL, run:
lein spark-submitThis is a shortcut to creating an uberjar and running it using spark-submit. By default, the templated main function:
- prints the Spark configuration;
- runs a Spark ML example;
- starts an nREPL server on port 65204; and
- steps into a REPL(-y).
Verify that spark.master is set to "yarn". To submit a standalone application, we can simply edit the -main function on core.clj. Remove the launch-repl function to prevent stepping into the REPL.
Once finished with the exercise, the easiest way to clean up is to simply delete the GCP project.
Alternatively, delete the cluster using:
gcloud dataproc clusters delete geni-cluster --region=asia-southeast1There may be dangling storage buckets that have to be deleted separately.