Amazon EMR and Yarn deployment mode

2020-01-27T03:12:59

I am learning Spark fundamentals and in order to test my Pyspark application created an EMR instance with Spark, Yarn, Hadoop, Oozie on AWS. I am successfully able to execute a simple pyspark application from the driver node using spark-submit. I have the default /etc/spark/conf/spark-default.conf file created by AWS which is using Yarn Resource Manager. Everything runs fine and I can monitor the Tracking URL as well. But I am not able to differentiate between whether the spark job is running in 'client' mode or 'cluster' mode. How do I determine that?

Excerpts from /etc/spark/conf/spark-default.conf

spark.master                     yarn                                                                                                            
spark.driver.extraLibraryPath    /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native                                                       
spark.executor.extraClassPath    :/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar    
spark.executor.extraLibraryPath  /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
spark.eventLog.enabled           true
spark.eventLog.dir               hdfs:///var/log/spark/apps
spark.history.fs.logDirectory    hdfs:///var/log/spark/apps
spark.sql.warehouse.dir          hdfs:///user/spark/warehouse
spark.sql.hive.metastore.sharedPrefixes com.amazonaws.services.dynamodbv2
spark.yarn.historyServer.address ip-xx-xx-xx-xx.ec2.internal:18080 
spark.history.ui.port            18080
spark.shuffle.service.enabled    true 
spark.driver.extraJavaOptions    -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'
spark.sql.parquet.fs.optimized.committer.optimization-enabled true
spark.sql.emr.internal.extensions com.amazonaws.emr.spark.EmrSparkSessionExtensions                                                              
spark.executor.memory            4743M                                                                                                           
spark.executor.cores             2                                                                                                               
spark.yarn.executor.memoryOverheadFactor 0.1875
spark.driver.memory              2048M

Excerpts from my pypspark job:

import os.path
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf   
from boto3.session import Session 

conf = SparkConf().setAppName('MyFirstPySparkApp')
spark = SparkSession.builder.config(conf=conf).getOrCreate()
sc = spark.sparkContext 
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", ACCESS_KEY)
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", SECRET_KEY) 
spark._jsc.hadoopConfiguration().set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
....# access S3 bucket
....
....

Is there a deployment mode called 'yarn-client' or is it just 'client' and 'cluster'? Also, why is "num-executors" not specified in the config file by AWS? Is that something I need to add?

Thanks

Copyright License:
Author:「NetRocks」,Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.
Link to:https://stackoverflow.com/questions/59921797/amazon-emr-and-yarn-deployment-mode

About “Amazon EMR and Yarn deployment mode” questions

I am learning Spark fundamentals and in order to test my Pyspark application created an EMR instance with Spark, Yarn, Hadoop, Oozie on AWS. I am successfully able to execute a simple pyspark appli...
I need to make a change to the YARN configuration on an EMR cluster. Do I need to make the change to just the yarn-site.xml file on the Hadoop master ? If so, how can I propagate the change to the
I am having this error in pyspark (Amazon EMR), my file is about 2G. How can I do to change the allocation? Thanks In tried to increase the size of the cluster, at some stages I still have the pr...
I have a long running YARN application running on EMR cluster. Based on Canceling EMR Steps, the running steps can be canceled with command aws emr cancel-steps as long as Amazon EMR versions 5.28...
I'm able to run Spark on AWS EMR without much trouble following the documentation but from what I see it always uses YARN instead of the standalone manager. Is there any way to use the standalone m...
I am trying to start an EMR cluster with bootstrap actions to configure YARN scheduler. This is the article I used to find the values. http://docs.aws.amazon.com/datapipeline/latest/DeveloperGui...
I want run yahoo`s project storm-on-yarn on AMAZON EMR. On the EMR, supervisor local directory can not create. I think maybe connect problem or no permission. Could you give me some suggestion? tha...
I was running an application on AWS EMR-Spark. Here, is the spark-submit job;- Arguments : spark-submit --deploy-mode cluster --class com.amazon.JavaSparkPi s3://spark-config-test/SWALiveOrderMode...
What I currently have: Spark cluster running on Amazon's EMR (emr-4.7.2). SSH tunnel to the master node using dynamic port forwarding. Spark application (I'm using JAVA) running locally from my...
I have a ECS task configured to run spark-submit to EMR Cluster. The spark-submit is configured as Yarn Cluster mode. My streaming application is suppose to save data to Redshift on an RDD, but I'm

Copyright License:Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.