Is it possible to use a custom hadoop version with EMR?

2022-06-29T09:28:39

As of today (2022-06-28), AWS EMR latest version is 6.6.0, which uses Hadoop 3.2.1.

I need to use a different Hadoop version (3.2.2). I tried the following approach, but it doesn't work. You can either set ReleaseLabel or Hadoop version, but not both.

client = boto3.client("emr", region_name="us-west-1")

response = client.run_job_flow(
    ReleaseLabel="emr-6.6.0",
    Applications=[{"Name": "Hadoop", "Version": "3.2.2"}]
)

Another approach that seems to not be an option, is loading a specific hadoop jar with SparkSession.builder.getOrCreate(), like so:

spark = SparkSession \
        .builder \
        .config('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.2.2') \
        .config('spark.hadoop.fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem') \
        .getOrCreate()

Is it even possible to run an EMR cluster with a different Hadoop version? If so, how does one go about doing that?

Copyright License:
Author:「Victor Valente」,Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.
Link to:https://stackoverflow.com/questions/72794805/is-it-possible-to-use-a-custom-hadoop-version-with-emr

About “Is it possible to use a custom hadoop version with EMR?” questions

As of today (2022-06-28), AWS EMR latest version is 6.6.0, which uses Hadoop 3.2.1. I need to use a different Hadoop version (3.2.2). I tried the following approach, but it doesn't work. You can ei...
In our EMR clusters, we are using custom log4j-appenders and log4j.properties to allow us to forward logs to Splunk and to let us do some magic that the provided libraries and configurations don't ...
We use Amazon EMR release 5.21.0, which should include following components: Hadoop 2.8.5, HBase 1.4.8 (see Release Guide) But hbase-server:1.4.8 contains dependency to hadoop-client:2.7.4, which ...
I have developed some MR jobs using java and hadoop 1.0.1. However, EMR supports only upto Hadoop 0.20. Is it possible to run Hadoop 1.0.1 jobs on EMR or do I have to downgrade my library stack to ...
I started with hadoop recently, and I’m trying to use it with giraph (because i need it for manipulating huge graphs). So, for building giraph (1.1.0, latest stable version), I chose the 2.4.0 rele...
We are currently using Apache Hadoop (Vanilla Version) in our org. We are planning to migrate to AWS EMR. I'm trying to understand how AWS EMR Hadoop works internally (not how to use it), I'm mainly
I have been attempting to use Hadoop streaming in Amazon EMR to do a simple word count for a bunch of text files. In order to get a handle on hadoop streaming and on Amazon's EMR I took a very simp...
I'm writing a large dataset in Parquet format to HDFS using Spark and it runs rather slowly in EMR vs say Databricks. I realize that if I was able to use Hadoop 3.1, it would be much more performant
I created a custom jar for a map-reduce app and tried running it in Amazon EMR job flow. I compiled the code in hadoop 1.0.4, however Amazon EMR has support for Hadoop 1.0.3. Also I compiled the code
I found that Amazon EMR has changed it's hadoop version from 2.7.3 to 2.8.3 and there is no option for 2.7.5. Now, I am using custom jar which is created using hadoop 2.7.5. Is there any way to run...

Copyright License:Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.