AWS Glue RDD.saveAsTextFile() raises Class org.apache.hadoop.mapred.DirectOutputCommitter not found

2020-12-22T21:26:39

I'm creating the simple ETL that reads a billion of files and re-partition them (in other words, compact to smaller amount for further processing).

Simple AWS Glue application:

import org.apache.spark.SparkContext

object Hello {
  def main(sysArgs: Array[String]) {
    val spark: SparkContext = new SparkContext()
    val input_path =  "s3a://my-bucket-name/input/*"
    val output_path = "s3a://my-bucket-name/output/*"
    val num_partitions = 5
    val ingestRDD = spark.textFile(input_path)
    ingestRDD.repartition(num_partitions).saveAsTextFile(output_path)    
  }
}

raises the following traceback:

ERROR [main] glue.ProcessLauncher (Logging.scala:logError(70)): Exception in User Class: java.lang.RuntimeException : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.mapred.DirectOutputCommitter not found
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2401)
org.apache.hadoop.mapred.JobConf.getOutputCommitter(JobConf.java:725)
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1048)
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1032)
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:958)
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:957)
org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1499)
org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1478)
org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1478)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1478)
Hello$.main(hello_world_parallel_rdd_scala:18)
Hello.main(hello_world_parallel_rdd_scala)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
com.amazonaws.services.glue.SparkProcessLauncherPlugin$class.invoke(ProcessLauncher.scala:38)
com.amazonaws.services.glue.ProcessLauncher$$anon$1.invoke(ProcessLauncher.scala:67)
com.amazonaws.services.glue.ProcessLauncher.launch(ProcessLauncher.scala:108)
com.amazonaws.services.glue.ProcessLauncher$.main(ProcessLauncher.scala:21)
com.amazonaws.services.glue.ProcessLauncher.main(ProcessLauncher.scala)

At the same time this code code works in Local Environment, in Cluster and in EMR Cluster.

Copyright License:
Author:「user913624」,Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.
Link to:https://stackoverflow.com/questions/65409534/aws-glue-rdd-saveastextfile-raises-class-org-apache-hadoop-mapred-directoutput

About “AWS Glue RDD.saveAsTextFile() raises Class org.apache.hadoop.mapred.DirectOutputCommitter not found” questions

I'm creating the simple ETL that reads a billion of files and re-partition them (in other words, compact to smaller amount for further processing). Simple AWS Glue application: import org.apache.sp...
Have some lambdas that request schemas form AWS Glue. Would like to know if there is a limit of requests to AWS Glue after which Glue cannot handle it? Load testing in other words. Have not found
I am trying to create job script using Java. In AWS Glue Console, I could be able to find only "Python, Spark", so which means we cant write script using Java at all? If yes, then whats t...
I am unable to create metric via Terraform it's not recognizing the log path as log name. data "aws_cloudwatch_log_group" "aws_glue_log" { name = "/aws-glue/python-...
I am trying to run Glue in my local using scala, so I added the below dependency as per the AWS Glue documentation(https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html...
We know that, the procedure of writing from pyspark script (aws glue job) to AWS data catalog is to write in s3 bucket (eg.csv) use a crawler and schedule it. Is there any other way of writing to aws
I have a table which contains few schedules for various jobs. I want to process the records and create Triggers via AWS Glue API. http://docs.aws.amazon.com/glue/latest/dg/aws-glue-api.html The...
I am currently looking at ETL services to use to transform and push data to a data warehouse. I have come across AWS Glue and I am wondering if it is possible to extract data from external sources ...
I am setting up a AWS Glue job that is created on trigger from a lambda function. The Glue job is created without issue, but I am using a custom python package that I've built (it's pure python). ...
I am fairly a new user of AWS Glue, which is one of new AWS managed services to orchestrate batch job workflows with ease. I have 3 different AWS IAM account (Dev, Test, Prod). Separate login for ...

Copyright License:Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.