Connect Redshift Spectrum/ AWS EMR with Hudi directly or via AWS Glue Data Catalog

2021-09-12T21:48:23

I'm trying to understand how to properly connect Redshift Spectrum with Hudi data.

Looks like I can directly create Redshift external table for data managed in Apache Hudi like it is described by the following documentation https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html The other way is to integrate Hudi with AWS Glue Data Catalog like it is mentioned here https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-how-it-works.html and then access Hudi tables with Redshift Spectrum via AWS Glue Data Catalog.

The same needs I have for AWS EMR for Apache Spark. Looks like I may use Hudi directly from EMR or via AWS Glue Data Catalog.

Right now, I don't understand what way to choose. Could you please advise what is the benefit to use Hudi via AWS Glue Data Catalog, or do I need to use it directly from Redshift Spectrum and AWS EMR ?

Copyright License:
Author:「alexanoid」,Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.
Link to:https://stackoverflow.com/questions/69152017/connect-redshift-spectrum-aws-emr-with-hudi-directly-or-via-aws-glue-data-catal

About “Connect Redshift Spectrum/ AWS EMR with Hudi directly or via AWS Glue Data Catalog” questions

I'm trying to understand how to properly connect Redshift Spectrum with Hudi data. Looks like I can directly create Redshift external table for data managed in Apache Hudi like it is described by the
I was wondering how I would tackle the following on AWS? - or whether it was not possible? Transient EMR Cluster for some bulk Spark processing When that cluster terminates, then and only then use a
I am running spark jobs on EKS and these jobs are submitted from Jupyter notebooks. We have all our tables in an S3 bucket and their metadata sits in Glue Data Catalog. I want to use the Glue Data
Let's say, the datalake is on AWS. Using S3 as storage and Glue as data catalog. So, we can easily use athena, redshift or EMR to query data on S3 using Glue as metastore. My question is, is it
I am looking to implement SCD2 logic with the help of S3,Hudi and Glue in AWS. Can anyone suggest how to implement without EMR? Also is there way to install/deploy hudi without EMR in AWS. Thanks
According to AWS Glue Data Catalog documentation https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html Temporary tables are not supported. It is not clear to me or un...
Connecting to Redshift spectrum table takes too long to load. Any best practices or alternative to get the data I need? It has 60 million records with item array #Amazon Redshift connect redshift_t...
I want to use AWS Glue Data Catalog as the Metastore for Spark SQL. I have launched EMR cluster through AWS console as instructed here. I am able to see all the Glue catalog tables from spark-shel...
is there a way to specify a Glue catalogId explicitly in EMR configuration? https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html This shows how to specify to use Glue ...
I read that the Glue Data Catalog needs a Crawler to run to see any new partitions, or to use the new enableUpdateCatalog feature for AWS Glue ETL. If, however, running native Spark Jobs on EMR and

Copyright License:Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.