Spark: how is parquet read

Rolintocour

2018-07-10T21:24:42

A question in order to understand how spark does its job. When you use:

spark.read.parquet(sourcePath).{transformation}.{action}

How is are the parquet files read? Is it done on the driver and then dispatched to each executor? Or is each file sent to an executor, which is responsable of the reading?

If parquet is snappy compressed, where and how is uncompression done?

I wonder if parquet files may stay in the driver memory after they have been read.

Copyright License：
Author:「Rolintocour」,Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.
Link to：https://stackoverflow.com/questions/51266495/spark-how-is-parquet-read

About “Spark: how is parquet read” questions

Spark: how is parquet read

A question in order to understand how spark does its job. When you use: spark.read.parquet(sourcePath).{transformation}.{action} How is are the parquet files read? Is it done on the driver and then

Schema option in spark_read_parquet()

I am pretty new to R and spark. I want to read a parquet file with the following code. Anyone knows how to specify schema there? library(sparklyr) sc <- spark_connect(master = "yarn", ...

How to parallelize spark.read.parquet()?

My Spark job reads a folder with parquet data partitioned by the column partition: val spark = SparkSession .builder() .appName("Prepare Id Mapping") .getOrCreate() import

Spark: read from parquet an int column as long

I have a parquet file that is read by spark as an external table. One of the columns is defined as int both in the parquet schema and in the spark table. Recently, I've discovered int is too smal...

Error when attempting to read Parquet in Spark

I am using Python Spark 2.4.3 I read the CSV and make a dataframe from it and write it to Parquet just fine. The 3rd line is what breaks. df = spark.read.csv("file.csv", header=True) df.write.par...

Unable to read parquet file locally in spark

I am running Pyspark locally and trying to read a parquet file and load into a data frame from notebook. df = spark.read.parquet("metastore_db/tmp/userdata1.parquet") I am getting this exception...

How does schema inference work in spark.read.parquet?

I'm trying to read a parquet file on spark and I have a question. How is the type inferred when loading a parquet file with spark.read.parquet? 1. Parquet Type INT32 -> Spark Type IntegerType 2.

spark save and read parquet on HDFS

I am writing this code val inputData = spark.read.parquet(inputFile) spark.conf.set("spark.sql.shuffle.partitions",6) val outputData = inputData.sort($"colname") outputData.write.parquet(outputFi...

How to read Parquet file using Spark Core API?

How to read Parquet file using Spark Core API? I know using Spark SQL has some methods to read parquet file. But we cannot use Spark SQL for our projects. Do we have to use newAPIHadoopFile metho...

The columns option in spark_read_parquet

I tried to read a subset of columns from a 'table' using spark_read_parquet, temp <- spark_read_parquet(sc, name='mytable',columns=c("Col1","Col2"), path="/m