Spark: how is parquet read

2018-07-10T21:24:42

A question in order to understand how spark does its job. When you use:

spark.read.parquet(sourcePath).{transformation}.{action}

How is are the parquet files read? Is it done on the driver and then dispatched to each executor? Or is each file sent to an executor, which is responsable of the reading?

If parquet is snappy compressed, where and how is uncompression done?

I wonder if parquet files may stay in the driver memory after they have been read.

Copyright License:
Author:「Rolintocour」,Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.
Link to:https://stackoverflow.com/questions/51266495/spark-how-is-parquet-read

About “Spark: how is parquet read” questions

A question in order to understand how spark does its job. When you use: spark.read.parquet(sourcePath).{transformation}.{action} How is are the parquet files read? Is it done on the driver and then
I am pretty new to R and spark. I want to read a parquet file with the following code. Anyone knows how to specify schema there? library(sparklyr) sc <- spark_connect(master = "yarn", ...
My Spark job reads a folder with parquet data partitioned by the column partition: val spark = SparkSession .builder() .appName("Prepare Id Mapping") .getOrCreate() import
I have a parquet file that is read by spark as an external table. One of the columns is defined as int both in the parquet schema and in the spark table. Recently, I've discovered int is too smal...
I am using Python Spark 2.4.3 I read the CSV and make a dataframe from it and write it to Parquet just fine. The 3rd line is what breaks. df = spark.read.csv("file.csv", header=True) df.write.par...
I am running Pyspark locally and trying to read a parquet file and load into a data frame from notebook. df = spark.read.parquet("metastore_db/tmp/userdata1.parquet") I am getting this exception...
I'm trying to read a parquet file on spark and I have a question. How is the type inferred when loading a parquet file with spark.read.parquet? 1. Parquet Type INT32 -> Spark Type IntegerType 2.
I am writing this code val inputData = spark.read.parquet(inputFile) spark.conf.set("spark.sql.shuffle.partitions",6) val outputData = inputData.sort($"colname") outputData.write.parquet(outputFi...
How to read Parquet file using Spark Core API? I know using Spark SQL has some methods to read parquet file. But we cannot use Spark SQL for our projects. Do we have to use newAPIHadoopFile metho...
I tried to read a subset of columns from a 'table' using spark_read_parquet, temp <- spark_read_parquet(sc, name='mytable',columns=c("Col1","Col2"), path="/m

Copyright License:Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.