Apache Spark Interview QnA
3 min readMar 9, 2023
(in-progress)
Spark core
- What is Spark Core?
- What are the features of Spark Core?
- What are the different types of cluster managers in Spark?
- What is RDD (Resilient Distributed Datasets)?
- How does Spark handle data partitioning?
- What is the difference between map() and flatMap() in Spark?
- What is lazy evaluation in Spark?
- What is the purpose of a Spark Driver?
- What is the difference between Spark SQL and Spark Core?
- How does Spark handle failures in a cluster?
- What is a Spark executor?
- What is the difference between cache() and persist() in Spark?
- What are the different types of transformations in Spark?
- How does Spark handle memory management?
- How can you optimize the performance of a Spark application?
Spark Dataframe
- What is a DataFrame in Spark? How is it different from an RDD?
- What are the different ways of creating a DataFrame in Spark?
- What is schema in Spark DataFrame? How is it useful?
- How do you select columns from a DataFrame in Spark?
- What are the different types of joins in Spark DataFrame?
- How do you handle missing or null values in Spark DataFrame?
- What are the different aggregate functions available in Spark DataFrame?
- Explain the concept of window functions in Spark DataFrame.
- What is the difference between cache() and persist() methods in Spark DataFrame?
- How do you perform a groupBy operation in Spark DataFrame?
- What is the role of Catalyst optimizer in Spark DataFrame?
- Explain the difference between filter() and where() methods in Spark DataFrame.
- What is the role of broadcast join in Spark DataFrame?
- How do you handle duplicate rows in Spark DataFrame?
- What are the best practices for optimizing Spark DataFrame performance?
- What is the difference between a DataFrame and a Dataset in Spark? When would you choose one over the other?
- Explain the concept of partitioning in Spark DataFrame. How does it affect performance?
- What is the role of a DataFrameWriter in Spark? How is it used for writing data to various file formats?
- How do you handle data skewness in Spark DataFrame? What are the techniques available for handling it?
- Explain the concept of schema inference in Spark DataFrame. What are the limitations of this approach?
- What is the difference between coalesce() and repartition() methods in Spark DataFrame?
- What is the role of the DataFrame API in machine learning applications?
- Explain the concept of serialization and deserialization in Spark. How does it affect DataFrame performance?
- How do you handle nested data structures in Spark DataFrame?
- What is the difference between checkpointing and caching in Spark DataFrame? When would you use one over the other?
- What is the role of the SparkSession in Spark DataFrame? How does it differ from SparkContext?
- Explain the concept of a broadcast variable in Spark DataFrame. How is it useful for optimizing performance?
- What are the best practices for writing efficient Spark SQL queries on Spark DataFrame?
- Explain the difference between select() and selectExpr() methods in Spark DataFrame.
- How do you handle schema evolution in Spark DataFrame when the schema changes over time?
Spark- SQL
- What is Spark SQL? How does it differ from traditional SQL?
- How do you create a table in Spark SQL? What are the different data sources supported?
- What is the role of Catalyst optimizer in Spark SQL? How does it improve query performance?
- Explain the difference between a temporary table and a global temporary table in Spark SQL.
- How do you load data from a CSV file into Spark SQL? What are the options available for handling header and delimiter?
- What are the different types of joins supported in Spark SQL?
- Explain the concept of window functions in Spark SQL. How are they used for data analysis?
- How do you handle null values in Spark SQL? What are the different functions available for null handling?
- What are the different types of aggregation functions available in Spark SQL?
- Explain the difference between collect() and take() methods in Spark SQL.
- How do you handle schema evolution in Spark SQL when the schema changes over time?
- What is the role of the SparkSession in Spark SQL? How is it different from SparkContext?
- What is the role of a DataFrame API in Spark SQL? How is it used for data manipulation?
- How do you create a view in Spark SQL? What are the different types of views supported?
- What are the best practices for optimizing query performance in Spark SQL?