Apache Spark Interview QnA

Lipsa Biswas
3 min readMar 9, 2023

--

(in-progress)

Spark core

  1. What is Spark Core?
  2. What are the features of Spark Core?
  3. What are the different types of cluster managers in Spark?
  4. What is RDD (Resilient Distributed Datasets)?
  5. How does Spark handle data partitioning?
  6. What is the difference between map() and flatMap() in Spark?
  7. What is lazy evaluation in Spark?
  8. What is the purpose of a Spark Driver?
  9. What is the difference between Spark SQL and Spark Core?
  10. How does Spark handle failures in a cluster?
  11. What is a Spark executor?
  12. What is the difference between cache() and persist() in Spark?
  13. What are the different types of transformations in Spark?
  14. How does Spark handle memory management?
  15. How can you optimize the performance of a Spark application?

Spark Dataframe

  1. What is a DataFrame in Spark? How is it different from an RDD?
  2. What are the different ways of creating a DataFrame in Spark?
  3. What is schema in Spark DataFrame? How is it useful?
  4. How do you select columns from a DataFrame in Spark?
  5. What are the different types of joins in Spark DataFrame?
  6. How do you handle missing or null values in Spark DataFrame?
  7. What are the different aggregate functions available in Spark DataFrame?
  8. Explain the concept of window functions in Spark DataFrame.
  9. What is the difference between cache() and persist() methods in Spark DataFrame?
  10. How do you perform a groupBy operation in Spark DataFrame?
  11. What is the role of Catalyst optimizer in Spark DataFrame?
  12. Explain the difference between filter() and where() methods in Spark DataFrame.
  13. What is the role of broadcast join in Spark DataFrame?
  14. How do you handle duplicate rows in Spark DataFrame?
  15. What are the best practices for optimizing Spark DataFrame performance?
  16. What is the difference between a DataFrame and a Dataset in Spark? When would you choose one over the other?
  17. Explain the concept of partitioning in Spark DataFrame. How does it affect performance?
  18. What is the role of a DataFrameWriter in Spark? How is it used for writing data to various file formats?
  19. How do you handle data skewness in Spark DataFrame? What are the techniques available for handling it?
  20. Explain the concept of schema inference in Spark DataFrame. What are the limitations of this approach?
  21. What is the difference between coalesce() and repartition() methods in Spark DataFrame?
  22. What is the role of the DataFrame API in machine learning applications?
  23. Explain the concept of serialization and deserialization in Spark. How does it affect DataFrame performance?
  24. How do you handle nested data structures in Spark DataFrame?
  25. What is the difference between checkpointing and caching in Spark DataFrame? When would you use one over the other?
  26. What is the role of the SparkSession in Spark DataFrame? How does it differ from SparkContext?
  27. Explain the concept of a broadcast variable in Spark DataFrame. How is it useful for optimizing performance?
  28. What are the best practices for writing efficient Spark SQL queries on Spark DataFrame?
  29. Explain the difference between select() and selectExpr() methods in Spark DataFrame.
  30. How do you handle schema evolution in Spark DataFrame when the schema changes over time?

Spark- SQL

  1. What is Spark SQL? How does it differ from traditional SQL?
  2. How do you create a table in Spark SQL? What are the different data sources supported?
  3. What is the role of Catalyst optimizer in Spark SQL? How does it improve query performance?
  4. Explain the difference between a temporary table and a global temporary table in Spark SQL.
  5. How do you load data from a CSV file into Spark SQL? What are the options available for handling header and delimiter?
  6. What are the different types of joins supported in Spark SQL?
  7. Explain the concept of window functions in Spark SQL. How are they used for data analysis?
  8. How do you handle null values in Spark SQL? What are the different functions available for null handling?
  9. What are the different types of aggregation functions available in Spark SQL?
  10. Explain the difference between collect() and take() methods in Spark SQL.
  11. How do you handle schema evolution in Spark SQL when the schema changes over time?
  12. What is the role of the SparkSession in Spark SQL? How is it different from SparkContext?
  13. What is the role of a DataFrame API in Spark SQL? How is it used for data manipulation?
  14. How do you create a view in Spark SQL? What are the different types of views supported?
  15. What are the best practices for optimizing query performance in Spark SQL?

--

--

No responses yet