> For the complete documentation index, see [llms.txt](https://jheck.gitbook.io/hadoop/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://jheck.gitbook.io/hadoop/apache-spark-dataframes.md).

# Apache Spark - DataFrames

### Introducing DataFrames <a href="#introducing-dataframes" id="introducing-dataframes"></a>

Under the covers, DataFrames are derived from data structures known as Resilient Distributed Datasets (RDDs). RDDs and DataFrames are immutable distributed collections of data. Let's take a closer look at what some of these terms mean before we understand how they relate to DataFrames:

* **Resilient**: They are fault tolerant, so if part of your operation fails, Spark quickly recovers the lost computation.
* **Distributed**: RDDs are distributed across networked machines known as a cluster.
* **DataFrame**: A data structure where data is organized into named columns, like a table in a relational database, but with richer optimizations under the hood.

Without the named columns and declared types provided by a schema, Spark wouldn't know how to optimize the executation of any computation. Since DataFrames have a schema, they use the Catalyst Optimizer to determine the optimal way to execute your code.

DataFrames were invented because the business community uses tables in a relational database, Pandas or R DataFrames, or Excel worksheets. A Spark DataFrame is conceptually equivalent to these, with richer optimizations under the hood and the benefit of being distributed across a cluster.

**Interacting with DataFrames**

Once created (instantiated), a DataFrame object has methods attached to it. Methods are operations one can perform on DataFrames such as filtering, counting, aggregating and many others.

**Example**: In Scala, to create (instantiate) a DataFrame, use this syntax:`val df = ...`. In Python it is simply`df = ...`

To display the contents of the DataFrame, apply a`show`operation (method) on it using the syntax`df.show()`.

The`.`indicates you are applying a method on the object. These concepts will be relevant whether you use Scala or Python.

In working with DataFrames, it is common to chain operations together, such as: `df.select().filter().groupBy()`.

By chaining operations together, you don't need to save intermediate DataFrames into local variables (thereby avoiding the creation of extra objects).

Also note that you do not have to worry about how to order operations because the optimizier determines the optimal order of execution of the operations for you.

`df.select(...).groupBy(...).filter(...)`

versus

`df.filter(...).select(...).groupBy(...)`

**DataFrames and SQL**

DataFrame syntax is more flexible than SQL syntax. Here we illustrate general usage patterns of SQL and DataFrames.

Suppose we have a data set we loaded as a table called`myTable`and an equivalent DataFrame, called`df`. We have three fields/columns called`col_1`(`numeric`type),`col_2`(`string`type) and`col_3`(`timestamp`type) Here are basic SQL operations and their DataFrame equivalents.

Notice that columns in DataFrames are referenced by`$""` in Scala or using`col("")`in Python.

| SQL                                         | DataFrame (Scala)                          | DataFrame (Python)                                 |
| ------------------------------------------- | ------------------------------------------ | -------------------------------------------------- |
| `SELECT col_1 FROM myTable`                 | `df.select($"col_1")`                      | `df.select(col("col_1"))`                          |
| `DESCRIBE myTable`                          | `df.printSchema`                           | `df.printSchema()`                                 |
| `SELECT col_1 from myTable WHERE col_1 > 0` | `df.select($"col_1").filter($"col_1" > 0)` | `df.select(col("col_1")).filter(col("col_1") > 0)` |
| `..GROUP BY col_2 ORDER BY col_2`           | `..groupBy($"col_2").orderBy($"col_2")`    | `..groupBy(col("col_2")).orderBy(col("col_2"))`    |
| `..WHERE year(col_3) > 1990`                | `..filter(year($"col_3") > 1990)`          | `..filter(year(col("col_3")) > 1990)`              |
| `SELECT * from myTable limit 10`            | `df.select("*").limit(10)`                 | `df.select("*").limit(10)`                         |
| `display(myTable)`(text format)             | `df.show()`                                | `df.show()`                                        |
| `display(myTable)`(html format)             | `display(df)`                              | `display(df)`                                      |

**Hint:** You can also run SQL queries with the special syntax`spark.sql("SELECT * FROM myTable")`


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://jheck.gitbook.io/hadoop/apache-spark-dataframes.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
