SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. SparkR also supports distributed machine learning using MLlib.
- Pyspark Dataframe Cheat Sheet
- Spark Dataframe Cheat Sheet 2019
- Spark Dataframe Cheat Sheet 2020
- Spark Cheat Sheet Pdf
- Spark Dataframe Cheat Sheet Example
- Spark Sql Dataframe Cheat Sheet
SparkR in notebooks
- For Spark 2.0 and above, you do not need to explicitly pass a
sqlContext
object to every function call. This article uses the new syntax. For old syntax examples, see SparkR 1.6 overview. - For Spark 2.2 and above, notebooks no longer import SparkR by default because SparkR functions were conflicting with similarly named functions from other popular packages. To use SparkR you can call
library(SparkR)
in your notebooks. The SparkR session is already configured, and all SparkR functions will talk to your attached cluster using the existing session.
SparkR in spark-submit jobs
You can run scripts that use SparkR on Databricks as spark-submit jobs, with minor code modifications. For an example, refer to Create and run a spark-submit job for R scripts.
Data Science in Spark with Sparklyr:: CHEAT SHEET Intro Using sparklyr. Download a Spark DataFrame to an R DataFrame Create an R package that calls the full Spark API & provide interfaces to Spark packages. Sparkconnection Connection between R and the Spark shell process. Continue Reading HBase Shell Commands Cheat Sheet. Spark todate – Convert String to Date format. Post author: NNK. Spark DataFrame example of how to retrieve the last day of a month from a Date using Scala language and Spark SQL Date and Time functions. Df.distinct #Returns distinct rows in this DataFrame df.sample#Returns a sampled subset of this DataFrame df.sampleBy #Returns a stratified sample without replacement Subset Variables (Columns) key 3 22343a 3 33 3 3 3 key 3 33223343a Function Description df.select #Applys expressions and returns a new DataFrame Make New Vaiables 1221.
Create SparkR DataFrames
You can create a DataFrame from a local R data.frame
, from a data source, or using a Spark SQL query.
From a local R data.frame
The simplest way to create a DataFrame is to convert a local R data.frame
into aSparkDataFrame
. Specifically we can use createDataFrame
and pass in the local Rdata.frame
to create a SparkDataFrame
. Like most other SparkR functions, createDataFrame
syntax changed in Spark 2.0. You can see examples of this in the code snippet bellow.Refer to createDataFrame for more examples.
Using the data source API
The general method for creating a DataFrame from a data source is read.df
.This method takes the path for the file to load and the type of data source.SparkR supports reading CSV, JSON, text, and Parquet filesnatively.
SparkR automatically infers the schema from the CSV file.
Adding a data source connector with Spark Packages
Through Spark Packages you can find data source connectorsfor popular file formats such as Avro. As an example, use thespark-avro packageto load an Avro file. The availability of the spark-avro package depends on your cluster’s image version. See Avro file.
First take an existing data.frame
, convert to a Spark DataFrame, and save it as an Avro file.
To verify that an Avro file was saved:
Now use the spark-avro package again to read back the data.
The data source API can also be used to save DataFrames intomultiple file formats. For example, you can save the DataFrame from theprevious example to a Parquet file using write.df
.
From a Spark SQL query
You can also create SparkR DataFrames using Spark SQL queries.
age
is a SparkDataFrame.
DataFrame operations
Spark DataFrames support a number of functions to do structured dataprocessing. Here are some basic examples. A complete list canbe found in the API docs.
Pyspark Dataframe Cheat Sheet
Select rows and columns
Spark Dataframe Cheat Sheet 2019
Grouping and aggregation
SparkDataFrames support a number of commonly used functions toaggregate data after grouping. For example you can count the number oftimes each waiting time appears in the faithful dataset.
Column operations
SparkR provides a number of functions that can be directly applied tocolumns for data processing and aggregation. The following example shows theuse of basic arithmetic functions.
Machine learning
SparkR exposes most of MLLib algorithms. Under the hood, SparkRuses MLlib to train the model.
The following example shows how to build a gaussian GLM model usingSparkR. To run linear regression, set family to 'gaussian'
. To runlogistic regression, set family to 'binomial'
. When using SparkML GLM SparkRautomatically performs one-hot encoding ofcategorical features so that it does not need to be done manually.Beyond String and Double type features, it is also possible to fit overMLlib Vector features, for compatibility with other MLlib components.
For tutorials, see SparkR ML tutorials.
This page contains a bunch of spark pipeline transformation methods, whichwe can use for different problems. Use this as a quick cheat on how we cando particular operation on spark dataframe or pyspark.
This code snippets are tested on spark-2.4.x version, mostly work onspark-2.3.x also, but not sure about older versions. |
Read the partitioned json files from disk
applicable to all types of files supported
Save partitioned files into a single file.
Here we are merging all the partitions into one file and dumping it intothe disk, this happens at the driver node, so be careful with sie ofdata set that you are dealing with. Otherwise, the driver node may go out of memory.
Use coalesce
method to adjust the partition size of RDD based on our needs.
Filter rows which meet particular criteria
Map with case class
Use case class if you want to map on multiple columns with a complexdata structure.
OR using Row
class.
Use selectExpr to access inner attributes
Provide easily access the nested data structures like json
and filter themusing any existing udfs, or use your udf to get more flexibility here.
How to access RDD methods from pyspark side
Using standard RDD
operation via pyspark API isn’t straight forward, to get thatwe need to invoke the .rdd
to convert the DataFrame to support these features.
For example, here we are converting a sparse vector to dense and summing it in column-wise.
Pyspark Map on multiple columns
Filtering a DataFrame column of type Seq[String]
Filter a column with custom regex and udf
Sum a column elements
Remove Unicode characters from tokens
Sometimes we only need to work with the ascii text, so it’s better to clean outother chars.
Connecting to jdbc with partition by integer column
When using the spark to read data from the SQL database and then do theother pipeline processing on it, it’s recommended to partition the dataaccording to the natural segments in the data, or at least on an integercolumn, so that spark can fire multiple sql queries to read data from SQLserver and operate on it separately, the results are going to the sparkpartition.
Bellow commands are in pyspark, but the APIs are the same for the scala version also.
Parse nested json data
This will be very helpful when working with pyspark
and want to pass verynested json data between JVM and Python processes. Lately spark community relay onapache arrow project to avoid multiple serialization/deserialization costs whensending data from java memory to python memory or vice versa.
So to process the inner objects you can make use of this getItem
methodto filter out required parts of the object and pass it over to python memory viaarrow. In the future arrow might support arbitrarily nested data, but right now it won’tsupport complex nested formats. The general recommended option is to go without nesting.
'string ⇒ array<string>' conversion
Spark Dataframe Cheat Sheet 2020
Type annotation .as[String]
avoid implicit conversion assumed.
A crazy string collection and groupby
Spark Cheat Sheet Pdf
This is a stream of operation on a column of type Array[String]
and collectthe tokens and count the n-gram distribution over all the tokens.
How to access AWS s3 on spark-shell or pyspark
Most of the time we might require a cloud storage provider like s3 / gs etc, toread and write the data for processing, very few keeps in-house hdfs to handle the datathemself, but for majority, I think cloud storage easy to start with and don’t needto bother about the size limitations.
Supply the aws credentials via environment variable
Spark Dataframe Cheat Sheet Example
Supply the credentials via default aws ~/.aws/config file
Recent versions of awscli
expect its configurations are kept under ~/.aws/credentials
file,but old versions looks at ~/.aws/config
path, spark 2.4.x version now looks at the ~/.aws/config
locationsince spark 2.4.x comes with default hadoop jars of version 2.7.x.
Set spark scratch space or tmp directory correctly
This might require when working with a huge dataset and your machine can’t hold themall in memory for given pipeline steps, those cases the data will be spilled overto disk, and saved in tmp directory.
Set bellow properties to ensure, you have enough space in tmp location.
Pyspark doesn’t support all the data types.
When using the arrow
to transport data between jvm to python memory, the arrow may throwbellow error if the types aren’t compatible to existing converters. The fixes may becomein the future on the arrow’s project. I’m keeping this here to know that how the pyspark getsdata from jvm and what are those things can go wrong in that process.
Work with spark standalone cluster manager
Start the spark clustering in standalone mode
Once you have downloaded the same version of the spark binary across the machinesyou can start the spark master and slave processes to form the standalone sparkcluster. Or you could run both these services on the same machine also.
Standalone mode,
Worker can have multiple executors.
Worker is like a node manager in yarn.
We can set worker max core and memory usage settings.
When defining the spark application via spark-shell or so, define the executor memory and cores.
When submitting the job to get 10 executor with 1 cpu and 2gb ram each,
Spark Sql Dataframe Cheat Sheet
This page will be updated as and when I see some reusable snippet of code for spark operations |