
bin/pyspark -master local -py-files code.py In the PySpark shell, a special interpreter-aware SparkContext is already created in the variable called sc. > sc.defaultMinPartitions #Default minimum number of partitions for RDDsĬonfiguration > from pyspark import SparkConf, SparkContext > sc.defaultParallelism #Return default level of parallelism > sc.applicationld #Retrieve application ID > str(sc.sparkUser()) #Retrieve name of the Spark User running SparkContext > str(sc.sparkHome) #Path where Spark is installed on worker nodes Inspect SparkContext > sc.version #Retrieve SparkContext version Initializing Spark SparkContext > from pyspark import SparkContext

PySpark is the Spark Python API that exposes the Spark programming model to Python. In real life data analysis, you'll be using Spark to analyze big data.Īre you hungry for more? Don't miss our other Python cheat sheets for data science that cover topics such as Python basics, Numpy, Pandas, Pandas Data Wrangling and much more! Note that the examples in the document take small data sets to illustrate the effect of specific functions on your data. You'll also see that topics such as repartitioning, iterating, merging, saving your data and stopping the SparkContext are included in the cheat sheet. This PySpark cheat sheet covers the basics, from initializing Spark and loading your data, to retrieving RDD information, sorting, filtering and sampling your data. Or what about other functions, like reduce() and reduceByKey()?Įven though the documentation is very elaborate, it never hurts to have a cheat sheet by your side, especially when you're just getting into it. Let's face it, map() and flatMap() are different enough, but it might still come as a challenge to decide which one you really need when you're faced with them in your analysis.

What's more, if you've never worked with any other programming language or if you're new to the field, it might be hard to distinguish between RDD operations. This is the Spark Python API exposes the Spark programming model to Python.Įven though working with Spark will remind you in many ways of working with Pandas DataFrames, you'll also see that it can be tough getting familiar with all the functions that you can use to query, transform, inspect. You can interface Spark with Python through "PySpark". It allows you to speed analytic applications up to 100 times faster compared to technologies on the market today. Apache Spark is generally known as a fast, general and open-source engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
