Keita, Moussa (2022): Guide Pratique de PySpark pour Data Engineer: Fonctions Usuelles et Exemples d’Applications.
Preview |
PDF
MPRA_paper_113562.pdf Download (2MB) | Preview |
Abstract
The area of Big Data is commonly characterized by situations where the volumes of data are such that it is impossible to store and process them on a single machine. Data are stored across a group of machines called "cluster". However, new technological solutions had to be imagined by IT engineers in order to be able to process and exploit the data distributed across a cluster. Apache Spark is one of the proposed solutions. Spark is an framework that allows applying parallel computations to data stored on several cluster nodes. PySpark is the implementation of the Spark framework in the Python programming language. The purpose of this document is to review the common parallel processing functions used by Big Data Engineers using PySpark.
Item Type: | MPRA Paper |
---|---|
Original Title: | Guide Pratique de PySpark pour Data Engineer: Fonctions Usuelles et Exemples d’Applications |
English Title: | Practical Guide of PySpark for Data Engineer: Common Functions and Application Examples |
Language: | French |
Keywords: | RDD, Dataframe, Big Data, PySpark, Hive, HDFS, csv, kafka, |
Subjects: | C - Mathematical and Quantitative Methods > C8 - Data Collection and Data Estimation Methodology ; Computer Programs |
Item ID: | 113562 |
Depositing User: | Moussa keita |
Date Deposited: | 28 Jun 2022 12:15 |
Last Modified: | 28 Jun 2022 15:19 |
References: | Chambers B. et Zaharia M., (2018), Spark: The Definite Guide: Big Data Processing Made Simple, O’Reilly Media, Inc Frampton Mike, (2015), Mastering Apache Spark: Gain Expertise In Processing And Storing Data By Using Advanced Techniques With Apache Spark, Packt Publishing Guller Mohammed, (2015), Big Data Analytics with Spark, A Practitioner's Guide to Using Spark for Large Scale Data Analysis Karau H et Warren R, (2017), High-Performance Spark: Best Practices for Scaling and Optimizing Apache Spark, O'Reilly Media, Inc. Karau H., Konwinski A, Wendell P. et Zaharia M, (2015), Learning Spark: Lightning-Fast Big Data Analysis, O’Reilly Media, Inc. Yadav Rishi, (2015), Spark Cookbook, Packt Publishing |
URI: | https://mpra.ub.uni-muenchen.de/id/eprint/113562 |