Guide Pratique de PySpark pour Data Engineer: Fonctions Usuelles et Exemples dâ€™Applications

Keita, Moussa (2022): Guide Pratique de PySpark pour Data Engineer: Fonctions Usuelles et Exemples dâ€™Applications.

Preview

PDF
MPRA_paper_113562.pdf
Download (2MB) | Preview

Abstract

The area of Big Data is commonly characterized by situations where the volumes of data are such that it is impossible to store and process them on a single machine. Data are stored across a group of machines called "cluster". However, new technological solutions had to be imagined by IT engineers in order to be able to process and exploit the data distributed across a cluster. Apache Spark is one of the proposed solutions. Spark is an framework that allows applying parallel computations to data stored on several cluster nodes. PySpark is the implementation of the Spark framework in the Python programming language. The purpose of this document is to review the common parallel processing functions used by Big Data Engineers using PySpark.

Item Type:	MPRA Paper
Original Title:	Guide Pratique de PySpark pour Data Engineer: Fonctions Usuelles et Exemples dâ€™Applications
English Title:	Practical Guide of PySpark for Data Engineer: Common Functions and Application Examples
Language:	French
Keywords:	RDD, Dataframe, Big Data, PySpark, Hive, HDFS, csv, kafka,
Subjects:	C - Mathematical and Quantitative Methods > C8 - Data Collection and Data Estimation Methodology ; Computer Programs
Item ID:	113562
Depositing User:	Moussa keita
Date Deposited:	28 Jun 2022 12:15
Last Modified:	28 Jun 2022 15:19
References:	Chambers B. et Zaharia M., (2018), Spark: The Definite Guide: Big Data Processing Made Simple, Oâ€™Reilly Media, Inc Frampton Mike, (2015), Mastering Apache Spark: Gain Expertise In Processing And Storing Data By Using Advanced Techniques With Apache Spark, Packt Publishing Guller Mohammed, (2015), Big Data Analytics with Spark, A Practitioner's Guide to Using Spark for Large Scale Data Analysis Karau H et Warren R, (2017), High-Performance Spark: Best Practices for Scaling and Optimizing Apache Spark, O'Reilly Media, Inc. Karau H., Konwinski A, Wendell P. et Zaharia M, (2015), Learning Spark: Lightning-Fast Big Data Analysis, Oâ€™Reilly Media, Inc. Yadav Rishi, (2015), Spark Cookbook, Packt Publishing
URI:	https://mpra.ub.uni-muenchen.de/id/eprint/113562

All papers reproduced by permission. Reproduction and distribution subject to the approval of the copyright owners.

View Item

Atom RSS 1.0 RSS 2.0

Contact us: mpra@ub.uni-muenchen.de

This repository has been built using EPrints software.

MPRA is a RePEc service hosted by .