Logo
Munich Personal RePEc Archive

Guide Pratique de PySpark pour Data Engineer: Fonctions Usuelles et Exemples d’Applications

Keita, Moussa (2022): Guide Pratique de PySpark pour Data Engineer: Fonctions Usuelles et Exemples d’Applications.

[thumbnail of MPRA_paper_113562.pdf]
Preview
PDF
MPRA_paper_113562.pdf

Download (2MB) | Preview

Abstract

The area of Big Data is commonly characterized by situations where the volumes of data are such that it is impossible to store and process them on a single machine. Data are stored across a group of machines called "cluster". However, new technological solutions had to be imagined by IT engineers in order to be able to process and exploit the data distributed across a cluster. Apache Spark is one of the proposed solutions. Spark is an framework that allows applying parallel computations to data stored on several cluster nodes. PySpark is the implementation of the Spark framework in the Python programming language. The purpose of this document is to review the common parallel processing functions used by Big Data Engineers using PySpark.

Atom RSS 1.0 RSS 2.0

Contact us: mpra@ub.uni-muenchen.de

This repository has been built using EPrints software.

MPRA is a RePEc service hosted by Logo of the University Library LMU Munich.