Munich Personal RePEc Archive

Data Science sous Python: Algorithme, Statistique, DataViz, DataMining et Machine-Learning

Keita, Moussa (2017): Data Science sous Python: Algorithme, Statistique, DataViz, DataMining et Machine-Learning.

[img]
Preview
PDF
MPRA_paper_76653.pdf

Download (3MB) | Preview

Abstract

Data Science is a technical discipline that associates statistical concepts to computer algorithms and calculations for processing and modeling mass data derived from observation phenomena (economic, industrial, commercial, financial, managerial, social, etc. ..). In the area of Business Intelligence, the Data Science has become an indispensable tool to help decision making for company managers in the sense that it allows to exploit and valorize the internal and external informational patrimony of the company. In recent years, Python has rapidly become one of the most used programming languages at by Data Scientists to exploit the growing potential of Big Data. The gain of popularity of this language, today, is largely explained by the numerous possibilities offered by its powerful libraries including that of numerical analysis and scientific computing (numpy, scipy, pandas), data visualization ( matplotlib) but also Machine Learning (scikit-learn). Presented in a pedagogical approach, this manuscript revisits the concepts essential for mastering Data Science with Python. The work is organized into seven chapters. The first chapter is is devoted to the presentation of the basics of programming on Python. The second chapter is devoted to the study of strings and regular expressions. The aim of this chapter is to familiarize with the processing and the use of strings values which constitute the values of variables commonly found in unstructured databases. The third chapter is devoted to presenting the methods of file management and text processing. The purpose of this chapter is to deepen the previous chapter by presenting the methods commonly used for the processing of unstructured data which are generally in the form of text files. The fourth chapter is devoted to the presentation of the methods of processing and organization of data originally stored as data tables. The fifth chapter is dedicated to presenting classical statistical analysis methods (descriptive analyzes, statistical tests, linear and logistic regression, ...). The sixth chapter is devoted to presenting of methods of datavisualization: histograms, bars graphs, pie-plots, box-plots, scatter-plots, trend curves, 3D graphs, ...). Finally, the seventh chapter is devoted to presenting of methods of data mining and machine-learning. In this chapter, we present methods such as data dimensions reductions (Principal Components Analysis, Factor Analysis, Multiple Correspondence Analysis) but also of classification methods (Hierarchical Classification, K-Means Clustering, Support Vector Machine, Random Forest).

UB_LMU-Logo
MPRA is a RePEc service hosted by
the Munich University Library in Germany.