The data munging process in Python: An overview

The data munging process in Python: An overview
Photo by AltumCode on Unsplash

We downloaded and set up a Python installation in the precedent abstracts, got introduced to several useful libraries and data structure. Hence we now want to commence with an exploratory analysis in Python. If you have not gone through the anterior article in the series, munificently do so afore proceeding further.

How do you transform a vast, inconsistently erratic spreadsheet of transactions riddled with typos and lamentable delimiters into structured and reliable input for use in advanced analysis? What if it's not even a spreadsheet but a webpage, a thousand emails, a JSON file, a text file which contains a billion error logs, or an amassment of unstructured documents stored randomly in a cloud?

The best answer is Data Munging. Data munging is a concepts and a methodology for taking data from unserviceable and erroneous forms to a new level of structure and quality required by advanced analytical processes and consumers.

The term 'Mung' was originated in the late 1960s as a scarcely derogatory term for actions and transformations which progressively degrade a dataset. It expeditiously became tied to the backronym "Mash Until No Good" (or, recursively, "Mung Until No Good").

Data wrangling sometimes called as data munging, is the process of transforming and mapping raw data form into other format to make it more apt and valuable for various downstream purposes such as analytics. Data analysts typically spend majority of their time (almost 80 per cent) in the process of data wrangling compared to the accurate analysis of the data.

With the wide variety of verticals, use-cases, types of users, and systems utilizing enterprise data today, the specifics of munging can take on multiple forms. Once you read data into a pandas object (mostly a DataFrame), you will perform various operations that include:

Step1: Inspect data –

· Checking attributes –index, values, row labels, column labels, data types, shape, info etc.

· Check-

o If a value exists

o Containing missing values

· Descriptive statistics on data – mean, median, mode, skew, kurtosis, max, min, sum, std, var, mad, percentiles, count etc.

Step2: Clean data / Manipulate:

· Mutation of table (Adding/deleting columns)

· Renaming columns or rows

· Binning data

· Creating dummies from categorical data

· Type conversions

· Handling missing values – detect, filter, replace

· Handling duplicates

· Slicing of data – sub setting

· Handling outliers

· Sorting – by data, index

· Table manipulation-

o Aggregation – Group by processing

o Merge, Join, Concatenate

o Reshaping & Pivoting data – stack/unstack,

o pivot table, summarizations

o Standardize the variables

Step3: Data Analysis:

· Univariate Analysis (Distribution of data, Data Audit)

· Bi-Variate Analysis (Statistical methods, Identifying relationships)

· Simple & Multivariate Analysis

Raw Data

After Data Munging:

Step1: Inspect data –

· Checking attributes in Series:

Attributes are the properties of any object. The Series attribute is described as any information cognate to the Series objects namely array, size, datatype, value, shape, index etc. Below are some of the attributes that you can utilize to get information about the Series object.

Attributes

Description

Series.index

Defines the range of the index(axis labels)  of the Series.

Series.shape

It return a tuple of the shape of the underlying data.

Series.dtype

It return the data type object of the underlying data.

Series.values

Return Series as ndarray or ndarray like depending upon dtype

Series.nbytes

It return the number of bytes in the underlying data.

Series.ndim

Returns the number of dimensions in the underlying data

Series.size

It return the size (number of elements) in the underlying data.

Series.hasnans

Return True if any nans; otherwise returns false

Series.empty

Return True if Series object is empty, otherwise false.

at, iat

To access a single value from Series

loc, iloc

To access slices from Series

Series.pop(item)

Return item and drops from series.

Series.T

It return the transpose, which is by definition self.

example

Output: