Data Science

The data munging process in Python: An overview

Satyabrata Sahoo

Jan 24, 2022 — 3 min read

We downloaded and set up a Python installation in the precedent abstracts, got introduced to several useful libraries and data structure. Hence we now want to commence with an exploratory analysis in Python. If you have not gone through the anterior article in the series, munificently do so afore proceeding further.

How do you transform a vast, inconsistently erratic spreadsheet of transactions riddled with typos and lamentable delimiters into structured and reliable input for use in advanced analysis? What if it's not even a spreadsheet but a webpage, a thousand emails, a JSON file, a text file which contains a billion error logs, or an amassment of unstructured documents stored randomly in a cloud?

The best answer is Data Munging. Data munging is a concepts and a methodology for taking data from unserviceable and erroneous forms to a new level of structure and quality required by advanced analytical processes and consumers.

The term 'Mung' was originated in the late 1960s as a scarcely derogatory term for actions and transformations which progressively degrade a dataset. It expeditiously became tied to the backronym "Mash Until No Good" (or, recursively, "Mung Until No Good").

Data wrangling sometimes called as data munging, is the process of transforming and mapping raw data form into other format to make it more apt and valuable for various downstream purposes such as analytics. Data analysts typically spend majority of their time (almost 80 per cent) in the process of data wrangling compared to the accurate analysis of the data.

With the wide variety of verticals, use-cases, types of users, and systems utilizing enterprise data today, the specifics of munging can take on multiple forms. Once you read data into a pandas object (mostly a DataFrame), you will perform various operations that include:

Step1: Inspect data –

· Checking attributes –index, values, row labels, column labels, data types, shape, info etc.

· Check-

o If a value exists

o Containing missing values

· Descriptive statistics on data – mean, median, mode, skew, kurtosis, max, min, sum, std, var, mad, percentiles, count etc.

Step2: Clean data / Manipulate:

· Mutation of table (Adding/deleting columns)

· Renaming columns or rows

· Binning data

· Creating dummies from categorical data

· Type conversions

· Handling missing values – detect, filter, replace

· Handling duplicates

· Slicing of data – sub setting

· Handling outliers

· Sorting – by data, index

· Table manipulation-

o Aggregation – Group by processing

o Merge, Join, Concatenate

o Reshaping & Pivoting data – stack/unstack,

o pivot table, summarizations

o Standardize the variables

Step3: Data Analysis:

· Univariate Analysis (Distribution of data, Data Audit)

· Bi-Variate Analysis (Statistical methods, Identifying relationships)

· Simple & Multivariate Analysis

Raw Data

After Data Munging:

Step1: Inspect data –

· Checking attributes in Series:

Attributes are the properties of any object. The Series attribute is described as any information cognate to the Series objects namely array, size, datatype, value, shape, index etc. Below are some of the attributes that you can utilize to get information about the Series object.

Attributes	Description
Series.index	Defines the range of the index(axis labels) of the Series.
Series.shape	It return a tuple of the shape of the underlying data.
Series.dtype	It return the data type object of the underlying data.
Series.values	Return Series as ndarray or ndarray like depending upon dtype
Series.nbytes	It return the number of bytes in the underlying data.
Series.ndim	Returns the number of dimensions in the underlying data
Series.size	It return the size (number of elements) in the underlying data.
Series.hasnans	Return True if any nans; otherwise returns false
Series.empty	Return True if Series object is empty, otherwise false.
at, iat	To access a single value from Series
loc, iloc	To access slices from Series
Series.pop(item)	Return item and drops from series.
Series.T	It return the transpose, which is by definition self.

example

Output:

The data munging process in Python: An overview

Satyabrata Sahoo

Read more

A Treaty Towards Transparency and Fairness

Are Evidence-Based Medicine and Public Health Incompatible?

Book Launch: Ek Samandar Mere Andar

This Weekend on IP Wave: How do you create value? More 'Ferris Bueller's Day Off,' Less 'The Terminator'