The data munging process in Python: An overview
We downloaded and set up a Python installation in the precedent abstracts, got introduced to several useful libraries and data structure. Hence we now want to commence with an exploratory analysis in Python. If you have not gone through the anterior article in the series, munificently do so afore proceeding further.
How do you transform a vast, inconsistently erratic spreadsheet of transactions riddled with typos and lamentable delimiters into structured and reliable input for use in advanced analysis? What if it's not even a spreadsheet but a webpage, a thousand emails, a JSON file, a text file which contains a billion error logs, or an amassment of unstructured documents stored randomly in a cloud?
The best answer is Data Munging. Data munging is a concepts and a methodology for taking data from unserviceable and erroneous forms to a new level of structure and quality required by advanced analytical processes and consumers.
The term 'Mung' was originated in the late 1960s as a scarcely derogatory term for actions and transformations which progressively degrade a dataset. It expeditiously became tied to the backronym "Mash Until No Good" (or, recursively, "Mung Until No Good").
Data wrangling sometimes called as data munging, is the process of transforming and mapping raw data form into other format to make it more apt and valuable for various downstream purposes such as analytics. Data analysts typically spend majority of their time (almost 80 per cent) in the process of data wrangling compared to the accurate analysis of the data.
With the wide variety of verticals, use-cases, types of users, and systems utilizing enterprise data today, the specifics of munging can take on multiple forms. Once you read data into a pandas object (mostly a DataFrame), you will perform various operations that include:
Step1: Inspect data –
· Checking attributes –index, values, row labels, column labels, data types, shape, info etc.
· Check-
o If a value exists
o Containing missing values
· Descriptive statistics on data – mean, median, mode, skew, kurtosis, max, min, sum, std, var, mad, percentiles, count etc.
Step2: Clean data / Manipulate:
· Mutation of table (Adding/deleting columns)
· Renaming columns or rows
· Binning data
· Creating dummies from categorical data
· Type conversions
· Handling missing values – detect, filter, replace
· Handling duplicates
· Slicing of data – sub setting
· Handling outliers
· Sorting – by data, index
· Table manipulation-
o Aggregation – Group by processing
o Merge, Join, Concatenate
o Reshaping & Pivoting data – stack/unstack,
o pivot table, summarizations
o Standardize the variables
Step3: Data Analysis:
· Univariate Analysis (Distribution of data, Data Audit)
· Bi-Variate Analysis (Statistical methods, Identifying relationships)
· Simple & Multivariate Analysis
Raw Data
After Data Munging:
Step1: Inspect data –
· Checking attributes in Series:
Attributes are the properties of any object. The Series attribute is described as any information cognate to the Series objects namely array, size, datatype, value, shape, index etc. Below are some of the attributes that you can utilize to get information about the Series object.
Attributes |
Description |
Series.index |
Defines the range of the index(axis labels) of the Series. |
Series.shape |
It return a tuple
of the shape of the underlying data. |
Series.dtype |
It return the data type
object of the underlying data. |
Series.values |
Return Series as
ndarray or ndarray like depending upon dtype |
Series.nbytes |
It return the number
of bytes in the underlying data. |
Series.ndim |
Returns the number
of dimensions in the underlying data |
Series.size |
It return the size (number
of elements) in the underlying data. |
Series.hasnans |
Return True if any
nans; otherwise returns false |
Series.empty |
Return True if
Series object is empty, otherwise false. |
at, iat |
To access a single
value from Series |
loc, iloc |
To access slices
from Series |
Series.pop(item) |
Return item and
drops from series. |
Series.T |
It return the
transpose, which is by definition self. |
example
Output: