Panel Data analysis is nothing but Pandas, a data science library of rich data structure and tools for working with structured datasets that are common in statistics, healthcare, finance, retail, social science and many more fields. This library provides built-in intuitive routines for performing prevalent data manipulations and analysis. It is the most commonly used open source Python package for data science and machine learning tasks. It is mainly popular for data wrangling, exploratory analysis, powerful, flexible, fastened, visualizing cross sectional and time series data and productivity. Categorically, it provides high-level data structures like the 'DataFrame' and 'Series'.

Getting Started:

Installation and Importing:

To install Pandas, go to command line prompt and just type:

Pip Install Pandas

Alternatively, install Anaconda.

Anaconda is nothing but a zero cost Python meta-distribution that comes with over 250 packages by default installed and over 7500 additional open-source packages from PyPI for scientific computing like machine learning applications, large scale data processing, predictive analysis, data science, etc. Pandas package by default installed by installing Anaconda but you can do it manually by writing:

                        Conda Install Pandas

Importing Pandas packages:

                        import pandas as pd

                        from pandas import *

Pandas Data Structure:

The two primary data structures in Pandas are Series for one-dimensional data and DataFrame for two-dimensional data. Hierarchical indexing is used for data in higher dimensions within the DataFrame.

1.     Series: A series is a unidimensional array-like data structure capable of holding any data type (integers, strings, floating-point numbers, Python objects, etc.) and an associated array of data labels, called its index. An index is, by default, integer values from 0 to N-1(N is the series's size). We can also specify our own indexes. When a dictionary passes, the resulting series index will take the dictionary's keys in sorted order. In series, we can have duplicate indexes. We can create series using a tuple, list, dictionary, set, and numpy array using the series() function.

Creating a Series:

Series(numpy-array, index = [Generally a list object])

If we don't provide any index value explicitly, a default one is created that consists of the natural integers number 0 through N – 1. Unlike the NumPy array, the index of a pandas Series could be a character vector or something else (other than integers).

A series can be converted into a list or dictionary using the method tolist() or to_dict() respectively.

Series Attributes:

Attributes are the properties of any object. For primitive python data structure, different attributes like Lists or Dictionaries are used, which provide useful metadata about the structure's contents. We can also use Series attributes like values, index, loc, iloc, dtype, shape, nbytes, ndim etc. to find different information about the data structure.

Important Methods:

There are variety of methods attributes that are being useful across the entire spectrum of data wrangling tasks.

pandas.Series.T

pandas.Series.align

pandas.Series.cumprod

pandas.Series.head

pandas.Series.multiply

pandas.Series.rfloordiv

pandas.Series.where

pandas.Series.iat

pandas.Series.lt

pandas.Series.array

pandas.Series.all

pandas.Series.cumsum

pandas.Series.hist

pandas.Series.ne

pandas.Series.rmod

pandas.Series.xs

pandas.Series.loc

pandas.Series.gt

pandas.Series.at

pandas.Series.any

pandas.Series.describe

pandas.Series.idxmax

pandas.Series.nlargest

pandas.Series.rmul

pandas.Series.index

pandas.Series.iloc

pandas.Series.le

pandas.Series.attrs

pandas.Series.append

pandas.Series.diff

pandas.Series.idxmin

pandas.Series.notna

pandas.Series.rolling

pandas.Series.array

pandas.Series.__iter__

pandas.Series.ge

pandas.Series.axes

pandas.Series.apply

pandas.Series.div

pandas.Series.infer_objects

pandas.Series.notnull

pandas.Series.round

pandas.Series.values

pandas.Series.items

pandas.Series.ne

pandas.Series.dtype

pandas.Series.argmax

pandas.Series.divide

pandas.Series.interpolate

pandas.Series.nsmallest

pandas.Series.rpow

pandas.Series.dtype

pandas.Series.iteritems

pandas.Series.eq

pandas.Series.dtypes

pandas.Series.argmin

pandas.Series.divmod

pandas.Series.isin

pandas.Series.nunique

pandas.Series.rsub

pandas.Series.shape

pandas.Series.keys

pandas.Series.product

pandas.Series.flags

pandas.Series.argsort

pandas.Series.dot

pandas.Series.isna

pandas.Series.pad

pandas.Series.rtruediv

pandas.Series.nbytes

pandas.Series.pop

pandas.Series.dot

pandas.Series.hasnans

pandas.Series.asfreq

pandas.Series.drop

pandas.Series.isnull

pandas.Series.pct_change

pandas.Series.sample

pandas.Series.ndim

pandas.Series.item

pandas.Series.apply

pandas.Series.iat

pandas.Series.asof

pandas.Series.drop_duplicates

pandas.Series.item

pandas.Series.pipe

pandas.Series.searchsorted

pandas.Series.size

pandas.Series.xs

pandas.Series.agg

pandas.Series.iloc

pandas.Series.astype

pandas.Series.droplevel

pandas.Series.items

pandas.Series.plot

pandas.Series.sem

pandas.Series.T

pandas.Series.add

pandas.Series.aggregate

pandas.Series.index

pandas.Series.at_time

pandas.Series.dropna

pandas.Series.iteritems

pandas.Series.pop

pandas.Series.set_axis

pandas.Series.memory_usage

pandas.Series.sub

pandas.Series.transform

pandas.Series.is_monotonic

pandas.Series.autocorr

pandas.Series.dt

pandas.Series.keys

pandas.Series.pow

pandas.Series.set_flags

pandas.Series.hasnans

pandas.Series.mul

pandas.Series.map

pandas.Series.is_monotonic_decreasing

pandas.Series.backfill

pandas.Series.duplicated

pandas.Series.kurt

pandas.Series.prod

pandas.Series.shift

pandas.Series.empty

pandas.Series.div

pandas.Series.groupby

pandas.Series.is_monotonic_increasing

pandas.Series.between

pandas.Series.eq

pandas.Series.kurtosis

pandas.Series.product

pandas.Series.skew

pandas.Series.dtypes

pandas.Series.truediv

pandas.Series.rolling

pandas.Series.is_unique

pandas.Series.between_time

pandas.Series.equals

pandas.Series.last

pandas.Series.quantile

pandas.Series.slice_shift

pandas.Series.name

pandas.Series.floordiv

pandas.Series.expanding

pandas.Series.loc

pandas.Series.bfill

pandas.Series.ewm

pandas.Series.last_valid_index

pandas.Series.radd

pandas.Series.sort_index

pandas.Series.flags

pandas.Series.mod

pandas.Series.ewm

pandas.Series.name

pandas.Series.bool

pandas.Series.expanding

pandas.Series.le

pandas.Series.rank

pandas.Series.sort_values

pandas.Series.set_flags

pandas.Series.pow

pandas.Series.pipe

pandas.Series.nbytes

pandas.Series.cat

pandas.Series.explode

pandas.Series.lt

pandas.Series.ravel

pandas.Series.sparse

pandas.Series.astype

pandas.Series.radd

pandas.Series.abs

pandas.Series.ndim

pandas.Series.clip

pandas.Series.factorize

pandas.Series.mad

pandas.Series.rdiv

pandas.Series.squeeze

pandas.Series.convert_dtypes

pandas.Series.rsub

pandas.Series.all

pandas.Series.shape

pandas.Series.combine

pandas.Series.ffill

pandas.Series.map

pandas.Series.rdivmod

pandas.Series.std

pandas.Series.infer_objects

pandas.Series.rmul

pandas.Series.any

pandas.Series.size

pandas.Series.combine_first

pandas.Series.fillna

pandas.Series.mask

pandas.Series.reindex

pandas.Series.str

pandas.Series.copy

pandas.Series.rdiv

pandas.Series.autocorr

pandas.Series.values

pandas.Series.compare

pandas.Series.filter

pandas.Series.max

pandas.Series.reindex_like

pandas.Series.sub

pandas.Series.bool

pandas.Series.rtruediv

pandas.Series.between

pandas.Series.abs

pandas.Series.convert_dtypes

pandas.Series.first

pandas.Series.mean

pandas.Series.rename

pandas.Series.subtract

pandas.Series.to_numpy

pandas.Series.rfloordiv

pandas.Series.clip

pandas.Series.add

pandas.Series.copy

pandas.Series.first_valid_index

pandas.Series.median

pandas.Series.rename_axis

pandas.Series.sum

pandas.Series.to_period

pandas.Series.rmod

pandas.Series.corr

pandas.Series.add_prefix

pandas.Series.corr

pandas.Series.floordiv

pandas.Series.memory_usage

pandas.Series.reorder_levels

pandas.Series.swapaxes

pandas.Series.to_timestamp

pandas.Series.rpow

pandas.Series.count

pandas.Series.add_suffix

pandas.Series.count

pandas.Series.ge

pandas.Series.min

pandas.Series.repeat

pandas.Series.swaplevel

pandas.Series.to_list

pandas.Series.combine

pandas.Series.cov

pandas.Series.agg

pandas.Series.cov

pandas.Series.get

pandas.Series.mod

pandas.Series.replace

pandas.Series.tail

pandas.Series.__array__

pandas.Series.combine_first

pandas.Series.cummax

pandas.Series.aggregate

pandas.Series.cummax

pandas.Series.groupby

pandas.Series.mode

pandas.Series.resample

pandas.Series.take

pandas.Series.get

pandas.Series.round

pandas.Series.cummin

pandas.Series.last_valid_index

pandas.Series.cummin

pandas.Series.gt

pandas.Series.mul

pandas.Series.reset_index

pandas.Series.to_clipboard

pandas.Series.at

pandas.Series.align

pandas.Series.cumprod

pandas.Series.resample

pandas.Series.dt.is_quarter_end

pandas.Series.str.capitalize

pandas.Series.str.rstrip

pandas.Series.cat.set_categories

pandas.Series.to_csv

pandas.Series.ffill

pandas.Series.drop

pandas.Series.cumsum

pandas.Series.tz_convert

pandas.Series.dt.is_year_start

pandas.Series.str.casefold

pandas.Series.str.slice

pandas.Series.cat.as_ordered

pandas.Series.to_dict

pandas.Series.fillna

pandas.Series.droplevel

pandas.Series.describe

pandas.Series.tz_localize

pandas.Series.dt.is_year_end

pandas.Series.str.cat

pandas.Series.str.slice_replace

pandas.Series.cat.as_unordered

pandas.Series.to_excel

pandas.Series.interpolate

pandas.Series.drop_duplicates

pandas.Series.diff

pandas.Series.at_time

pandas.Series.dt.is_leap_year

pandas.Series.str.center

pandas.Series.str.split

pandas.Series.sparse.npoints

pandas.Series.to_frame

pandas.Series.isna

pandas.Series.duplicated

pandas.Series.factorize

pandas.Series.between_time

pandas.Series.dt.daysinmonth

pandas.Series.str.contains

pandas.Series.str.rsplit

pandas.Series.sparse.density

pandas.Series.to_hdf

pandas.Series.isnull

pandas.Series.equals

pandas.Series.kurt

pandas.Series.tshift

pandas.Series.dt.days_in_month

pandas.Series.str.count

pandas.Series.str.startswith

pandas.Series.sparse.fill_value

pandas.Series.to_json

pandas.Series.notna

pandas.Series.first

pandas.Series.mad

pandas.Series.slice_shift

pandas.Series.dt.tz

pandas.Series.str.decode

pandas.Series.str.strip

pandas.Series.sparse.sp_values

pandas.Series.to_latex

pandas.Series.notnull

pandas.Series.head

pandas.Series.max

pandas.Series.dt.date

pandas.Series.dt.freq

pandas.Series.str.encode

pandas.Series.str.swapcase

pandas.Series.sparse.from_coo

pandas.Series.to_list

pandas.Series.pad

pandas.Series.idxmax

pandas.Series.mean

pandas.Series.dt.time

pandas.Series.dt.to_period

pandas.Series.str.endswith

pandas.Series.str.title

pandas.Series.sparse.to_coo

pandas.Series.to_markdown

pandas.Series.replace

pandas.Series.idxmin

pandas.Series.median

pandas.Series.dt.timetz

pandas.Series.dt.to_pydatetime

pandas.Series.str.extract

pandas.Series.str.translate

pandas.Flags

pandas.Series.to_numpy

pandas.Series.argsort

pandas.Series.isin

pandas.Series.min

pandas.Series.dt.year

pandas.Series.dt.tz_localize

pandas.Series.str.extractall

pandas.Series.str.upper

pandas.Series.attrs

pandas.Series.to_period

pandas.Series.argmin

pandas.Series.last

pandas.Series.mode

pandas.Series.dt.month

pandas.Series.dt.tz_convert

pandas.Series.str.find

pandas.Series.str.wrap

pandas.Series.plot

pandas.Series.to_pickle

pandas.Series.argmax

pandas.Series.reindex

pandas.Series.nlargest

pandas.Series.dt.day

pandas.Series.dt.normalize

pandas.Series.str.findall

pandas.Series.str.zfill

pandas.Series.plot.area

pandas.Series.to_sql

pandas.Series.reorder_levels

pandas.Series.reindex_like

pandas.Series.nsmallest

pandas.Series.dt.hour

pandas.Series.dt.strftime

pandas.Series.str.fullmatch

pandas.Series.str.isalnum

pandas.Series.plot.bar

pandas.Series.to_string

pandas.Series.sort_values

pandas.Series.rename

pandas.Series.pct_change

pandas.Series.dt.minute

pandas.Series.dt.round

pandas.Series.str.get

pandas.Series.str.isalpha

pandas.Series.plot.barh

pandas.Series.to_timestamp

pandas.Series.sort_index

pandas.Series.rename_axis

pandas.Series.prod

pandas.Series.dt.second

pandas.Series.dt.floor

pandas.Series.str.index

pandas.Series.str.isdigit

pandas.Series.plot.box

pandas.Series.to_xarray

pandas.Series.swaplevel

pandas.Series.reset_index

pandas.Series.quantile

pandas.Series.dt.microsecond

pandas.Series.dt.ceil

pandas.Series.str.join

pandas.Series.str.isspace

pandas.Series.plot.density

pandas.Series.tolist

pandas.Series.unstack

pandas.Series.sample

pandas.Series.rank

pandas.Series.dt.nanosecond

pandas.Series.dt.month_name

pandas.Series.str.len

pandas.Series.str.islower

pandas.Series.plot.hist

pandas.Series.transform

pandas.Series.explode

pandas.Series.set_axis

pandas.Series.sem

pandas.Series.dt.week

pandas.Series.dt.day_name

pandas.Series.str.ljust

pandas.Series.str.isupper

pandas.Series.plot.kde

pandas.Series.transpose

pandas.Series.searchsorted

pandas.Series.take

pandas.Series.skew

pandas.Series.dt.weekofyear

pandas.Series.dt.qyear

pandas.Series.str.lower

pandas.Series.str.istitle

pandas.Series.plot.line

pandas.Series.truediv

pandas.Series.ravel

pandas.Series.tail

pandas.Series.std

pandas.Series.dt.dayofweek

pandas.Series.dt.start_time

pandas.Series.str.lstrip

pandas.Series.str.isnumeric

pandas.Series.plot.pie

pandas.Series.truncate

pandas.Series.repeat

pandas.Series.truncate

pandas.Series.sum

pandas.Series.dt.day_of_week

pandas.Series.dt.end_time

pandas.Series.str.match

pandas.Series.str.isdecimal

pandas.Series.hist

pandas.Series.tshift

pandas.Series.squeeze

pandas.Series.where

pandas.Series.var

pandas.Series.dt.weekday

pandas.Series.dt.days

pandas.Series.str.normalize

pandas.Series.str.get_dummies

pandas.Series.to_pickle

pandas.Series.tz_convert

pandas.Series.view

pandas.Series.mask

pandas.Series.kurtosis

pandas.Series.dt.dayofyear

pandas.Series.dt.seconds

pandas.Series.str.pad

pandas.Series.cat.categories

pandas.Series.to_csv

pandas.Series.tz_localize

pandas.Series.append

pandas.Series.add_prefix

pandas.Series.unique

pandas.Series.dt.day_of_year

pandas.Series.dt.microseconds

pandas.Series.str.partition

pandas.Series.cat.ordered

pandas.Series.to_dict

pandas.Series.unique

pandas.Series.compare

pandas.Series.add_suffix

pandas.Series.nunique

pandas.Series.dt.quarter

pandas.Series.dt.nanoseconds

pandas.Series.str.repeat

pandas.Series.cat.codes

pandas.Series.to_excel

pandas.Series.unstack

pandas.Series.update

pandas.Series.filter

pandas.Series.is_unique

pandas.Series.dt.is_month_start

pandas.Series.dt.components

pandas.Series.str.replace

pandas.Series.cat.rename_categories

pandas.Series.to_frame

pandas.Series.update

pandas.Series.asfreq

pandas.Series.backfill

pandas.Series.is_monotonic

pandas.Series.dt.is_month_end

pandas.Series.dt.to_pytimedelta

pandas.Series.str.rfind

pandas.Series.cat.reorder_categories

pandas.Series.to_xarray

pandas.Series.value_counts

pandas.Series.asof

pandas.Series.bfill

pandas.Series.is_monotonic_increasing

pandas.Series.dt.is_quarter_start

pandas.Series.dt.total_seconds

pandas.Series.str.rindex

pandas.Series.cat.add_categories

pandas.Series.to_hdf

pandas.Series.var

pandas.Series.shift

pandas.Series.dropna

pandas.Series.is_monotonic_decreasing

pandas.Series.to_string

pandas.Series.to_latex

pandas.Series.str.rjust

pandas.Series.cat.remove_categories

pandas.Series.to_sql

pandas.Series.view

pandas.Series.first_valid_index

pandas.Series.to_json

pandas.Series.value_counts

pandas.Series.to_clipboard

pandas.Series.to_markdown

pandas.Series.str.rpartition

pandas.Series.cat.remove_unused_categories

 

 

 

 

 

Data Wrangling Tasks:

  1. After peaking the data, head and tail methods are used to view a small sample of the Series or DataFrame object. The default sample of the records to display is five, but we can see a custom number of records.
  2. Type Conversion: astype method explicitly convert the data types from one to another.
  3. Treating Outlier: clip method is used to treat outliers at a threshold value. All the values lower than the threshold value will assign to a lower parameter or higher than the threshold value will assign to the upper parameter present in clip methods.
  4. Replacing Values: The replace method returns the source value with the target value by supplying a dictionary with the required replacements.
  5. Handling Duplicate values: Duplicated engenders a boolean that marks every instance of a value after its first occurrence as True. drop_duplicates method returns the series with the duplicates removed. If we want to drop duplicated permanently, then pass the inplace=True argument in the drop_duplicates method.
  6. Dealing with missing data: isnull, notnull methods are used on a Series with missing data to produce boolean series to identify missing or non-missing values, respectively. Both the NumPy np.nan and the base Python None type are identified as missing values.
  7. Missing values imputation: fillna, dropna, ffill and bfill methods are used to deal with missing data by imputing them with a specific value or by copying the last known value over the missing one, which is typically used in time series analysis. When we sometimes want to drop the missing data altogether, we use the dropna method. One common practice to deal with missing values is a numerical variable by its mean, or median for skewed data and categorical with its mode.
  8. Uniques and their frequency: unique, nunique, value_counts methods are used to find the array of distinct values present in a Series, count the number of distinct items and create a frequency table, respectively.
  9. Most significant/Smallest Values: idxmax, idxmin, nlargest, nsmallest methods are used to find the largest, smallest, n-largest, n-smallest, respectively. These methods returned index labels, and this can be especially helpful in many cases.
  10. Checking values in a list: isin produces a true/false statement by comparing each series element against the provided list. It takes True if the element belongs to the list. This boolean may then be used for subsetting the series.
  11. Sorting: sort_values , sort_index method used to sort a Series by values or by index. To make the sorting permanent, we need to pass an argument inplace=True.
  12. Basic statistics: mean, median, std, quantile, describe methods are used for basic statistical calculation employed to find the measures of central tendency for a given set of data points. Quantile finds the requested percentiles, whereas describe produces the summary statistics for the data.
  13. Apply function to each element: map is the essential method of all series methods. It takes a general-purpose or user-defined function that is applied to each value in the series. It combines with base Python's lambda functions, and it can be a potent tool in transforming a given Series.
  14. Data Visualization: The plot method is the gateway to a treasure trove of possible visualizations such as histograms, bar charts, scatter plots, box plots etc.

To be continued…

Share this post