Data Science

Pandas Data Structures: DataFrame

Satyabrata Sahoo

Nov 20, 2021 — 3 min read

A DataFrame is a two dimensional potentially heterogeneous table-like data structure composed of rows and columns. Each column or variable of a DataFrame is called a Pandas Series. It is fundamentally different from NumPy 2D arrays in that here each column can be a different dtype. The columns present in a DataFrame should be of the same length but can be of different data types like an object, float, int, bool, etc. It has both row and column indexes for fast lookups and data alignment and joins. It is operationally similar to the R data.frame and associated with different methods that make general tasks like merging, plotting etc., very articulate.

Creating a DataFrame:

DataFrame will be engendered by loading the datasets from available storage like SQL Database, CSV file, an Excel file, HTML file, etc. We can create from the lists, dictionary, and from a list of dictionaries, etc. We can create DataFrame in multiple ways. Below are some of the ways to create DataFrame:

1. Creating a DataFrame from a NumPy 2D arrays:

Syntax: DataFrame(data=, index=, columns=)

As discussed in the previous abstract in the case with Series, if the index and the column parameters are not specified, default numeric sequences running from 0 to N-1 will be used.

2. Creating pandas DataFrame from a dictionary of array or list:

The simplest way to create a DataFrame is to use a Python dictionary of arrays/lists of the same length. The dictionary's keys will be called column names from the key-value pair, and a list of strings can be provided to be utilized as the index. As with Series, if you pass a column that isn't contained in data, it will appear with NaN values in the result.

3. Reading External Data into Pandas:

As we have seen, data can be classified as structured, semi-structured, unstructured data.

Structured data is the Data that comply with a pre-defined data model and is consequently straightforward to analyze. It is a tabular structure like data with the relationship between the different rows and columns. Common examples of structured data are CSV and Flat files, Excel, Databases (SQL Server, Oracle, Postgre SQL, Teradata etc.), or HDFS (Hadoop Distributed File system). Each of these dataframes has structured rows and columns that can be sorted, processed and accessed.

Semi-structured data is a shape of structured data that doesn't conform with the formal structure of statistical models associated with relational databases or different kinds of data tables but incorporates tags or other markers to split semantic elements and put into effect hierarchies of statistics and fields within the facts. Consequently, it is also referred to as a self-describing structure. The typical examples of semi-structured data include JSON and XML files.

Unstructured data that doesn't organize in a pre-defined manner. Unstructured information is typically text-heavy. Some examples are text data, images, Audio & Video files.

We can also import data from API's like Twitter, Facebook, Scrapping data from website URLs.

a) Read a CSV or Flat file:

CSV files include both a delimiter and a non-compulsory enclosing character. A delimiter separates the data fields and usually a comma, but can also be a tab, pipe, or single value character. An enclosing character occurs at the commencement and the terminus of a value. It is sometimes called a quote character (usually double quotes). However, you can use another character alternatively. A flat file is a type of database that stores data in a plain text format. We read a CSV using read_csv() function.

Syntax: read_csv(‘filepath_or_buffer’, <arguments> )

Important argumentsavailable:

o Sep- Separator use in file. By default it is comma( , ) separator.

o Delimiter -Delimiter to use.

o Header - Row number(s) to use as the column names and the start of the data. If header=None, then no name passed as the column.

o Skiprow-Number of lines to skip at the start of the file

o Names- List of column names to use

o Nrows- int, default is None Number of rows of file to read. Useful for reading pieces of large files.

o parse_dates=False(by default), used for date column