Manipulating Data with Pandas — part 1
Pandas is an open source data structures and data analysis tool for Python.
What problem does pandas solve?
It allows user to carry out their entire data analysis workflow in Python without having to switch to a more domain specific language like R.
Some Library highlights:
- A fast and efficient DataFrame object for data manipulation with integrated indexing;
- Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
- Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
- Flexible reshaping and pivoting of data sets;
- Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
- Columns can be inserted and deleted from data structures for size mutability;
- Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
- High performance merging and joining of data sets;
- Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
- Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
- Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.
First thing first, import the pandas library
Concatenation with pd.concat
By, default the concatenation takes place row-wise i.e axis=0
. We can specify axis=1
.
The append()
method
The append()
method is a shorter version of pd.concat([df1, df2])
, you can simply call df1.append(df2)
.
Merge and Join
Pandas offer a high-performance, in-memory join and merge operations. pd.merge()
is a subset of relational algebra, a formal set of rules for manipulating relational data.
Merge and join operations are often the most used when dealing with multiple datasets.
A quick look of the types of joins available — thanks to stackoverflow
Let’s say we want to rank US states and territories by their 2008 population density. Once we know we have the information to perform the merge, we’ll start with a many-to-one merge.
There are no more null values in state
column. We’ve filled our null values appropriately.
We notice that our data does not contain the area of the United States as a whole. We can insert the value by using the sum of all state areas, but in this case we’ll just drop the null values.
We’re selecting the data we need to answer our questions, rank US states and territories by their 2008 population density. Then we’ll compute the population density and sort it in order. We’ll start by indexing our data using state
.
The result show the ranking of US states plus D.C. and Puerto Rico in order of their 2008 population density. Let’s check out the end of the list.
Merging data is a common task in order to answer questions posed by real datasets. This simple example provides an idea of how we can use pandas to gain insight from our data. I hope this helps! shout out to Shout out to Jake VanderPlas for writing the Python Data Science Handbook.