Manipulating Data with Pandas — Part 2
If you haven’t read Manipulating Data with Pandas — part 1, I highly recommend it prior to reading this.
Here, we’ll continue with aggregating and grouping with Pandas. We’ll use the Planets dataset available in seaborn as an example. The dataset consists of extrasolar planets, planets that are discovered around other stars. We’ll start by loading the dataset and for now, we’ll drop the null values and use .describe()
to obtain our descriptive statistics.
This is a useful way to start looking at your dataset. You can see the first exoplanet (short for extrasolar planet) was discovered in 1989 and the majority was discovered after 2010, about 50%.
We can check below using pandas value_counts(normalize=True)
method, normalize=True
uses percentage vs. count. This is a simple aggregation technique that generally happens towards the very beginning of the process. More often than not, we'll need to dive deeper into segmentation and data aggregation, and this is where Pandas groupby
comes in handy.
GroupBy
Generally, we prefer to aggregate conditionally with specific label or index. GroupBy consists of a split, apply, and combine step.
- Split: breaking up and splitting a DataFrame based on a specified value
- Apply: computes a function or some sort of aggregation within a specified column or group
- Combine: merges the result into an array
We’ll try a simple groupby aggregation using .sum()
. Most Pandas and NumPy’s functions can be applied to groupby aggregation.
Some Functionalities of GroupBy
Column indexing
We’ll use groupby on column method
with the object orbital_period
and call the .mean()
on the object. It’ll give and idea of orbital periods in days that each method
is sensitive to.
Iteration over groups
This can be useful for doing things manually. But the apply
function can speed things up much faster. Which we’ll see later below.
Dispatch methods
You can use methods like describe()
and value_counts()
on groupby objects to perform a set of aggregations.
A quick observation from above, we can see that the majority of planets were discovered by using the Radial Velocity and Transit methods. Also, we realize the newest method Transit Timing Variations and Orbital Brightness Modulation which was introduced in 2011, is not as effective as the two major methods.
The utility of dispatch method is applied to each individual group, and any valid DataFrame method can be used in tandem with groupby.
GroupBy: aggregate, apply, filter, transform
GroupBy objects have methods that efficiently implement operations prior to combining the grouped data.
Aggregation
The aggregate()
method can take a string, function, or list and compute all the aggregates at once. We'll use planets dataset and randomly generated tech stock price (not current price) as an example.
Apply
The apply()
method lets you apply functions to the group results. Below is an example where our function normalizes the first column by the sum of the second.
Another method is to use a dictionary that maps index values to the desired group.
You can also pass Python functions that will input the index value and output the group.
Filtering
Filtering allows select data based on group properties. For example, we want to select groups with standard deviation over a certain amount.
Because AMZN
and NFLX
has a standard deviation greater than 2, is was not included in the group selection.
Transformation
This method transform the data into a transformed version of the original data.. The output is the same shape as the input. A common example is to center the data by subtracting the group-wise mean.
Bringing it together
We can use a couple line of code to compile and count discovered planets by methods and decade.
This display the power of combining operations that we’ve talked about, used in a realistic dataset.