Manipulating Data with Pandas — Part 2

Bram Tunggala
4 min readOct 16, 2019

--

If you haven’t read Manipulating Data with Pandas — part 1, I highly recommend it prior to reading this.

Here, we’ll continue with aggregating and grouping with Pandas. We’ll use the Planets dataset available in seaborn as an example. The dataset consists of extrasolar planets, planets that are discovered around other stars. We’ll start by loading the dataset and for now, we’ll drop the null values and use .describe() to obtain our descriptive statistics.

This is a useful way to start looking at your dataset. You can see the first exoplanet (short for extrasolar planet) was discovered in 1989 and the majority was discovered after 2010, about 50%.

We can check below using pandas value_counts(normalize=True) method, normalize=True uses percentage vs. count. This is a simple aggregation technique that generally happens towards the very beginning of the process. More often than not, we'll need to dive deeper into segmentation and data aggregation, and this is where Pandas groupby comes in handy.

GroupBy

Generally, we prefer to aggregate conditionally with specific label or index. GroupBy consists of a split, apply, and combine step.

  • Split: breaking up and splitting a DataFrame based on a specified value
  • Apply: computes a function or some sort of aggregation within a specified column or group
  • Combine: merges the result into an array

We’ll try a simple groupby aggregation using .sum(). Most Pandas and NumPy’s functions can be applied to groupby aggregation.

Some Functionalities of GroupBy

Column indexing

We’ll use groupby on column method with the object orbital_period and call the .mean() on the object. It’ll give and idea of orbital periods in days that each method is sensitive to.

Iteration over groups

This can be useful for doing things manually. But the apply function can speed things up much faster. Which we’ll see later below.

Dispatch methods

You can use methods like describe() and value_counts() on groupby objects to perform a set of aggregations.

A quick observation from above, we can see that the majority of planets were discovered by using the Radial Velocity and Transit methods. Also, we realize the newest method Transit Timing Variations and Orbital Brightness Modulation which was introduced in 2011, is not as effective as the two major methods.

The utility of dispatch method is applied to each individual group, and any valid DataFrame method can be used in tandem with groupby.

GroupBy: aggregate, apply, filter, transform

GroupBy objects have methods that efficiently implement operations prior to combining the grouped data.

Aggregation

The aggregate() method can take a string, function, or list and compute all the aggregates at once. We'll use planets dataset and randomly generated tech stock price (not current price) as an example.

Planets dataset

Apply

The apply() method lets you apply functions to the group results. Below is an example where our function normalizes the first column by the sum of the second.

using sum method to aggregate the sum of each Ticker per column

Another method is to use a dictionary that maps index values to the desired group.

You can also pass Python functions that will input the index value and output the group.

Filtering

Filtering allows select data based on group properties. For example, we want to select groups with standard deviation over a certain amount.

Because AMZN and NFLX has a standard deviation greater than 2, is was not included in the group selection.

Transformation

This method transform the data into a transformed version of the original data.. The output is the same shape as the input. A common example is to center the data by subtracting the group-wise mean.

Bringing it together

We can use a couple line of code to compile and count discovered planets by methods and decade.

This display the power of combining operations that we’ve talked about, used in a realistic dataset.

--

--

No responses yet