Skip to content

Good practices in Pandas dataframes

A collection of examples on good practices in Panda DataFrame analysis. Examples reconstructed from Pandas online documentation and search in Web pages.

Create new columns in a dataframe

There are two ways to create a new column in a Pandas dataframe.

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [x for x in np.arange(0, 10)]})

# - create a new column with direct assignment. The bad practice! 
df['B'] = df['A']**2

# - use t

with the assign function of Pandas, we can create single or multiple columns and even use them directly.

df.assign(
    alfa_squared = lambda row: row['A']**2,
    alfa_halfed = lambda row : row['A']/2,
    alfa_sq_halfed = lambda row : row['alfa_squared']/2
    )
A B alfa_squared alfa_halfed alfa_sq_halfed
0 0 0 0 0.0 0.0
1 1 1 1 0.5 0.5
2 2 4 4 1.0 2.0
3 3 9 9 1.5 4.5
4 4 16 16 2.0 8.0
5 5 25 25 2.5 12.5
6 6 36 36 3.0 18.0
7 7 49 49 3.5 24.5
8 8 64 64 4.0 32.0
9 9 81 81 4.5 40.5

The creation of the new column can be combined with a selection of hte original df

df.assign(
    beta_squared = lambda row: row['B']**2
    ).loc[lambda row : row['A'] > 3]
A B beta_squared
4 4 16 256
5 5 25 625
6 6 36 1296
7 7 49 2401
8 8 64 4096
9 9 81 6561

Using internal panda functions in aggregation

import numpy as np

data = np.random.randint(0, 1000, size=300)
df = pd.DataFrame(data, columns=['data'])
df = df.assign(
        decade = lambda row: (row['data']/10).astype(int)
    )
df.groupby('decade').mean()
data
decade
0 7.600000
1 14.750000
2 22.200000
3 36.500000
4 49.000000
... ...
95 954.333333
96 964.166667
97 974.500000
98 980.333333
99 995.750000

98 rows × 1 columns

df = df.groupby('decade').agg(
                decade_mean = pd.NamedAgg(column='data', aggfunc='mean'),
                decade_std = pd.NamedAgg(column='data', aggfunc='std'),
                decade_npstd = pd.NamedAgg(column='data', aggfunc=np.std)
            )
df
decade_mean decade_std decade_npstd
decade
0 7.600000 1.140175 1.140175
1 14.750000 3.774917 3.774917
2 22.200000 2.167948 2.167948
3 36.500000 4.358899 4.358899
4 49.000000 NaN NaN
... ... ... ...
95 954.333333 2.081666 2.081666
96 964.166667 3.125167 3.125167
97 974.500000 0.707107 0.707107
98 980.333333 0.577350 0.577350
99 995.750000 3.201562 3.201562

98 rows × 3 columns

xpd = df['decade_std'].values
xnp = df['decade_npstd'].values
xnp-xpd
array([ 0.,  0.,  0.,  0., nan,  0.,  0., nan,  0.,  0.,  0.,  0.,  0.,
        0., nan,  0.,  0., nan,  0., nan,  0.,  0.,  0.,  0.,  0.,  0.,
        0., nan,  0.,  0.,  0.,  0., nan,  0.,  0.,  0., nan,  0.,  0.,
        0.,  0.,  0.,  0., nan,  0.,  0.,  0.,  0.,  0.,  0., nan,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0., nan,  0.,
        0.,  0., nan,  0.,  0.,  0.,  0.,  0., nan,  0., nan,  0.,  0.,
        0.,  0.,  0., nan,  0.,  0., nan,  0.,  0.,  0.,  0.,  0.,  0.,
       nan,  0.,  0.,  0.,  0.,  0.,  0.])