Good practices in Pandas dataframes
A collection of examples on good practices in Panda DataFrame analysis. Examples reconstructed from Pandas online documentation and search in Web pages.
Create new columns in a dataframe
There are two ways to create a new column in a Pandas dataframe.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [x for x in np.arange(0, 10)]})
# - create a new column with direct assignment. The bad practice!
df['B'] = df['A']**2
# - use t
with the assign function of Pandas, we can create single or multiple columns and even use them directly.
df.assign(
alfa_squared = lambda row: row['A']**2,
alfa_halfed = lambda row : row['A']/2,
alfa_sq_halfed = lambda row : row['alfa_squared']/2
)
A | B | alfa_squared | alfa_halfed | alfa_sq_halfed | |
---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0.0 | 0.0 |
1 | 1 | 1 | 1 | 0.5 | 0.5 |
2 | 2 | 4 | 4 | 1.0 | 2.0 |
3 | 3 | 9 | 9 | 1.5 | 4.5 |
4 | 4 | 16 | 16 | 2.0 | 8.0 |
5 | 5 | 25 | 25 | 2.5 | 12.5 |
6 | 6 | 36 | 36 | 3.0 | 18.0 |
7 | 7 | 49 | 49 | 3.5 | 24.5 |
8 | 8 | 64 | 64 | 4.0 | 32.0 |
9 | 9 | 81 | 81 | 4.5 | 40.5 |
The creation of the new column can be combined with a selection of hte original df
df.assign(
beta_squared = lambda row: row['B']**2
).loc[lambda row : row['A'] > 3]
A | B | beta_squared | |
---|---|---|---|
4 | 4 | 16 | 256 |
5 | 5 | 25 | 625 |
6 | 6 | 36 | 1296 |
7 | 7 | 49 | 2401 |
8 | 8 | 64 | 4096 |
9 | 9 | 81 | 6561 |
Using internal panda functions in aggregation
import numpy as np
data = np.random.randint(0, 1000, size=300)
df = pd.DataFrame(data, columns=['data'])
df = df.assign(
decade = lambda row: (row['data']/10).astype(int)
)
df.groupby('decade').mean()
data | |
---|---|
decade | |
0 | 7.600000 |
1 | 14.750000 |
2 | 22.200000 |
3 | 36.500000 |
4 | 49.000000 |
... | ... |
95 | 954.333333 |
96 | 964.166667 |
97 | 974.500000 |
98 | 980.333333 |
99 | 995.750000 |
98 rows × 1 columns
df = df.groupby('decade').agg(
decade_mean = pd.NamedAgg(column='data', aggfunc='mean'),
decade_std = pd.NamedAgg(column='data', aggfunc='std'),
decade_npstd = pd.NamedAgg(column='data', aggfunc=np.std)
)
df
decade_mean | decade_std | decade_npstd | |
---|---|---|---|
decade | |||
0 | 7.600000 | 1.140175 | 1.140175 |
1 | 14.750000 | 3.774917 | 3.774917 |
2 | 22.200000 | 2.167948 | 2.167948 |
3 | 36.500000 | 4.358899 | 4.358899 |
4 | 49.000000 | NaN | NaN |
... | ... | ... | ... |
95 | 954.333333 | 2.081666 | 2.081666 |
96 | 964.166667 | 3.125167 | 3.125167 |
97 | 974.500000 | 0.707107 | 0.707107 |
98 | 980.333333 | 0.577350 | 0.577350 |
99 | 995.750000 | 3.201562 | 3.201562 |
98 rows × 3 columns
xpd = df['decade_std'].values
xnp = df['decade_npstd'].values
xnp-xpd
array([ 0., 0., 0., 0., nan, 0., 0., nan, 0., 0., 0., 0., 0.,
0., nan, 0., 0., nan, 0., nan, 0., 0., 0., 0., 0., 0.,
0., nan, 0., 0., 0., 0., nan, 0., 0., 0., nan, 0., 0.,
0., 0., 0., 0., nan, 0., 0., 0., 0., 0., 0., nan, 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., nan, 0.,
0., 0., nan, 0., 0., 0., 0., 0., nan, 0., nan, 0., 0.,
0., 0., 0., nan, 0., 0., nan, 0., 0., 0., 0., 0., 0.,
nan, 0., 0., 0., 0., 0., 0.])