Skip to content

Panda Dataframes - Tips

A collection of tips and code examples for analysis with panda Dataframes. I did not invent them; typically I faced the problem in my work, searched the web and found hints or similar examples.

General functions

In the following df is a pandas DataFrame.

Apply multiple functions to a vector column data

Column A in the dataframe is a vector quantity, and we want to calculate its mean, std, etc... per row.

```
def vecProc(df, c):
    return df[c].apply(lambda r: pd.Series({f'{c}-mean':r.mean(),
                                            f'{c}-min':r.min(),
                                            f'{c}-max':r.max(),
                                            f'{c}-std':r.std(),
                                            f'{c}-sum':r.sum()}))

df = pd.DataFrame({'A':[np.random.rand(100) for x in np.arange(0,10)],'B':np.random.rand(10)})

dfnew = vecProc(df, 'A')
dfnew
df.join(dfnew)

```

Working with timestamp Dataframes

In the following we assume df is a panda DataFrame that has timestamps in a colum or as index, and it is ordered.

Get the raw data to a given timestamp value

Return the raw corresponding ot the input timestamp value. The function uses the pd.get_loc command of Pandas.

``` 
def assignBMode(df, tsel):
    ''' Assign the BMODE for the time tsel - get the PREVIOUS nearest record '''
    df = df.set_index('timestamp')
    s_index = df.index.unique()[df.index.unique().get_loc(tsel, method='pad')]
    s = df.loc[s_index]
    # print (s, '\nFillNo = ', s.FILLN)
    # print (f' >>> tsel={tsel}, {s_index}, {tsel-s_index}')
    return s
```

Good practices in pandas dataframes

Tips on how to best profit of the pandas library for data analysis. good practices

Filter Dataframe with multiple string conditions in column

Select dataframe rows that do not contain the strings in filter_values list.

filter_values = ['drift', 'marker']
_aux[_aux.keyword.str.contains('|'.join(filter_values)) == False]

Pickle files and Panda versions

In our analysis we structure the data in Panda DataFrames that we store in pickle files. However between lxplus, SWAN and desktop computer we often have different versions of python and pandas that creates problem to read back the files.

One solution of course is to upgrade pandas! Alternatively the code below from [here] (https://stackoverflow.com/questions/54665527/importerror-no-module-named-pandas-core-internals-managers-pandas-core-inte) gives a solution to read pickle files created with Pandas 0.24.x in Pandas 0.23.x (the one presently of NXCALS SWAN stack):

```
from pandas.compat.pickle_compat import _class_locations_map

_class_locations_map.update({
    ('pandas.core.internals.managers', 'BlockManager'): ('pandas.core.internals', 'BlockManager')
})
```

Powerful inliners!!

Handy commands to do data analysis with Panda DataFrames. Try them!

```
DFy = pd.DataFrame()   # a dataframe
DFy.nlargest(n, "Column A", keep="all")   # keep in 'first', 'last', 'all'
DFy.sort_values("Column A", ascending=False)
DFy.nsmallest(n, "Column A", keep="all")

_aux = DFy[DFy.duplicated(subset=["Col A", "Col B"], keep=False)]
```

and combined with some graphics

_aux = pd.crosstab(DFY["A"], DFy]"B"])
sns.heatmap(_aux, annot=True)
sns.heatmap(_aux, annot=True, fmt='g')

_aux = pd.pivot_table(DFy, index=["Column A"],
                    columns=["Column B"],
                    aggfunc='size',
                    fill_value=0)
sns.heatmap(_aux, annot=True, fmt='g')

Pivot, melt, stack, unstack in Pandas

Set of useful methods to manipulate datasets like excel worksheets.

  • pivot : reshape a dataframe based on index, columns, and values from the DF columns
  • melt : the invert of pivot !
  • stack : basically the same as pivot but works with multi-index case
  • unstack : same as melt but again works with multi-index case

Look at the pandas documentation for examples

Dataframe manipulation

Some useful methods to consider

  • assign : assign new columns to the DF
  • transform : apply function to the whole DF, and
  • groupby.transform : apply function to all groupped elements
  • styler : use the Styler class combined with the apply and applymap to display coloured tables of the DF