Panda Dataframes - Tips
A collection of tips and code examples for analysis with panda Dataframes. I did not invent them; typically I faced the problem in my work, searched the web and found hints or similar examples.
General functions
In the following df
is a pandas DataFrame.
Apply multiple functions to a vector column data
Column A in the dataframe is a vector quantity, and we want to calculate its mean, std, etc... per row.
```
def vecProc(df, c):
return df[c].apply(lambda r: pd.Series({f'{c}-mean':r.mean(),
f'{c}-min':r.min(),
f'{c}-max':r.max(),
f'{c}-std':r.std(),
f'{c}-sum':r.sum()}))
df = pd.DataFrame({'A':[np.random.rand(100) for x in np.arange(0,10)],'B':np.random.rand(10)})
dfnew = vecProc(df, 'A')
dfnew
df.join(dfnew)
```
Working with timestamp Dataframes
In the following we assume df
is a panda DataFrame that has timestamps in a colum or as index, and it is ordered.
Get the raw data to a given timestamp value
Return the raw corresponding ot the input timestamp value. The function uses the pd.get_loc command of Pandas.
```
def assignBMode(df, tsel):
''' Assign the BMODE for the time tsel - get the PREVIOUS nearest record '''
df = df.set_index('timestamp')
s_index = df.index.unique()[df.index.unique().get_loc(tsel, method='pad')]
s = df.loc[s_index]
# print (s, '\nFillNo = ', s.FILLN)
# print (f' >>> tsel={tsel}, {s_index}, {tsel-s_index}')
return s
```
Good practices in pandas dataframes
Tips on how to best profit of the pandas library for data analysis. good practices
Filter Dataframe with multiple string conditions in column
Select dataframe rows that do not contain the strings in filter_values list.
filter_values = ['drift', 'marker']
_aux[_aux.keyword.str.contains('|'.join(filter_values)) == False]
Pickle files and Panda versions
In our analysis we structure the data in Panda DataFrames that we store in pickle files. However between lxplus, SWAN and desktop computer we often have different versions of python and pandas that creates problem to read back the files.
One solution of course is to upgrade pandas! Alternatively the code below from [here] (https://stackoverflow.com/questions/54665527/importerror-no-module-named-pandas-core-internals-managers-pandas-core-inte) gives a solution to read pickle files created with Pandas 0.24.x in Pandas 0.23.x (the one presently of NXCALS SWAN stack):
```
from pandas.compat.pickle_compat import _class_locations_map
_class_locations_map.update({
('pandas.core.internals.managers', 'BlockManager'): ('pandas.core.internals', 'BlockManager')
})
```
Powerful inliners!!
Handy commands to do data analysis with Panda DataFrames. Try them!
```
DFy = pd.DataFrame() # a dataframe
DFy.nlargest(n, "Column A", keep="all") # keep in 'first', 'last', 'all'
DFy.sort_values("Column A", ascending=False)
DFy.nsmallest(n, "Column A", keep="all")
_aux = DFy[DFy.duplicated(subset=["Col A", "Col B"], keep=False)]
```
and combined with some graphics
_aux = pd.crosstab(DFY["A"], DFy]"B"])
sns.heatmap(_aux, annot=True)
sns.heatmap(_aux, annot=True, fmt='g')
_aux = pd.pivot_table(DFy, index=["Column A"],
columns=["Column B"],
aggfunc='size',
fill_value=0)
sns.heatmap(_aux, annot=True, fmt='g')
Pivot, melt, stack, unstack in Pandas
Set of useful methods to manipulate datasets like excel worksheets.
- pivot : reshape a dataframe based on index, columns, and values from the DF columns
- melt : the invert of pivot !
- stack : basically the same as pivot but works with multi-index case
- unstack : same as melt but again works with multi-index case
Look at the pandas documentation for examples
Dataframe manipulation
Some useful methods to consider
- assign : assign new columns to the DF
- transform : apply function to the whole DF, and
- groupby.transform : apply function to all groupped elements
- styler : use the Styler class combined with the apply and applymap to display coloured tables of the DF