My Pandas Cheat Sheet

https://stackoverflow.com/questions/36794433/python-using-multiprocessing-on-a-pandas-dataframe http://www.racketracer.com/2016/07/06/pandas-in-parallel/

  import pandas as pd

  pd.__version__

Creating Dataframes

  df = pd.DataFrame()

select multiple columns as a dataframe from a bigger dataframe:

  df2 = df[['Id', 'team', 'winPlacePerc']]

select a single column as a dataframe:

  df2 = df[['name']] 
#double square brackets make the results dataframe, 
#single makes it series

pandas axis:

  axis 1 = columns, axis 0 = rows

get a series from a dataframe column filtered by another column:

  zero_names = df[df["weights"] < 10000]["names"]

turn it into a list:

  zn_list = zero_names.tolist()

N-way split:

  X_train, X_test, y_train, y_test, r_train, r_test = 
   model_selection.train_test_split(X, y, r, 
      test_size=0.25, random_state=99)

load only part of data: nrows

Accessing Values

find row at 1111th index in dataframe:

  df.iloc[1111]

find row with value 1111 for index in dataframe:

  df.loc[1111]

set value in cell (stackoverflow):

  children_df.at[555, "visited"] = True 
    #555 here refers to row index

modify row while iteration over it:

  for index, row in df.iterrows():
    df.at[index,'open_change_p1'] = day_change_op1

drop last 5 columns

  df_train.drop(df_train.columns[-5:], axis=1)

print numpy data types in multidimensional array (in jupyter return value is printed):

[type(row) for row in values[0]]

Aggregates

calculate mean for ‘team’ and ‘winPlacePerc’ columns, after grouping them by match id and group id:

  agg_cols = ['groupId', 'matchId', 'team', 'winPlacePerc']
  df_mean = df_test[agg_cols].groupby(["groupId", "matchId"],
    as_index=False).mean().add_suffix("_mean")

run multiple such aggregations at once:

  agg = df_train.groupby(['matchId'])
    .agg({'players_in_team': ['min', 'max', 'mean', 'median']})

specify the name suffixes of the generated aggregation column names:

  agg_train = df_train[agg_cols].groupby(["groupId", "matchId"], 
                        as_index=False)
    .agg([('_min', 'min'), ('_max', 'max'), ('_mean', 'mean')])

multiple aggregations will create a 2-level column header (see df.head()). to change it into one level (for merge etc):

  mi = agg_train.columns #mi for multi-index
  ind = pd.Index([e[0] + e[1] for e in mi.tolist()])
  agg_train.columns = ind

custom aggregation function:

  def q90(x):
      return x.quantile(0.9)

  agg = df_train.groupby(['matchId'])
    .agg({'players_in_team': 
       ['min', 'max', 'mean', 'median', q90]})

create new column as number of rows in a group:

  agg = df_train.groupby(['matchId'])
          .size().to_frame('players_in_match')

group and sum column for groups:

  revenues = df_train.groupby("geoNetwork_subContinent")
     ['totals_transactionRevenue'].sum()
     .sort_values(ascending=False)

Merge

merge two dataframes (here df_train and agg) by a single column:

  df_train = df_train.merge(agg, how='left', on=['groupId'])

merge on multiple columns:

  dfg_train = dfg_train.merge(agg_train, how='left',
     on=["groupId", "matchId"])

set merge suffixes = ["", "_rank"] <- left and right side on overlapping columns

  dfg_test = dfg_test.merge(agg_test_rank, suffixes=["", "_rank"],
        how='left', on=["groupId", "matchId"])

above sets columns from dfg_test to have no suffix ("") and columns from agg_test_rank to have suffix "_rank"

merge columns by value:

  df_train['rankPoints'] = np.where(df_train['rankPoints'] < 0,
       df_train['winPoints'], df_train['rankPoints'])

->above sets value as winpoints if rankpoints is below zero, else rankpoints

merge by defining the column names to match on left and right:

  pd.merge(left, right, left_on='key1', right_on='key2')

merge types (the how=merge_type in pd.merge)(link):

  • inner: keep rows that match in both left and right
  • outer: keep all rows in both left and right
  • left: keep all rows from left and matching ones from right
  • right: keep all rows from right and matching ones from left

Rename, Delete, Compare Dataframes

rename columns to remove a suffix (here remove _mean):

  df_mean = df_mean.rename(columns={'groupId_mean': 'groupId',
               'matchId_mean': 'matchId'}) 

delete dataframes to save memory:

  del df_agg

find all columns in one dataframe but not in another

  diff_set = set(train_df.columns).difference(set(test_df.columns))
  print(diff_set)

find all columns in one both dataframes and drop them

  common_set = set(train_df.columns).intersection(set(FEATS_EXCLUDED))
  train_df = train_df.drop(common_set, axis=1)

drop rows with index in list (stackoverflow):

  df.drop(df.index[[1,3]])

replace nulls/nans with 0:

  X.fillna(0)
  X_test.fillna(0)

only for specific columns:

  df[['a', 'b']] = df[['a','b']].fillna(value=0)

drop rows with null values for specific column:

  df_train.dropna(subset=['winPlacePerc'], inplace=True)

drop columns B and C:

  df.drop(['B', 'C'], axis=1)
  df.drop(['B', 'C'], axis=1, inplace = True)

Dataframe Statistics

this df has 800k rows (values) and 999 columns (features):

  df_train_subset.shape
  (800000, 999)

data types:

  df_train.dtypes

number of rows, columns, memory size (light, fast):

  df.info()

statistics per feature/column (cpu/mem heavy):

  df.describe()

number of unique values:

  df.nunique()

bounds:

  df.min()
  df.max()

replace infinity with 0. esp scalers can have issues with inf:

    X[col] = X[col].replace(np.inf, 0)

replace positive and negative inf with nan:

  df_pct.replace([np.inf, -np.inf], np.nan)

number of non-nulls per row in dataframe:

  df_non_null = df.notnull().sum(axis=1)

find the actual rows with null values:

  df_train[df_train.isnull().any(axis=1)]

number of unique values

  df.nunique()

number of times different values appear in column:

  df_train["device_operatingSystem"].value_counts()

sorted by index (e.g., 0,1,2,…)

  phase_counts = df_tgt0["phase"].value_counts().sort_index()

number of nans per column in dataframe:

  df.isnull().sum()

find columns with only one unique value

  const_cols = [c for c in train_df.columns 
                if train_df[c].nunique(dropna=False)==1 ]

drop them

  train_df = train_df.drop(const_cols, axis=1)

replace outlier values over 3std with 3std:

  upper = df_train['totals_transactionRevenue'].mean()
            +3*df_train['totals_transactionRevenue'].std()
  mask = np.abs(df_train['totals_transactionRevenue']
           -df_train['totals_transactionRevenue'].mean()) >
           (3*df_train['totals_transactionRevenue'].std())
  df_train.loc[mask, 'totals_transactionRevenue'] = upper

or use zscore:

  df['zscore'] = (df.a - df.a.mean())/df.a.std(ddof=0)

from scipy (stackoverflow)

  from scipy import stats
  import numpy as np
  z = np.abs(stats.zscore(boston_df))
  print(z)

show groupby object data statistics for each column by grouped element:

  grouped.describe()

create dataframe from classifier column names and importances (where supported), sort by weight:

  df_feats = pd.DataFrame()
  df_feats["names"] = X.columns
  df_feats["weights"] = clf.feature_importances_
  df_feats.sort_values(by="weights")

lag columns to show absolute value they changed over rows:

  for col in input_col_names:
    df_diff[col] = df_sig[col].diff().abs()

unique datatypes in dataframe:

  train_df.dtypes.unique()

show columns of selected type:

  train_df.select_dtypes(include=['float64'])

Numpy

Get numpy matrix from dataframe:

  data_diffs = df_diff.values

get column from numpy matrix as row

  for sig in range(0, 3):
      data_sig = data_measure[:,sig]

slice numpy array:

  bin_data_raw = data_sig[i:i + bin_size]

calculate statistics over numpy array:

  bin_avg_raw = bin_data_raw.mean()
  bin_sum_raw = bin_data_raw.sum()
  bin_std_raw = bin_data_raw.std()
  bin_percentiles = np.percentile(bin_data_raw, [0, 1, 25, 50, 75, 99, 100])
  bin_range = bin_percentiles[-1] - bin_percentiles[0]
  bin_rel_perc = bin_percentiles - bin_avg_raw

count outliers at different scales:

  outliers = []
  for r in range(1, 7):
      t = bin_std_diff * r
      o_count = np.sum(np.abs(bin_data_diff-bin_avg_diff) >= t)
      outliers.append(o_count)

concatenate multiple numpy arrays into one:

  bin_row = np.concatenate([raw_features, diff_features, bin_percentiles, bin_rel_perc, outliers])

limit values between min/max:

  df_sig.clip(upper=127, lower=-127)

value range in column (ptp = peak to peak):

  df.groupby('GROUP')['VALUE'].agg(np.ptp)

replace nan:

  my_arr[np.isnan(my_arr)] = 0

Time-Series

create values for percentage changes over time (row to row):

  df_pct = pd.DataFrame()
  for col in df_train_subset.columns[:30]:
      df_pct['pct_chg_'+col] =
          df_train_subset[col].pct_change().abs()

pct_change() gives the change in percentage over time, abs() makes it absolute if just looking for overall change as in negative or positive.

also possible to set number of rows to count pct_change over:

  df_pct['pct_chg_'+col] =
          df_train_subset[col].pct_change(periods=10)

pct_change on a set of items with specific values:

  df['pct_chg_open1'] = df.groupby('assetCode')['open']
           .apply(lambda x: x.pct_change())

TODO: try different ways to calculate ewm to see how this all works ewm ([stackoverflow]( EWMA: https://stackoverflow.com/questions/37924377/does-pandas-calculate-ewm-wrong)):

  df["close_ewma_10"] = df.groupby('assetName')['pct_chg_close1']
            .apply(lambda x: x.ewm(span=10).mean())

shift by 10 backwards

  y = mt_df["mket_close_10"].shift(-10)

Date Types

convert seconds into datetime

  start_col = pd.to_datetime(df_train.visitStartTime, unit='s')

parse specific columns as specific date types:

  df_train = pd.read_csv("../input/train-flat.csv.gz",
                         parse_dates=["date"], 
                         dtype={'fullVisitorId': 'str',
                              'date': 'str'
                             },
                        )

or if need to parse specific format:

  pd.to_datetime('13000101', format='%Y%m%d', errors='ignore')

access elements of datetime objects:

  data_df["weekday"] = data_df['date'].dt.weekday
  data_df["hour"] = data_df['date'].dt.hour

Data Type Manipulations

one-hot encode / convert categorical:

  dfg_train = pd.get_dummies( dfg_train, columns = ['matchType'] )
  dfg_test = pd.get_dummies( dfg_test, columns = ['matchType'] )

set category value by range:

  here 1 = solo game, 2 = duo game, 3 = squad, 4 = custom
  df_train['team'] = 
     [1 if i == 1 else 2 if i == 2 else 4 if i > 4 else 3 
              for i in df_train["players_in_team_q90"]]

calculate value over two columns and make it a new column:

  dfg_train['killsNorm'] = 
      dfg_train['kills_mean']*
      ((100-dfg_train['players_in_match'])/100 + 1)

  data_df['hit_view_ratio'] =
      data_df['totals_pageviews']/data_df['totals_hits']

mark a set of columns as category type:

  for col in X_cat_cols:
      df_train[col] = df_train[col].astype('category')

set category value based on set of values in column:

  X_test['matchType'] = X_test['matchType'].astype('str')
  X_test.loc[X_test['matchType'] == "squad-fpp", 'matchType'] =
     "squad"
  X_test['matchType'] = X_test['matchType'].astype('category')

how to get the indices from a dataframe as a list (e.g., collect a subset):

  list(outlier_collection.index.values)

to see it all on one line in Jupyter (easier copying, e.g. to drop all in list):

print(list(outlier_collection.index.values))

Jupyter

show dataframe as visual table in notebook (stackoverflow):

  from IPython.display import display, HTML

  display(df1)
  display(HTML(df2.to_html()))

pycharm notebook (jetbrains help):

correlations, unstacking correlation matrix link

Memory Reducer (From Kaggler:

def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object and col_type.name != 'category':
            #print(col_type)
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df
Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s