https://stackoverflow.com/questions/36794433/python-using-multiprocessing-on-a-pandas-dataframe http://www.racketracer.com/2016/07/06/pandas-in-parallel/
import pandas as pd
pd.__version__
Creating Dataframes
df = pd.DataFrame()
select multiple columns as a dataframe from a bigger dataframe:
df2 = df[['Id', 'team', 'winPlacePerc']]
select a single column as a dataframe:
df2 = df[['name']]
#double square brackets make the results dataframe,
#single makes it series
pandas axis:
axis 1 = columns, axis 0 = rows
get a series from a dataframe column filtered by another column:
zero_names = df[df["weights"] < 10000]["names"]
turn it into a list:
zn_list = zero_names.tolist()
N-way split:
X_train, X_test, y_train, y_test, r_train, r_test =
model_selection.train_test_split(X, y, r,
test_size=0.25, random_state=99)
load only part of data: nrows
Accessing Values
find row at 1111th index in dataframe:
df.iloc[1111]
find row with value 1111 for index in dataframe:
df.loc[1111]
set value in cell (stackoverflow):
children_df.at[555, "visited"] = True
#555 here refers to row index
modify row while iteration over it:
for index, row in df.iterrows():
df.at[index,'open_change_p1'] = day_change_op1
drop last 5 columns
df_train.drop(df_train.columns[-5:], axis=1)
print numpy data types in multidimensional array (in jupyter return value is printed):
[type(row) for row in values[0]]
Aggregates
calculate mean for ‘team’ and ‘winPlacePerc’ columns, after grouping them by match id and group id:
agg_cols = ['groupId', 'matchId', 'team', 'winPlacePerc']
df_mean = df_test[agg_cols].groupby(["groupId", "matchId"],
as_index=False).mean().add_suffix("_mean")
run multiple such aggregations at once:
agg = df_train.groupby(['matchId'])
.agg({'players_in_team': ['min', 'max', 'mean', 'median']})
specify the name suffixes of the generated aggregation column names:
agg_train = df_train[agg_cols].groupby(["groupId", "matchId"],
as_index=False)
.agg([('_min', 'min'), ('_max', 'max'), ('_mean', 'mean')])
multiple aggregations will create a 2-level column header (see df.head()). to change it into one level (for merge etc):
mi = agg_train.columns #mi for multi-index
ind = pd.Index([e[0] + e[1] for e in mi.tolist()])
agg_train.columns = ind
custom aggregation function:
def q90(x):
return x.quantile(0.9)
agg = df_train.groupby(['matchId'])
.agg({'players_in_team':
['min', 'max', 'mean', 'median', q90]})
create new column as number of rows in a group:
agg = df_train.groupby(['matchId'])
.size().to_frame('players_in_match')
group and sum column for groups:
revenues = df_train.groupby("geoNetwork_subContinent")
['totals_transactionRevenue'].sum()
.sort_values(ascending=False)
Merge
merge two dataframes (here df_train and agg) by a single column:
df_train = df_train.merge(agg, how='left', on=['groupId'])
merge on multiple columns:
dfg_train = dfg_train.merge(agg_train, how='left',
on=["groupId", "matchId"])
set merge suffixes = ["", "_rank"] <- left and right side on overlapping columns
dfg_test = dfg_test.merge(agg_test_rank, suffixes=["", "_rank"],
how='left', on=["groupId", "matchId"])
above sets columns from dfg_test to have no suffix ("") and columns from agg_test_rank to have suffix "_rank"
merge columns by value:
df_train['rankPoints'] = np.where(df_train['rankPoints'] < 0,
df_train['winPoints'], df_train['rankPoints'])
->above sets value as winpoints if rankpoints is below zero, else rankpoints
merge by defining the column names to match on left and right:
pd.merge(left, right, left_on='key1', right_on='key2')
merge types (the how=merge_type in pd.merge)(link):
- inner: keep rows that match in both left and right
- outer: keep all rows in both left and right
- left: keep all rows from left and matching ones from right
- right: keep all rows from right and matching ones from left
Rename, Delete, Compare Dataframes
rename columns to remove a suffix (here remove _mean):
df_mean = df_mean.rename(columns={'groupId_mean': 'groupId',
'matchId_mean': 'matchId'})
delete dataframes to save memory:
del df_agg
find all columns in one dataframe but not in another
diff_set = set(train_df.columns).difference(set(test_df.columns))
print(diff_set)
find all columns in one both dataframes and drop them
common_set = set(train_df.columns).intersection(set(FEATS_EXCLUDED))
train_df = train_df.drop(common_set, axis=1)
drop rows with index in list (stackoverflow):
df.drop(df.index[[1,3]])
replace nulls/nans with 0:
X.fillna(0)
X_test.fillna(0)
only for specific columns:
df[['a', 'b']] = df[['a','b']].fillna(value=0)
drop rows with null values for specific column:
df_train.dropna(subset=['winPlacePerc'], inplace=True)
drop columns B and C:
df.drop(['B', 'C'], axis=1)
df.drop(['B', 'C'], axis=1, inplace = True)
Dataframe Statistics
this df has 800k rows (values) and 999 columns (features):
df_train_subset.shape
(800000, 999)
data types:
df_train.dtypes
number of rows, columns, memory size (light, fast):
df.info()
statistics per feature/column (cpu/mem heavy):
df.describe()
number of unique values:
df.nunique()
bounds:
df.min()
df.max()
replace infinity with 0. esp scalers can have issues with inf:
X[col] = X[col].replace(np.inf, 0)
replace positive and negative inf with nan:
df_pct.replace([np.inf, -np.inf], np.nan)
number of non-nulls per row in dataframe:
df_non_null = df.notnull().sum(axis=1)
find the actual rows with null values:
df_train[df_train.isnull().any(axis=1)]
number of unique values
df.nunique()
number of times different values appear in column:
df_train["device_operatingSystem"].value_counts()
sorted by index (e.g., 0,1,2,…)
phase_counts = df_tgt0["phase"].value_counts().sort_index()
number of nans per column in dataframe:
df.isnull().sum()
find columns with only one unique value
const_cols = [c for c in train_df.columns
if train_df[c].nunique(dropna=False)==1 ]
drop them
train_df = train_df.drop(const_cols, axis=1)
replace outlier values over 3std with 3std:
upper = df_train['totals_transactionRevenue'].mean()
+3*df_train['totals_transactionRevenue'].std()
mask = np.abs(df_train['totals_transactionRevenue']
-df_train['totals_transactionRevenue'].mean()) >
(3*df_train['totals_transactionRevenue'].std())
df_train.loc[mask, 'totals_transactionRevenue'] = upper
or use zscore:
df['zscore'] = (df.a - df.a.mean())/df.a.std(ddof=0)
from scipy (stackoverflow)
from scipy import stats
import numpy as np
z = np.abs(stats.zscore(boston_df))
print(z)
show groupby object data statistics for each column by grouped element:
grouped.describe()
create dataframe from classifier column names and importances (where supported), sort by weight:
df_feats = pd.DataFrame()
df_feats["names"] = X.columns
df_feats["weights"] = clf.feature_importances_
df_feats.sort_values(by="weights")
lag columns to show absolute value they changed over rows:
for col in input_col_names:
df_diff[col] = df_sig[col].diff().abs()
unique datatypes in dataframe:
train_df.dtypes.unique()
show columns of selected type:
train_df.select_dtypes(include=['float64'])
Numpy
Get numpy matrix from dataframe:
data_diffs = df_diff.values
get column from numpy matrix as row
for sig in range(0, 3):
data_sig = data_measure[:,sig]
slice numpy array:
bin_data_raw = data_sig[i:i + bin_size]
calculate statistics over numpy array:
bin_avg_raw = bin_data_raw.mean()
bin_sum_raw = bin_data_raw.sum()
bin_std_raw = bin_data_raw.std()
bin_percentiles = np.percentile(bin_data_raw, [0, 1, 25, 50, 75, 99, 100])
bin_range = bin_percentiles[-1] - bin_percentiles[0]
bin_rel_perc = bin_percentiles - bin_avg_raw
count outliers at different scales:
outliers = []
for r in range(1, 7):
t = bin_std_diff * r
o_count = np.sum(np.abs(bin_data_diff-bin_avg_diff) >= t)
outliers.append(o_count)
concatenate multiple numpy arrays into one:
bin_row = np.concatenate([raw_features, diff_features, bin_percentiles, bin_rel_perc, outliers])
limit values between min/max:
df_sig.clip(upper=127, lower=-127)
value range in column (ptp = peak to peak):
df.groupby('GROUP')['VALUE'].agg(np.ptp)
replace nan:
my_arr[np.isnan(my_arr)] = 0
Time-Series
create values for percentage changes over time (row to row):
df_pct = pd.DataFrame()
for col in df_train_subset.columns[:30]:
df_pct['pct_chg_'+col] =
df_train_subset[col].pct_change().abs()
pct_change() gives the change in percentage over time, abs() makes it absolute if just looking for overall change as in negative or positive.
also possible to set number of rows to count pct_change over:
df_pct['pct_chg_'+col] =
df_train_subset[col].pct_change(periods=10)
pct_change on a set of items with specific values:
df['pct_chg_open1'] = df.groupby('assetCode')['open']
.apply(lambda x: x.pct_change())
TODO: try different ways to calculate ewm to see how this all works ewm ([stackoverflow]( EWMA: https://stackoverflow.com/questions/37924377/does-pandas-calculate-ewm-wrong)):
df["close_ewma_10"] = df.groupby('assetName')['pct_chg_close1']
.apply(lambda x: x.ewm(span=10).mean())
shift by 10 backwards
y = mt_df["mket_close_10"].shift(-10)
Date Types
convert seconds into datetime
start_col = pd.to_datetime(df_train.visitStartTime, unit='s')
parse specific columns as specific date types:
df_train = pd.read_csv("../input/train-flat.csv.gz",
parse_dates=["date"],
dtype={'fullVisitorId': 'str',
'date': 'str'
},
)
or if need to parse specific format:
pd.to_datetime('13000101', format='%Y%m%d', errors='ignore')
access elements of datetime objects:
data_df["weekday"] = data_df['date'].dt.weekday
data_df["hour"] = data_df['date'].dt.hour
Data Type Manipulations
one-hot encode / convert categorical:
dfg_train = pd.get_dummies( dfg_train, columns = ['matchType'] )
dfg_test = pd.get_dummies( dfg_test, columns = ['matchType'] )
set category value by range:
here 1 = solo game, 2 = duo game, 3 = squad, 4 = custom
df_train['team'] =
[1 if i == 1 else 2 if i == 2 else 4 if i > 4 else 3
for i in df_train["players_in_team_q90"]]
calculate value over two columns and make it a new column:
dfg_train['killsNorm'] =
dfg_train['kills_mean']*
((100-dfg_train['players_in_match'])/100 + 1)
data_df['hit_view_ratio'] =
data_df['totals_pageviews']/data_df['totals_hits']
mark a set of columns as category type:
for col in X_cat_cols:
df_train[col] = df_train[col].astype('category')
set category value based on set of values in column:
X_test['matchType'] = X_test['matchType'].astype('str')
X_test.loc[X_test['matchType'] == "squad-fpp", 'matchType'] =
"squad"
X_test['matchType'] = X_test['matchType'].astype('category')
how to get the indices from a dataframe as a list (e.g., collect a subset):
list(outlier_collection.index.values)
to see it all on one line in Jupyter (easier copying, e.g. to drop all in list):
print(list(outlier_collection.index.values))
Jupyter
show dataframe as visual table in notebook (stackoverflow):
from IPython.display import display, HTML
display(df1)
display(HTML(df2.to_html()))
pycharm notebook (jetbrains help):
correlations, unstacking correlation matrix link
Memory Reducer (From Kaggler:
def reduce_mem_usage(df):
""" iterate through all the columns of a dataframe and modify the data type
to reduce memory usage.
"""
start_mem = df.memory_usage().sum() / 1024**2
print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
for col in df.columns:
col_type = df[col].dtype
if col_type != object and col_type.name != 'category':
#print(col_type)
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
else:
df[col] = df[col].astype('category')
end_mem = df.memory_usage().sum() / 1024**2
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
return df