Tutorial

Decorators

@timeit_arg_info_dec

timeit_arg_info_dec is a decorator that decorates a function when it runs, by printing its used parameters, their arguments, the execution time and the output of that function. This can help e.g. with debugging.

[2]:

from typing import List
from time import sleep

import pandas as pd

from extra_ds_tools.decorators.func_decorators import timeit_arg_info_dec

@timeit_arg_info_dec(round_seconds=1)
def illustrate_decorater(a_number: int,
                         text: str,
                         lst: List[int],
                         df: pd.DataFrame,
                         either: bool = True,
                         *args,
                         **kwargs):
    sleep(1)
    return "Look how informative!"

illustrate_decorater(42, 'Bob', list(range(100)), pd.DataFrame([list(range(1,10))]), either=False, **{'Even': 'this works!'})


illustrate_decorater()
---------------------------------------------------------------------------------------------------------------------------------
    param          type_hint                    default_value    arg_type                     arg_value                 arg_len
--  -------------  ---------------------------  ---------------  ---------------------------  ------------------------  ---------
 0  a_number       int                                           int                          42
 1  text           str                                           str                          Bob                       3
 2  lst            List[int]                                     list                         [0, 1, 2,  .. 7, 98, 99]  100
 3  df             pandas.core.frame.DataFrame                   pandas.core.frame.DataFrame                            (1, 9)
 4  either         bool                         True             bool                         False
 5  kwarg['Even']                                                str                          this works!               11

illustrate_decorater()took 1.0 seconds to run.

Returned:
Look how informative!
---------------------------------------------------------------------------------------------------------------------------------

[2]:

'Look how informative!'

For the full documentation of timeit_arg_info_dec click here.

Plots

stripboxplot

stripboxplot is a plot which combines seaborn’s boxplot and stripplot into one plot and adds extra count information using extra-datascience-tool’s add_counts_to_yticks and add_counts_to_xticks.

[4]:

# import libraries
import pandas as pd
import numpy as np

from extra_ds_tools.plots.eda import stripboxplot
from numpy.random import default_rng

# generate data
rng = default_rng(42)
cats = ['Cheetah', 'Leopard', 'Puma']
cats = rng.choice(cats, size=1000)
cats = np.append(cats, [None]*102)
weights = rng.integers(25, 100, size=1000)
weights = np.append(weights, [np.nan]*100)
weights = np.append(weights, np.array([125,135]))
rng.shuffle(cats)
rng.shuffle(weights)
df = pd.DataFrame({'cats': cats, 'weights': weights})
df.head()

[4]:

	cats	weights
0	Cheetah	86.0
1	Puma	38.0
2	Puma	68.0
3	None	NaN
4	Puma	36.0

[5]:

# run stripboxplot
fig, ax = stripboxplot(df, 'cats', 'weights')

For the full documentation of stripboxplot click here.

add_counts_to_xticks

add_counts_to_xticks is a function which add count statistics of a categorical variable on the x-axis of a plot. If the categorical variable is on the y-axis you can use add_counts_to_yticks instead.

[1]:

# import libraries
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

from extra_ds_tools.plots.format import add_counts_to_xticks
from numpy.random import default_rng

# generate data
rng = default_rng(42)
cats = ['Cheetah', 'Leopard', 'Puma']
cats = rng.choice(cats, size=1000)
cats = np.append(cats, [None]*102)
weights = rng.integers(25, 100, size=1000)
weights = np.append(weights, [np.nan]*100)
weights = np.append(weights, np.array([125,135]))
rng.shuffle(cats)
rng.shuffle(weights)
df = pd.DataFrame({'cats': cats, 'weights': weights})
df.head()

[1]:

	cats	weights
0	Cheetah	86.0
1	Puma	38.0
2	Puma	68.0
3	None	NaN
4	Puma	36.0

Create e.g. a violinplot

[2]:

fig, ax = plt.subplots()
sns.violinplot(df, x='cats', y='weights', ax=ax)

[2]:

<AxesSubplot: xlabel='cats', ylabel='weights'>

Add counts to the x-ticks

[3]:

fig, ax = add_counts_to_xticks(fig, ax, df, x_col='cats', y_col='weights')
fig

[3]:

For the full documentation of add_counts_to_xticks click here.

try_diff_distribution_plots

try_diff_distribution_plots is a function which performs different transformations to a list of numerical values and plots the histogram, probability and the boxplot for each transformation.

[1]:

from numpy.random import default_rng
from extra_ds_tools.plots.eda import try_diff_distribution_plots

rng = default_rng(42)
values = rng.pareto(a=100, size=1000)

fig, axes, transformed_values = try_diff_distribution_plots(values, hist_bins=40)

For the full documentation of try_diff_distribution_plots click here.

ML

sklearn

filter_tried_params

filter_tried_params is a function which filters out previously tried parameters in a GridSearchCV if the model is otherwise identical. This can save a lot of time because you won’t be rerunning already tried settings.

[1]:

# import libraries
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.pipeline import make_pipeline

from extra_ds_tools.ml.sklearn.model_selection import filter_tried_params

model = make_pipeline(DecisionTreeRegressor()) # use make_pipeline to make filter_tried_params work

new_param_grid = {
    "decisiontreeregressor__max_depth": [1, 2],
    "decisiontreeregressor__splitter": ["best", "random"],
}
new_gridsearch = GridSearchCV(model, new_param_grid)

# initiate two other GridsearchCVs, we assume we have ran them already
tried_param_grid1 = {
    "decisiontreeregressor__max_depth": [2, 3],
    "decisiontreeregressor__splitter": ["best", "random"],
}
tried_param_grid2 = {
    "decisiontreeregressor__max_depth": [3, 4],
    "decisiontreeregressor__splitter": ["best", "random"],
}
tried_gridsearches = [
    GridSearchCV(model, tried_param_grid1),
    GridSearchCV(model, tried_param_grid2)
]

# change the param grid of the new GridSearchCV to the filtered param grid
untried_param_grid = filter_tried_params(gridsearchcv=new_gridsearch, tried_gridsearches=tried_gridsearches)
new_gridsearch.param_grid = untried_param_grid
new_gridsearch.param_grid

[1]:

[{'decisiontreeregressor__max_depth': [1],
  'decisiontreeregressor__splitter': ['best']},
 {'decisiontreeregressor__max_depth': [1],
  'decisiontreeregressor__splitter': ['random']}]

As you can see above, the new GridSearchCV will only run the two new options it hasn’t tried before.

For the full documentation of filter_tried_params click here.