Tutorial

Decorators

@timeit_arg_info_dec

timeit_arg_info_dec is a decorator that decorates a function when it runs, by printing its used parameters, their arguments, the execution time and the output of that function. This can help e.g. with debugging.

[2]:
from typing import List
from time import sleep

import pandas as pd

from extra_ds_tools.decorators.func_decorators import timeit_arg_info_dec

@timeit_arg_info_dec(round_seconds=1)
def illustrate_decorater(a_number: int,
                         text: str,
                         lst: List[int],
                         df: pd.DataFrame,
                         either: bool = True,
                         *args,
                         **kwargs):
    sleep(1)
    return "Look how informative!"

illustrate_decorater(42, 'Bob', list(range(100)), pd.DataFrame([list(range(1,10))]), either=False, **{'Even': 'this works!'})

illustrate_decorater()
---------------------------------------------------------------------------------------------------------------------------------
    param          type_hint                    default_value    arg_type                     arg_value                 arg_len
--  -------------  ---------------------------  ---------------  ---------------------------  ------------------------  ---------
 0  a_number       int                                           int                          42
 1  text           str                                           str                          Bob                       3
 2  lst            List[int]                                     list                         [0, 1, 2,  .. 7, 98, 99]  100
 3  df             pandas.core.frame.DataFrame                   pandas.core.frame.DataFrame                            (1, 9)
 4  either         bool                         True             bool                         False
 5  kwarg['Even']                                                str                          this works!               11

illustrate_decorater()took 1.0 seconds to run.

Returned:
Look how informative!
---------------------------------------------------------------------------------------------------------------------------------
[2]:
'Look how informative!'

For the full documentation of timeit_arg_info_dec click here.

Plots

stripboxplot

stripboxplot is a plot which combines seaborn’s boxplot and stripplot into one plot and adds extra count information using extra-datascience-tool’s add_counts_to_yticks and add_counts_to_xticks.

[4]:
# import libraries
import pandas as pd
import numpy as np

from extra_ds_tools.plots.eda import stripboxplot
from numpy.random import default_rng

# generate data
rng = default_rng(42)
cats = ['Cheetah', 'Leopard', 'Puma']
cats = rng.choice(cats, size=1000)
cats = np.append(cats, [None]*102)
weights = rng.integers(25, 100, size=1000)
weights = np.append(weights, [np.nan]*100)
weights = np.append(weights, np.array([125,135]))
rng.shuffle(cats)
rng.shuffle(weights)
df = pd.DataFrame({'cats': cats, 'weights': weights})
df.head()
[4]:
cats weights
0 Cheetah 86.0
1 Puma 38.0
2 Puma 68.0
3 None NaN
4 Puma 36.0
[5]:
# run stripboxplot
fig, ax = stripboxplot(df, 'cats', 'weights')
../_images/notebooks_tutorial_8_0.png

For the full documentation of stripboxplot click here.

add_counts_to_xticks

add_counts_to_xticks is a function which add count statistics of a categorical variable on the x-axis of a plot. If the categorical variable is on the y-axis you can use add_counts_to_yticks instead.

[1]:
# import libraries
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

from extra_ds_tools.plots.format import add_counts_to_xticks
from numpy.random import default_rng

# generate data
rng = default_rng(42)
cats = ['Cheetah', 'Leopard', 'Puma']
cats = rng.choice(cats, size=1000)
cats = np.append(cats, [None]*102)
weights = rng.integers(25, 100, size=1000)
weights = np.append(weights, [np.nan]*100)
weights = np.append(weights, np.array([125,135]))
rng.shuffle(cats)
rng.shuffle(weights)
df = pd.DataFrame({'cats': cats, 'weights': weights})
df.head()
[1]:
cats weights
0 Cheetah 86.0
1 Puma 38.0
2 Puma 68.0
3 None NaN
4 Puma 36.0

Create e.g. a violinplot

[2]:
fig, ax = plt.subplots()
sns.violinplot(df, x='cats', y='weights', ax=ax)
[2]:
<AxesSubplot: xlabel='cats', ylabel='weights'>
../_images/notebooks_tutorial_13_1.png

Add counts to the x-ticks

[3]:
fig, ax = add_counts_to_xticks(fig, ax, df, x_col='cats', y_col='weights')
fig
[3]:
../_images/notebooks_tutorial_15_0.png

For the full documentation of add_counts_to_xticks click here.

try_diff_distribution_plots

try_diff_distribution_plots is a function which performs different transformations to a list of numerical values and plots the histogram, probability and the boxplot for each transformation.

[1]:
from numpy.random import default_rng
from extra_ds_tools.plots.eda import try_diff_distribution_plots

rng = default_rng(42)
values = rng.pareto(a=100, size=1000)

fig, axes, transformed_values = try_diff_distribution_plots(values, hist_bins=40)
../_images/notebooks_tutorial_19_0.png

For the full documentation of try_diff_distribution_plots click here.

ML

sklearn

filter_tried_params

filter_tried_params is a function which filters out previously tried parameters in a GridSearchCV if the model is otherwise identical. This can save a lot of time because you won’t be rerunning already tried settings.

[1]:
# import libraries
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.pipeline import make_pipeline

from extra_ds_tools.ml.sklearn.model_selection import filter_tried_params

model = make_pipeline(DecisionTreeRegressor()) # use make_pipeline to make filter_tried_params work

new_param_grid = {
    "decisiontreeregressor__max_depth": [1, 2],
    "decisiontreeregressor__splitter": ["best", "random"],
}
new_gridsearch = GridSearchCV(model, new_param_grid)

# initiate two other GridsearchCVs, we assume we have ran them already
tried_param_grid1 = {
    "decisiontreeregressor__max_depth": [2, 3],
    "decisiontreeregressor__splitter": ["best", "random"],
}
tried_param_grid2 = {
    "decisiontreeregressor__max_depth": [3, 4],
    "decisiontreeregressor__splitter": ["best", "random"],
}
tried_gridsearches = [
    GridSearchCV(model, tried_param_grid1),
    GridSearchCV(model, tried_param_grid2)
]

# change the param grid of the new GridSearchCV to the filtered param grid
untried_param_grid = filter_tried_params(gridsearchcv=new_gridsearch, tried_gridsearches=tried_gridsearches)
new_gridsearch.param_grid = untried_param_grid
new_gridsearch.param_grid

[1]:
[{'decisiontreeregressor__max_depth': [1],
  'decisiontreeregressor__splitter': ['best']},
 {'decisiontreeregressor__max_depth': [1],
  'decisiontreeregressor__splitter': ['random']}]

As you can see above, the new GridSearchCV will only run the two new options it hasn’t tried before.

For the full documentation of filter_tried_params click here.