ModelTrackeR

7 minute read | Updated:

Description

This tool is designed to keep track of all changes made to your models, and to track any metrics you want to run on your models. For example if you make many changes to your model's formula, just run

track(model)

after every iteration to store all the models you tested and it will track them, store them in a dataframe, and sort them by time. It can also keep track of metrics, functions you want to run on your model, and custom metrics.

Usage

In [ ]:
track(metrics, metrics=NULL, customMetrics=NULL, ...)

Arguments

metrics      A list of strings that exist as attributes on the model.

customFunctions  A list of strings that exists as functions you have defined.

...        Additional arguments to be entered as new columns in the MODELS dataframe.

Source code

In [2]:
library(plyr)
library(dplyr)

track <- function(model, metrics=NULL, customFunctions=NULL, ...){
    
    kwargs <- list(...);
    
    environmentVars <- list(
        datetime = format(Sys.time(), "%Y-%m-%d %H:%M:%S"),
        username = Sys.info()[["user"]],
        className = class(model)[[1]]
    )
    
    form <- data.frame(formula=paste(deparse(formula(model)), collapse=''));
    
    for (env in names(environmentVars)){
        form[[env]] <-  environmentVars[[env]];
    }

    if (!is.null(metrics)){
        for (metric in metrics){
            form[[metric]]  <- summary(model)[[metric]];
        }
    }

    if (!is.null(customFunctions)){
        for (cmetric in customFunctions){
            form[[cmetric]] <- get(cmetric)(model);
        }
    }

    for (arg in names(kwargs)){
        form[[arg]] <- kwargs[[arg]];
    }
    
    DIRTY_METRICS <- FALSE;
    
    if (exists('METRICS_LIST')){
        
        DIRTY_METRICS <- FALSE %in% 
            unique(
                append(
                    names(kwargs), unlist(
                        append(
                            metrics, customFunctions
                        )
                    )
                ) == METRICS_LIST
            )
        
        if (DIRTY_METRICS){
            MMODELS <<- form;
        }
    }
    
    METRICS_LIST <<- append(
        names(kwargs), unlist(
            append(
                metrics, customFunctions
            )
        )
    )

    if (exists('MODELS')){
        COUNTER <<- COUNTER + 1;
    } else {
        MODELS <<- form;
        COUNTER <<- 1;
    }
    
    if (exists('METRICS_LIST') & DIRTY_METRICS){

        for (env in names(environmentVars)){
            MMODELS[[env]] <<-  environmentVars[[env]];
        }

        if (!is.null(metrics)){
            for (metric in metrics){
                MMODELS[[metric]]  <<- summary(model)[[metric]];
            }
        }

        if (!is.null(customFunctions)){
            for (cmetric in customFunctions){
                MMODELS[[cmetric]] <<- get(cmetric)(model);
            }
        }

        for (arg in names(kwargs)){
            MMODELS[[arg]] <<- kwargs[[arg]];
        }
    }
    
    if (COUNTER > 1){
        if (DIRTY_METRICS){
            combinedDf <- rbind.fill(MODELS, MMODELS);
            MODELS <<- combinedDf[with(combinedDf, order(datetime, decreasing=TRUE)),];
        } else {
            combinedDf <- rbind.fill(MODELS, form);
            MODELS <<- combinedDf[with(combinedDf, order(datetime, decreasing=TRUE)),];
        }
    }
    return(form);
}

Read in data

In [3]:
df <- read.csv('middle_tn_schools.csv')
In [4]:
head(df)
nameschool_ratingsizereduced_lunchstate_percentile_16state_percentile_15stu_teach_ratioschool_typeavg_score_15avg_score_16full_time_teacherspercent_blackpercent_whitepercent_asianpercent_hispanic
Allendale Elementary School 5 851 10 90.2 95.8 15.7 Public 89.4 85.2 54 2.9 85.5 1.6 5.6
Anderson Elementary 2 412 71 32.8 37.3 12.8 Public 43.0 38.3 32 3.9 86.7 1.0 4.9
Avoca Elementary 4 482 43 78.4 83.6 16.6 Public 75.7 73.0 29 1.0 91.5 1.2 4.4
Bailey Middle 0 394 91 1.6 1.0 13.1 Public Magnet 2.1 4.4 30 80.7 11.7 2.3 4.3
Barfield Elementary 4 948 26 85.3 89.2 14.8 Public 81.3 79.6 64 11.8 71.2 7.1 6.0
Barkers Mill Elementary School4 893 48 78.1 76.4 13.9 Public 69.4 72.3 64 28.6 39.9 2.2 17.8

ModelTracker default options

By default, ModelTracker records your model's formula, the execution timestamp, the username, and the model's class name.

In [5]:
model.lm <- lm(avg_score_16 ~ stu_teach_ratio + school_type, data=df)

summary(model.lm)

track(model.lm)
Call:
lm(formula = avg_score_16 ~ stu_teach_ratio + school_type, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-64.817 -18.583   1.753  21.543  59.687 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  5.3802     8.6094   0.625 0.532438    
stu_teach_ratio              3.5258     0.5578   6.321 8.10e-10 ***
school_typePublic Charter    5.7192     9.3476   0.612 0.541051    
school_typePublic Magnet   -15.5177     4.1378  -3.750 0.000207 ***
school_typePublic Virtual -318.9387    59.4852  -5.362 1.52e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 26.08 on 342 degrees of freedom
Multiple R-squared:  0.1403,	Adjusted R-squared:  0.1303 
F-statistic: 13.96 on 4 and 342 DF,  p-value: 1.47e-10
formuladatetimeusernameclassName
avg_score_16 ~ stu_teach_ratio + school_type2017-06-29 21:34:18 tdobbins lm

Details about the models you have tested are stored in MODELS.

In [6]:
MODELS
formuladatetimeusernameclassName
avg_score_16 ~ stu_teach_ratio + school_type2017-06-29 21:34:18 tdobbins lm

Metrics

track accepts an argument called metrics. This has to be an attribute that exists on your model's summary object. To extract it, put the attribute name in a list of strings like

list('r.squared', 'sigma')

We'll also add another variable (state_percentile_15) to our formula to display how ModelTracker keeps track of the formulas.

In [7]:
model.lm <- lm(avg_score_16 ~ stu_teach_ratio + school_type + state_percentile_15, data=df)

metrics <- list('r.squared', 'sigma')

summary(model.lm)
track(model.lm, metrics=metrics)
Call:
lm(formula = avg_score_16 ~ stu_teach_ratio + school_type + state_percentile_15, 
    data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-33.660  -5.444   0.069   5.711  41.670 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 8.5299     3.1113   2.742  0.00644 ** 
stu_teach_ratio             0.1617     0.2132   0.759  0.44864    
school_typePublic Charter  -3.1549     3.7291  -0.846  0.39814    
school_typePublic Magnet   -2.0058     1.4565  -1.377  0.16939    
school_typePublic Virtual   1.3703    22.3127   0.061  0.95107    
state_percentile_15         0.7992     0.0161  49.651  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.005 on 335 degrees of freedom
  (6 observations deleted due to missingness)
Multiple R-squared:  0.8972,	Adjusted R-squared:  0.8956 
F-statistic: 584.5 on 5 and 335 DF,  p-value: < 2.2e-16
formuladatetimeusernameclassNamer.squaredsigma
avg_score_16 ~ stu_teach_ratio + school_type + state_percentile_152017-06-29 21:34:23 tdobbins lm 0.8971659 9.00481

Notice our MODELS object now contains two additional metrics, r.squared and sigma. The first model, since it wasn't tracking those metrics, shows NA for those values.

In [8]:
MODELS
formuladatetimeusernameclassNamer.squaredsigma
2avg_score_16 ~ stu_teach_ratio + school_type + state_percentile_152017-06-29 21:34:23 tdobbins lm 0.8971659 9.00481
1avg_score_16 ~ stu_teach_ratio + school_type 2017-06-29 21:34:18 tdobbins lm NA NA

Custom functions

ModelTracker also accepts an argument called customFunctions. This is for defining custom functions that accept the model as input and return a single value. Be sure that the functions only accept one argument, your model.

For example, I define two functions, getr2 and getFstat.

In [9]:
getr2 <- function(model){
    return(summary(model)$r.squared);
}

getFstat <- function(model){
    return(summary(model)$fstatistic[['value']])
}

Now I will create a list and put my custom functions in them then pass them to track. Note, you must specify them as strings.

In [10]:
model.lm <- lm(avg_score_16 ~ reduced_lunch + poly(size, 2), data=df)

summary(model.lm)

customFunctions <- list('getr2', 'getFstat')
track(model.lm, customFunctions=customFunctions)
Call:
lm(formula = avg_score_16 ~ reduced_lunch + poly(size, 2), data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-45.168 -10.779  -1.083  10.450  58.887 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    103.22970    1.97690  52.218  < 2e-16 ***
reduced_lunch   -0.91846    0.03541 -25.935  < 2e-16 ***
poly(size, 2)1 -46.38692   16.66492  -2.784  0.00568 ** 
poly(size, 2)2 -19.69939   16.12389  -1.222  0.22264    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 16 on 343 degrees of freedom
Multiple R-squared:  0.6756,	Adjusted R-squared:  0.6728 
F-statistic: 238.1 on 3 and 343 DF,  p-value: < 2.2e-16
formuladatetimeusernameclassNamegetr2getFstat
avg_score_16 ~ reduced_lunch + poly(size, 2)2017-06-29 21:38:37 tdobbins lm 0.6756111 238.1242

Notice that the MODELS object sorts your models by time in descending order.

In [11]:
MODELS
formuladatetimeusernameclassNamer.squaredsigmagetr2getFstat
3avg_score_16 ~ reduced_lunch + poly(size, 2) 2017-06-29 21:38:37 tdobbins lm NA NA 0.6756111 238.1242
1avg_score_16 ~ stu_teach_ratio + school_type + state_percentile_152017-06-29 21:34:23 tdobbins lm 0.8971659 9.00481 NA NA
2avg_score_16 ~ stu_teach_ratio + school_type 2017-06-29 21:34:18 tdobbins lm NA NA NA NA

Custom metrics

Finally, my favorite. This feature lets you track any metric by calculating it before running track and then just inputting the value into our track call.

For instance, here I calculate pseudoR2 and retrieve the coefficient for reduced_lunch. Then I add them to track using whatever names I want; here I use pseudoR2 and reducedLunchCoef.

In [12]:
model.glm <- glm(avg_score_16 ~ I(reduced_lunch^2) + sqrt(size) + factor(school_type),
                 data=df)

summary(model.glm)

pseudoR2 <- 1-(model.glm$deviance/model.glm$null.deviance)
reducedLunchCoef <- summary(model.glm)$coefficients[2,1]

track(model.glm, pseudoR2=pseudoR2, reducedLunchCoef=reducedLunchCoef)
Call:
glm(formula = avg_score_16 ~ I(reduced_lunch^2) + sqrt(size) + 
    factor(school_type), data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-45.657   -9.101    0.439    9.698   45.115  

Coefficients:
                                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)                       92.5334745  3.6667864  25.236  < 2e-16 ***
I(reduced_lunch^2)                -0.0093298  0.0003373 -27.662  < 2e-16 ***
sqrt(size)                        -0.2344695  0.1221849  -1.919   0.0558 .  
factor(school_type)Public Charter 33.3718677  5.5360610   6.028 4.31e-09 ***
factor(school_type)Public Magnet  -4.7844382  2.4292237  -1.970   0.0497 *  
factor(school_type)Public Virtual -4.4385146 15.2299952  -0.291   0.7709    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 228.2756)

    Null deviance: 270663  on 346  degrees of freedom
Residual deviance:  77842  on 341  degrees of freedom
AIC: 2877.1

Number of Fisher Scoring iterations: 2
formuladatetimeusernameclassNamepseudoR2reducedLunchCoef
avg_score_16 ~ I(reduced_lunch^2) + sqrt(size) + factor(school_type)2017-06-29 21:43:12 tdobbins glm 0.7124028 -0.009329825

Whatever name you pass to track will show up as the column name.

In [13]:
MODELS
formuladatetimeusernameclassNamer.squaredsigmagetr2getFstatpseudoR2reducedLunchCoef
4avg_score_16 ~ I(reduced_lunch^2) + sqrt(size) + factor(school_type)2017-06-29 21:43:12 tdobbins glm NA NA NA NA 0.7124028 -0.009329825
1avg_score_16 ~ reduced_lunch + poly(size, 2) 2017-06-29 21:38:37 tdobbins lm NA NA 0.6756111 238.1242 NA NA
2avg_score_16 ~ stu_teach_ratio + school_type + state_percentile_15 2017-06-29 21:34:23 tdobbins lm 0.8971659 9.00481 NA NA NA NA
3avg_score_16 ~ stu_teach_ratio + school_type 2017-06-29 21:34:18 tdobbins lm NA NA NA NA NA NA
In [ ]: