Patrick’s LEGO Model

Welcome! Thanks for checking out my LEGO Model.

In this project, I use all the LEGO data I could find to create a statistical model. My goal is to better understand what affects a LEGO set’s price.

This is written for a general audience. I like explaining things to everyone.

Join me as I learn about LEGO! Or, skip ahead to the following findings:

- Patrick

Skip to the imaginator, a demo that predicts the price of a set you imagine, using the model explained in this notebook!

Technical description

In this notebook, I fit a log-transformed linear model to a dataset of LEGO products to understand what contributes to their price.

Skip to the Model Diagnostics to see the results directly.

Data Sources

I got data of all LEGO sets and themes at Rebrickable, and prices for many LEGO sets at Brickset.

Skip to this section to see the data I have: Data Availability

LEGO Data Primer

LEGO sells products called Sets, and they are made of a number of LEGO Pieces. Sets all have categorical Themes, like Castle and Star Wars. Some, like Star Wars, are licensed from other IPs. Others, like Castle, are not. Sets also include a number of Minifigures, and were released in a certain Year.

Code
#data_raw <- read_csv("C:/Users/dange/Data/Legostracker/datasets/data_raw.csv")
data("legosets", package = "brickset")

data_raw <- legosets %>%
  mutate(set_id = paste0(number, "-", numberVariant)) #%>%
#filter(year <= 2022)

themes <- read_csv("C:/Users/dange/Data/Legostracker/datasets/themes_2.csv")
sets <- read_csv("C:/Users/dange/Data/Legostracker/datasets/sets.csv")
mysets <- read_csv("C:/Users/dange/Code/LegosTracker/data/my_sets.csv")

# Generate flag License = true if theme, subtheme, or themegroup contains Licensed
data_raw <- data_raw %>%
  mutate(
    theme = ifelse(is.na(theme), "Unknown", theme),
    minifigs = ifelse(is.na(minifigs), 0, minifigs),
    License = ifelse(
      grepl("Licensed", theme, ignore.case = TRUE) |
        grepl("Licensed", subtheme, ignore.case = TRUE) |
        grepl("Licensed", themeGroup, ignore.case = TRUE),
      TRUE,
      FALSE
    ),
    
    pp = US_retailPrice / pieces
  )

# Filter for NA, pieces, and exclude technical/educational sets
ldata <- data_raw %>%
  filter(
    !is.na(US_retailPrice),
    !is.na(pieces),
    !is.na(category),
    !is.na(pieces),
    pieces >= 3,
    #        category == "Normal",
    #        themeGroup != "Educational",
    #        themeGroup != "Technical",
  )

# Clean and filter numeric columns
ldata <- ldata %>%
  mutate(
    pieces = to_num(pieces),
    US_retailPrice = to_num(US_retailPrice),
    minifigs = to_num(minifigs),
    year = as.integer(to_num(year)),
    theme = as.factor(theme)
  ) %>%
  filter(!is.na(pieces),
         pieces > 0,
         !is.na(US_retailPrice),
         US_retailPrice > 0)

# Row counts for charts
ldata <- ldata %>%
  group_by(pieces, US_retailPrice) %>%
  mutate(point_count = n()) %>%
  ungroup()

# Merge theme into ldata via set
ldata <- ldata %>%
  left_join(sets %>% select(set_num, theme_id), by = join_by(set_id == set_num)) %>%
  left_join(themes %>% select(id, themes_root), by = join_by(theme_id == id)) %>%
  # Remove theme and replace wit themes_root
  select(-theme) %>%
  rename(theme = themes_root)

# Save ldata
save(ldata, file = "analysis/data/ldata.RData")

Taking a Look

Let’s investigate the Price of LEGO Sets. My first guess is that the Price of each Set is related to how many Pieces it has, so I make a plot with Pieces on the X axis and Price on the Y axis.

Code
ggplot(ldata, aes(x = pieces, y = US_retailPrice, size = point_count)) +
  geom_point(alpha = 0.5,
             shape = 15,
             color = lego_palette[["Lblue"]]) +
  scale_x_continuous(limits = c(1, max(ldata$pieces, na.rm = TRUE)), labels = comma) +
  scale_y_continuous(limits = c(1, max(ldata$US_retailPrice, na.rm = TRUE)), labels = dollar) +
  labs(
    title = "Sets' Prices and Pieces",
    subtitle = "",
    x = "Pieces",
    y = "Price"
  ) +
  theme(legend.position = "none")

Each square is a set. When several sets share the same spot, I make the square bigger.

As expected, Sets’ Prices and Pieces seem to be related: The points are grouped together.

On the Line

To investigate, I’m going to use math to draw a straight line on the graph. The straight line is called a line of best fit. It’s drawn to be the “best line possible” that’s perfectly in the middle between all the points. It’s like taking an average across a range of data.

I’m also going to highlight a few LEGO sets I’ve owned!

Code
ggplot(ldata, aes(x = pieces, y = US_retailPrice, size = point_count)) +
  geom_point(alpha = 0.5,
             shape = 15,
             color = lego_palette[["Lblue"]]) +
  geom_smooth(
    method = "lm",
    se = FALSE,
    color = lego_palette[["Llblue"]],
    linewidth = 1
  ) +
  geom_point(
    data = subset(ldata, set_id %in% mysets$set_id),
    aes(x = pieces, y = US_retailPrice, size = point_count),
    shape = 21,
    color = lego_palette[["Ldred"]],
    fill = lego_palette[["Lred"]],
    size = 5
  ) +
  scale_x_continuous(limits = c(1, max(ldata$pieces, na.rm = TRUE)), labels = scales::comma) +
  scale_y_continuous(limits = c(1, max(ldata$US_retailPrice, na.rm = TRUE)), labels = scales::dollar) +
  labs(
    title = "Sets' Prices and Pieces",
    subtitle = "Fitting one straight line",
    x = "Pieces",
    y = "Price"
  ) +
  theme(legend.position = "none")

My Sets

Set Theme Price, Pieces
Y-Wing Fighter Star Wars $39.99, 454 pc
Bonsai Tree Botanicals $49.99, 878 pc
Hokusai - The Great Wave LEGO Art $99.99, 1,810 pc
Millenium Falcon Star Wars $84.99, 921 pc

The distribution of Sets’ Prices and Pieces seems skewed - Most sets are below $100 and 1,000 pieces, but because there are a few big sets, the graph is very zoomed out.

To get a closer look at most of the ldata, I’ll use a mathematical trick to “zoom in” on the ldata we care about.

Changing Perspective: “Log” Scale

Normally, the halfway point between 0 and 100 is 50. We can change this with the Logarithmic Scale (or Log scale).

In Log scale, the halfway point between 0 and 100 is shown as 10. This is really useful when there’s lots of ldata between 0 and 10 and not much ldata between 10 and 100.

Below, I plot the same Sets as above, but on the Log Scale. I’ll put my same four sets on the graph, so we can see how they move.

Code
ggplot(ldata, aes(x = pieces, y = US_retailPrice, size = point_count)) +
  geom_point(alpha = 0.3,
             shape = 15,
             color = lego_palette[["Lblue"]]) +
  geom_smooth(
    method = "lm",
    formula = y ~ x,
    se = FALSE,
    color = lego_palette[["Llblue"]],
    linewidth = 1
  ) +
  geom_point(
    data = subset(ldata, set_id %in% mysets$set_id),
    aes(x = pieces, y = US_retailPrice, size = point_count),
    shape = 21,
    color = lego_palette[["Ldred"]],
    fill = lego_palette[["Lred"]],
    size = 5
  ) +
  scale_x_log10(limits = c(1, max(ldata$pieces, na.rm = TRUE)), labels = scales::comma) +
  scale_y_log10(limits = c(1, max(ldata$US_retailPrice, na.rm = TRUE)), labels = scales::dollar) +
  scale_size_continuous(range = c(1, 4), guide = "none") +
  labs(
    title = "Sets' Prices and Pieces",
    subtitle = "On a log scale, and fitting one straight line",
    x = "Pieces",
    y = "Price"
  )

The data shown here is the exact same as the graph above. The only difference is the scale. See the labels on the axes.

This is a much better way to see the data. We’ll use this view from now on.

Asking some Questions

Now that we have a good way to view the data, we can start asking some questions.

First, I’m curious if Licensed Sets are more expensive than non-licensed sets.Let’s color them differently and draw the graph.

Code
ldata <- ldata %>%
  group_by(pieces, US_retailPrice, License) %>%
  mutate(point_count = n()) %>%
  ungroup()

license_colors <- c(
  "FALSE" = lego_palette[["Lblue"]],
  "TRUE" = lego_palette[["Lorange"]]
)

ggplot(ldata, aes(x = pieces, y = US_retailPrice, color = License)) +
  geom_point(aes(size = point_count), alpha = 0.3, shape = 15) +
  scale_x_log10(limits = c(1, max(ldata$pieces, na.rm = TRUE)), labels = comma) +
  scale_y_log10(limits = c(1, max(ldata$US_retailPrice, na.rm = TRUE)), labels = dollar) +
  scale_color_manual(values = license_colors,
                     labels = c("FALSE" = "Non-Licensed", "TRUE" = "Licensed")) +
  scale_size_continuous(range = c(1, 3), guide = "none") +
  labs(
    title = "Licensed vs Non-Licensed Sets",
    subtitle = "",
    x = "Pieces",
    y = "Price",
    color = ""
  ) +
  guides(
    colour = guide_legend(override.aes = list(alpha = 1)),
    fill   = guide_legend(override.aes = list(alpha = 1)),
    shape  = guide_legend(override.aes = list(alpha = 1)),
  )

OK, I generally see the licensed color a little higher than the non-licensed color, but we’ll have to use modelling to get more conclusive.

Next, let’s investigate themes. I noticed a group of points above and to the left of the main, larger group. With some analysis, I found that those are sets of the Duplo theme.

Below, I draw another plot and color only the Duplo sets. I also create Lines of Best Fit separately for Duplo and for all other Sets, so we can see if their average prices are different.

Code
lego <- ldata %>%
  mutate(theme_chr = tolower(trimws(as.character(theme))), Duplo = theme_chr == "duplo") %>%
  filter(!is.na(Duplo),!is.na(pieces),
         pieces > 0,!is.na(US_retailPrice),
         US_retailPrice > 0)

duplo_colors <- c(`FALSE` = lego_palette[["Lblue"]], `TRUE` = lego_palette[["Lred"]])

# 2) Plot with regular per-group LMs
ggplot(lego, aes(x = pieces, y = US_retailPrice, color = Duplo)) +
  geom_point(aes(size = point_count), alpha = 0.3, shape = 15) +
  geom_smooth(
    method = "lm",
    formula = y ~ x,
    se = FALSE,
    linewidth = 1
  ) +
  scale_x_log10(limits = c(1, max(lego$pieces, na.rm = TRUE)), labels = scales::comma) +
  scale_y_log10(limits = c(1, max(lego$US_retailPrice, na.rm = TRUE)), labels = scales::dollar) +
  scale_color_manual(
    values = duplo_colors,
    labels = c(`FALSE` = "Non-Duplo", `TRUE` = "Duplo"),
    breaks = c(FALSE, TRUE)
  ) +
  scale_size_continuous(range = c(1, 3), guide = "none") +
  labs(
    title = "Duplo vs Non-Duplo Sets",
    subtitle = "",
    x = "Pieces",
    y = "Price",
    color = ""
  ) +
  guides(colour = guide_legend(override.aes = list(alpha = 1)))

Duplo pieces are exactly twice as big as all other LEGO sets, intended for toddlers. Because Pieces is a main driver of Price, it makes sense that twice-as-large Pieces might make a set more expensive due to the cost of material.

These are solid results - all the Duplo sets clearly have a higher Price than the non-Duplo sets even when they have similar Piece counts.

Model

To see the relationship betweeen Price and Pieces, taking into account more than just whether the theme is Duplo or not - we need to reach for one of the most tried-and-true tools in statistics: Linear Regression.1

1 Linear Regression is a statistical model that estimates the relationship between a dependent variable and one or more independent variables.

Wikipedia

We’ve actually already used it - those straight lines “of best fit” above are simple linear regressions. But multiple linear regression can handle much more than that. In multiple linear regression, we can ask a model to estimate the relationship between a dependent variable and multiple explanatory variables.

In our case, we’re going to ask a model to fit a straight line for each Theme that explains Price using these factors: Pieces, release Year, number of Minifigs.

Our Model

\[ \begin{aligned} log(\text{Price}) = \beta_0 + \beta_1 \log(\text{Pieces}) + \text{Theme} \\ + \big(\log(\text{Pieces})\times \text{Theme}\big) \\ + \beta_2 \,\text{Year} + \beta_3 \,\text{Minifigs} + \varepsilon \end{aligned} \]

Model Term Data Type Explanation
Price (Dependent Variable) Numerical, log-transformed Retail price of the LEGO set (dependent variable to be explained).
Pieces Numerical, log-transformed Number of pieces in the set; larger sets generally cost more.
Theme Categorical Captures theme-specific shifts in price levels.
Pieces × Theme Interaction (numeric × categorical) Allows each theme to have its own slope: shows whether the relationship between pieces and price differs across themes.
Year Numerical Release year of the set; controls for time trends in pricing (e.g., inflation, product strategy changes).
Minifigs Numerical Number of minifigures in the set, captures additional value.

Theme is in this model twice. I add an interaction term that will see how different Themes have different relationships between Piece and Price. Without the interaction term, the model would just consider the difference in Price for each theme as a whole on average.

In R, we run this model with the command:

lm(log(US_retailPrice) ~ log(pieces) * theme + year + minifigs)

Code
# Define Levels
lev <- levels(factor(ldata$theme))
C <- contr.sum(length(lev))
rownames(C) <- lev
colnames(C) <- lev[-length(lev)]

#Lm Just Theme
lmt <- lm(
  log(US_retailPrice) ~ log(pieces) + theme + year + minifigs,
  data = transform(ldata, theme = factor(theme, levels = lev)),
  contrasts = list(theme = C)
)

# LM Interaction
model_lmi <- lm(
  log(US_retailPrice) ~ log(pieces) * theme + year + minifigs,
  data = transform(ldata, theme = factor(theme, levels = lev)),
  contrasts = list(theme = C)
)
save(model_lmi, file = "analysis/model/model_lmi.RData")

This cute animation represents the model running…

Skip to the Model Diagnostics to see the full results of the model.

The model performs well!

With the variables given, it’s able to explain 91.5% of the changes in Price.2

2 This is Adjusted R-Squared of the model. This is the most common measure of how much of the variance in the dependent variable the model is able to explain using the given explanatory variables. Wikipedia

Code
# 1) Grab coefficients; treat any aliased (NA) coeffs as 0 so they don't poison sums
co <- coef(lmt)
co[is.na(co)] <- 0

# 2) Build the model matrix for the fitted model
mm <- model.matrix(lmt)
stopifnot("log(pieces)" %in% colnames(mm))
col_lp   <- which(colnames(mm) == "log(pieces)")
col_int  <- grep("^log\\(pieces\\):theme", colnames(mm))

# 3) Row-specific slope wrt log(pieces): beta_lp + (row's interaction combo)
slope_i <- as.numeric(co[col_lp]) + as.numeric(mm[, col_int, drop = FALSE] %*% co[col_int])

# 4) Convert "10% more pieces" into % price change, then average (AME)
ame_pieces10 <- mean((1.10^slope_i) - 1, na.rm = TRUE)

ame_minifig1 <- exp(unname(co["minifigs"])) - 1
ame_year1 <- exp(unname(co["year"])) - 1

# Optional sanity checks
#any(is.na(slope_i))           # should be FALSE
#range(slope_i)                # slopes across themes/rows
#scales::percent(ame_pieces10, accuracy = 0.1)

Let’s interpret what the model says about changes in sets’ characteristics:

  • A 10% increase in number of pieces is associated with 7.2% higher price on average.

  • Each additional minifigure is associated with a 3.6% increase in price, holding other variables constant.

  • A one-year newer release is associated with 0.6% increase in price, holding other variables constant. Accounting for everything else we see, LEGO sets haven’t changed price much over the years.

Code
blocks <- c("Pieces", "Theme", "Pieces×Theme", "Year", "Minifigs")
block_terms <- list(
  "Pieces"        = "log(pieces)",
  "Theme"         = "theme",
  "Pieces×Theme"  = "log(pieces):theme",
  "Year"          = "year",
  "Minifigs"      = "minifigs"
)

# Build a clean modeling data frame from the fitted model
mf <- model.frame(model_lmi)
mf <- droplevels(mf)
mf$y <- model.response(mf)

# Recreate raw vars if only logged versions exist
if (!"pieces" %in% names(mf) && "log(pieces)" %in% names(mf)) {
  mf$pieces <- exp(mf[["log(pieces)"]])
}
if (!"minifigs" %in% names(mf) && "log(minifigs)" %in% names(mf)) {
  mf$minifigs <- exp(mf[["log(minifigs)"]])
}
if (!"year" %in% names(mf) && "log(year)" %in% names(mf)) {
  mf$year <- exp(mf[["log(year)"]])
}

# Helpers
subset_terms <- function(S) {
  T <- unlist(block_terms[S], use.names = FALSE)
  if ("Pieces×Theme" %in% S) {
    T <- unique(c(T, block_terms$Pieces, block_terms$Theme))
  }
  T
}

rhs_formula <- function(S) {
  ts <- subset_terms(S)
  if (length(ts) == 0) {
    "1"
  } else {
    paste(ts, collapse = " + ")
  }
}

key <- function(S) {
  if (length(S) == 0) {
    "<BASE>"
  } else {
    paste(sort(S), collapse = "|")
  }
}

r2_cache <- new.env(parent = emptyenv())
get_r2 <- function(S) {
  k <- key(S)
  if (exists(k, envir = r2_cache, inherits = FALSE)) {
    return(get(k, envir = r2_cache))
  }
  fml <- as.formula(paste("y ~", rhs_formula(S)))
  fit <- lm(fml, data = mf, na.action = na.omit)
  r2  <- summary(fit)$r.squared
  assign(k, r2, envir = r2_cache)
  r2
}

fact <- function(m) {
  if (m <= 1) 1 else prod(2:m)
}

wgt <- function(s, n) {
  fact(s) * fact(n - s - 1) / fact(n)
}

# Exact Shapley (LMG) over the 5 blocks
n <- length(blocks)
phi <- setNames(numeric(n), blocks)

for (bi in blocks) {
  others <- setdiff(blocks, bi)
  for (s in 0:length(others)) {
    Ssets <- if (s == 0) {
      list(character(0))
    } else {
      combn(others, s, simplify = FALSE)
    }
    for (S in Ssets) {
      r2_S  <- get_r2(S)
      r2_Si <- get_r2(c(S, bi))
      phi[bi] <- phi[bi] + wgt(length(S), n) * (r2_Si - r2_S)
    }
  }
}

total_r2 <- get_r2(blocks)

variance_shares <- tibble(
  feature = c(names(phi), "Residual"),
  sumsq   = c(unname(phi), 1 - total_r2)
) |> 
  mutate(share = 100 * sumsq / sum(sumsq))

Next, let’s see what the model says about the factors that contribute to the price of a LEGO set.

The variables in the model each affect the predicted price.

We’re able to interpret how much each of the variables affects the price, to see how influential each variable is.

Sets’ pieces in general, and differences by theme.

  • Pieces alone account for 29.1% of variation in price.

  • The Theme×Pieces interaction (themes having different piece–price relationships) accounts for 39.7% of the variation in price.

Theme-level differences.
The virtue of a set being a certain theme contributes 13.8% of its price.

Other controls.

  • Minifigs explains 7.6% of the variation in price. That’s quite substantial.

  • Year explains 1.5% of the variation in price. LEGO sets haven’t changed much in price, considering these other variables.

Residual share. About of the variance in Price isn’t explained by this model. That’s called a residual.

I plot this below, to show what makes up a price!

Shares come from a Shapley/LMG decomposition over the five blocks {Pieces, Theme, Pieces×Theme, Year, Minifigs}. “Unexplained” equals 1−R21-R^21−R2; small differences from adjusted R2R^2R2 reflect the adjustment and rounding.

Code
pal <- c(
  "Pieces"        = "#8ECDA3",
  "Theme"         = "#00AE4D",
  "Pieces×Theme"  = "#66BE7D",
  "Year"          = "#009246",
  "Minifigs"      = "#006834",
  "Residual"       = "#4A2D91"
)

order_levels <- c("Residual",
                  "Pieces",
                  "Pieces×Theme",
                  "Theme",
                  "Year",
                  "Minifigs")

plot_df <- variance_shares %>%
  mutate(Feature = factor(feature, levels = order_levels),
         pct     = share / sum(share)) %>%
  arrange(Feature)

bar_df <- plot_df %>%
  mutate(ymin = lag(cumsum(pct), default = 0), ymax = cumsum(pct))

lab_df <- bar_df %>%
  mutate(ymid  = (ymin + ymax) / 2,
         label = paste0(Feature, ": ", round(share, 0), "%"))

bar_width  <- 0.2
bar_center <- 1
xmin_bar   <- bar_center - bar_width / 2
xmax_bar   <- bar_center + bar_width / 2

top_feature <- bar_df$Feature[which.max(bar_df$share)]

pad    <- 0.08 * bar_width
gap    <- 0.10 * bar_width
stud_w <- (bar_width - 2 * pad - gap) / 2
stud_h <- 0.02

studs <- tibble(
  xmin = c(xmin_bar + pad, xmin_bar + pad + stud_w + gap),
  xmax = c(xmin_bar + pad + stud_w, xmax_bar - pad),
  ymin = 1,
  ymax = 1 + stud_h
)

ggplot(bar_df, aes(
  x = 1,
  ymin = ymin,
  ymax = ymax,
  fill = Feature
)) +
  geom_rect(xmin = xmin_bar, xmax = xmax_bar) +
  geom_rect(
    data = studs,
    aes(
      xmin = xmin,
      xmax = xmax,
      ymin = ymin,
      ymax = ymax
    ),
    inherit.aes = FALSE,
    fill = "#00492C",
    colour = NA
  ) +
  ggrepel::geom_text_repel(
    data = lab_df,
    aes(
      x = xmax_bar,
      y = ymid,
      label = label,
      colour = Feature
    ),
    inherit.aes = FALSE,
    nudge_x = 0.4,
    hjust = 0,
    direction = "y",
    segment.size = 0.4,
    box.padding = 0.2,
    point.padding = 0.1,
    min.segment.length = 0,
    size = 12
  ) +
  scale_fill_manual(values = pal, drop = FALSE) +
  scale_colour_manual(values = pal, guide = "none") +
  coord_cartesian(
    xlim = c(0.5, 2.4),
    ylim = c(0, 1 + stud_h + 0.03),
    clip = "off"
  ) +
  theme_void() +
  theme(legend.position = "none",
        plot.margin = margin(10, 10, 10, 10))

Theme Effects

Different themes have different prices. Below, I plot how each theme’s prices are different than the average of all themes.

Because we used our model, this accounts for the fact that some sets may have more pieces than others. The virtue of a set being a certain theme is isolated, from the other variables we told the model to take into account.

The small lines show the confidence interval for the estimated effect on price. The model is estimating, and it reports the range in which it’s 95% sure about its prediction. When that 95% confidence interval doesn’t cross zero, it’s called statistically significant, and I darken the color bar for that theme.

Code
theme_counts <- dplyr::count(ldata, theme, name = "n")


pal <- setNames(lego_colors$hex, lego_colors$color)

# 1) Grab the k-1 named theme coefficients (sum coding omits the last level)
tid <- tidy(lmt, conf.int = TRUE)

theme_terms <- tid %>%
  filter(str_starts(term, "^theme")) %>%
  transmute(theme = str_remove(term, "^theme"), estimate, conf.low, conf.high)

# 2) Recover the omitted last level with proper SE and CI from vcov
V <- vcov(lmt)
present <- paste0("theme", theme_terms$theme)
a <- rep(-1, length(present))                      # coefficients for -sum(present)
est_last <- -sum(theme_terms$estimate)
var_last <- as.numeric(t(a) %*% V[present, present, drop = FALSE] %*% a)
se_last  <- sqrt(var_last)
alpha <- 0.05
crit <- qt(1 - alpha / 2, df = df.residual(lmt))
ci_last <- est_last + c(-1, 1) * crit * se_last

last_theme <- setdiff(lev, theme_terms$theme)

theme_full <- bind_rows(
  theme_terms,
  tibble(
    theme = last_theme,
    estimate = est_last,
    conf.low = ci_last[1],
    conf.high = ci_last[2]
  )
)

# 3) Back-transform to % diffs vs overall mean (exact, not linearized)
label_percent_from_log <- function(x)
  percent(exp(x) - 1, accuracy = 1)

theme_tbl <- theme_full %>%
  mutate(
    p   = exp(estimate) - 1,
    plo = exp(conf.low) - 1,
    phi = exp(conf.high) - 1
  ) %>%
  mutate(
    y   = sign(p)   * log1p(abs(p)),
    ymin = sign(plo) * log1p(abs(plo)),
    ymax = sign(phi) * log1p(abs(phi)),
    fill_name = case_when(
      phi < 0 ~ "Lblue",
      plo > 0 ~ "Ldred",
      p   < 0 ~ "Llblue",
      p   > 0 ~ "Llred",
      TRUE    ~ "Lyellow"
    )
  ) %>%
  arrange(desc(y)) %>%
  left_join(theme_counts, by = "theme") %>%
  filter(n > 5) %>%
  mutate(theme_lab = paste0(theme, " (", n, ")"))

# symmetric percent ticks
pct_ticks <- c(-0.75, -0.50, -0.25, 0, 0.25, 0.50, 1.00, 2.5, 6.00)
breaks    <- sign(pct_ticks) * log1p(abs(pct_ticks))

ggplot(theme_tbl, aes(reorder(theme_lab, y), y, fill = fill_name)) +
  geom_col() +
  geom_errorbar(aes(ymin = ymin, ymax = ymax),
                width = 0.2,
                color = "#272727") +
  coord_flip() +
  scale_fill_manual(values = pal) +
  guides(fill = "none") +
  scale_y_continuous(breaks = breaks,
                     labels = scales::percent(pct_ticks, accuracy = 1)) +
  labs(
    x = NULL,
    y = "% higher/lower $ price vs overall average (log scale)",
    title = "Theme Price Comparison",
    subtitle = "Comparing LEGO themes by their price per piece",
  ) + theme(
    #aspect.ratio = 1,
    axis.text.y = element_text(size = rel(0.95)),
    axis.text.x = element_text(size = rel(0.85))
  )

The x-axis is also set to log scale, since some themes have such high variation in prices.

We can see Duplo high up, like we saw in the beginning! The model estimates Duplo sets are 200% more expensive - or three times as expensive as the average.

Theme Price Trends

Each theme also has a different trend between Price and Pieces.

LEGO Art is not only cheaper than expected overall - it’s price also doesn’t increase as much as others do, as the sets have more pieces. It has a flatter piece-price trend.

Some themes have sets of all the same price no matter how many pieces they have. For Mixels, all sets from 45 to 75 pieces are all $4.99. Its piece-price trend is a straight line.

Train actually has a negative relationship with price, despite being the most expensive theme overall!

To investigate, I plot all sets and let us overlay each theme’s own trend as well as its own sets.

Code
ldata <- ldata %>%
  mutate(theme = str_squish(as.character(theme))) %>%
  mutate(theme = fct_relevel(theme, sort(levels(theme))))

star_theme <- "Architecture"

themes_lvl <- levels(ldata$theme)

theme_lims <- ldata |>
  group_by(theme) |>
  summarise(
    xmin = min(pieces, na.rm = TRUE),
    xmax = max(pieces, na.rm = TRUE),
    .groups = "drop"
  )

year0 <- median(ldata$year, na.rm = TRUE)
minifigs0 <- median(ldata$minifigs, na.rm = TRUE)

pred_df <- map_dfr(seq_len(nrow(theme_lims)), function(i) {
  th <- theme_lims$theme[i]
  xs <- seq(theme_lims$xmin[i], theme_lims$xmax[i], length.out = 80)
  tibble(
    theme = th,
    pieces = xs,
    year = year0,
    minifigs = minifigs0
  )
})
pred_df$theme <- factor(pred_df$theme, levels = themes_lvl)
pred_df$price <- exp(predict(model_lmi, newdata = pred_df))

bg_color <- lego_palette[["Llblue"]]
hl_color <- lego_palette[["Lred"]]

x_ticks <- 10^(floor(range(log10(ldata$pieces), finite = TRUE)[1]):ceiling(range(log10(ldata$pieces), finite = TRUE)[2]))
y_ticks <- 10^(floor(range(log10(
  ldata$US_retailPrice
), finite = TRUE)[1]):ceiling(range(log10(
  ldata$US_retailPrice
), finite = TRUE)[2]))

plt <- plot_ly()

plt <- add_markers(
  plt,
  data = ldata,
  x = ~ pieces,
  y = ~ US_retailPrice,
  type = "scattergl",
  marker = list(
    color = bg_color,
    size = 5,
    symbol = "square"
  ),
  name = "",
  showlegend = FALSE
)

pieces_all <- range(ldata$pieces, na.rm = TRUE)
pred_all <- tibble(
  pieces = seq(pieces_all[1], pieces_all[2], length.out = 200),
  year = year0,
  minifigs = minifigs0,
  theme = factor(themes_lvl[1], levels = themes_lvl)
)
pred_all$price <- exp(predict(model_lmi, newdata = pred_all))

plt <- add_lines(
  plt,
  data = pred_all,
  x = ~ pieces,
  y = ~ price,
  line = list(
    color = lego_palette[["Lblue"]],
    width = 1,
    dashed = "dashed"
  ),
  name = "Overall model",
  showlegend = FALSE
)

for (th in themes_lvl) {
  df_th   <- dplyr::filter(ldata, theme == th)
  df_line <- dplyr::filter(pred_df, theme == th)
  
  plt <- add_markers(
    plt,
    data = df_th,
    x = ~ pieces,
    y = ~ US_retailPrice,
    type = "scattergl",
    marker = list(
      color = hl_color,
      size = 7,
      symbol = "circle"
    ),
    name = th,
    legendgroup = th,
    showlegend = TRUE,
    visible = ifelse(th == star_theme, TRUE, "legendonly")
  )
  
  plt <- add_lines(
    plt,
    data = df_line,
    x = ~ pieces,
    y = ~ price,
    line = list(color = lego_palette[["Ldred"]], width = 3),
    name = paste0(th, " model"),
    legendgroup = th,
    showlegend = FALSE,
    visible = ifelse(th == star_theme, TRUE, "legendonly")
  )
}

plt <- layout(
  plt,
  title = "Theme Price Trends",
  xaxis = list(
    type = "log",
    title = "Pieces",
    tickmode = "array",
    tickvals = x_ticks,
    dtick = 1,
    minor = list(ticks = "")
  ),
  yaxis = list(
    type = "log",
    title = "Price",
    tickprefix = "$",
    tickmode = "array",
    tickvals = y_ticks,
    dtick = 1,
    minor = list(ticks = "")
  ),
  legend = list(
    orientation = "v",
    x = 1.05,
    xanchor = "left",
    y = 1,
    yanchor = "top",
    groupclick = "togglegroup"
  ),
  margin = list(r = 220)
)

plt

You’re ready to try the imaginator! Use this model to predict the price of a set you imagine.

Final Thoughts

In the world of AI and neural networks, simple models like linear regressions can look unglamorous. But, sometimes the simplest tools work surprisingly well.

Like Emmet shows us in The LEGO Movie - sometimes, a simple model, thoughtfully made and used, can be the right tool for the job - even if it doesn’t look flashy.

A clip of Emmet from the LEGO movie, showing off his double-decker couch

Appendix

Data Availability

Before beginning any data analysis, it’s important to inspect what data you have. 25389 sets are available in the dataset, and 6057 of them have a US retail price.

Below, I show a few different ways of assessing this missingness.

First, I plot how many sets we have prices for, compared to all sets. I plot this below in two types of bar charts.

Code
# Merge data_raw and sets ldata and create availability flag
sets <- sets %>%
  left_join(themes %>% rename("theme" = "name"), by = c("theme_id" = "id"))

sets_m <- sets %>%
  left_join(data_raw,
            by = c("set_num" = "set_id"),
            suffix = c("", ".m")) %>%
          mutate(price_availability = ifelse(is.na(US_retailPrice), "Missing", "Available"))



status_colors <- c("Available" = lego_palette[["Lblue"]], "Missing" = lego_palette[["Llblue"]])

main_plot <- sets_m %>%
  mutate(
    year = as.numeric(year),
    price_availability = factor(price_availability, levels = c("Available", "Missing"))
  ) %>%
  count(year, price_availability) %>%
  ggplot(aes(x = year, y = n, fill = price_availability)) +
  geom_col(position = position_stack(reverse = TRUE), width = 1) +
  scale_fill_manual(values = status_colors) +
  scale_x_continuous(breaks = seq(
    from = floor(min(sets_m$year, na.rm = TRUE) / 5) * 5,
    to = ceiling(max(sets_m$year, na.rm = TRUE) / 5) * 5,
    by = 5
  ),
  expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0), labels = comma) +
  labs(
    title = "US Retail Price Availability Counts Over Years",
    x = NULL,
    # Remove x-axis label for cleaner combination
    y = "Count",
    fill = "Price Status"
  ) +
  theme(
    panel.grid.minor.x = element_blank(),
    legend.position = "none",
    axis.text.y = element_text(size = rel(0.85)),
    axis.text.x = element_text(size = rel(0.85))
  )

percent_plot <- sets_m %>%
  mutate(
    year = as.numeric(year),
    price_availability = factor(price_availability, levels = c("Available", "Missing"))
  ) %>%
  count(year, price_availability) %>%
  group_by(year) %>%
  mutate(pct = n / sum(n)) %>%
  ggplot(aes(x = year, y = pct, fill = price_availability)) +
  geom_col(position = position_fill(reverse = TRUE), width = 1) +
  scale_fill_manual(values = status_colors) +
  scale_x_continuous(breaks = seq(
    from = floor(min(sets_m$year, na.rm = TRUE) / 5) * 5,
    to = ceiling(max(sets_m$year, na.rm = TRUE) / 5) * 5,
    by = 5
  ),
  expand = c(0, 0)) +
  scale_y_continuous(labels = scales::percent, expand = c(0, 0)) +
  labs(y = "Percentage", fill = "Price Status") +
  theme(
    panel.grid.minor.x = element_blank(),
    legend.position = "bottom",
    axis.text.y = element_text(size = rel(0.65)),
    axis.text.x = element_text(size = rel(0.85))
  )

# Combine plots vertically
combined_plot <- main_plot / percent_plot +
  plot_layout(heights = c(6, 2))

# Display the combined plot
combined_plot

To see differences by theme, I create a table showing how many sets have prices available, and what percentage that is, by theme.

Code
pretty_table <- sets_m %>%
  count(themes_root, price_availability) %>%
  pivot_wider(names_from = price_availability,
              values_from = n,
              values_fill = 0) %>%
  mutate(`% Available` = Available / (Available + Missing),
         .after = Available) %>%
  arrange(desc(Available)) %>%
  rename(
    "Theme" = themes_root,
    "Available Prices" = Available,
    "Missing Prices" = Missing
  )

datatable(
  pretty_table,
  rownames = FALSE,
  filter = "top",
  extensions = c("Buttons", "Scroller"),
  options = list(
    dom = "Bfrtip",
    buttons = c("copy", "csv", "excel"),
    scrollX = TRUE,
    scrollY = "400px",
    scroller = TRUE,
    pageLength = 10,
    columnDefs = list(list(
      targets = which(names(pretty_table) == "% Available") - 1,
      # 0-based index
      render = JS(
        "function(ldata, type, row, meta) {",
        "return type === 'display' ?",
        "(ldata * 100).toFixed(1) + '%' : ldata;",
        "}"
      )
    )),
    initComplete = JS(
      "function(settings, json) {",
      "$(this.api().table().header()).css({'background-color': '#4A2D91', 'color': '#fff'});",
      "$('body').css({'font-family': 'opensans'});",
      "}"
    )
  ),
  class = "stripe hover"
) %>%
  formatStyle("Theme",
              backgroundColor = "#ecf0f1",
              fontFamily = "Open Sans") %>%
  formatStyle(
    c("Available Prices", "Missing Prices"),
    color = JS("'black'"),
    fontWeight = JS("'normal'"),
    fontFamily = "Open Sans"
  ) %>%
  formatStyle(
    "% Available",
    background = styleColorBar(c(0, 1), lego_palette[["Llblue"]]),
    backgroundSize = '98% 88%',
    backgroundRepeat = 'no-repeat',
    backgroundPosition = 'center',
    fontFamily = "Open Sans"
  )

Finally, to see what we should expect about price availability with respect to piece count, I fit a logistic regression model predicting whether a set has a price available or not, using piece count and year as predictors.

This model shows that we’re more likely to be missing prices for small sets, and less likely to be missing prices for large sets.

Code
sets_mp <- sets_m %>%
  mutate(pieces = as.numeric(pieces),
         have_price = as.numeric(price_availability != "Missing")) %>%
  filter(pieces > 0, set_num != "BIGBOX-1") #An outlier of a very large set with a missing price

price <- glm(have_price ~ pieces + year,
             data = sets_mp,
             family = binomial(link = "logit"))

#summary(price)
#plot(price)

# Generate predictions at mean year
pred_pieces <- ggpredict(price, terms = "pieces [all]", condition = c(year = mean(sets_mp$year)))

# Custom breaks for piece counts
breaks <- c(1, 10, 25, 50, 100, 250, 500, 1000, 5000)

# Plot with LEGO colors
ggplot(pred_pieces, aes(x, predicted)) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high),
              fill = lego_palette["Llblue"],
              alpha = 0.3) +
  geom_line(color = lego_palette["Lblue"], linewidth = 1.5) +
  labs(
    x = "Number of Pieces (semi-log scale)",
    y = "Probability of Having Price",
    title = "Effect of Piece Count on Price Availability",
    subtitle = "Holding year constant at average value"
  ) +
  scale_y_continuous(
    labels = scales::percent,
    limits = c(0, 1),
    # Forces y-axis from 0% to 100%
    expand = c(0, 0)   # Removes padding at axis ends
  ) +
  scale_x_continuous(
    trans = trans_new(
      name = "softlog",
      transform = function(x)
        log(x + 10),
      inverse = function(x)
        exp(x) - 10
    ),
    breaks = breaks,
    labels = breaks
  )

Model Diagnostics

The ANOVA table below shows that all terms are highly statistically significant.

Code
car::Anova(model_lmi, type = 2)
Anova Table (Type II tests)

Response: log(US_retailPrice)
                   Sum Sq   Df   F value    Pr(>F)    
log(pieces)       2337.95    1 26498.483 < 2.2e-16 ***
theme             1051.93  102   116.889 < 2.2e-16 ***
year                 5.52    1    62.575 3.046e-15 ***
minifigs            29.25    1   331.529 < 2.2e-16 ***
log(pieces):theme   95.42   96    11.266 < 2.2e-16 ***
Residuals          515.00 5837                        
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

I show the model summary output below. While not many variables are significant, the overall model is highly significant.

Code
options(max.print = 1200)
options(width = 1000)
summary(model_lmi)

Call:
lm(formula = log(US_retailPrice) ~ log(pieces) * theme + year + 
    minifigs, data = transform(ldata, theme = factor(theme, levels = lev)), 
    contrasts = list(theme = C))

Residuals:
     Min       1Q   Median       3Q      Max 
-1.83088 -0.16980 -0.00883  0.14746  2.93855 

Coefficients: (6 not defined because of singularities)
                                                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)                                       -1.754e+01  2.301e+00  -7.622 2.90e-14 ***
log(pieces)                                        9.898e-01  3.511e-01   2.819 0.004826 ** 
themeAgents                                        2.455e-01  1.248e+00   0.197 0.844076    
themeAngry Birds                                   1.032e+00  1.444e+00   0.715 0.474937    
themeAnimal Crossing                              -1.343e+00  1.800e+00  -0.746 0.455782    
themeAquazone                                     -8.575e-01  1.423e+00  -0.603 0.546709    
themeArchitecture                                  2.085e+00  1.199e+00   1.740 0.081984 .  
themeAtlantis                                      6.889e-01  1.210e+00   0.569 0.569244    
themeAvatar                                        3.880e-01  1.615e+00   0.240 0.810197    
themeBelville                                      1.536e+00  1.283e+00   1.197 0.231187    
themeBen 10                                        4.585e+00  2.439e+00   1.880 0.060108 .  
themeBionicle                                      2.058e+00  1.173e+00   1.755 0.079299 .  
themeBooks                                         4.401e+00  1.471e+00   2.992 0.002787 ** 
themeBotanicals                                    1.986e-01  1.587e+00   0.125 0.900412    
themeBrick Sketches                                4.706e+00  2.749e+00   1.712 0.086970 .  
themeBrickheadz                                    7.606e-01  1.195e+00   0.637 0.524339    
themeBrickLink Designer Program                    2.135e-02  1.893e+00   0.011 0.991000    
themeBulk Bricks                                   3.722e+00  1.238e+00   3.006 0.002659 ** 
themeCars                                          5.954e-01  1.220e+00   0.488 0.625650    
themeCastle                                        1.553e+00  1.178e+00   1.318 0.187499    
themeChinese Traditional Festivals                 5.949e-01  1.522e+00   0.391 0.695916    
themeCity                                          1.157e+00  1.168e+00   0.990 0.322090    
themeClassic                                       2.018e-02  1.191e+00   0.017 0.986482    
themeCollectible Minifigures                       1.902e+00  1.171e+00   1.624 0.104465    
themeCreator                                      -4.300e-01  1.171e+00  -0.367 0.713520    
themeDC Super Hero Girls                           6.876e-02  1.418e+00   0.049 0.961316    
themeDespicable Me 4                               1.889e+00  1.674e+00   1.129 0.259130    
themeDimensions                                    7.895e-01  1.200e+00   0.658 0.510672    
themeDino                                          4.170e-01  1.466e+00   0.285 0.776032    
themeDisney                                        1.246e+00  1.174e+00   1.061 0.288565    
themeDOTS                                          1.050e+00  1.176e+00   0.893 0.371969    
themeDreamzzz                                     -1.214e-01  1.276e+00  -0.095 0.924193    
themeDuplo                                         2.723e+00  1.168e+00   2.332 0.019733 *  
themeEducational and Dacta                         4.207e+00  1.196e+00   3.518 0.000438 ***
themeElves                                         2.830e-01  1.246e+00   0.227 0.820302    
themeExo-Force                                    -3.116e-01  1.275e+00  -0.244 0.806931    
themeFactory                                      -1.968e+01  3.679e+01  -0.535 0.592694    
themeFortnite                                     -6.046e-01  1.740e+00  -0.348 0.728171    
themeFriends                                       7.462e-01  1.169e+00   0.638 0.523274    
themeFusion                                        6.440e+00  9.100e+00   0.708 0.479139    
themeGabby's Dollhouse                             9.483e-01  1.394e+00   0.680 0.496467    
themeGames                                         1.958e+00  1.230e+00   1.592 0.111433    
themeGear                                          4.419e+00  9.168e+00   0.482 0.629837    
themeGhostbusters                                  8.780e-01  1.875e+00   0.468 0.639687    
themeHarry Potter                                  1.006e+00  1.183e+00   0.850 0.395294    
themeHero Factory                                  2.058e+00  1.183e+00   1.740 0.081904 .  
themeHidden Side                                   2.466e-01  1.305e+00   0.189 0.850127    
themeIcons                                         7.170e-01  1.254e+00   0.572 0.567610    
themeIndiana Jones                                 4.512e-01  1.273e+00   0.355 0.722958    
themeJuniors                                       1.336e+00  1.221e+00   1.094 0.273936    
themeJurassic World                                1.178e+00  1.211e+00   0.973 0.330637    
themeLegends of Chima                              1.487e+00  1.183e+00   1.258 0.208571    
themeLEGO Art                                      2.977e+00  1.459e+00   2.041 0.041333 *  
themeLEGO Ideas and CUUSOO                         5.835e-01  1.218e+00   0.479 0.631930    
themeMake & Create                                 7.590e-01  1.218e+00   0.623 0.533126    
themeMaster Builder Academy                        1.305e+00  2.002e+00   0.652 0.514607    
themeMindstorms                                    3.450e+00  1.190e+00   2.900 0.003749 ** 
themeMinecraft                                     7.207e-01  1.187e+00   0.607 0.543720    
themeMinions                                       1.797e+00  1.330e+00   1.351 0.176812    
themeMixels                                        3.486e+00  1.642e+00   2.123 0.033816 *  
themeModular Buildings                             6.039e-01  2.204e+00   0.274 0.784129    
themeMonkie Kid                                    1.032e+00  1.226e+00   0.842 0.400033    
themeMonster Fighters                              6.206e-02  1.303e+00   0.048 0.962005    
themeNexo Knights                                  1.530e+00  1.182e+00   1.295 0.195511    
themeNinjago                                       1.911e+00  1.168e+00   1.635 0.102029    
themeOther                                         7.886e-01  1.186e+00   0.665 0.506172    
themeOverwatch                                    -2.509e-01  1.608e+00  -0.156 0.875990    
themePharaoh's Quest                               1.420e+00  1.266e+00   1.121 0.262268    
themePirates                                       1.298e+00  1.199e+00   1.083 0.278903    
themePower Functions                              -3.990e-01  2.291e+00  -0.174 0.861734    
themePower Miners                                  1.295e-01  1.262e+00   0.103 0.918275    
themePromotional                                  -3.946e+00  7.690e+00  -0.513 0.607904    
themeQuatro                                       -4.970e+01  1.047e+02  -0.475 0.635035    
themeRacers                                        5.314e-01  1.177e+00   0.452 0.651589    
themeScooby-Doo                                    9.567e-01  1.521e+00   0.629 0.529496    
themeSculptures                                    2.412e-01  1.545e+00   0.156 0.875957    
themeSeasonal                                      9.893e-01  1.171e+00   0.845 0.398175    
themeService Packs                                -2.763e-01  4.850e-01  -0.570 0.568916    
themeSonic The Hedgehog                           -2.741e-01  1.732e+00  -0.158 0.874243    
themeSpace                                         5.275e-01  1.185e+00   0.445 0.656300    
themeSpeed Champions                               5.516e-02  1.235e+00   0.045 0.964378    
themeSpeed Racer                                  -8.618e-01  2.573e+00  -0.335 0.737702    
themeSpongeBob SquarePants                         1.276e+00  1.518e+00   0.841 0.400481    
themeStar Wars                                     6.043e-01  1.169e+00   0.517 0.605186    
themeStranger Things                              -8.339e-01  1.646e+00  -0.507 0.612389    
themeSuper Heroes DC                               1.009e+00  1.175e+00   0.858 0.390662    
themeSuper Heroes Marvel                           7.836e-01  1.173e+00   0.668 0.504007    
themeSuper Mario                                   2.154e+00  1.175e+00   1.833 0.066873 .  
themeTechnic                                      -2.163e-01  1.173e+00  -0.184 0.853771    
themeTeenage Mutant Ninja Turtles                  1.025e+00  1.325e+00   0.774 0.438874    
themeThe Hobbit and Lord of the Rings              2.712e-01  1.244e+00   0.218 0.827417    
themeThe Legend of Zelda                          -4.136e-01  1.676e+00  -0.247 0.805075    
themeThe LEGO Movie                                1.015e+00  1.187e+00   0.855 0.392444    
themeThe Lone Ranger                               9.005e-01  1.429e+00   0.630 0.528506    
themeThe Powerpuff Girls                           7.012e-01  4.880e+00   0.144 0.885737    
themeTown                                         -8.937e-01  1.597e+00  -0.560 0.575779    
themeToy Story                                     9.092e-01  1.314e+00   0.692 0.489009    
themeTrain                                         5.282e+00  1.188e+00   4.448 8.84e-06 ***
themeTrolls: World Tour                            3.166e-01  1.390e+00   0.228 0.819810    
themeUnikitty!                                     1.382e+00  1.248e+00   1.107 0.268458    
themeVIDIYO                                        2.670e+00  1.248e+00   2.140 0.032416 *  
themeVikings                                      -6.698e-01  1.142e+00  -0.587 0.557431    
themeWednesday                                    -3.855e+01  4.164e+01  -0.926 0.354579    
themeWicked                                       -5.643e-01  1.163e+00  -0.485 0.627621    
year                                               7.790e-03  9.848e-04   7.910 3.05e-15 ***
minifigs                                           3.547e-02  1.948e-03  18.208  < 2e-16 ***
log(pieces):themeAgents                           -1.238e-01  3.591e-01  -0.345 0.730303    
log(pieces):themeAngry Birds                      -2.490e-01  3.812e-01  -0.653 0.513603    
log(pieces):themeAnimal Crossing                   1.738e-01  4.257e-01   0.408 0.683133    
log(pieces):themeAquazone                          4.455e-02  3.800e-01   0.117 0.906678    
log(pieces):themeArchitecture                     -3.917e-01  3.538e-01  -1.107 0.268251    
log(pieces):themeAtlantis                         -1.913e-01  3.565e-01  -0.537 0.591550    
log(pieces):themeAvatar                           -1.246e-01  3.937e-01  -0.317 0.751557    
log(pieces):themeBelville                         -1.889e-01  3.693e-01  -0.512 0.608995    
log(pieces):themeBen 10                           -9.898e-01  8.094e-01  -1.223 0.221390    
log(pieces):themeBionicle                         -4.194e-01  3.521e-01  -1.191 0.233713    
log(pieces):themeBooks                            -8.150e-01  4.736e-01  -1.721 0.085287 .  
log(pieces):themeBotanicals                       -1.489e-01  3.890e-01  -0.383 0.701806    
log(pieces):themeBrick Sketches                   -1.006e+00  6.063e-01  -1.659 0.097258 .  
log(pieces):themeBrickheadz                       -2.850e-01  3.545e-01  -0.804 0.421441    
log(pieces):themeBrickLink Designer Program       -1.012e-01  4.022e-01  -0.252 0.801420    
log(pieces):themeBulk Bricks                      -9.594e-01  3.669e-01  -2.615 0.008943 ** 
log(pieces):themeCars                             -1.694e-01  3.578e-01  -0.473 0.635975    
log(pieces):themeCastle                           -3.360e-01  3.526e-01  -0.953 0.340579    
log(pieces):themeChinese Traditional Festivals    -2.313e-01  3.788e-01  -0.611 0.541395    
log(pieces):themeCity                             -2.508e-01  3.513e-01  -0.714 0.475237    
log(pieces):themeClassic                          -1.548e-01  3.533e-01  -0.438 0.661363    
log(pieces):themeCollectible Minifigures          -4.812e-01  3.542e-01  -1.359 0.174256    
log(pieces):themeCreator                          -3.312e-02  3.515e-01  -0.094 0.924947    
log(pieces):themeDC Super Hero Girls              -8.358e-02  3.788e-01  -0.221 0.825360    
log(pieces):themeDespicable Me 4                  -3.970e-01  4.024e-01  -0.986 0.323937    
log(pieces):themeDimensions                       -4.426e-02  3.571e-01  -0.124 0.901357    
log(pieces):themeDino                             -9.017e-02  3.858e-01  -0.234 0.815191    
log(pieces):themeDisney                           -2.605e-01  3.519e-01  -0.740 0.459174    
log(pieces):themeDOTS                             -3.632e-01  3.523e-01  -1.031 0.302628    
log(pieces):themeDreamzzz                         -8.207e-02  3.608e-01  -0.227 0.820066    
log(pieces):themeDuplo                            -3.222e-01  3.514e-01  -0.917 0.359237    
log(pieces):themeEducational and Dacta            -5.603e-01  3.540e-01  -1.583 0.113516    
log(pieces):themeElves                            -1.395e-01  3.590e-01  -0.389 0.697540    
log(pieces):themeExo-Force                        -5.471e-02  3.637e-01  -0.150 0.880440    
log(pieces):themeFactory                           2.805e+00  5.435e+00   0.516 0.605771    
log(pieces):themeFortnite                         -3.360e-02  4.035e-01  -0.083 0.933641    
log(pieces):themeFriends                          -2.097e-01  3.514e-01  -0.597 0.550623    
log(pieces):themeFusion                           -1.183e+00  1.694e+00  -0.698 0.484998    
log(pieces):themeGabby's Dollhouse                -1.572e-01  3.836e-01  -0.410 0.681941    
log(pieces):themeGames                            -4.812e-01  3.594e-01  -1.339 0.180594    
log(pieces):themeGear                             -9.898e-01  2.664e+00  -0.372 0.710249    
log(pieces):themeGhostbusters                     -2.250e-01  4.031e-01  -0.558 0.576832    
log(pieces):themeHarry Potter                     -2.504e-01  3.525e-01  -0.710 0.477541    
log(pieces):themeHero Factory                     -4.116e-01  3.542e-01  -1.162 0.245242    
log(pieces):themeHidden Side                      -1.449e-01  3.647e-01  -0.397 0.691227    
log(pieces):themeIcons                            -1.805e-01  3.562e-01  -0.507 0.612458    
log(pieces):themeIndiana Jones                    -1.661e-01  3.617e-01  -0.459 0.646008    
log(pieces):themeJuniors                          -2.722e-01  3.591e-01  -0.758 0.448483    
log(pieces):themeJurassic World                   -2.321e-01  3.555e-01  -0.653 0.513896    
log(pieces):themeLegends of Chima                 -3.307e-01  3.530e-01  -0.937 0.348981    
log(pieces):themeLEGO Art                         -5.363e-01  3.683e-01  -1.456 0.145461    
log(pieces):themeLEGO Ideas and CUUSOO            -1.790e-01  3.546e-01  -0.505 0.613692    
log(pieces):themeMake & Create                    -2.697e-01  3.564e-01  -0.757 0.449245    
log(pieces):themeMaster Builder Academy           -2.321e-01  4.414e-01  -0.526 0.599090    
log(pieces):themeMindstorms                       -3.551e-01  3.541e-01  -1.003 0.316000    
log(pieces):themeMinecraft                        -2.052e-01  3.531e-01  -0.581 0.561081    
log(pieces):themeMinions                          -3.993e-01  3.731e-01  -1.070 0.284580    
log(pieces):themeMixels                           -9.994e-01  4.500e-01  -2.221 0.026392 *  
log(pieces):themeModular Buildings                -1.963e-01  4.261e-01  -0.461 0.645096    
log(pieces):themeMonkie Kid                       -2.485e-01  3.556e-01  -0.699 0.484744    
log(pieces):themeMonster Fighters                 -9.371e-02  3.642e-01  -0.257 0.796981    
log(pieces):themeNexo Knights                     -3.601e-01  3.529e-01  -1.020 0.307641    
log(pieces):themeNinjago                          -4.092e-01  3.513e-01  -1.165 0.244150    
log(pieces):themeOther                            -2.282e-01  3.532e-01  -0.646 0.518197    
log(pieces):themeOverwatch                        -4.730e-02  3.990e-01  -0.119 0.905651    
log(pieces):themePharaoh's Quest                  -2.986e-01  3.644e-01  -0.820 0.412490    
log(pieces):themePirates                          -2.877e-01  3.552e-01  -0.810 0.417875    
log(pieces):themePower Functions                   1.483e+00  1.024e+00   1.448 0.147576    
log(pieces):themePower Miners                     -5.971e-02  3.624e-01  -0.165 0.869141    
log(pieces):themePromotional                       6.727e-01  1.538e+00   0.437 0.661774    
log(pieces):themeQuatro                            1.710e+01  3.495e+01   0.489 0.624729    
log(pieces):themeRacers                           -1.630e-01  3.524e-01  -0.463 0.643717    
log(pieces):themeScooby-Doo                       -2.309e-01  3.914e-01  -0.590 0.555283    
log(pieces):themeSculptures                       -2.768e-01  3.807e-01  -0.727 0.467142    
log(pieces):themeSeasonal                         -2.920e-01  3.516e-01  -0.830 0.406318    
log(pieces):themeService Packs                            NA         NA      NA       NA    
log(pieces):themeSonic The Hedgehog               -2.424e-02  4.106e-01  -0.059 0.952934    
log(pieces):themeSpace                            -1.608e-01  3.532e-01  -0.455 0.648949    
log(pieces):themeSpeed Champions                  -1.342e-01  3.580e-01  -0.375 0.707843    
log(pieces):themeSpeed Racer                       5.370e-02  5.290e-01   0.102 0.919145    
log(pieces):themeSpongeBob SquarePants            -3.035e-01  3.910e-01  -0.776 0.437570    
log(pieces):themeStar Wars                        -1.608e-01  3.513e-01  -0.458 0.647249    
log(pieces):themeStranger Things                          NA         NA      NA       NA    
log(pieces):themeSuper Heroes DC                  -2.393e-01  3.519e-01  -0.680 0.496578    
log(pieces):themeSuper Heroes Marvel              -2.085e-01  3.517e-01  -0.593 0.553390    
log(pieces):themeSuper Mario                      -4.133e-01  3.519e-01  -1.174 0.240318    
log(pieces):themeTechnic                          -2.617e-02  3.516e-01  -0.074 0.940686    
log(pieces):themeTeenage Mutant Ninja Turtles     -2.514e-01  3.673e-01  -0.685 0.493685    
log(pieces):themeThe Hobbit and Lord of the Rings -1.228e-01  3.584e-01  -0.343 0.731876    
log(pieces):themeThe Legend of Zelda                      NA         NA      NA       NA    
log(pieces):themeThe LEGO Movie                   -2.542e-01  3.531e-01  -0.720 0.471629    
log(pieces):themeThe Lone Ranger                  -2.204e-01  3.792e-01  -0.581 0.561083    
log(pieces):themeThe Powerpuff Girls              -1.843e-01  9.792e-01  -0.188 0.850712    
log(pieces):themeTown                                     NA         NA      NA       NA    
log(pieces):themeToy Story                        -1.817e-01  3.696e-01  -0.491 0.623136    
log(pieces):themeTrain                            -1.032e+00  3.588e-01  -2.876 0.004047 ** 
log(pieces):themeTrolls: World Tour               -1.160e-01  3.779e-01  -0.307 0.758837    
log(pieces):themeUnikitty!                        -3.850e-01  3.623e-01  -1.063 0.287960    
log(pieces):themeVIDIYO                           -5.194e-01  3.614e-01  -1.437 0.150728    
log(pieces):themeVikings                                  NA         NA      NA       NA    
log(pieces):themeWednesday                         5.753e+00  6.362e+00   0.904 0.365861    
log(pieces):themeWicked                                   NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.297 on 5837 degrees of freedom
  (18 observations deleted due to missingness)
Multiple R-squared:  0.9175,    Adjusted R-squared:  0.9147 
F-statistic:   323 on 201 and 5837 DF,  p-value: < 2.2e-16

Next, the diagnostic plots tell an optimistic story. For such a simple model, it performs well.

The Residuals vs Fitted plot shows residuals without any dominant shape, while there are some patterns in outliers: There is a clear upward curvature of the residuals for the smallest fitted values. For small sets, the model is under-predicting the price.

The QQ plot shows meaningful deviations at the tails.

The Scale-Location plot has intricate curving patterns. I interpret this to be the becasue of the Discrete Price Levels I discuss below.

The Residuals vs Leverage plot shows many strong outliers. This makes sense, as I’m sure there’s much that isn’t captured by the model, like promotional sets, one-off special sets, and other kinds of material outliers.

I’m always interested in heteroskedasticity of the residuals. While more robust tests and procedures exist to treat heteroskedasticity, I’m satisfied with this for now.

Code
par(
  cex.axis = 2,
  cex.lab  = 3,
  cex.main = 4
)

plot(model_lmi, pch = 16, cex = 0.5)

Code
hist(residuals(model_lmi), breaks = 50, main = "Histogram of Residuals", col = "gray", border = "white")

Model Considerations

Shaping the Data to the Model - or Vice Versa

As always in data work - the dataset called Sets includes categories of things that I wouldn’t really call sets:

  • A line of collectible Minifigures of popular licensed characters, which have 0 pieces and 1 minifigure and are more expensive than expected

  • Small bundles of 1-5 special pieces old not as a designed set but as bespoke pieces by themselves

  • Products that aren’t sets at all: Keychains, books, and backpacks

  • Electrical motors that are counted as the one piece of a one-piece set, and are expensive

Two main options are available to deal with this:

  1. Removing categories of things that don’t get at what you really care about
  2. Changing the model to incorporate the things you don’t care about

There are advantages and disadvantages to both. Here, I removed sets that had less than 3 pieces.

Discrete Price Levels

Like most retail products, LEGO sets are not priced at continuous decimals like $12.43 - they are usually at levels like $12.99, and at higher prices, further bunch to categories like $79.99 instead of $77.99. I didn’t consider this in this model, but would like to have done so.

Nonlinear Parameters

In the case of low-piece sets, I theorize a nonlinear shape of the price-piece relationship: Sets have a fixed cost to sell. A 1-piece special minifigure may cost $4, and a small 20-piece set may cost $4, implying a negative piece-price relationship. A nonlinear parameter can capture this relationship, where fixed costs show a negative relationship at low Ns and then more linear positive relationships after that “dip.” I didn’t consider that in this model, instead opting for linear estimators.