Challenges Set 8
Instructions
Use the starwars
dataset from the dplyr
package (loaded already with tidyverse
) to complete the below challenges. I highly recommend all of you to first get to know the starwars
dataset by trying the “Get to know your data” functions covered at the beginning of the Week 2 Starter file. Good luck and have fun in completing them:
🛑 Don’t Click Submit Just Yet 🚧
Please read carefully the below information:
Once you have completed all the coding challenges, and your confident in your work, copy and paste your responses from the chunk into the form fields below each challenge.
You are responsible for correctly coping and pasting only the required code to solve each challenge We will grade only what you have submitted!
We will only grade 1 submission per student so do not click Submit until you are confident in your responses.
By submitting this form you are certifying that you have followed the academic integrity guidelines available in the syllabus. The code and answers submitted are the results of your work and your work only!
Make sure you have completed all the challenges and included all the required personal information (e.g., full name, email, zid) in the respective form’s fields. If you don’t know/want to complete a challenge just leave the field below it empty.
Now you are ready to click the above “Submit” button. Congrats you have completed this set of challenges!!!
Dataset Overview The marketing_campaign dataset includes customer-level data with variables like age, income, family size, and product purchases. The goal will be to predict total purchases (NumStorePurchases) based on customer demographics and behavior.
Challenge 1: Data Splitting Objective: Split the data into training and testing sets.
Task: Load the dataset and examine it. Split the data into training (75%) and testing (25%) sets. Expected Code:
library(tidymodels) library(modeldata)
data(“marketing_campaign”)
set.seed(123) split <- initial_split(marketing_campaign, prop = 0.75) train_data <- training(split) test_data <- testing(split) Challenge 2: Data Preprocessing Objective: Create a recipe to preprocess the data.
Task: Predict NumStorePurchases using Age, Income, Education, and Marital_Status. Normalize numeric variables and one-hot encode categorical variables. Expected Code:
recipe_marketing <- recipe(NumStorePurchases ~ Age + Income + Education + Marital_Status, data = train_data) %>% step_normalize(all_numeric_predictors()) %>% step_dummy(all_nominal_predictors()) Challenge 3: Defining a Linear Regression Model Objective: Create a linear regression model.
Task: Specify a linear regression model using the lm engine. Expected Code:
model_lm <- linear_reg() %>% set_engine(“lm”) Challenge 4: Building a Workflow Objective: Combine the recipe and model into a workflow and fit it.
Task: Use the recipe_marketing and model_lm in a workflow. Fit the workflow to the training data. Expected Code:
workflow_lm <- workflow() %>% add_recipe(recipe_marketing) %>% add_model(model_lm)
fit_lm <- workflow_lm %>% fit(data = train_data) Challenge 5: Making Predictions Objective: Predict total purchases on the test set.
Task: Generate predictions on the test data. Add the predictions to the test dataset. Expected Code:
predictions <- predict(fit_lm, new_data = test_data) %>% bind_cols(test_data) Challenge 6: Evaluating Model Performance Objective: Compute RMSE, R², and MAE for the predictions.
Task: Use the yardstick package to evaluate the model. Expected Code:
metrics <- predictions %>% metrics(truth = NumStorePurchases, estimate = .pred) %>% filter(.metric %in% c(“rmse”, “rsq”, “mae”))
metrics Challenge 7: Advanced Modeling Objective: Fit a Random Forest model for comparison.
Task: Define a random forest model using rand_forest. Train and evaluate the model. Expected Code:
model_rf <- rand_forest(mtry = 4, trees = 500, min_n = 10) %>% set_engine(“ranger”) %>% set_mode(“regression”)
workflow_rf <- workflow() %>% add_recipe(recipe_marketing) %>% add_model(model_rf)
fit_rf <- workflow_rf %>% fit(data = train_data)
predictions_rf <- predict(fit_rf, new_data = test_data) %>% bind_cols(test_data)
metrics_rf <- predictions_rf %>% metrics(truth = NumStorePurchases, estimate = .pred)
metrics_rf Challenge 8: Feature Engineering Objective: Add interaction terms to the recipe.
Task: Include interactions between Age and Income to enhance the model. Expected Code:
recipe_interactions <- recipe(NumStorePurchases ~ Age + Income + Education + Marital_Status, data = train_data) %>% step_normalize(all_numeric_predictors()) %>% step_dummy(all_nominal_predictors()) %>% step_interact(terms = ~ Age:Income) Challenge 9: Model Comparison Objective: Compare RMSE, R², and MAE across models.
Task: Combine the metrics for both linear regression and random forest models. Discuss the better-performing model and its implications. Expected Code:
comparison <- bind_rows( metrics %>% mutate(Model = “Linear Regression”), metrics_rf %>% mutate(Model = “Random Forest”) )
comparison Challenge 10: Optimization with Cross-Validation Objective: Use cross-validation to optimize the random forest model.
Task: Perform 5-fold cross-validation. Tune the mtry and min_n parameters. Expected Code:
set.seed(123) folds <- vfold_cv(train_data, v = 5)
grid <- grid_regular( mtry(range = c(2, 6)), min_n(range = c(5, 15)), levels = 5 )
tuned_rf <- tune_grid( workflow_rf, resamples = folds, grid = grid, metrics = metric_set(rmse, rsq, mae) )
best_rf <- select_best(tuned_rf, “rmse”) final_rf <- finalize_workflow(workflow_rf, best_rf)
final_fit <- final_rf %>% fit(data = train_data)
predictions_final <- predict(final_fit, new_data = test_data) %>% bind_cols(test_data)
metrics_final <- predictions_final %>% metrics(truth = NumStorePurchases, estimate = .pred)
metrics_final
Challenge 1: Data Splitting Objective: Teach students how to split data into training and testing sets.
Dataset: webr::MoviesData (a dataset on movies, their genres, gross revenue, and more). Task: Load the dataset and inspect its structure. Split the data into 80% training and 20% testing sets, ensuring reproducibility with a random seed. Expected Code:
library(tidymodels) library(webr)
data(MoviesData)
set.seed(123) # Ensure reproducibility split <- initial_split(MoviesData, prop = 0.8) train_data <- training(split) test_data <- testing(split) Challenge 2: Creating a Recipe Objective: Create a recipe for preprocessing the data.
Task: Build a recipe to predict Revenue using Runtime, Rating, and Votes as predictors. Include steps for normalizing numeric predictors and creating interactions between Runtime and Rating. Expected Code:
recipe_revenue <- recipe(Revenue ~ Runtime + Rating + Votes, data = train_data) %>% step_normalize(all_numeric_predictors()) %>% step_interact(terms = ~ Runtime:Rating) Challenge 3: Defining a Model Objective: Define a linear regression model using parsnip.
Task: Specify a linear regression model. Use the lm engine. Expected Code:
model_lm <- linear_reg() %>% set_engine(“lm”) Challenge 4: Building a Workflow Objective: Combine the recipe and model into a workflow.
Task: Create a workflow using the recipe_revenue and model_lm. Fit the workflow to the training data. Expected Code:
workflow_lm <- workflow() %>% add_recipe(recipe_revenue) %>% add_model(model_lm)
fit_lm <- workflow_lm %>% fit(data = train_data) Challenge 5: Making Predictions Objective: Generate predictions on the test set.
Task: Use the fitted workflow to predict Revenue on the test set. Add the predictions as a column to the test dataset. Expected Code:
predictions <- predict(fit_lm, new_data = test_data) %>% bind_cols(test_data) Challenge 6: Assessing Model Performance Objective: Calculate RMSE, R², and MAE for the model.
Task: Use the yardstick package to compute the metrics. Compare actual vs. predicted Revenue. Expected Code:
metrics <- predictions %>% metrics(truth = Revenue, estimate = .pred) %>% filter(.metric %in% c(“rmse”, “rsq”, “mae”))
metrics Challenge 7: Extending to Another Model Objective: Compare the performance of a different model.
Dataset: Use the same dataset. Task: Build a random forest model using the rand_forest() function with ranger engine. Compare the RMSE, R², and MAE of the random forest model against the linear regression model. Expected Code:
model_rf <- rand_forest(mtry = 2, trees = 500, min_n = 5) %>% set_engine(“ranger”) %>% set_mode(“regression”)
workflow_rf <- workflow() %>% add_recipe(recipe_revenue) %>% add_model(model_rf)
fit_rf <- workflow_rf %>% fit(data = train_data)
predictions_rf <- predict(fit_rf, new_data = test_data) %>% bind_cols(test_data)
metrics_rf <- predictions_rf %>% metrics(truth = Revenue, estimate = .pred) %>% filter(.metric %in% c(“rmse”, “rsq”, “mae”))
metrics_rf