Challenges Set 8

Instructions

Use the starwars dataset from the dplyr package (loaded already with tidyverse) to complete the below challenges. I highly recommend all of you to first get to know the starwars dataset by trying the “Get to know your data” functions covered at the beginning of the Week 2 Starter file. Good luck and have fun in completing them:

Challenge 1:

Write the code to complete the following:

Create a correlation matrix named “star_wars_corr_matrix” using all the numeric columns in the starwars dataset.
Visualize the “star_wars_corr_matrix”.

Note

Experiment visualizing the correlation matrix also using the corrplot.mixed() function. What do you notice?

Paste only the required code to solve Challenge 1 below:

Challenge 2:

Write the code to complete the following:

Set a random seed to reproduce the data splitting.
Define a data split with a 90/10 proportion, call the split as “star_wars_split”.
Create a “star_wars_train” and a “star_wars_test” object.

Paste only the required code to solve Challenge 2 below:

Challenge 3:

Create a recipe named “recipe_c3” by following the below steps:

Specify mass as the outcome variable and height, birth_year, species, gender and homeworld as predictors in the formula.
Make sure to build your model only on the train set.
Impute missing values in all the numeric variables with the median.
Impute missing values in all the nominal variables with the mode.
Standardize the average to 0 and sd to 1 for all the numeric columns.
Create dummy variables for the nominal columns.

Paste only the required code to solve Challenge 3 below:

Challenge 4:

Build a regression model:

Use parsnip to create a regression model and name it “linear_reg_model”
Set the model engine to “lm”.
Specify the mode as regression.
Display the model object.

Paste only the required code to solve Challenge 4 below:

Challenge 5:

Build a decision tree model:

Use parsnip to create a decision tree model and name it “decision_tree_model”
Set the model engine to “rpart”.
Specify the mode as regression.
Display the model object.

Paste only the required code to solve Challenge 5 below:

Challenge 6:

Create a workflow to complete the following:

Use the recipe created in Challenge 3.
Add the regression model from Challenge 4.
Fit the workflow to the starwars train dataset.
Display the results of the fitted workflow in a tidy format .

Paste only the required code to solve Challenge 6 below:

Challenge 7:

Create a workflow to complete the following:

Use the recipe created in Challenge 3.
Add the decision tree model from Challenge 5.
Fit the workflow to the starwars train dataset.
Display the results of the fitted workflow .

Paste only the required code to solve Challenge 7 below:

Insert Your Name:

Insert Your Email:

Insert Your ZID:

Comments for professor or TA:

🛑 Don’t Click Submit Just Yet 🚧

Please read carefully the below information:

Once you have completed all the coding challenges, and your confident in your work, copy and paste your responses from the chunk into the form fields below each challenge.
You are responsible for correctly coping and pasting only the required code to solve each challenge We will grade only what you have submitted!
We will only grade 1 submission per student so do not click Submit until you are confident in your responses.
By submitting this form you are certifying that you have followed the academic integrity guidelines available in the syllabus. The code and answers submitted are the results of your work and your work only!
Make sure you have completed all the challenges and included all the required personal information (e.g., full name, email, zid) in the respective form’s fields. If you don’t know/want to complete a challenge just leave the field below it empty.
Now you are ready to click the above “Submit” button. Congrats you have completed this set of challenges!!!

Dataset Overview The marketing_campaign dataset includes customer-level data with variables like age, income, family size, and product purchases. The goal will be to predict total purchases (NumStorePurchases) based on customer demographics and behavior.

Challenge 1: Data Splitting Objective: Split the data into training and testing sets.

Task: Load the dataset and examine it. Split the data into training (75%) and testing (25%) sets. Expected Code:

library(tidymodels) library(modeldata)

data(“marketing_campaign”)

set.seed(123) split <- initial_split(marketing_campaign, prop = 0.75) train_data <- training(split) test_data <- testing(split) Challenge 2: Data Preprocessing Objective: Create a recipe to preprocess the data.

Task: Predict NumStorePurchases using Age, Income, Education, and Marital_Status. Normalize numeric variables and one-hot encode categorical variables. Expected Code:

recipe_marketing <- recipe(NumStorePurchases ~ Age + Income + Education + Marital_Status, data = train_data) %>% step_normalize(all_numeric_predictors()) %>% step_dummy(all_nominal_predictors()) Challenge 3: Defining a Linear Regression Model Objective: Create a linear regression model.

Task: Specify a linear regression model using the lm engine. Expected Code:

model_lm <- linear_reg() %>% set_engine(“lm”) Challenge 4: Building a Workflow Objective: Combine the recipe and model into a workflow and fit it.

Task: Use the recipe_marketing and model_lm in a workflow. Fit the workflow to the training data. Expected Code:

workflow_lm <- workflow() %>% add_recipe(recipe_marketing) %>% add_model(model_lm)

fit_lm <- workflow_lm %>% fit(data = train_data) Challenge 5: Making Predictions Objective: Predict total purchases on the test set.

Task: Generate predictions on the test data. Add the predictions to the test dataset. Expected Code:

predictions <- predict(fit_lm, new_data = test_data) %>% bind_cols(test_data) Challenge 6: Evaluating Model Performance Objective: Compute RMSE, R², and MAE for the predictions.

Task: Use the yardstick package to evaluate the model. Expected Code:

metrics <- predictions %>% metrics(truth = NumStorePurchases, estimate = .pred) %>% filter(.metric %in% c(“rmse”, “rsq”, “mae”))

metrics Challenge 7: Advanced Modeling Objective: Fit a Random Forest model for comparison.

Task: Define a random forest model using rand_forest. Train and evaluate the model. Expected Code:

model_rf <- rand_forest(mtry = 4, trees = 500, min_n = 10) %>% set_engine(“ranger”) %>% set_mode(“regression”)

workflow_rf <- workflow() %>% add_recipe(recipe_marketing) %>% add_model(model_rf)

fit_rf <- workflow_rf %>% fit(data = train_data)

predictions_rf <- predict(fit_rf, new_data = test_data) %>% bind_cols(test_data)

metrics_rf <- predictions_rf %>% metrics(truth = NumStorePurchases, estimate = .pred)

metrics_rf Challenge 8: Feature Engineering Objective: Add interaction terms to the recipe.

Task: Include interactions between Age and Income to enhance the model. Expected Code:

recipe_interactions <- recipe(NumStorePurchases ~ Age + Income + Education + Marital_Status, data = train_data) %>% step_normalize(all_numeric_predictors()) %>% step_dummy(all_nominal_predictors()) %>% step_interact(terms = ~ Age:Income) Challenge 9: Model Comparison Objective: Compare RMSE, R², and MAE across models.

Task: Combine the metrics for both linear regression and random forest models. Discuss the better-performing model and its implications. Expected Code:

comparison <- bind_rows( metrics %>% mutate(Model = “Linear Regression”), metrics_rf %>% mutate(Model = “Random Forest”) )

comparison Challenge 10: Optimization with Cross-Validation Objective: Use cross-validation to optimize the random forest model.

Task: Perform 5-fold cross-validation. Tune the mtry and min_n parameters. Expected Code:

set.seed(123) folds <- vfold_cv(train_data, v = 5)

grid <- grid_regular( mtry(range = c(2, 6)), min_n(range = c(5, 15)), levels = 5 )

tuned_rf <- tune_grid( workflow_rf, resamples = folds, grid = grid, metrics = metric_set(rmse, rsq, mae) )

best_rf <- select_best(tuned_rf, “rmse”) final_rf <- finalize_workflow(workflow_rf, best_rf)

final_fit <- final_rf %>% fit(data = train_data)

predictions_final <- predict(final_fit, new_data = test_data) %>% bind_cols(test_data)

metrics_final <- predictions_final %>% metrics(truth = NumStorePurchases, estimate = .pred)

metrics_final

Challenge 1: Data Splitting Objective: Teach students how to split data into training and testing sets.

Dataset: webr::MoviesData (a dataset on movies, their genres, gross revenue, and more). Task: Load the dataset and inspect its structure. Split the data into 80% training and 20% testing sets, ensuring reproducibility with a random seed. Expected Code:

library(tidymodels) library(webr)

data(MoviesData)

set.seed(123) # Ensure reproducibility split <- initial_split(MoviesData, prop = 0.8) train_data <- training(split) test_data <- testing(split) Challenge 2: Creating a Recipe Objective: Create a recipe for preprocessing the data.

Task: Build a recipe to predict Revenue using Runtime, Rating, and Votes as predictors. Include steps for normalizing numeric predictors and creating interactions between Runtime and Rating. Expected Code:

recipe_revenue <- recipe(Revenue ~ Runtime + Rating + Votes, data = train_data) %>% step_normalize(all_numeric_predictors()) %>% step_interact(terms = ~ Runtime:Rating) Challenge 3: Defining a Model Objective: Define a linear regression model using parsnip.

Task: Specify a linear regression model. Use the lm engine. Expected Code:

model_lm <- linear_reg() %>% set_engine(“lm”) Challenge 4: Building a Workflow Objective: Combine the recipe and model into a workflow.

Task: Create a workflow using the recipe_revenue and model_lm. Fit the workflow to the training data. Expected Code:

workflow_lm <- workflow() %>% add_recipe(recipe_revenue) %>% add_model(model_lm)

fit_lm <- workflow_lm %>% fit(data = train_data) Challenge 5: Making Predictions Objective: Generate predictions on the test set.

Task: Use the fitted workflow to predict Revenue on the test set. Add the predictions as a column to the test dataset. Expected Code:

predictions <- predict(fit_lm, new_data = test_data) %>% bind_cols(test_data) Challenge 6: Assessing Model Performance Objective: Calculate RMSE, R², and MAE for the model.

Task: Use the yardstick package to compute the metrics. Compare actual vs. predicted Revenue. Expected Code:

metrics <- predictions %>% metrics(truth = Revenue, estimate = .pred) %>% filter(.metric %in% c(“rmse”, “rsq”, “mae”))

metrics Challenge 7: Extending to Another Model Objective: Compare the performance of a different model.

Dataset: Use the same dataset. Task: Build a random forest model using the rand_forest() function with ranger engine. Compare the RMSE, R², and MAE of the random forest model against the linear regression model. Expected Code:

model_rf <- rand_forest(mtry = 2, trees = 500, min_n = 5) %>% set_engine(“ranger”) %>% set_mode(“regression”)

workflow_rf <- workflow() %>% add_recipe(recipe_revenue) %>% add_model(model_rf)

fit_rf <- workflow_rf %>% fit(data = train_data)

predictions_rf <- predict(fit_rf, new_data = test_data) %>% bind_cols(test_data)

metrics_rf <- predictions_rf %>% metrics(truth = Revenue, estimate = .pred) %>% filter(.metric %in% c(“rmse”, “rsq”, “mae”))

metrics_rf