MODEL IMPLEMENTATION

We chose to address one particular question to be answered which is “How do age, gender, and education affect the median/average value of Dietary Factors over the years? ” We focused on Regression-based models as we are experimenting with how the median value of different dietary factors changes over a period of time. We have performed model implementation with our data for 5 different models addressing the same question mentioned above.

Before implementing the models, we have performed some transformations on the dataset by removing outliers, duplicates and performed undersampling method to reduce the dataset rows. We have also converted some of the object variables to categorical form. Below are the snippets:

Before transformation:

EXPLANATION:

The performance metrics between Ridge and Lasso regression models, show nearly identical results for both. Mean Absolute Error (MAE) is around 42.74, and both Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) are approximately 4480 and 66.93, respectively, indicating significant prediction errors. The R² values are about 0.0034 for both models, demonstrating minimal explanatory power in predicting data variability. Overall, the metrics indicate poor performance by both models with the given dataset. - results and what they are showing (along with explanation of the metrics).

After transformation:

The models implemented are as follows: NOTE: we did fine tuning for each model by changing hyperparameters to achieve higher accuracy for each model.

1) Ridge and Lasso regression model

It involves the addition of an L2 penalty sum of squared coefficients to the loss function, which shrinks the coefficients but retains all the features. Lasso regression works similarly but adds an L1 penalty-sum of absolute values of coefficients-which, for some coefficients, can shrink to zero, therefore performing feature selection. Ridge and Lasso regression models perform a prediction and provide insight into feature importance. Lasso could remove the irrelevant features. They help improve model generalization by preventing overfitting.

THE RESULT:

2) KNN model

K-Nearest Neighbors, or KNN for short, is a very simple, non-parametric algorithm that classifies a data point based on the majority label of its K nearest neighbors in feature space. It predicts in regression by finding the average value of the K nearest neighbors. From a KNN model, we obtain predictions based on the majority class, should it be classification, or the average value, should it be regression, of the K nearest data points. It also serves to provide insight into the local structure of the data by considering the proximity of similar data points.

THE RESULT:

EXPLANATION:

The K-Nearest Neighbors (KNN) model shows poor performance in predicting how age, gender, and education influence dietary factors over the years, indicated by an R² score of -0.037, suggesting it performs worse than a simple mean prediction. The Mean Absolute Error (MAE) of 44.39 and Root Mean Squared Error (RMSE) of 68.29 highlight significant prediction errors. This indicates that KNN does not effectively capture the relationship between demographic variables and dietary outcomes, possibly due to inappropriate parameter settings or insufficient feature informativeness.

3) Random Forest Model

Random Forest is an ensemble learning method that generates a forest of decision trees, each of which is trained on random subsets of data. It then generalizes by aggregating the predictions from all trees through voting for classification or averaging for regression. It enhances accuracy and reduces overfitting when used as an ensemble learner while finding the feature importance as it shows which features are most contributing to the model's decisions.

THE RESULT:

EXPLANATION:

The Random Forest model's performance in predicting how age, gender, and education affect dietary factors is inadequate, indicated by an R² score of -0, which shows it fails to explain any variability in the data. Despite a low Mean Absolute Error (MAE), high Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) suggest substantial errors when inaccuracies occur. This implies that the model may be overfitting or not suitably capturing the complex relationships between the demographic variables and dietary outcomes.

4) XGBOOST model

XGBoost is an optimized distributed gradient boosting library to be used for classification and regression tasks, building an ensemble of sequential decision trees to correct the errors of the previous ones. Predictions from an XGBoost model are made by the summation of the predictions from a sequence of decision trees, where each tree tries to correct the error of its predecessor. Feature importance is another useful artifact that comes from the model, indicating which features are most influencing the predictions.

THE RESULT:

EXPLANATION:

The XGBoost model showed a Mean Absolute Error (MAE) of 42.67 and an R² value of 0.0045. Despite the MAE being moderate, the R² value is very low, suggesting that the model is not effectively capturing the relationship between the input features (age, gender, and education) and dietary factors. The model struggles to explain the variability in dietary habits, indicating that XGBoost might require parameter tuning or additional features to improve its predictive accuracy.

5) Bayesian model

The Bayes' theorem in a Bayesian model updates the probability of a hypothesis based on acquiring more evidence or data. The new data gets combined with prior beliefs to offer a probability prediction or inference about the occurrence of an uncertain event. From a Bayesian model, we get updated probabilities, the posterior distributions that tell the likelihood of different hypotheses based on prior knowledge and new data. It also shows uncertainty estimates or, in other words, confidence in the predictions or inferences.

THE RESULT:

EXPLANATION:

The Bayesian regression model shows excellent performance with low regularization parameters (lambda), achieving an R² score of 1.0000, indicating a strong fit for predicting dietary factors based on age, gender, and education. However, as (lambda) increases to 10, performance drastically worsens, with a sharp increase in MSE and a negative R² score, suggesting significant underfitting. This highlights the importance of carefully selecting (lambda) to balance complexity and fit in modeling demographic impacts on dietary trends.

INFERENCE FROM THE MODELS:

THE BEST MODEL OF THE ABOVE MODELS IS THE BAYESIAN MODEL.THE REST SHOWED VERY LOW ACCURACY.

MODEL IMPLEMENTATION

Join us on mobile!