Data Exploration | My Site 1

DATA COLLECTION

The data for this study was collected from the Global Dietary Database, which offers comprehensive data on dietary intake across various regions and populations. The dataset is downloaded from the official website of the GDD, https://globaldietarydatabase.org/data-download, where detailed records exist on the daily consumption of key vitamins and nutrients. It draws on the findings of hundreds of dietary surveys and studies carried out in almost every part of the world, representing diverse age, gender, education, and income groups. The GDD offers valuable insights into global dietary patterns, helping to track how nutrition and food consumption vary across different countries and change over time.

DATA UNDERSTANDING

The dataset we obtained is a huge amount of data. It is close to a few millions. There were 57 dietary factors and 3 different main divisions: Superregion, Country and global. Under this we chose country file alone and only chose 26 dietary factors that are most important. We start by checking the shape of the dataset, which tells you how many rows observations and columns-variables the dataset has, hence giving you an idea about its size. Then used the head() function to have a look at the first few rows to get an idea of the format of data. First, reviewing column names provides the features that you have at your disposal to analyze; second, verifying each column's data type will tell if it is numerical, categorical, or textual, thus informing the kind of analyses you can conduct. This is apart from the summary statistics that describe() provides for numerical columns on mean, median, standard deviation, and quartiles that come in handy during trend identification and outlier detection, hence problem identification. These methods therefore avail meaningful insights into your data, which is an important activity in any cleaning, transformation, and analysis of data.

DATA PREPARATION/DATA CLEANING

To prepare the data for analysis, the cleaning process begins by loading and inspecting all datasets to ensure compatibility and quality. Each dataset undergoes an initial review, including summarizing its structure, checking the number of rows and columns, and displaying sample records. Missing values are assessed using summary statistics to identify columns with null entries, and appropriate measures are taken to handle them—such as imputing with statistical measures (mean, median, or mode) or dropping rows/columns based on the extent of missing data. Categorical variables are validated to ensure they contain only predefined or expected categories (e.g., gender encoded as 0 and 1, or region encoded as 0 and 1), and any anomalies are corrected or removed. Numerical columns are examined for outliers and errors, such as negative or unusually high values outside reasonable bounds, which are either corrected or excluded to maintain data integrity. Filtering is applied to isolate relevant records, such as restricting the data to specific timeframes (e.g., year ≥ 2010) or valid age ranges (e.g., 12.5 to 42.5).

Additionally, duplicate rows are identified and removed to ensure the dataset does not contain redundant information that could skew results. Data consistency is a critical focus; all datasets are checked for uniform column names, consistent data types, and alignment of values across related files to facilitate merging or aggregation. Once the cleaning is completed, the refined datasets are validated to ensure they meet the required standards for further processing, including accurate column data types, expected record counts, and readiness for visualization or advanced analysis. This meticulous process ensures that the data is reliable, relevant, and ready for high-quality insights and decision-making.

DATASET BEFORE CLEANING

This is just a screenshot of 2 datasets out of 26 datasets before we combined. If we display it here it may be be to long. All the data sets have the same columns. The only thing is that each dataset represents one dietary factor.

DATASET AFTER CLEANING

This is screenshot of the dataset after combining it to one single dataset.

DATA VISUALISATION

1.⁠ ⁠Barchart on first ten dietary factors across different countries:

This visualization analyzes dietary factor consumption patterns across different countries, represented by stacked bars for each nation. The height of each bar reflects the total dietary contribution, allowing us to compare overall consumption levels among countries. The varying proportions of dietary factors (e.g., added sugars, fruits, calcium) highlight differences in nutritional focus or availability in each country. Notably, some countries exhibit higher proportions of specific nutrients or foods, which could reflect cultural, economic, or agricultural influences. The balanced yet distinct patterns suggest variability in dietary diversity and nutritional quality across regions.

2. Distribution of Fruit Intake Across Countries:

This strip plot analyzes the distribution of fruit intake across different countries. Each point represents a data observation, showing the median nutrition intake on the y-axis for individual countries on the x-axis. The dense clustering of points in some countries reflects high variability in intake, while others with more uniform patterns suggest consistent consumption. The gradual color gradient highlights regional trends, revealing disparities in fruit consumption across nations and populations.

3. Distribution of Nutrition Intake by Gender:

This violin plot compares the distribution of median nutrition intake between genders, where 0 represents females and 1 represents males. Both distributions show a similar pattern, with the majority of data concentrated at lower intake levels and a long tail indicating a few individuals with significantly higher intake. The widths of the violins suggest comparable variability in nutrition intake across genders. Overall, there is no significant visual difference between males and females in terms of median nutrition intake distribution.

4. Analysis of Average Diet Intake Trends (2010-2020):

This bar plot shows the average diet intake from 2010 to 2020. The intake remains stable from 2010 to 2018, averaging around 50 units. However, there is a sharp decline in 2020, dropping significantly below previous years. This suggests a major shift in dietary patterns or reporting in 2020. The steep decline in 2020 could indicate external factors affecting diet intake, such as economic, health, or environmental changes. Further investigation is needed to identify the cause of this significant drop.

5. Correlation Between Dietary Factors and Average Education Levels:

This bar plot shows the relationship between dietary factors and average education levels. Unprocessed red meats and dietary cholesterol have high education-level associations, while added sugars also rank prominently. Whole grains show the highest association with education, significantly surpassing other factors. Conversely, plant omega-3 fat and beans and legumes have the lowest education-level associations, suggesting minimal emphasis or understanding within these groups.

6. Correlation Matrix of Age, Year, Education, and Nutritional Intake:

This correlation matrix heat map visualizes the relationships between age, year, education level, and median nutritional intake. Age and year show no significant correlation with other variables, as most values are near zero. Median nutritional intake has a slight negative correlation with education (-0.06), indicating a weak inverse relationship. The overall weak correlations suggest limited direct interaction between these factors.

7. Distribution of Nutrition Intake by Gender across regions :

This histogram compares the average intake of various dietary factors across different regions. Region 0 has the highest overall dietary intake, with notable contributions from factors like fruits, dietary fiber, and added sugars. Other regions show much lower or specific intakes; for instance, Region 400 primarily highlights vitamin-related intake. The dietary diversity and intensity drastically decline from Region 0 to the others.

8. Pairwise Relationships Among Age, Income, Education, and Year:

This pairwise plot shows the relationships among variables: age, median income, education, and year. Age displays a periodic distribution, while median income clusters around specific levels, with little spread. Education exhibits a peaked distribution with some clustering, possibly indicating common educational attainment levels. Year has discrete values, with data concentrated between 2010 and 2020, reflecting a time-bound dataset.

9. Nutritional Intake Patterns of Older Adults in the USA and Canada:

This radar plot visualizes the nutritional intake profiles of older populations in the USA and Canada. The data reveals relatively higher intakes for dietary sodium, refined grains, and unprocessed red meats, while nutrients like plant omega-3 fats and Vitamin D are consumed at lower levels. Dietary cholesterol and added sugars also show moderate levels, indicating balanced but potentially health-critical dietary patterns. Overall, the plot highlights areas for nutritional improvement among older demographics in these regions.

10. Hierarchical Breakdown of Dietary Factors Across Macronutrients and Micronutrients:

This sunburst chart visualizes the distribution of dietary factors categorized into macronutrients and micronutrients. Macronutrients, such as whole grains, nuts, and fruits, are represented in orange, highlighting a wide variety. Micronutrients, shown in green, include iron, vitamin D, and vitamin B12, showcasing key nutrient sources like eggs and cheese. The hierarchical breakdown effectively organizes food types and nutrients, emphasizing diversity in dietary composition.

WhatsApp Image 2024-11-21 at 23.48.31_73fd7725.jpg

11. Consistent Nutrition Intake Patterns Across Age Groups:

This box plot visualizes the distribution of capped median nutrition intake across various age groups. The median intake remains consistent across all age groups, indicating no significant variation. The interquartile range (IQR) and whiskers suggest a similar spread of data for all age groups, with no outliers evident. This consistency highlights that nutrition intake patterns are steady across the observed age range.