This repository is published as a GitHub Pages project site at:
https://usern102.github.io/IJC437-Introduction-to-Data-Science-Coursework-Project/
GitHub Repository:
https://github.com/userN102/IJC437-Introduction-to-Data-Science-Coursework-Project
This project analyses UK industry-level business demography data (ONS) to understand how enterprise births differ across industries and time. It combines cleaned ONS tables into a single master panel, uses exploratory data analysis (EDA) to describe patterns and relationships, and then evaluates forecast-style prediction models using lagged indicators and time-aware cross-validation.
RQ1: How do enterprise births vary across UK industries between 2019 and 2023, and how are these differences associated with industry size, enterprise survival, and high-growth activity?
RQ2: How accurately can enterprise births be predicted using industry-level business demography indicators, and how does the predictive performance of multiple linear regression compare with random forest regression when evaluated using cross-validation on a limited dataset?
All R scripts are located in the Codes/ folder:
https://github.com/userN102/IJC437-Introduction-to-Data-Science-Coursework-Project/tree/main/Codes
Codes/
├── active_enterprises_10_employer_industry.R
├── active_enterprises_industry.R
├── enterprise_births_industry.R
├── enterprise_deaths_industry.R
├── enterprise_survival_industry.R
├── high_growth_enterprises_industry.R
├── merge_industry_characteristics.R
├── RQ1_EDA.R
└── RQ2_regression_model.R
Dataset/
├── ONS_Business_Demography/
│ ├── ons_original.xlsx
│ ├── Births_Of_New_Enterprises_2019_2024_by_Industry.xlsx
│ ├── Deaths_Of_New_Enterprises_2019_2024_by_Industry.xlsx
│ ├── Active_Enterprises_2019_2024_by_Industry.xlsx
│ ├── Active_Enterprises_10+_Emp_2019_2024_by_Industry.xlsx
│ ├── High_Growth_Enterprises_2019_2024_by_Industry.xlsx
│ └── Births_and_Survival_of_Enterprises_2019_2023_by_Industry.xlsx
└── Final_Master_Datasets/
└── master_panel_industry_characteristics_2019_2023.xlsx
Outputs/
├── Tables/
└── Visuals/
Workflow_Diagram/
└── workflow.png
A workflow diagram summarising the full data pipeline is provided in the Workflow Diagram/ folder.
These scripts read ONS tables and save tidy CSV outputs (long format unless noted):
enterprise_births_industry.R → births of new enterprises by industry/year (Table 1.2)enterprise_deaths_industry.R → deaths of new enterprises by industry/year (Table 2.2)active_enterprises_industry.R → active enterprises by industry/year (Table 3.2)high_growth_enterprises_industry.R → high-growth enterprises by industry/year (Table 7.2)active_enterprises_10_employer_industry.R → active enterprises (10+ employees) by industry/year (Table 7.4)enterprise_survival_industry.R → births + survival (1–5 years) by industry/year (Table 5.2a–5.2e)merge_industry_characteristics.R
Reads the cleaned CSV files and merges them into:
Dataset/Final_Master_Datasets/master_panel_industry_characteristics_2019_2023.csvRQ1_EDA.R
Produces distributions, time patterns, industry comparisons, association plots, and correlation matrices to justify feature choices for modelling.RQ2_regression_model.R
Builds lagged predictors (t-1), runs time-aware cross-validation, compares:
Install R and RStudio (recommended). This project was run using R (version may vary).
Clone or download this repository to your computer.
Ensure the original ONS Excel workbook is located at:
Dataset/ONS_Business_Demography/ons_original.xlsx
(If you rename the file or move it, update the file_path inside the scripts.)
Run the following in R once:
install.packages(c(
"tidyverse", "readxl", "readr", "dplyr", "tidyr", "stringr",
"scales", "corrplot", "tidymodels", "ranger", "rsample", "purrr"
))
Run scripts in this order:
Data extraction scripts (create cleaned CSV files from the Excel workbook)
enterprise_births_industry.Renterprise_deaths_industry.Ractive_enterprises_industry.Ractive_enterprises_10_employer_industry.Rhigh_growth_enterprises_industry.Renterprise_survival_industry.RBuild the master panel
merge_industry_characteristics.REDA for RQ1
RQ1_EDA.R
Outputs saved to: Outputs/Visuals/Modelling for RQ2
RQ2_regression_model.R
Outputs saved to:
Outputs/Tables/ (metrics + predictions) and Outputs/Visuals/ (plots)After running all scripts, you should see:
Dataset/Final_Master_Datasets/master_panel_industry_characteristics_2019_2023.csvOutputs/Visuals/ (EDA + modelling diagnostics)Outputs/Tables/set.seed(123)).