IJC437-Introduction-to-Data-Science-Coursework-Project

Explaining Patterns and Short-Term Prediction in Enterprise Births Across UK Industries Using Business Demographics

This repository is published as a GitHub Pages project site at:
https://usern102.github.io/IJC437-Introduction-to-Data-Science-Coursework-Project/

GitHub Repository:
https://github.com/userN102/IJC437-Introduction-to-Data-Science-Coursework-Project

This project analyses UK industry-level business demography data (ONS) to understand how enterprise births differ across industries and time. It combines cleaned ONS tables into a single master panel, uses exploratory data analysis (EDA) to describe patterns and relationships, and then evaluates forecast-style prediction models using lagged indicators and time-aware cross-validation.

Research Questions

RQ1: How do enterprise births vary across UK industries between 2019 and 2023, and how are these differences associated with industry size, enterprise survival, and high-growth activity?

RQ2: How accurately can enterprise births be predicted using industry-level business demography indicators, and how does the predictive performance of multiple linear regression compare with random forest regression when evaluated using cross-validation on a limited dataset?

Key Findings (Summary)


Repository Structure

All R scripts are located in the Codes/ folder:
https://github.com/userN102/IJC437-Introduction-to-Data-Science-Coursework-Project/tree/main/Codes

Codes/
├── active_enterprises_10_employer_industry.R
├── active_enterprises_industry.R
├── enterprise_births_industry.R
├── enterprise_deaths_industry.R
├── enterprise_survival_industry.R
├── high_growth_enterprises_industry.R
├── merge_industry_characteristics.R
├── RQ1_EDA.R
└── RQ2_regression_model.R

Dataset/
├── ONS_Business_Demography/
│   ├── ons_original.xlsx
│   ├── Births_Of_New_Enterprises_2019_2024_by_Industry.xlsx
│   ├── Deaths_Of_New_Enterprises_2019_2024_by_Industry.xlsx
│   ├── Active_Enterprises_2019_2024_by_Industry.xlsx
│   ├── Active_Enterprises_10+_Emp_2019_2024_by_Industry.xlsx
│   ├── High_Growth_Enterprises_2019_2024_by_Industry.xlsx
│   └── Births_and_Survival_of_Enterprises_2019_2023_by_Industry.xlsx
└── Final_Master_Datasets/
    └── master_panel_industry_characteristics_2019_2023.xlsx

Outputs/
├── Tables/
└── Visuals/

Workflow_Diagram/
└── workflow.png


A workflow diagram summarising the full data pipeline is provided in the Workflow Diagram/ folder.

R Code

1) Data extraction (from the ONS Excel workbook)

These scripts read ONS tables and save tidy CSV outputs (long format unless noted):

2) Build master panel (merge all indicators)

3) EDA (RQ1)

4) Modelling (RQ2)

How to Download and Run

Step 0 — Requirements

Install R and RStudio (recommended). This project was run using R (version may vary).

Step 1 — Get the repository

Clone or download this repository to your computer.

Step 2 — Ensure the dataset is in the correct place

Ensure the original ONS Excel workbook is located at:

Dataset/ONS_Business_Demography/ons_original.xlsx

(If you rename the file or move it, update the file_path inside the scripts.)

Step 3 — Install required R packages

Run the following in R once:

install.packages(c(
  "tidyverse", "readxl", "readr", "dplyr", "tidyr", "stringr",
  "scales", "corrplot", "tidymodels", "ranger", "rsample", "purrr"
))

Step 4 — Run scripts in order

Run scripts in this order:

  1. Data extraction scripts (create cleaned CSV files from the Excel workbook)

    • enterprise_births_industry.R
    • enterprise_deaths_industry.R
    • active_enterprises_industry.R
    • active_enterprises_10_employer_industry.R
    • high_growth_enterprises_industry.R
    • enterprise_survival_industry.R
  2. Build the master panel

    • merge_industry_characteristics.R
  3. EDA for RQ1

    • RQ1_EDA.R Outputs saved to: Outputs/Visuals/
  4. Modelling for RQ2

    • RQ2_regression_model.R Outputs saved to: Outputs/Tables/ (metrics + predictions) and Outputs/Visuals/ (plots)

Outputs

After running all scripts, you should see:


Notes on Reproducibility