1 Description

In this section, we elaborate on the steps taken to clean the dataset obtained from ‘ookla_speed_q4_2022.csv’. The dataset, consisting of 20,000 entries and 7 features related to network performance, underwent a comprehensive cleaning process.

Dataset

View Ookla dataset on ookla-open-data repository.

Description Section Source Code

View Description.ipynb notebook for this section.

1.1 Initial Data Overview

Upon loading the dataset using the pandas library in Python, we observed that it contained 20,000 entries with 7 columns. An initial assessment revealed the presence of missing values in the avg_lat_down_ms and avg_lat_up_ms columns. (Fig. 1)

Code

# Create a deep copy of the dataframe for cleaning
cleaning_df = df.copy()

# Not many rows with missing values as shown in plot
cleaning_df.isna().sum().plot(kind='bar', ylim=(0, cleaning_df.shape[0]))

# Drop rows with missing values
cleaning_df.dropna(inplace = True)

**Figure** 1. Bar plot of missing values in each column

Source: Description.ipynb

1.2 Data Cleaning Steps

Dropping Rows with Missing Values: Rows containing missing values were dropped to ensure the reliability of our subsequent analyses. (Fig. 1)
Column Removal: We removed the unnecessary Unnamed: 0 column, as it served as an unnamed index and did not contribute to the analysis.
Spelling Corrections and Categorization: We addressed spelling errors in the net_type column, changing ‘moblie’ to ‘Mobile’ and capitalizing ‘fixed’. The net_type column was then converted to a categorical data type.
Duplicate Entry Removal: Duplicate entries were identified and subsequently dropped to ensure the uniqueness of our data.
Conversion of Float Columns to Int: We verified the avg_lat_down_ms and avg_lat_up_ms columns for floating-point values and converted them to integers if necessary.

1.3 Column Renaming and Unit Conversion

To enhance clarity, we renamed columns related to average download and upload speeds and converted the corresponding values from kilobits per second to megabits per second.

1.4 Resulting Dataset

The resulting cleaned dataset, now saved as ‘cleaned_dataset.parquet’, comprises 19,030 entries and 6 columns, each with non-null values. The net_type column is categorized into ‘Mobile’ and ‘Fixed’. The dataset is now ready for further analysis and modeling.

2 Comprehensive Data Analysis

In this section, we delve into the exploratory data analysis (EDA) process, aiming to comprehend the underlying distributions, compare fixed and mobile network data, and identify any notable correlations. Recognizing that the initial data exhibited heavily positively skewed distributions, we undertook a series of data transformations to bring the distributions closer to normality. The primary objective was to enhance the suitability of the data for subsequent hypothesis testing.

Comprehensive Data Analysis Section Source Code

View Analysis.ipynb notebook for this section.

2.1 Understanding Initial Distributions

The initial step involved an examination of the distributions of both fixed and mobile network data. Histograms, box plots (Fig. 2), and summary statistics (Figs. 3 & 4) were employed to gain insights into the central tendencies, dispersions, and skewness of the datasets. Notably, the distributions were observed to be heavily positively skewed, prompting the need for transformation to meet the assumptions of parametric statistical tests.

**Figure** 3. Summary statistics of each network type

2.2 Comparative Analysis

To assess the disparities between fixed and mobile networks, we conducted thorough comparative analyses. Kernel density plots (Fig. 5) and statistical tests (Fig. 3) were leveraged to highlight variations in central tendencies. These comparisons served as a foundation for subsequent transformations and allowed us to pinpoint differences between the two networks.

2.3 Data Transformations

Several data transformations were applied, including but not limited to logarithmic, Box-Cox, and Yeo-Johnson transformations. Each transformation was carefully chosen based on its appropriateness for the given context and the nature of the initial distributions. Log transformations, for instance, are effective in addressing exponential growth patterns, while Box-Cox transformations are versatile in handling skewed data. (Lee, S. X. and McLachlan, G. J., 2022) (West, R. M., 2022)

2.4 Comparative Assessment of Transformations

A meticulous examination of the transformed datasets ensued, involving comparative analyses with the original data. Visualizations (Figs. 6) and statistical measures, including skewness and kurtosis tests (Fig. 8), were employed to quantify the improvements brought about by each transformation. The Yeo-Johnson transformation consistently demonstrated superior results in terms of bringing the data closer to a normal distribution. (Fig. 7)

(a) Original vs Sqrt transformed on fixed network

Source: Analysis.ipynb

2.5 Correlation Analysis

In addition to distribution improvements, we investigated the impact of transformations on correlation structures within the data. Scatter plots (Figs. 9 & 10) and correlation matrices (Figs. 11 & 12) were employed to evaluate changes in relationships between variables. This step aimed to ensure that the transformations not only enhanced distributions but also preserved or revealed meaningful associations.

Code

# Create a pairplot of the dataframe 'df' sorted by 'net_type'
g = sns.pairplot(df, hue='net_type', corner=True)
# Set the title of the plot and adjust its position
g.fig.suptitle('Everything', y=1.02)
# Display the plot
plt.show()

# Loop over each key in the dictionary of dataframes 'dfs'
for key in dfs:
    # Create a pairplot each dataframe in the dictionary
    g = sns.pairplot(dfs[key][0], corner=True)
    # Set the title of the plot as the key and adjust its position
    g.fig.suptitle(key, y=1.02)
    # Display the plot
    plt.show()

Source: Analysis.ipynb

Code

# Loop over each key in the 'yeoj_dfs' dictionary
for key in yeoj_dfs:
    # Create a pairplot of the dataframe corresponding to the current key
    # 'corner=True' means that only the lower triangle of the plot will be shown
    g = sns.pairplot(yeoj_dfs[key], corner=True)
    
    # Set the title of the figure, adding a small space above the title
    g.fig.suptitle(key+': Yeo-J Transformed', y=1.02)
    
    # Display the figure
    plt.show()

Source: Analysis.ipynb

Code

# Create a subplot with 1 row and 3 columns, sharing the y-axis, and set the figure size
fig, axes_mat = plt.subplots(1, 3, sharey=True, figsize=(10, 5))

# Create a correlation matrix of all the numerical columns
corr = df.drop(columns='net_type').corr()

# Visualize the correlation matrix with a heatmap
sns.heatmap(corr, cmap='RdBu', vmin=-1, vmax=1, annot=True, square=True, ax=axes_mat[0], cbar=False)

# Rotate and align labels on the x-axis
axes_mat[0].set_xticklabels(axes_mat[0].get_xticklabels(), rotation=45, horizontalalignment='right')
axes_mat[0].set_title('Correlation Matrix for Everything')

# Loop over each key in the dictionary 'dfs' and 
# the axes in 'axes_mat' starting from the second one
for key, ax in zip(dfs, axes_mat[1:]):
    # Create a correlation matrix of all the numerical columns
    corr = dfs[key][0].corr()

    # Visualize the correlation matrix with a heatmap
    sns.heatmap(corr, cmap='RdBu', vmin=-1, vmax=1, annot=True, square=True, ax=ax, cbar=False)

    # Rotate and align labels on the x-axis
    ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')
    ax.set_title('Correlation Matrix for '+key)

# Create a colorbar for the whole figure
norm = Normalize(vmin=-1, vmax=1)
# Create a ScalarMappable object with the 'RdBu' colormap and the normalization
sm = plt.cm.ScalarMappable(cmap='RdBu', norm=norm)
# Set the array for the ScalarMappable to an empty array
sm.set_array([])

# Add an axes to the figure for the colorbar at position [left, top, width, height]
cbar_ax = fig.add_axes([0.15, 0.95, 0.7, 0.05])  # [left, top, width, height]
# Add the colorbar to the figure with the ScalarMappable,
# with horizontal orientation, and in the colorbar axes
fig.colorbar(sm, orientation='horizontal', cax=cbar_ax)

# Adjust the padding between and around the subplots
plt.tight_layout()
# Display the figure
plt.show()

**Figure** 11. Correlation heatmap matrices for both networks together and separately

Source: Analysis.ipynb

Code

# Create a subplot with 1 row and 2 columns, 
# sharing the y-axis, and set the figure size
fig, axes_mat = plt.subplots(1, 2, sharey=True, figsize=(10, 7))

# Loop over each key in the 'yeoj_dfs' dictionary and the axes in 'axes_mat'
for key, ax in zip(yeoj_dfs, axes_mat):
    # Create a correlation matrix of all the numerical columns
    corr = yeoj_dfs[key].corr()

    # Visualize the correlation matrix with a heatmap
    sns.heatmap(corr, cmap='RdBu', vmin=-1, vmax=1, annot=True, square=True, ax=ax, cbar=False)

    # Rotate and align labels on the x-axis
    ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')
    ax.set_title('Correlation Matrix for '+key)

# Create a colorbar for the whole figure
norm = Normalize(vmin=-1, vmax=1)
sm = plt.cm.ScalarMappable(cmap='RdBu', norm=norm)
sm.set_array([])

# Add an axes to the figure for the colorbar at position [left, top, width, height]
cbar_ax = fig.add_axes([0.15, 0.95, 0.7, 0.05])  # [left, top, width, height]
fig.colorbar(sm, orientation='horizontal', cax=cbar_ax)

# Adjust the padding between and around the subplots
plt.tight_layout()
# Display the figure
plt.show()

**Figure** 12. Correlation heatmap matrices for both networks after Yeo-Johnson transformation

Source: Analysis.ipynb

2.6 Conclusion

The described EDA and distribution transformations constitute a critical phase in preparing the data for hypothesis testing. The chosen transformations were justified through a systematic exploration of initial distributions, comparative analyses, and a thorough assessment of the impact on correlations. The Yeo-Johnson transformation demonstrated a remarkable ability to normalize skewed data, effectively mitigating the positive skewness observed in the initial distributions. This methodical approach ensures that subsequent analyses are conducted on data that aligns more closely with parametric assumptions, enhancing the robustness and reliability of the findings.

3 Hypothesis Definition and Testing

This section explores the variability and average download speed differences between fixed and mobile networks. Our goal is to determine if the standard deviation of avg_d_mbps varies significantly between the networks, providing insights into their consistency, and to establish whether one network has significantly higher average download speeds.

Hypothesis Definition and Testing Section Source Code

View Hypothesis_Testing.ipynb notebook for this section.

3.1 Methodology

We employed a comprehensive set of statistical tests, considering the positively skewed nature of the original avg_d_mbps dataset.

3.1.1 Levene’s Test: Untransformed Data

Levene’s test was conducted on the untransformed avg_d_mbps data to assess whether the standard deviation of download speeds differs significantly between fixed and mobile networks.

Decision Justification: Levene’s test is robust for assessing equality of variances, and its non-parametric nature aligns well with the skewed distribution of the original data. (Yuhang Zhou, Yiyang Zhu and Weng Kee Wong, 2023) (Hosken, D. J., Buss, D. L. and Hodgson, D. J., 2018)

3.1.2 F-Test: Yeo-Johnson Transformed Data

An F-test was performed on Yeo-Johnson transformed data to compare variances between fixed and mobile networks after addressing the skewness.

Decision Justification: F-test is suitable for comparing variances, and using the transformed data allows us to make robust comparisons while accounting for skewness.

3.1.3 T-Tests: Untransformed and Yeo-Johnson Transformed Data

Independent sample t-tests were conducted on both untransformed and transformed avg_d_mbps data to assess whether one network has significantly higher average download speeds than the other.

Decision Justification: T-tests are appropriate for comparing means, and conducting them on both datasets ensures a comprehensive evaluation of average download speeds.

3.1.4 Mann-Whitney U Test: Untransformed Data

A non-parametric Mann-Whitney U test was performed on the untransformed data to corroborate findings from the t-tests and provide additional robustness.

Decision Justification: The non-parametric nature of the Mann-Whitney U test suits skewed data, offering an alternative perspective on average download speed differences. (Mori, M. et al., 2024) (María Teresa Politi, Juliana Carvalho Ferreira and Cecilia María Patino, 2021)

3.2 Results and Interpretation

3.2.1 Levene’s Test: Untransformed Data

F statistic: 1046.03, p-value: 0.0
Conclusion: The standard deviation of avg_d_mbps significantly differs between fixed and mobile networks.

3.2.2 F-Test: Yeo-Johnson Transformed Data

F statistic: 6.07, p-value: 0.0
Conclusion: The F-test on transformed data reinforces the conclusion that the standard deviation of avg_d_mbps varies significantly between networks. Also, it indicates that the fixed network has significantly higher average download speeds and a higher standard deviation than the mobile network.

3.2.3 T-Tests: Untransformed and Transformed Data

3.2.3.1 Untransformed Data:

t statistic: 40.16, p-value: 0.0
Conclusion: The fixed network has significantly higher average download speeds than the mobile network, and it also exhibits a higher standard deviation.

3.2.3.2 Yeo-Johnson Transformed Data:

t statistic: 120.57, p-value: 0.0
Conclusion: The transformed data supports the initial conclusion of the fixed network outperforming the mobile network in both average download speeds and standard deviation.

3.2.4 Mann-Whitney U Test: Untransformed Data

U statistic: 63199341.5, p-value: 0.0
Conclusion: The Mann-Whitney U test aligns with t-test results, indicating that the fixed network tends to have significantly higher average download speeds and a higher standard deviation.

3.3 Summary

Our multifaceted analysis, incorporating Levene’s test, F-test, t-tests on both original and transformed data, and the Mann-Whitney U test, consistently suggests that the fixed network exhibits significantly higher average download speeds compared to the mobile network. However, it’s important to note that this superior performance is accompanied by a higher standard deviation, indicating a greater degree of variability in download speeds. While the fixed network showcases higher speeds on average, the increased standard deviation suggests a higher level of variability, implying that the consistency of download speeds in the fixed network may be more variable than that of the mobile network. This thorough approach provides a nuanced understanding of the network performance, acknowledging the strengths and potential areas of variability.

4 Implementation

Implementation Section Source Code

View ML_Models.ipynb notebook for this section.

4.1 Regression Models for Average Download Speed

4.1.1 Linear Regression

Uni-variate and Multivariate linear regression models were employed to predict average download speed (avg_d_mbps). The initial models were trained on the original data, and the others were trained on Yeo-Johnson transformed data. The Yeo-Johnson transformed data exhibited a marginal improvement in performance, suggesting that addressing skewness contributed to better predictions (Pan, P., Li, R. and Zhang, Y., 2023). The mean absolute error (MAE), mean squared error (MSE), root mean squared error (RMSE), and R-squared (R2) were used to evaluate model performance. (Figs. 13 & 14) (Subasi, A. et al., 2020)

**Figure** 13. Comparison of uni-variate linear regression models trained on original and transformed data

**Figure** 14. Comparison of multivariate linear regression models trained on original and transformed data

4.1.2 Gradient Boosting Regression

A multivariate Gradient Boosting Regressor was employed as a more sophisticated regression model (Subasi, A. et al., 2020). The model was trained on the original data, and its performance was evaluated using the same metrics (Fig. 15). The Gradient Boosting model outperformed the linear regression models, achieving an R2 of 0.54. Gradient Boosting Regression demonstrated superior predictive power compared to linear regression.

**Figure** 15. Gradient boosting results

4.2 Classification Models for Network Type

4.2.1 Support Vector Machine (SVM)

An SVM classification model was trained on original, and Yeo-Johnson transformed data to predict the network type (Fixed or Mobile). Again, the transformed data trained model performed better than the other, achieving an accuracy of approximately 87%. The confusion matrix (Figs. 16 & 17) and classification report provided insights into precision, recall, and F1-score for each class.

Code

#Support Vector Machine model
model = svm.SVC()

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Plot confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=model.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
disp.plot()
plt.show()

# Evaluate the model
print(classification_report(y_test, y_pred))

**Figure** 16. SVM original data confusion matrix

              precision    recall  f1-score   support

       Fixed       0.84      0.80      0.82      1968
      Mobile       0.79      0.83      0.81      1838

    accuracy                           0.82      3806
   macro avg       0.82      0.82      0.82      3806
weighted avg       0.82      0.82      0.82      3806

Source: ML_Models.ipynb

Code

# Support Vector Machine model with yeo-j transform

# Create a pipeline for the transformed data
pipe_trans = Pipeline([
    # Apply Yeo-Johnson transformation
    ('power_transform', PowerTransformer(method='yeo-johnson')), 
    # Create a SVM model
    ('model', svm.SVC())  
])

# Train the model
pipe_trans.fit(X_train, y_train)

# Make predictions
y_pred = pipe_trans.predict(X_test)

# Plot confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=pipe_trans.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=pipe_trans.classes_)
disp.plot()
plt.show()

# Evaluate the model
print(classification_report(y_test, y_pred))

**Figure** 17. SVM transformed data confusion matrix

              precision    recall  f1-score   support

       Fixed       0.92      0.83      0.87      1968
      Mobile       0.84      0.92      0.88      1838

    accuracy                           0.87      3806
   macro avg       0.88      0.88      0.87      3806
weighted avg       0.88      0.87      0.87      3806

Source: ML_Models.ipynb

4.2.2 Random Forest Classifier

A Random Forest Classifier was also employed for classification, achieving an accuracy of approximately 87%. (Figs. 18 & 19) A grid search was conducted to fine-tune hyperparameters, resulting in optimal values for max_depth, max_leaf_nodes, min_samples_leaf, and min_samples_split. (Behera, G. and Nain, N., 2022)

Code

# Random Forest Classifier Model

# Create a pipeline for the transformed data
pipe_trans = Pipeline([
    # Apply Random Forest
    ('model', RandomForestClassifier(n_estimators=300))
])

# Train the model
pipe_trans.fit(X_train, y_train)

# Make predictions
y_pred = pipe_trans.predict(X_test)

# Plot confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=pipe_trans.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=pipe_trans.classes_)
disp.plot()
plt.show()

# Evaluate the model
print(classification_report(y_test, y_pred))

**Figure** 18. Random forest confusion matrix

              precision    recall  f1-score   support

       Fixed       0.90      0.85      0.87      1968
      Mobile       0.84      0.90      0.87      1838

    accuracy                           0.87      3806
   macro avg       0.87      0.87      0.87      3806
weighted avg       0.87      0.87      0.87      3806

Source: ML_Models.ipynb

# Perform grid search to find more optimal hyperparameters for 
# The Random Forest Classifier Model

# Define the parameter grid
param_grid = {
    'model__max_depth': [None, 5, 10, 15],
    'model__max_leaf_nodes': [None, 5, 10, 15],
    'model__min_samples_leaf': [1, 2, 4],
    'model__min_samples_split': [2, 5, 10]
}

# Create a GridSearchCV object
grid_search = GridSearchCV(pipe_trans, param_grid, cv=2, scoring='accuracy', verbose=1, n_jobs=-1)

# Fit the GridSearchCV object to the data
grid_search.fit(X_train, y_train)

# Print the best parameters
print(grid_search.best_params_)
print(grid_search.best_score_)

Fitting 2 folds for each of 144 candidates, totalling 288 fits
{'model__max_depth': None, 'model__max_leaf_nodes': None, 'model__min_samples_leaf': 1, 'model__min_samples_split': 5}
0.8707304256437205

Source: ML_Models.ipynb

Code

# Try Random Forest classifier with better parameters
# Create a pipeline for the transformed data
pipe_trans = Pipeline([
    # Apply Random Forest
    ('model', RandomForestClassifier(n_estimators=300, max_depth=None, max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=5,  ))  
])

# Train the model
pipe_trans.fit(X_train, y_train)

# Make predictions
y_pred = pipe_trans.predict(X_test)

# Plot confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=pipe_trans.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=pipe_trans.classes_)
disp.plot()
plt.show()

# Evaluate the model
print(classification_report(y_test, y_pred))

**Figure** 19. Random forest confusion matrix trained with more optimal parameters

              precision    recall  f1-score   support

       Fixed       0.90      0.84      0.87      1968
      Mobile       0.84      0.90      0.87      1838

    accuracy                           0.87      3806
   macro avg       0.87      0.87      0.87      3806
weighted avg       0.88      0.87      0.87      3806

Source: ML_Models.ipynb

4.3 Model Comparison and Analysis

The choice of models depended on the nature of the prediction task. Gradient Boosting Regression demonstrated superior performance in predicting average download speed, while Random Forest Classification excelled in predicting network types. The decision to employ Yeo-Johnson transformation in regression was justified by the slight improvement in predictive accuracy (Pan, P., Li, R. and Zhang, Y., 2023). Both SVM and Random Forest Classifier provided competitive results for network classification, with the latter outperforming SVM.

References

Behera, G. and Nain, N. (2022) “Gso-Crs: Grid Search Optimization for Collaborative Recommendation System,” Sādhanā : Published by the Indian Academy of Sciences, 47(3). doi: 10.1007/s12046-022-01924-0.

Hosken, D. J., Buss, D. L. and Hodgson, D. J. (2018) ‘Beware the F Test (or, How to Compare Variances)’, Animal behaviour, 136, pp. 119–126.

Lee, S. X. and McLachlan, G. J. (2022) ‘An Overview of Skew Distributions in Model-Based Clustering’, Journal of Multivariate Analysis, 188. doi: 10.1016/j.jmva.2021.104853.

María Teresa Politi, Juliana Carvalho Ferreira and Cecilia María Patino (2021) Nonparametric Statistical Tests: Friend or Foe?, 47(4). doi: 10.36416/1806-3756/e20210292.

Mori, M. et al. (2024) “An Analytical Investigation of Body Parts More Susceptible to Aging and Composition Changes Using Statistical Hypothesis Testing,” Healthcare Analytics, 5. doi: 10.1016/j.health.2023.100284.

Pan, P., Li, R. and Zhang, Y. (2023) “Predicting Punching Shear in Rc Interior Flat Slabs with Steel and Frp Reinforcements Using Box-Cox and Yeo-Johnson Transformations,” Case Studies in Construction Materials, 19. doi: 10.1016/j.cscm.2023.e02409.

Subasi, A. et al. (2020) “Permeability Prediction of Petroleum Reservoirs Using Stochastic Gradient Boosting Regression,” Journal of Ambient Intelligence and Humanized Computing, 13(7), pp. 3555–3564. doi: 10.1007/s12652-020-01986-0.

West, R. M. (2022) “Best Practice in Statistics: The Use of Log Transformation,” Annals of Clinical Biochemistry, 59(3), pp. 162–165. doi: 10.1177/00045632211050531.

Yuhang Zhou, Yiyang Zhu and Weng Kee Wong (2023) ‘Statistical Tests for Homogeneity of Variance for Clinical Trials and Recommendations’, Contemporary Clinical Trials Communications, 33, p. 101119. doi: 10.1016/j.conctc.2023.101119.