portfolio - Description

Exploratory analysis of the dataset

In [55]:

# import libraries 
import pandas as pd

# import dataset into a dataframe
df = pd.read_csv('Dataset.csv')
print(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Unnamed: 0       20000 non-null  int64  
 1   avg_d_kbps       20000 non-null  int64  
 2   avg_u_kbps       20000 non-null  int64  
 3   avg_lat_ms       20000 non-null  int64  
 4   avg_lat_down_ms  19097 non-null  float64
 5   avg_lat_up_ms    19127 non-null  float64
 6   net_type         20000 non-null  object 
dtypes: float64(2), int64(4), object(1)
memory usage: 1.1+ MB
None

	Unnamed: 0	avg_d_kbps	avg_u_kbps	avg_lat_ms	avg_lat_down_ms	avg_lat_up_ms	net_type
0	1	9711	1489	21	208.0	NaN	moblie
1	2	50073	18199	40	475.0	1954.0	moblie
2	3	21784	745	47	1493.0	2252.0	moblie
3	4	18159	1662	21	244.0	2067.0	moblie
4	5	1439	659	749	2357.0	5083.0	moblie

There are many rows of data so we can drop every row that contains NaN entries and still have enough to train our prediction models and test our hypotheses.

In [56]:

Code

# Create a deep copy of the dataframe for cleaning
cleaning_df = df.copy()

# Not many rows with missing values as shown in plot
cleaning_df.isna().sum().plot(kind='bar', ylim=(0, cleaning_df.shape[0]))

# Drop rows with missing values
cleaning_df.dropna(inplace = True)

Figure 1: Bar plot of missing values in each column

Column ‘Unamed: 0’ is not needed therefore we will drop it from the dataset. Correct the spelling of the ‘moblie’ to ‘Mobile’ and capitalize ‘fixed’ in ’net_type column and convert that column to a category dtype.

In [57]:

# drop unnecessary index column
cleaning_df.drop(columns='Unnamed: 0', inplace=True)

# Correct spelling mistake and capitalize
cleaning_df.replace('moblie', 'Mobile', inplace=True)
cleaning_df.replace('fixed', 'Fixed', inplace=True)

# change column to category type
cleaning_df['net_type'] = cleaning_df['net_type'].astype('category')
cleaning_df['net_type'].unique()

['Mobile', 'Fixed']
Categories (2, object): ['Fixed', 'Mobile']

Drop any duplicate entries in the dataframe

In [58]:

# Drop any duplicate entries in the dataframe
cleaning_df.drop_duplicates(inplace=True)

Check the two columns set as a float64 dtype to see if they can be changed to int64. If they only have integers in them then convert to int64.

In [59]:

#set bool to True intially, bool will change to False if any floats are discovered
is_int = True
# iterate through each row in column
for row in cleaning_df['avg_lat_down_ms']:
    #if value isn't an integer set bool is_int to false and exit loop
    if not row.is_integer():
        is_int = False
        break
    
# if after checking there are no floats found in column convert column to int64
if is_int:
    cleaning_df['avg_lat_down_ms'] = cleaning_df['avg_lat_down_ms'].astype('int64')

#set bool to True intially, bool will change to False if any floats are discovered
is_int = True
# iterate through each row in column
for row in cleaning_df['avg_lat_up_ms']:
    #if value isn't an integer set bool is_int to false and exit loop
    if not row.is_integer():
        is_int = False
        break

# if after checking there are no floats found in column convert column to int64
if is_int:
    cleaning_df['avg_lat_up_ms'] = cleaning_df['avg_lat_up_ms'].astype('int64')

cleaning_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 19030 entries, 1 to 19999
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   avg_d_kbps       19030 non-null  int64   
 1   avg_u_kbps       19030 non-null  int64   
 2   avg_lat_ms       19030 non-null  int64   
 3   avg_lat_down_ms  19030 non-null  int64   
 4   avg_lat_up_ms    19030 non-null  int64   
 5   net_type         19030 non-null  category
dtypes: category(1), int64(5)
memory usage: 910.7 KB

In [60]:

cleaning_df

	avg_d_kbps	avg_u_kbps	avg_lat_ms	avg_lat_down_ms	avg_lat_up_ms	net_type
1	50073	18199	40	475	1954	Mobile
2	21784	745	47	1493	2252	Mobile
3	18159	1662	21	244	2067	Mobile
4	1439	659	749	2357	5083	Mobile
5	13498	3525	37	598	1023	Mobile
...	...	...	...	...	...	...
19995	215644	114035	14	384	606	Fixed
19996	48533	17553	34	172	43	Fixed
19997	5732	473	52	8039	304	Fixed
19998	116025	129465	8	91	219	Fixed
19999	145911	42130	15	139	555	Fixed

19030 rows × 6 columns

All of the columns are using the same units of measurement and set to the correct data type. Spelling errors have been corrected and all rows containing missing values have been dropped from the dataset. The unnamed index column has also been dropped as it is not needed. Now we save the cleaned dataset for use in Analysis.ipynb.

In [62]:

# rename columns to the unit they will be converted into
cleaning_df.rename(columns={'avg_d_kbps' : 'avg_d_mbps', 'avg_u_kbps' : 'avg_u_mbps'}, inplace=True)

# convert kbps to mbps for more readability
cleaning_df['avg_d_mbps'] = cleaning_df['avg_d_mbps']/1000
cleaning_df['avg_u_mbps'] = cleaning_df['avg_u_mbps']/1000

cleaning_df

	avg_d_mbps	avg_u_mbps	avg_lat_ms	avg_lat_down_ms	avg_lat_up_ms	net_type
1	50.073	18.199	40	475	1954	Mobile
2	21.784	0.745	47	1493	2252	Mobile
3	18.159	1.662	21	244	2067	Mobile
4	1.439	0.659	749	2357	5083	Mobile
5	13.498	3.525	37	598	1023	Mobile
...	...	...	...	...	...	...
19995	215.644	114.035	14	384	606	Fixed
19996	48.533	17.553	34	172	43	Fixed
19997	5.732	0.473	52	8039	304	Fixed
19998	116.025	129.465	8	91	219	Fixed
19999	145.911	42.130	15	139	555	Fixed

19030 rows × 6 columns

In [63]:

#save cleaned dataset to a parquet file to preserve the dtype info in the dataframe
cleaning_df.to_parquet('cleaned_dataset.parquet', index=False)

Description.ipynb