SMS Spam Classifcation - Machine Learning

SMS Spam Classifcation - Machine Learning

The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.

Data

The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.

Library Used

  • pandas
  • numpy
  • nltk
  • sklearn

Approach

  • Loading Data
  • Input and Output Data
  • Applying Regular Expression
  • Each word to lower case
  • Splitting words to Tokenize
  • Stemming with PorterStemmer handling Stop Words
  • Preparing Messages with Remaining Tokens
  • Preparing WordVector Corpus
  • Applying Classification

Importing Libraries

import pandas as pd
import numpy as np
import nltk
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import re

Loading Data

df = pd.read_csv('spam.csv', encoding='latin-1')
df = df.drop(['Unnamed: 2','Unnamed: 3','Unnamed: 4'],axis=1)
df.head(3)
v1 v2
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...

Understanding Data

# Replace ham with 0 and spam with 1

df = df.replace(['ham','spam'],[0, 1]) 
df.head(3)
v1 v2
0 0 Go until jurong point, crazy.. Available only ...
1 0 Ok lar... Joking wif u oni...
2 1 Free entry in 2 a wkly comp to win FA Cup fina...
# Total ham(0) and spam(1) messages
df['v1'].value_counts()
0    4825
1     747
Name: v1, dtype: int64
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
v1    5572 non-null int64
v2    5572 non-null object
dtypes: int64(1), object(1)
memory usage: 87.1+ KB

Count the number of words in each Text

df['Count']=0
for i in np.arange(0,len(df.v2)):
    df.loc[i,'Count'] = len(df.loc[i,'v2'])
    
df.head(3)
v1 v2 Count
0 0 Go until jurong point, crazy.. Available only ... 111
1 0 Ok lar... Joking wif u oni... 29
2 1 Free entry in 2 a wkly comp to win FA Cup fina... 155
corpus = []
ps = PorterStemmer()
# Original Messages

print (df['v2'][0])
print (df['v2'][1])
Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
Ok lar... Joking wif u oni...

Processing Messages

for i in range(0, 5572):

    # Applying Regular Expression
    
    '''
    Replace email addresses with 'emailaddr'
    Replace URLs with 'httpaddr'
    Replace money symbols with 'moneysymb'
    Replace phone numbers with 'phonenumbr'
    Replace numbers with 'numbr'
    '''
    msg = df['v2'][i]
    msg = re.sub('\b[\w\-.]+?@\w+?\.\w{2,4}\b', 'emailaddr', df['v2'][i])
    msg = re.sub('(http[s]?\S+)|(\w+\.[A-Za-z]{2,4}\S*)', 'httpaddr', df['v2'][i])
    msg = re.sub('£|\$', 'moneysymb', df['v2'][i])
    msg = re.sub('\b(\+\d{1,2}\s)?\d?[\-(.]?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}\b', 'phonenumbr', df['v2'][i])
    msg = re.sub('\d+(\.\d+)?', 'numbr', df['v2'][i])
    
    ''' Remove all punctuations '''
    msg = re.sub('[^\w\d\s]', ' ', df['v2'][i])
    
    if i<2:
        print("\t\t\t\t MESSAGE ", i)
    
    if i<2:
        print("\n After Regular Expression - Message ", i, " : ", msg)
    
    # Each word to lower case
    msg = msg.lower()    
    if i<2:
        print("\n Lower case Message ", i, " : ", msg)
    
    # Splitting words to Tokenize
    msg = msg.split()    
    if i<2:
        print("\n After Splitting - Message ", i, " : ", msg)
    
    # Stemming with PorterStemmer handling Stop Words
    msg = [ps.stem(word) for word in msg if not word in set(stopwords.words('english'))]
    if i<2:
        print("\n After Stemming - Message ", i, " : ", msg)
    
    # preparing Messages with Remaining Tokens
    msg = ' '.join(msg)
    if i<2:
        print("\n Final Prepared - Message ", i, " : ", msg, "\n\n")
    
    # Preparing WordVector Corpus
    corpus.append(msg)
				 MESSAGE  0

 After Regular Expression - Message  0  :  Go until jurong point  crazy   Available only in bugis n great world la e buffet    Cine there got amore wat   

 Lower case Message  0  :  go until jurong point  crazy   available only in bugis n great world la e buffet    cine there got amore wat   

 After Splitting - Message  0  :  ['go', 'until', 'jurong', 'point', 'crazy', 'available', 'only', 'in', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', 'cine', 'there', 'got', 'amore', 'wat']

 After Stemming - Message  0  :  ['go', 'jurong', 'point', 'crazi', 'avail', 'bugi', 'n', 'great', 'world', 'la', 'e', 'buffet', 'cine', 'got', 'amor', 'wat']

 Final Prepared - Message  0  :  go jurong point crazi avail bugi n great world la e buffet cine got amor wat 


				 MESSAGE  1

 After Regular Expression - Message  1  :  Ok lar    Joking wif u oni   

 Lower case Message  1  :  ok lar    joking wif u oni   

 After Splitting - Message  1  :  ['ok', 'lar', 'joking', 'wif', 'u', 'oni']

 After Stemming - Message  1  :  ['ok', 'lar', 'joke', 'wif', 'u', 'oni']

 Final Prepared - Message  1  :  ok lar joke wif u oni 
cv = CountVectorizer()
x = cv.fit_transform(corpus).toarray()

Applying Classification

  • Input : Prepared Sparse Matrix
  • Ouput : Labels (Spam or Ham)
y = df['v1']
print (y.value_counts())
0    4825
1     747
Name: v1, dtype: int64

Encoding Labels

le = LabelEncoder()
y = le.fit_transform(y)

Splitting to Training and Testing DATA

xtrain, xtest, ytrain, ytest = train_test_split(x, y,test_size= 0.20, random_state = 0)

Applying Guassian Naive Bayes

bayes_classifier = GaussianNB()
bayes_classifier.fit(xtrain, ytrain)
GaussianNB(priors=None, var_smoothing=1e-09)
# Predicting
y_pred = bayes_classifier.predict(xtest)

Results

# Evaluating
cm = confusion_matrix(ytest, y_pred)
cm
array([[824, 125],
       [ 19, 147]], dtype=int64)
print ("Accuracy : %0.5f \n\n" % accuracy_score(ytest, bayes_classifier.predict(xtest)))
print (classification_report(ytest, bayes_classifier.predict(xtest)))
Accuracy : 0.87085 


              precision    recall  f1-score   support

           0       0.98      0.87      0.92       949
           1       0.54      0.89      0.67       166

    accuracy                           0.87      1115
   macro avg       0.76      0.88      0.80      1115
weighted avg       0.91      0.87      0.88      1115

Applying Decision Tree

dt = DecisionTreeClassifier(random_state=50)
dt.fit(xtrain, ytrain)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=50, splitter='best')
# Predicting
y_pred_dt = dt.predict(xtest)

Results

# Evaluating
cm = confusion_matrix(ytest, y_pred_dt)

print(cm)
[[943   6]
 [ 29 137]]
print ("Accuracy : %0.5f \n\n" % accuracy_score(ytest, dt.predict(xtest)))
print (classification_report(ytest, dt.predict(xtest)))
Accuracy : 0.96861 


              precision    recall  f1-score   support

           0       0.97      0.99      0.98       949
           1       0.96      0.83      0.89       166

    accuracy                           0.97      1115
   macro avg       0.96      0.91      0.93      1115
weighted avg       0.97      0.97      0.97      1115

Final Accuracy

  • Decision Tree : 96.861%
  • Guassian NB : 87.085%