Fake News detection using machine learning with flask web application

Share this post

The advent of the World Wide Web and the rapid adoption of social media platforms (such as Facebook and Twitter) paved the way for information dissemination that has never been witnessed in the human history before.Consumers are creating and sharing more information than ever before thanks to social media platforms, some of it is inaccurate and has no bearing on reality. It’s difficult to classify a written article as misleading or disinformation using an algorithm.

Project Folder

Note: You can downlaod complete project press downlaod now button at the end of this blog

Model

In the blog, we learn how to make a flask web application that classifies the text using machine learning approaches. The machine learning classify either the text is real or fake in another word we say spam or not spam.

Interface 1
Interface 2
Interface 3

The project is dividee in to wtoe steps:

  1. Train machine learning model
  2. Deploy the model using Flask APP

1.Train machine learning model

For the train machine learning model first of all we need to understand the problem we are going to solve either its classification and regression then pick any machine learning model and train. In this blog, we trained different ML model such as SVM, Naive Bayes, a logistic regression model for fake news text classification.

we build a text classification to define whether or not a certain article is a fake news or real news. Using Natural Language Processing methodologies in Python and Classification Theory, we reached an accuracy of 0.945455 for classifying news as fake.

prerequisite: make sure you have already installed Flask, nltk, python ,sklearn and all necessary libraires

Train Classification problem with Fake and Real news

## This file has all imports and helper functions used throughout the notebook
%run python_helper.py
%matplotlib inline 
python_helper.py
######################################################
#################### IMPORTS
######################################################
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import nltk
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import ShuffleSplit
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
import os
import warnings
import seaborn as sns
import re
import string
from termcolor import colored
from nltk import word_tokenize
import string
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.tokenize import WhitespaceTokenizer
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('averaged_perceptron_tagger')

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

warnings.filterwarnings('ignore')
from matplotlib.pyplot import *

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

from sklearn.model_selection import train_test_split

from sklearn import preprocessing

from sklearn.metrics import classification_report, accuracy_score
from sklearn.metrics import confusion_matrix

from nltk.corpus import wordnet
from sklearn.feature_extraction.text import TfidfTransformer
######################################################
#################### Globals
######################################################

seed = 12345
cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=seed)
encoder = preprocessing.LabelEncoder()

######################################################
#################### Helper Functions
######################################################
def get_wordnet_pos(pos_tag):
    if pos_tag.startswith('J'):
        return wordnet.ADJ
    elif pos_tag.startswith('V'):
        return wordnet.VERB
    elif pos_tag.startswith('N'):
        return wordnet.NOUN
    elif pos_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

def preprocess(text):

    # lowercase the text
    text = text.lower()
    # remove the words counting just one letter
    text = [t for t in text.split(" ") if len(t) > 1]

    # remove the words that contain numbers
    text = [word for word in text if not any(c.isdigit() for c in word)]
    # tokenize the text and remove puncutation

    text = [word.strip(string.punctuation) for word in text]
    # remove all stop words
    stop = stopwords.words('english')
    text = [x for x in text if x not in stop]
    # remove tokens that are empty
    text = [t for t in text if len(t) > 0]
    # pos tag the text
    pos_tags = pos_tag(text)
    # lemmatize the text
    text = [WordNetLemmatizer().lemmatize(t[0], get_wordnet_pos(t[1])) for t in pos_tags]

    # join all
    text = " ".join(text)
    return (text)

def split_train_holdout_test(encoder, df, verbose=True):
  # Resplit original train and test
  train = df[df["label"] != "None"]
  test = df[df["label"] == "None"]

  # Encode Target
  train["encoded_label"] = encoder.fit_transform(train.label.values)

  # Take holdout from train
  train_cv, train_holdout, train_cv_label, train_holdout_label = train_test_split(train, train.encoded_label, test_size=0.33, random_state=seed)

  if(verbose):
    print("\nTrain dataset (Full)")
    print(train.shape)
    print("Train dataset cols")
    print(list(train.columns))

    print("\nTrain CV dataset (subset)")
    print(train_cv.shape)
    print("Train Holdout dataset (subset)")
    print(train_holdout.shape)

    print("\nTest dataset")
    print(test.shape)
    print("Test dataset cols")
    print(list(test.columns))

  return encoder, train, test, train_cv, train_holdout, train_cv_label, train_holdout_label

def runModel(encoder, train_vector, train_label, holdout_vector, holdout_label, type, name):
  global cv
  global seed

  ## Classifier types
  if (type == "svc"):
    classifier = SVC()
    grid = [
      {'C': [1, 10, 50, 100], 'kernel': ['linear']},
      {'C': [10, 100, 500, 1000], 'gamma': [0.0001], 'kernel': ['rbf']},
    ]
  if (type == "nb"):
    classifier = MultinomialNB()
    grid = {}
  if (type == "maxEnt"):
      classifier = LogisticRegression()
      grid = {'penalty': ['l1','l2'], 'C': [0.001,0.01,0.1,1,10,100,1000]}

  # Model
  print(colored(name, 'red'))

  model = GridSearchCV(estimator=classifier, cv=cv,  param_grid=grid)
  print(colored(model.fit(train_vector, train_label), "yellow"))

  # Score
  print(colored("\nCV-scores", 'blue'))
  means = model.cv_results_['mean_test_score']
  stds = model.cv_results_['std_test_score']
  for mean, std, params in sorted(zip(means, stds, model.cv_results_['params']), key=lambda x: -x[0]):
      print("Accuracy: %0.3f (+/-%0.03f) for params: %r" % (mean, std * 2, params))
  print()


  print(colored("\nBest Estimator Params", 'blue'))
  print(colored(model.best_estimator_, "yellow"))

  # Predictions
  print(colored("\nPredictions:", 'blue'))
  model_train_pred = encoder.inverse_transform( model.predict(holdout_vector) )
  print(model_train_pred)

  # Confusion Matrix
  cm = confusion_matrix(holdout_label, model_train_pred)

  # Transform to df for easier plotting
  cm_df = pd.DataFrame(cm,
                      index = ['FAKE','REAL'],
                      columns = ['FAKE','REAL'])


  plt.figure(figsize=(5.5,4))
  sns.heatmap(cm_df, annot=True, fmt='g')
  plt.ylabel('True label')
  plt.xlabel('Predicted label')
  plt.show()

  # Accuracy
  acc = accuracy_score(holdout_label, model_train_pred)
  print(colored("\nAccuracy:", 'blue'))
  print(colored(acc, 'green'))
  return [name, model, acc]

def pos_tag_words(text):
    pos_text = nltk.pos_tag(nltk.word_tokenize(text))
    return " ".join([pos + "-" + word for word, pos in pos_text])

Clean & Save Data

Inspecting the data files, we noticed several issues for processing the training dataset correctly. Using Regular Expression, we convert all commas between quotations to a pipe, so the CSV parsing works correctly with all values in their correct columns.

input_str = open("fake_or_real_news_training.csv", encoding= 'utf-8')

# Remove all new lines
noNewLines = re.sub("\n", "", input_str.read())
# re-add new line at end of each row
noNewLines = re.sub("X1,X2", "X1,X2\n", noNewLines)
noNewLines = re.sub(",FAKE[,]+", ",FAKE,,\n", noNewLines)
# noNewLines = re.sub(",FAKE,(?!,)",",FAKE,,\n",noNewLines)
# noNewLines = re.sub(",FAKE,,(?!,)",",FAKE,,\n",noNewLines)
  
noNewLines = re.sub(",REAL[,]+", ",REAL,,\n", noNewLines)
# noNewLines = re.sub(",REAL,(?!,)",",REAL,,\n",noNewLines)
# noNewLines = re.sub(",REAL,,(?!,)",",REAL,,\n",noNewLines)
# Replace any commas between two quotes with |
lines = noNewLines.split('\n')
def removeComma(g):
      t = g.groups()
      t = [t[0], t[1].replace(',', ' |'), t[2], t[3]]
      return "".join(t)
betweenQuotes = lambda line: re.sub(r'(.*,")(.*)(",)(.*)', lambda x: removeComma(x), line)
secondCol = lambda line: re.sub(r'^([0-9]+,)(.*,.*)(,\")(.*)$', lambda x: removeComma(x), line, 1)
lines = [betweenQuotes(l) for l in lines]
lines = [secondCol(l) for l in lines]
finalString = '\n'.join(lines)

Save cleaned file:

file = open('fake_or_real_news_training_CLEANED.csv', 'w',encoding= 'utf-8')
file.write(finalString)
file.close()

Data Preparation:

train = pd.read_csv("fake_or_real_news_training_CLEANED.csv")
test = pd.read_csv("fake_or_real_news_test.csv")
train = train.drop(['X1', 'X2'], axis=1)

We study if the dataset is unbalanced. From the plot we see this is not the case, as there is a similar amount of Fake and Real news articles. No further actions have to be taken.

from collections import Counter
ax = sns.countplot(train.label, order=[x for x, count in sorted(Counter(train.label).items(), key=lambda x: -x[1])])

for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/len(train)*100),
            ha="center") 
ax.set_title("Test dataset target")
show()

In order to not do double work by doing operations on our train and testset and to analyze general distributions of our data, we stack train and test in df.

test['label'] = None  # empty label for test
df = pd.concat([train, test])

Data Preprocessing:

In this part, we will be cleaning the articles with the help of different NLP techniques, of which we will first explain the concept and its importance.

In order to take into account the title in our accuracy prediction, we created an extra column that combines text and title. We will not do separate predictions on the title since these might classify as e.g. Fake news, whether the actual text with more explanation tells a Real story.

df['title_and_text'] = df['title'] +' '+ df['text']
df.tail()

preprocess() can be found in python_helper.py Here you can read the explanations of the preprocess steps we took

  1. lowercase the text

This preprocessing step is done so words van later be cross checked with the stopword and pos_tag dictionaries. For future analysis purposes, it could have been benefitial to analyze text with a lot of words in capital letters, by adding a flag variable.

  1. remove the words counting just one letter

Idem step one.

  1. remove the words that contain numbers

Idem step one.

  1. tokenize the text and remove punctuation

We performed tokenization with the base python .string function, to split sentences into words (tokens).

  1. remove all stop words

A relevant analysis of the text depends on the most recurring words. Stopwords including words as “the”, “as” and “and” appear a lot in a text, but do not give a relevant explanation. For this reason, they are removed.

  1. remove tokens that are empty

After tokenization, we have to make sure all tokens taken into account contribute to the label prediction.

  1. pos tag the text

We use the pos_tag function included in the ntlk library. This classifies our tokenized words as a noun, verb, adjective or adverb and adds to the understaning of the articles.

  1. lemmatize the text

In order to normalize the text, we apply lemmatization. In this way, words with the same root are processed equally e.g. when took or taken are read in the text, they are lemmatized to take, infinitive of the two verbs.

df['preprocessed_text'] = df['title_and_text'].apply(lambda x: preprocess(x))
## Save preprocessed df
df.to_csv("fake_or_real_news_train_PREPROCESSED.csv", index=False)
df = pd.read_csv("fake_or_real_news_train_PREPROCESSED.csv")
df = df.astype(object).replace(np.nan, 'None')
df.tail()

Split Train and Test again after pre-processing is done:

encoder, train, test, train_cv, train_holdout, train_cv_label, train_holdout_label = split_train_holdout_test(encoder, df)

Baseline Modelling

First, we create a dataframe called models to keep track of different models and their scores.

models = pd.DataFrame(columns=['model_name', 'model_object', 'score'])

Vectorizing dataset:

For any text to be fed to a model, the text has to be transformed into numerical values. This process is called vectorizing and will be redone everytime a new feature is added.

count_vect = CountVectorizer(analyzer = "word")
count_vectorizer = count_vect.fit(df.preprocessed_text)
train_cv_vector = count_vectorizer.transform(train_cv.preprocessed_text)
train_holdout_vector = count_vectorizer.transform(train_holdout.preprocessed_text)
test_vector = count_vectorizer.transform(test.preprocessed_text)
count_vect.get_feature_names()[:10]

Baseline Model 1: SVC

We create a baseline classification model with a support vector machine, a good model to handle complex classifications.

SVC_classifier = runModel(encoder,
               train_cv_vector,
               train_cv_label,
               train_holdout_vector,
               train_holdout.label,
               "svc",
               "Baseline Model 1: SVC")
models.loc[len(models)] = SVC

Baseline Model 2: Naïve Bayes

we can explain why the Naïve Bayes model is helpful for our classification. The labels Real and Fake text are hidden, but every word, based on our training data, has a certain probability to belong to one of the two categories. The final score is calculated, multiplying all probabilities of the words (0.006 for real, 0.288 for fake). The algo thus does not take into account the order of the words in the multiplication. rude hell worth will be classified as fake.

NB = runModel(encoder,
              train_cv_vector,
              train_cv_label,
              train_holdout_vector,
              train_holdout.label,
              "nb",
              "Baseline Model 2: Naiive Bayes")
models.loc[len(models)] = NB

Baseline Model 3: MaxEnt Classifier

maxEnt = runModel(encoder,
              train_cv_vector,
              train_cv_label,
              train_holdout_vector,
              train_holdout.label,
              "maxEnt",
              "Baseline Model 3: MaxEnt Classifier")
models.loc[len(models)] = maxEnt

Feature Engineering

  • Explicit POS tagging
  • TF-IDF weighting
  • Bigram Count Vectorizer

==> Select Final Model and predict on test

1. POS Tagging

Adding a prefix to each word with its type (Noun, Verb, Adjective,…). e.g: I went to school => PRP-I VBD-went TO-to NN-school

Also, after lemmatization, it will be ‘VB-go NN-school’, which indicates the semantics and distinguishes the purpose of the sentence.

This will help the classifier differentiate between different types of sentences.

df['pos_tagged_text'] = df['preprocessed_text'].apply(lambda x: pos_tag_words(x))
df.head()

Rerun Models on pos-tagged text (FE1)

encoder, train, test, train_cv, train_holdout, train_cv_label, train_holdout_label = split_train_holdout_test(encoder, df, False)
count_vect = CountVectorizer(analyzer = "word")
count_vectorizer = count_vect.fit(df.preprocessed_text)
train_cv_vector = count_vectorizer.transform(train_cv.pos_tagged_text)
train_holdout_vector = count_vectorizer.transform(train_holdout.pos_tagged_text)
test_vector = count_vectorizer.transform(test.pos_tagged_text)

a. SVC with FE1

SVC_pos_tag = runModel(encoder,
               train_cv_vector,
               train_cv_label,
               train_holdout_vector,
               train_holdout.label,
               "svc",
               "SVC on pos-tagged text")
models.loc[len(models)] = SVC_pos_tag

b. NB_pos_tag with FE1

NB_pos_tag = runModel(encoder,
              train_cv_vector,
              train_cv_label,
              train_holdout_vector,
              train_holdout.label,
              "nb",
              "Naiive Bayes on pos-tagged text")
models.loc[len(models)] = NB_pos_tag

c. maxEnt with FE1

maxEnt_pos_tag = runModel(encoder,
              train_cv_vector,
              train_cv_label,
              train_holdout_vector,
              train_holdout.label,
              "maxEnt",
              "MaxEnt Classifier on pos-tagged text")
models.loc[len(models)] = maxEnt_pos_tag

There seems to be a slight increase in Accuracy after pos-tagging.

2. TF-IDF weighting

Try to add weight to each word using TF-IDF

We are going to calculate the TFIDF score of each term in a piece of text. The text will be tokenized into sentences and each sentence is then considered a text item.

We will also apply those on the cleaned text and the concatinated POS_tagged text.

df["clean_and_pos_tagged_text"] = df['preprocessed_text'] + ' ' + df['pos_tagged_text']
df.head(1)
encoder, train, test, train_cv, train_holdout, train_cv_label, train_holdout_label = split_train_holdout_test(encoder, df, False)
count_vect = CountVectorizer(analyzer = "word")
count_vectorizer = count_vect.fit(df.clean_and_pos_tagged_text)
train_cv_vector = count_vectorizer.transform(train_cv.clean_and_pos_tagged_text)
train_holdout_vector = count_vectorizer.transform(train_holdout.clean_and_pos_tagged_text)
test_vector = count_vectorizer.transform(test.clean_and_pos_tagged_text)
tf_idf = TfidfTransformer(norm="l2")
train_cv_tf_idf = tf_idf.fit_transform(train_cv_vector)
train_holdout_tf_idf = tf_idf.fit_transform(train_holdout_vector)
test_tf_idf = tf_idf.fit_transform(test_vector)  

Rerun Models on preprocessed + pos-tagged (FE1) + TF-IDF weighted text (FE2)

a. SVC with FE1 and FE2

SVC_tf_idf = runModel(encoder,
               train_cv_tf_idf,
               train_cv_label,
               train_holdout_tf_idf,
               train_holdout.label,
               "svc",
               "SVC on preprocessed+pos-tagged TF-IDF weighted text")
models.loc[len(models)] = SVC_tf_idf

b. NB with FE1 and FE2

NB_tf_idf = runModel(encoder,
               train_cv_tf_idf,
               train_cv_label,
               train_holdout_tf_idf,
               train_holdout.label,
              "nb",
              "Naiive Bayes on preprocessed+pos-tagged TF-IDF weighted text")
models.loc[len(models)] = NB_tf_idf

c. maxEnt with FE1 and FE2

maxEnt_tf_idf = runModel(encoder,
               train_cv_tf_idf,
               train_cv_label,
               train_holdout_tf_idf,
               train_holdout.label,
              "maxEnt",
              "MaxEnt on preprocessed+pos-tagged TF-IDF weighted text")
models.loc[len(models)] = maxEnt_tf_idf
Using TF-IDF increased the score to ~94.5% with SVC and Max-Ent models.
Naive-Bayes rather decreased the score. Therefore we drop it from the pipeline.

3. Use Bigram Vectorizer instead of regular vectorizer

For FE3, we use the Trigram vectorizer, which vectorizes triplets of words rather than each word separately. In this short example sentence, the trigrams are “In this short”, “this short example” and “short example sentence”.

encoder, train, test, train_cv, train_holdout, train_cv_label, train_holdout_label = split_train_holdout_test(encoder, df, False)
trigram_vect = CountVectorizer(analyzer = "word", ngram_range=(1,2))
trigram_vect = count_vect.fit(df.clean_and_pos_tagged_text)
train_cv_vector = trigram_vect.transform(train_cv.clean_and_pos_tagged_text)
train_holdout_vector = trigram_vect.transform(train_holdout.clean_and_pos_tagged_text)
test_vector = trigram_vect.transform(test.clean_and_pos_tagged_text)

tf_idf = TfidfTransformer(norm="l2")
train_cv_bigram_tf_idf = tf_idf.fit_transform(train_cv_vector)
train_holdout_bigram_tf_idf = tf_idf.fit_transform(train_holdout_vector)
test_bigram_tf_idf = tf_idf.fit_transform(test_vector)

Rerun Models on preprocessed + pos-tagged (FE1) + TF-IDF weighted (FE2) + Trigram vectorized text (FE3)

a. SVC with FE1, FE2 and FE3

SVC_trigram_tf_idf = runModel(encoder,
               train_cv_bigram_tf_idf,
               train_cv_label,
               train_holdout_bigram_tf_idf,
               train_holdout.label,
               "svc",
               "SVC on bigram vect.+ TF-IDF")
models.loc[len(models)] = SVC_trigram_tf_idf

b. maxEnt with FE1, FE2 and FE3

encoder, train, test, train_cv, train_holdout, train_cv_label, train_holdout_label = split_train_holdout_test(encoder, df, False)
trigram_vect = CountVectorizer(analyzer = "word", ngram_range=(1,3))
trigram_vect = count_vect.fit(df.clean_and_pos_tagged_text)
train_cv_vector = trigram_vect.transform(train_cv.clean_and_pos_tagged_text)
train_holdout_vector = trigram_vect.transform(train_holdout.clean_and_pos_tagged_text)

tf_idf = TfidfTransformer(norm="l2")
train_cv_trigram_tf_idf = tf_idf.fit_transform(train_cv_vector)
train_holdout_trigram_tf_idf = tf_idf.fit_transform(train_holdout_vector)
maxEnt_tf_idf = runModel(encoder,
               train_cv_trigram_tf_idf,
               train_cv_label,
               train_holdout_trigram_tf_idf,
               train_holdout.label,
              "maxEnt",
              "MaxEnt on trigram vect.+ TF-IDF")
models.loc[len(models)] = maxEnt_tf_idf

It looks like the “MaxEnt on trigram vect.+ TF-IDF” is the best model with the highest score. We will use it to predict and classify the test set.

Predicting on test dataset

1. Train on whole data and predict on test

PREPROCESSED data

test = pd.read_csv("fake_or_real_news_test.csv")
train = pd.read_csv("fake_or_real_news_training_CLEANED.csv")

train['title_and_text'] = train['title'] +' '+ train['text']
train['preprocessed_text'] = train['title_and_text'].apply(lambda x: preprocess(x))

test['title_and_text'] = test['title'] +' '+ test['text']
test['preprocessed_text'] = test['title_and_text'].apply(lambda x: preprocess(x))

## Save preprocessed df
train.to_csv("fake_or_real_news_train_PREPROCESSED.csv", index=False)

# Save preprocessed df
test.to_csv("fake_or_real_news_test_PREPROCESSED.csv", index=False)
train = pd.read_csv("fake_or_real_news_train_PREPROCESSED.csv")
train = train.astype(object).replace(np.nan, 'None')

test = pd.read_csv("fake_or_real_news_test_PREPROCESSED.csv")
test = test.astype(object).replace(np.nan, 'None')

POS Tagging

train['pos_tagged_text'] = train['preprocessed_text'].apply(lambda x: pos_tag_words(x))
test['pos_tagged_text'] = test['preprocessed_text'].apply(lambda x: pos_tag_words(x))

Merge clean and pos tagged

train["clean_and_pos_tagged_text"] = train['preprocessed_text'] + ' ' + train['pos_tagged_text']
test["clean_and_pos_tagged_text"] = test['preprocessed_text'] + ' ' + train['pos_tagged_text']

Modelling using MaxEnt on trigram vect.+ TF-IDF Grid Search Best params

Trigram + Tfdif + classifier pipeline

from sklearn.pipeline import Pipeline
trigram_vectorizer = CountVectorizer(analyzer = "word", ngram_range=(1,3))
tf_idf = TfidfTransformer(norm="l2")
classifier = LogisticRegression(C=1000, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='multinomial',
          n_jobs=None, penalty='l2', random_state=None, solver='saga',
          tol=0.0001, verbose=0, warm_start=False)

pipeline = Pipeline([
     ('trigram_vectorizer', trigram_vectorizer),
     ('tfidf', tf_idf),
     ('clf', classifier),
 ])
pipeline.fit(train.clean_and_pos_tagged_text, encoder.fit_transform(train.label.values))
import pickle
pickle.dump( pipeline, open( "pipeline.pkl", "wb" ) )

2. Predicting on test

print(colored("Predicting on test", 'blue'))
test_predictions = test_predictions = pipeline.predict(test.clean_and_pos_tagged_text)
test_predictions_decoded = encoder.inverse_transform( test_predictions )

predictions = test
predictions["label"] = test_predictions_decoded
import collections
ax = sns.countplot(predictions.label,
                order=[x for x, count in sorted(collections.Counter(predictions.label).items(),
                key=lambda x: -x[1])])
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/len(predictions)*100),
            ha="center") 
ax.set_title("Test dataset target")
show()
predictions.drop(columns=["title","text","title_and_text","preprocessed_text","pos_tagged_text","clean_and_pos_tagged_text"]).head()
predictions.to_csv("TEST_PREDICTIONS.csv", index=False)

Deploy Model Flask app Code (app.py & predictionModel)

  1. predictionModel
  2. app.py

predictionModel.py is shown below

#This is predictionModel.py File
# preprocessing
import timeit
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.corpus import wordnet
import _pickle as pickle
import pickle 
import string
import nltk
nltk.data.path.append('./nltk_data')

start = timeit.default_timer()


with open("pickle/pipeline.pkl", 'rb') as f:
            pipeline = pickle.load(f)
            stop = timeit.default_timer()
            print('=> Pickle Loaded in: ', stop - start)

       
class PredictionModel:
    output = {}

    # constructor
    def __init__(self, text):
        self.output['original'] = text

    def predict(self):

        self.preprocess()
        self.pos_tag_words()

        # Merge text
        clean_and_pos_tagged_text = self.output['preprocessed'] + \
            ' ' + self.output['pos_tagged']

        self.output['prediction'] = 'FAKE' if pipeline.predict(
            [clean_and_pos_tagged_text])[0] == 0 else 'REAL'

        return self.output

    # Helper methods
    def preprocess(self):
        # lowercase the text
        text = self.output['original'].lower()

        # remove the words counting just one letter
        text = [t for t in text.split(" ") if len(t) > 1]

        # remove the words that contain numbers
        text = [word for word in text if not any(c.isdigit() for c in word)]

        # tokenize the text and remove puncutation
        text = [word.strip(string.punctuation) for word in text]

        # remove all stop words
        stop = stopwords.words('english')
        text = [x for x in text if x not in stop]

        # remove tokens that are empty
        text = [t for t in text if len(t) > 0]

        # pos tag the text
        pos_tags = pos_tag(text)

        # lemmatize the text
        text = [WordNetLemmatizer().lemmatize(t[0], self.get_wordnet_pos(t[1]))
                for t in pos_tags]

        # join all
        self.output['preprocessed'] = " ".join(text)

    def get_wordnet_pos(self, pos_tag):
        if pos_tag.startswith('J'):
            return wordnet.ADJ
        elif pos_tag.startswith('V'):
            return wordnet.VERB
        elif pos_tag.startswith('N'):
            return wordnet.NOUN
        elif pos_tag.startswith('R'):
            return wordnet.ADV
        else:
            return wordnet.NOUN

    def pos_tag_words(self):
        pos_text = nltk.pos_tag(
            nltk.word_tokenize(self.output['preprocessed']))
        self.output['pos_tagged'] = " ".join(
            [pos + "-" + word for word, pos in pos_text])

app.py is shown below

from flask import Flask, jsonify, request, render_template
from predictionModel import PredictionModel
import pandas as pd
from random import randrange

app = Flask(__name__, static_folder="./public/static",
            template_folder="./public")
@app.route("/")
def home():
    return render_template('index.html')

@app.route('/predict', methods=['POST'])
def predict():
    model = PredictionModel(request.json)
    return jsonify(model.predict())

@app.route('/random', methods=['GET'])
def random():
    data = pd.read_csv("data/fake_or_real_news_test.csv")
    index = randrange(0, len(data)-1, 1)
    return jsonify({'title': data.loc[index].title, 'text': data.loc[index].text})

# Only for local running
if __name__ == '__main__':
    app.run()

Download complete Project:


Share this post

17 thoughts on “Fake News detection using machine learning with flask web application”

  1. Hey there! I could have sworn I’ve been to this site before but
    after browsing through some of the post I realized it’s new to me.
    Nonetheless, I’m definitely glad I found it and I’ll be book-marking and
    checking back frequently!

  2. Good day very cool web site!! Man .. Beautiful .. Amazing ..
    I’ll bookmark your site and take the feeds also? I am satisfied
    to seek out numerous helpful information here in the put
    up, we want develop extra techniques in this regard, thanks for sharing.
    . . . . .

  3. Does your blog have a contact page? I’m having trouble locating it but,
    I’d like to shoot you an e-mail. I’ve got some suggestions for your blog you might be interested in hearing.
    Either way, great blog and I look forward to seeing it develop
    over time.

  4. I’m not that much of a online reader to be honest but your blogs really
    nice, keep it up! I’ll go ahead and bookmark your site to come back later
    on. Cheers

  5. I just couldn’t depart your web site before suggesting that I extremely enjoyed
    the standard information an individual provide for your visitors?
    Is gonna be back continuously to check up on new posts

  6. I really like your blog.. very nice colors & theme. Did you design this website yourself or did
    you hire someone to do it for you? Plz reply as I’m looking to design my own blog and would like to know
    where u got this from. thanks

  7. Fantastic blog! Do you have any helpful hints for aspiring writers?
    I’m planning to start my own blog soon but I’m a little
    lost on everything. Would you recommend starting with a free platform
    like WordPress or go for a paid option? There are so many options
    out there that I’m totally overwhelmed ..
    Any ideas? Thank you!

  8. This design is wicked! You certainly know how to keep a reader entertained.
    Between your wit and your videos, I was almost moved to
    start my own blog (well, almost…HaHa!) Great job.
    I really enjoyed what you had to say, and more
    than that, how you presented it. Too cool!

  9. I was recommended this website by my cousin. I’m not sure whether this post is written by him as nobody else know such detailed
    about my trouble. You are incredible! Thanks!

  10. Along with the whole thing which seems to be developing within this specific subject matter, a significant percentage of opinions happen to be very radical. However, I beg your pardon, but I do not subscribe to your whole strategy, all be it radical none the less. It looks to us that your remarks are actually not entirely rationalized and in fact you are generally yourself not really completely confident of your assertion. In any event I did take pleasure in looking at it.

Leave a Comment

Your email address will not be published. Required fields are marked *