The advent of the World Wide Web and the rapid adoption of social media platforms (such as Facebook and Twitter) paved the way for information dissemination that has never been witnessed in the human history before.Consumers are creating and sharing more information than ever before thanks to social media platforms, some of it is inaccurate and has no bearing on reality. It’s difficult to classify a written article as misleading or disinformation using an algorithm.
Note: You can downlaod complete project press downlaod now button at the end of this blog
In the blog, we learn how to make a flask web application that classifies the text using machine learning approaches. The machine learning classify either the text is real or fake in another word we say spam or not spam.
The project is dividee in to wtoe steps:
- Train machine learning model
- Deploy the model using Flask APP
1.Train machine learning model
For the train machine learning model first of all we need to understand the problem we are going to solve either its classification and regression then pick any machine learning model and train. In this blog, we trained different ML model such as SVM, Naive Bayes, a logistic regression model for fake news text classification.
we build a text classification to define whether or not a certain article is a fake news or real news. Using Natural Language Processing methodologies in Python and Classification Theory, we reached an accuracy of 0.945455 for classifying news as fake.
prerequisite: make sure you have already installed Flask, nltk, python ,sklearn and all necessary libraires
Train Classification problem with Fake and Real news
## This file has all imports and helper functions used throughout the notebook %run python_helper.py %matplotlib inline
python_helper.py
###################################################### #################### IMPORTS ###################################################### import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) import nltk from sklearn.model_selection import GridSearchCV from sklearn.linear_model import LogisticRegression from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.feature_extraction.text import TfidfTransformer from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.model_selection import ShuffleSplit import matplotlib.pyplot as plt from nltk.corpus import stopwords import os import warnings import seaborn as sns import re import string from termcolor import colored from nltk import word_tokenize import string from nltk import pos_tag from nltk.corpus import stopwords from nltk.tokenize import WhitespaceTokenizer from nltk.stem import WordNetLemmatizer import nltk nltk.download('averaged_perceptron_tagger') from sklearn.model_selection import GridSearchCV from sklearn.model_selection import cross_val_score warnings.filterwarnings('ignore') from matplotlib.pyplot import * from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.svm import LinearSVC from sklearn.tree import DecisionTreeRegressor from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.model_selection import train_test_split from sklearn import preprocessing from sklearn.metrics import classification_report, accuracy_score from sklearn.metrics import confusion_matrix from nltk.corpus import wordnet from sklearn.feature_extraction.text import TfidfTransformer ###################################################### #################### Globals ###################################################### seed = 12345 cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=seed) encoder = preprocessing.LabelEncoder() ###################################################### #################### Helper Functions ###################################################### def get_wordnet_pos(pos_tag): if pos_tag.startswith('J'): return wordnet.ADJ elif pos_tag.startswith('V'): return wordnet.VERB elif pos_tag.startswith('N'): return wordnet.NOUN elif pos_tag.startswith('R'): return wordnet.ADV else: return wordnet.NOUN def preprocess(text): # lowercase the text text = text.lower() # remove the words counting just one letter text = [t for t in text.split(" ") if len(t) > 1] # remove the words that contain numbers text = [word for word in text if not any(c.isdigit() for c in word)] # tokenize the text and remove puncutation text = [word.strip(string.punctuation) for word in text] # remove all stop words stop = stopwords.words('english') text = [x for x in text if x not in stop] # remove tokens that are empty text = [t for t in text if len(t) > 0] # pos tag the text pos_tags = pos_tag(text) # lemmatize the text text = [WordNetLemmatizer().lemmatize(t[0], get_wordnet_pos(t[1])) for t in pos_tags] # join all text = " ".join(text) return (text) def split_train_holdout_test(encoder, df, verbose=True): # Resplit original train and test train = df[df["label"] != "None"] test = df[df["label"] == "None"] # Encode Target train["encoded_label"] = encoder.fit_transform(train.label.values) # Take holdout from train train_cv, train_holdout, train_cv_label, train_holdout_label = train_test_split(train, train.encoded_label, test_size=0.33, random_state=seed) if(verbose): print("\nTrain dataset (Full)") print(train.shape) print("Train dataset cols") print(list(train.columns)) print("\nTrain CV dataset (subset)") print(train_cv.shape) print("Train Holdout dataset (subset)") print(train_holdout.shape) print("\nTest dataset") print(test.shape) print("Test dataset cols") print(list(test.columns)) return encoder, train, test, train_cv, train_holdout, train_cv_label, train_holdout_label def runModel(encoder, train_vector, train_label, holdout_vector, holdout_label, type, name): global cv global seed ## Classifier types if (type == "svc"): classifier = SVC() grid = [ {'C': [1, 10, 50, 100], 'kernel': ['linear']}, {'C': [10, 100, 500, 1000], 'gamma': [0.0001], 'kernel': ['rbf']}, ] if (type == "nb"): classifier = MultinomialNB() grid = {} if (type == "maxEnt"): classifier = LogisticRegression() grid = {'penalty': ['l1','l2'], 'C': [0.001,0.01,0.1,1,10,100,1000]} # Model print(colored(name, 'red')) model = GridSearchCV(estimator=classifier, cv=cv, param_grid=grid) print(colored(model.fit(train_vector, train_label), "yellow")) # Score print(colored("\nCV-scores", 'blue')) means = model.cv_results_['mean_test_score'] stds = model.cv_results_['std_test_score'] for mean, std, params in sorted(zip(means, stds, model.cv_results_['params']), key=lambda x: -x[0]): print("Accuracy: %0.3f (+/-%0.03f) for params: %r" % (mean, std * 2, params)) print() print(colored("\nBest Estimator Params", 'blue')) print(colored(model.best_estimator_, "yellow")) # Predictions print(colored("\nPredictions:", 'blue')) model_train_pred = encoder.inverse_transform( model.predict(holdout_vector) ) print(model_train_pred) # Confusion Matrix cm = confusion_matrix(holdout_label, model_train_pred) # Transform to df for easier plotting cm_df = pd.DataFrame(cm, index = ['FAKE','REAL'], columns = ['FAKE','REAL']) plt.figure(figsize=(5.5,4)) sns.heatmap(cm_df, annot=True, fmt='g') plt.ylabel('True label') plt.xlabel('Predicted label') plt.show() # Accuracy acc = accuracy_score(holdout_label, model_train_pred) print(colored("\nAccuracy:", 'blue')) print(colored(acc, 'green')) return [name, model, acc] def pos_tag_words(text): pos_text = nltk.pos_tag(nltk.word_tokenize(text)) return " ".join([pos + "-" + word for word, pos in pos_text])
Clean & Save Data
Inspecting the data files, we noticed several issues for processing the training dataset correctly. Using Regular Expression, we convert all commas between quotations to a pipe, so the CSV parsing works correctly with all values in their correct columns.
input_str = open("fake_or_real_news_training.csv", encoding= 'utf-8') # Remove all new lines noNewLines = re.sub("\n", "", input_str.read()) # re-add new line at end of each row noNewLines = re.sub("X1,X2", "X1,X2\n", noNewLines) noNewLines = re.sub(",FAKE[,]+", ",FAKE,,\n", noNewLines) # noNewLines = re.sub(",FAKE,(?!,)",",FAKE,,\n",noNewLines) # noNewLines = re.sub(",FAKE,,(?!,)",",FAKE,,\n",noNewLines) noNewLines = re.sub(",REAL[,]+", ",REAL,,\n", noNewLines) # noNewLines = re.sub(",REAL,(?!,)",",REAL,,\n",noNewLines) # noNewLines = re.sub(",REAL,,(?!,)",",REAL,,\n",noNewLines) # Replace any commas between two quotes with | lines = noNewLines.split('\n') def removeComma(g): t = g.groups() t = [t[0], t[1].replace(',', ' |'), t[2], t[3]] return "".join(t) betweenQuotes = lambda line: re.sub(r'(.*,")(.*)(",)(.*)', lambda x: removeComma(x), line) secondCol = lambda line: re.sub(r'^([0-9]+,)(.*,.*)(,\")(.*)$', lambda x: removeComma(x), line, 1) lines = [betweenQuotes(l) for l in lines] lines = [secondCol(l) for l in lines] finalString = '\n'.join(lines)
Save cleaned file:
file = open('fake_or_real_news_training_CLEANED.csv', 'w',encoding= 'utf-8') file.write(finalString) file.close()
Data Preparation:
train = pd.read_csv("fake_or_real_news_training_CLEANED.csv") test = pd.read_csv("fake_or_real_news_test.csv") train = train.drop(['X1', 'X2'], axis=1)
We study if the dataset is unbalanced. From the plot we see this is not the case, as there is a similar amount of Fake and Real news articles. No further actions have to be taken.
from collections import Counter ax = sns.countplot(train.label, order=[x for x, count in sorted(Counter(train.label).items(), key=lambda x: -x[1])]) for p in ax.patches: height = p.get_height() ax.text(p.get_x()+p.get_width()/2., height + 3, '{:1.2f}%'.format(height/len(train)*100), ha="center") ax.set_title("Test dataset target") show()
In order to not do double work by doing operations on our train and testset and to analyze general distributions of our data, we stack train and test in df.
test['label'] = None # empty label for test df = pd.concat([train, test])
Data Preprocessing:
In this part, we will be cleaning the articles with the help of different NLP techniques, of which we will first explain the concept and its importance.
In order to take into account the title in our accuracy prediction, we created an extra column that combines text and title. We will not do separate predictions on the title since these might classify as e.g. Fake news, whether the actual text with more explanation tells a Real story.
df['title_and_text'] = df['title'] +' '+ df['text'] df.tail()
preprocess() can be found in python_helper.py Here you can read the explanations of the preprocess steps we took
- lowercase the text
This preprocessing step is done so words van later be cross checked with the stopword and pos_tag dictionaries. For future analysis purposes, it could have been benefitial to analyze text with a lot of words in capital letters, by adding a flag variable.
- remove the words counting just one letter
Idem step one.
- remove the words that contain numbers
Idem step one.
- tokenize the text and remove punctuation
We performed tokenization with the base python .string function, to split sentences into words (tokens).
- remove all stop words
A relevant analysis of the text depends on the most recurring words. Stopwords including words as “the”, “as” and “and” appear a lot in a text, but do not give a relevant explanation. For this reason, they are removed.
- remove tokens that are empty
After tokenization, we have to make sure all tokens taken into account contribute to the label prediction.
- pos tag the text
We use the pos_tag function included in the ntlk library. This classifies our tokenized words as a noun, verb, adjective or adverb and adds to the understaning of the articles.
- lemmatize the text
In order to normalize the text, we apply lemmatization. In this way, words with the same root are processed equally e.g. when took or taken are read in the text, they are lemmatized to take, infinitive of the two verbs.
df['preprocessed_text'] = df['title_and_text'].apply(lambda x: preprocess(x)) ## Save preprocessed df df.to_csv("fake_or_real_news_train_PREPROCESSED.csv", index=False) df = pd.read_csv("fake_or_real_news_train_PREPROCESSED.csv") df = df.astype(object).replace(np.nan, 'None') df.tail()
Split Train and Test again after pre-processing is done:
encoder, train, test, train_cv, train_holdout, train_cv_label, train_holdout_label = split_train_holdout_test(encoder, df)
Baseline Modelling
First, we create a dataframe called models to keep track of different models and their scores.
models = pd.DataFrame(columns=['model_name', 'model_object', 'score'])
Vectorizing dataset:
For any text to be fed to a model, the text has to be transformed into numerical values. This process is called vectorizing and will be redone everytime a new feature is added.
count_vect = CountVectorizer(analyzer = "word") count_vectorizer = count_vect.fit(df.preprocessed_text) train_cv_vector = count_vectorizer.transform(train_cv.preprocessed_text) train_holdout_vector = count_vectorizer.transform(train_holdout.preprocessed_text) test_vector = count_vectorizer.transform(test.preprocessed_text) count_vect.get_feature_names()[:10]
Baseline Model 1: SVC
We create a baseline classification model with a support vector machine, a good model to handle complex classifications.
SVC_classifier = runModel(encoder, train_cv_vector, train_cv_label, train_holdout_vector, train_holdout.label, "svc", "Baseline Model 1: SVC") models.loc[len(models)] = SVC
Baseline Model 2: Naïve Bayes
we can explain why the Naïve Bayes model is helpful for our classification. The labels Real and Fake text are hidden, but every word, based on our training data, has a certain probability to belong to one of the two categories. The final score is calculated, multiplying all probabilities of the words (0.006 for real, 0.288 for fake). The algo thus does not take into account the order of the words in the multiplication. rude hell worth will be classified as fake.
NB = runModel(encoder, train_cv_vector, train_cv_label, train_holdout_vector, train_holdout.label, "nb", "Baseline Model 2: Naiive Bayes") models.loc[len(models)] = NB
Baseline Model 3: MaxEnt Classifier
maxEnt = runModel(encoder, train_cv_vector, train_cv_label, train_holdout_vector, train_holdout.label, "maxEnt", "Baseline Model 3: MaxEnt Classifier") models.loc[len(models)] = maxEnt
Feature Engineering
- Explicit POS tagging
- TF-IDF weighting
- Bigram Count Vectorizer
==> Select Final Model and predict on test
1. POS Tagging
Adding a prefix to each word with its type (Noun, Verb, Adjective,…). e.g: I went to school => PRP-I VBD-went TO-to NN-school
Also, after lemmatization, it will be ‘VB-go NN-school’, which indicates the semantics and distinguishes the purpose of the sentence.
This will help the classifier differentiate between different types of sentences.
df['pos_tagged_text'] = df['preprocessed_text'].apply(lambda x: pos_tag_words(x)) df.head()
Rerun Models on pos-tagged text (FE1)
encoder, train, test, train_cv, train_holdout, train_cv_label, train_holdout_label = split_train_holdout_test(encoder, df, False) count_vect = CountVectorizer(analyzer = "word") count_vectorizer = count_vect.fit(df.preprocessed_text) train_cv_vector = count_vectorizer.transform(train_cv.pos_tagged_text) train_holdout_vector = count_vectorizer.transform(train_holdout.pos_tagged_text) test_vector = count_vectorizer.transform(test.pos_tagged_text)
a. SVC with FE1
SVC_pos_tag = runModel(encoder, train_cv_vector, train_cv_label, train_holdout_vector, train_holdout.label, "svc", "SVC on pos-tagged text") models.loc[len(models)] = SVC_pos_tag
b. NB_pos_tag with FE1
NB_pos_tag = runModel(encoder, train_cv_vector, train_cv_label, train_holdout_vector, train_holdout.label, "nb", "Naiive Bayes on pos-tagged text") models.loc[len(models)] = NB_pos_tag
c. maxEnt with FE1
maxEnt_pos_tag = runModel(encoder, train_cv_vector, train_cv_label, train_holdout_vector, train_holdout.label, "maxEnt", "MaxEnt Classifier on pos-tagged text") models.loc[len(models)] = maxEnt_pos_tag
There seems to be a slight increase in Accuracy after pos-tagging.
2. TF-IDF weighting
Try to add weight to each word using TF-IDF
We are going to calculate the TFIDF score of each term in a piece of text. The text will be tokenized into sentences and each sentence is then considered a text item.
We will also apply those on the cleaned text and the concatinated POS_tagged text.
df["clean_and_pos_tagged_text"] = df['preprocessed_text'] + ' ' + df['pos_tagged_text'] df.head(1)
encoder, train, test, train_cv, train_holdout, train_cv_label, train_holdout_label = split_train_holdout_test(encoder, df, False) count_vect = CountVectorizer(analyzer = "word") count_vectorizer = count_vect.fit(df.clean_and_pos_tagged_text) train_cv_vector = count_vectorizer.transform(train_cv.clean_and_pos_tagged_text) train_holdout_vector = count_vectorizer.transform(train_holdout.clean_and_pos_tagged_text) test_vector = count_vectorizer.transform(test.clean_and_pos_tagged_text) tf_idf = TfidfTransformer(norm="l2") train_cv_tf_idf = tf_idf.fit_transform(train_cv_vector) train_holdout_tf_idf = tf_idf.fit_transform(train_holdout_vector) test_tf_idf = tf_idf.fit_transform(test_vector)
Rerun Models on preprocessed + pos-tagged (FE1) + TF-IDF weighted text (FE2)
a. SVC with FE1 and FE2
SVC_tf_idf = runModel(encoder, train_cv_tf_idf, train_cv_label, train_holdout_tf_idf, train_holdout.label, "svc", "SVC on preprocessed+pos-tagged TF-IDF weighted text") models.loc[len(models)] = SVC_tf_idf
b. NB with FE1 and FE2
NB_tf_idf = runModel(encoder, train_cv_tf_idf, train_cv_label, train_holdout_tf_idf, train_holdout.label, "nb", "Naiive Bayes on preprocessed+pos-tagged TF-IDF weighted text") models.loc[len(models)] = NB_tf_idf
c. maxEnt with FE1 and FE2
maxEnt_tf_idf = runModel(encoder, train_cv_tf_idf, train_cv_label, train_holdout_tf_idf, train_holdout.label, "maxEnt", "MaxEnt on preprocessed+pos-tagged TF-IDF weighted text") models.loc[len(models)] = maxEnt_tf_idf
Using TF-IDF increased the score to ~94.5% with SVC and Max-Ent models.
Naive-Bayes rather decreased the score. Therefore we drop it from the pipeline.
3. Use Bigram Vectorizer instead of regular vectorizer
For FE3, we use the Trigram vectorizer, which vectorizes triplets of words rather than each word separately. In this short example sentence, the trigrams are “In this short”, “this short example” and “short example sentence”.
encoder, train, test, train_cv, train_holdout, train_cv_label, train_holdout_label = split_train_holdout_test(encoder, df, False) trigram_vect = CountVectorizer(analyzer = "word", ngram_range=(1,2)) trigram_vect = count_vect.fit(df.clean_and_pos_tagged_text) train_cv_vector = trigram_vect.transform(train_cv.clean_and_pos_tagged_text) train_holdout_vector = trigram_vect.transform(train_holdout.clean_and_pos_tagged_text) test_vector = trigram_vect.transform(test.clean_and_pos_tagged_text) tf_idf = TfidfTransformer(norm="l2") train_cv_bigram_tf_idf = tf_idf.fit_transform(train_cv_vector) train_holdout_bigram_tf_idf = tf_idf.fit_transform(train_holdout_vector) test_bigram_tf_idf = tf_idf.fit_transform(test_vector)
Rerun Models on preprocessed + pos-tagged (FE1) + TF-IDF weighted (FE2) + Trigram vectorized text (FE3)
a. SVC with FE1, FE2 and FE3
SVC_trigram_tf_idf = runModel(encoder, train_cv_bigram_tf_idf, train_cv_label, train_holdout_bigram_tf_idf, train_holdout.label, "svc", "SVC on bigram vect.+ TF-IDF") models.loc[len(models)] = SVC_trigram_tf_idf
b. maxEnt with FE1, FE2 and FE3
encoder, train, test, train_cv, train_holdout, train_cv_label, train_holdout_label = split_train_holdout_test(encoder, df, False) trigram_vect = CountVectorizer(analyzer = "word", ngram_range=(1,3)) trigram_vect = count_vect.fit(df.clean_and_pos_tagged_text) train_cv_vector = trigram_vect.transform(train_cv.clean_and_pos_tagged_text) train_holdout_vector = trigram_vect.transform(train_holdout.clean_and_pos_tagged_text) tf_idf = TfidfTransformer(norm="l2") train_cv_trigram_tf_idf = tf_idf.fit_transform(train_cv_vector) train_holdout_trigram_tf_idf = tf_idf.fit_transform(train_holdout_vector)
maxEnt_tf_idf = runModel(encoder, train_cv_trigram_tf_idf, train_cv_label, train_holdout_trigram_tf_idf, train_holdout.label, "maxEnt", "MaxEnt on trigram vect.+ TF-IDF") models.loc[len(models)] = maxEnt_tf_idf
It looks like the “MaxEnt on trigram vect.+ TF-IDF” is the best model with the highest score. We will use it to predict and classify the test set.
Predicting on test dataset
1. Train on whole data and predict on test
PREPROCESSED data
test = pd.read_csv("fake_or_real_news_test.csv") train = pd.read_csv("fake_or_real_news_training_CLEANED.csv") train['title_and_text'] = train['title'] +' '+ train['text'] train['preprocessed_text'] = train['title_and_text'].apply(lambda x: preprocess(x)) test['title_and_text'] = test['title'] +' '+ test['text'] test['preprocessed_text'] = test['title_and_text'].apply(lambda x: preprocess(x)) ## Save preprocessed df train.to_csv("fake_or_real_news_train_PREPROCESSED.csv", index=False) # Save preprocessed df test.to_csv("fake_or_real_news_test_PREPROCESSED.csv", index=False)
train = pd.read_csv("fake_or_real_news_train_PREPROCESSED.csv") train = train.astype(object).replace(np.nan, 'None') test = pd.read_csv("fake_or_real_news_test_PREPROCESSED.csv") test = test.astype(object).replace(np.nan, 'None')
POS Tagging
train['pos_tagged_text'] = train['preprocessed_text'].apply(lambda x: pos_tag_words(x)) test['pos_tagged_text'] = test['preprocessed_text'].apply(lambda x: pos_tag_words(x))
Merge clean and pos tagged
train["clean_and_pos_tagged_text"] = train['preprocessed_text'] + ' ' + train['pos_tagged_text'] test["clean_and_pos_tagged_text"] = test['preprocessed_text'] + ' ' + train['pos_tagged_text']
Modelling using MaxEnt on trigram vect.+ TF-IDF Grid Search Best params
Trigram + Tfdif + classifier pipeline
from sklearn.pipeline import Pipeline trigram_vectorizer = CountVectorizer(analyzer = "word", ngram_range=(1,3)) tf_idf = TfidfTransformer(norm="l2") classifier = LogisticRegression(C=1000, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='multinomial', n_jobs=None, penalty='l2', random_state=None, solver='saga', tol=0.0001, verbose=0, warm_start=False) pipeline = Pipeline([ ('trigram_vectorizer', trigram_vectorizer), ('tfidf', tf_idf), ('clf', classifier), ])
pipeline.fit(train.clean_and_pos_tagged_text, encoder.fit_transform(train.label.values))
import pickle pickle.dump( pipeline, open( "pipeline.pkl", "wb" ) )
2. Predicting on test
print(colored("Predicting on test", 'blue')) test_predictions = test_predictions = pipeline.predict(test.clean_and_pos_tagged_text) test_predictions_decoded = encoder.inverse_transform( test_predictions ) predictions = test predictions["label"] = test_predictions_decoded
import collections ax = sns.countplot(predictions.label, order=[x for x, count in sorted(collections.Counter(predictions.label).items(), key=lambda x: -x[1])]) for p in ax.patches: height = p.get_height() ax.text(p.get_x()+p.get_width()/2., height + 3, '{:1.2f}%'.format(height/len(predictions)*100), ha="center") ax.set_title("Test dataset target") show()
predictions.drop(columns=["title","text","title_and_text","preprocessed_text","pos_tagged_text","clean_and_pos_tagged_text"]).head()
predictions.to_csv("TEST_PREDICTIONS.csv", index=False)
Deploy Model Flask app Code (app.py & predictionModel)
- predictionModel
- app.py
predictionModel.py is shown below
#This is predictionModel.py File # preprocessing import timeit from nltk.stem import WordNetLemmatizer from nltk import pos_tag from nltk.corpus import stopwords from nltk.corpus import wordnet import _pickle as pickle import pickle import string import nltk nltk.data.path.append('./nltk_data') start = timeit.default_timer() with open("pickle/pipeline.pkl", 'rb') as f: pipeline = pickle.load(f) stop = timeit.default_timer() print('=> Pickle Loaded in: ', stop - start) class PredictionModel: output = {} # constructor def __init__(self, text): self.output['original'] = text def predict(self): self.preprocess() self.pos_tag_words() # Merge text clean_and_pos_tagged_text = self.output['preprocessed'] + \ ' ' + self.output['pos_tagged'] self.output['prediction'] = 'FAKE' if pipeline.predict( [clean_and_pos_tagged_text])[0] == 0 else 'REAL' return self.output # Helper methods def preprocess(self): # lowercase the text text = self.output['original'].lower() # remove the words counting just one letter text = [t for t in text.split(" ") if len(t) > 1] # remove the words that contain numbers text = [word for word in text if not any(c.isdigit() for c in word)] # tokenize the text and remove puncutation text = [word.strip(string.punctuation) for word in text] # remove all stop words stop = stopwords.words('english') text = [x for x in text if x not in stop] # remove tokens that are empty text = [t for t in text if len(t) > 0] # pos tag the text pos_tags = pos_tag(text) # lemmatize the text text = [WordNetLemmatizer().lemmatize(t[0], self.get_wordnet_pos(t[1])) for t in pos_tags] # join all self.output['preprocessed'] = " ".join(text) def get_wordnet_pos(self, pos_tag): if pos_tag.startswith('J'): return wordnet.ADJ elif pos_tag.startswith('V'): return wordnet.VERB elif pos_tag.startswith('N'): return wordnet.NOUN elif pos_tag.startswith('R'): return wordnet.ADV else: return wordnet.NOUN def pos_tag_words(self): pos_text = nltk.pos_tag( nltk.word_tokenize(self.output['preprocessed'])) self.output['pos_tagged'] = " ".join( [pos + "-" + word for word, pos in pos_text])
app.py is shown below
from flask import Flask, jsonify, request, render_template from predictionModel import PredictionModel import pandas as pd from random import randrange app = Flask(__name__, static_folder="./public/static", template_folder="./public") @app.route("/") def home(): return render_template('index.html') @app.route('/predict', methods=['POST']) def predict(): model = PredictionModel(request.json) return jsonify(model.predict()) @app.route('/random', methods=['GET']) def random(): data = pd.read_csv("data/fake_or_real_news_test.csv") index = randrange(0, len(data)-1, 1) return jsonify({'title': data.loc[index].title, 'text': data.loc[index].text}) # Only for local running if __name__ == '__main__': app.run()
Download complete Project:
Wow, this article is nice, my younger sister is analyzing these
kinds of things, thus I am going to let know her.
Hey there! I could have sworn I’ve been to this site before but
after browsing through some of the post I realized it’s new to me.
Nonetheless, I’m definitely glad I found it and I’ll be book-marking and
checking back frequently!
I visited several blogs however the audio feature for audio
songs present at this web site is actually superb.
Why users still use to read news papers when in this technological globe everything is accessible on web?
Good day very cool web site!! Man .. Beautiful .. Amazing ..
I’ll bookmark your site and take the feeds also? I am satisfied
to seek out numerous helpful information here in the put
up, we want develop extra techniques in this regard, thanks for sharing.
. . . . .
Does your blog have a contact page? I’m having trouble locating it but,
I’d like to shoot you an e-mail. I’ve got some suggestions for your blog you might be interested in hearing.
Either way, great blog and I look forward to seeing it develop
over time.
Good way of telling, and nice post to get information regarding my presentation subject, which i am going to convey
in school.
I am actually thankful to the owner of this web site who has
shared this enormous paragraph at at this place.
Excellent way of telling, and fastidious paragraph to takeinformation about my presentation topic, which i am goingto present in school.
I’m not that much of a online reader to be honest but your blogs really
nice, keep it up! I’ll go ahead and bookmark your site to come back later
on. Cheers
I just couldn’t depart your web site before suggesting that I extremely enjoyed
the standard information an individual provide for your visitors?
Is gonna be back continuously to check up on new posts
I really like your blog.. very nice colors & theme. Did you design this website yourself or did
you hire someone to do it for you? Plz reply as I’m looking to design my own blog and would like to know
where u got this from. thanks
Fantastic blog! Do you have any helpful hints for aspiring writers?
I’m planning to start my own blog soon but I’m a little
lost on everything. Would you recommend starting with a free platform
like WordPress or go for a paid option? There are so many options
out there that I’m totally overwhelmed ..
Any ideas? Thank you!
This design is wicked! You certainly know how to keep a reader entertained.
Between your wit and your videos, I was almost moved to
start my own blog (well, almost…HaHa!) Great job.
I really enjoyed what you had to say, and more
than that, how you presented it. Too cool!
I was recommended this website by my cousin. I’m not sure whether this post is written by him as nobody else know such detailed
about my trouble. You are incredible! Thanks!
I visited various blogs except the audio quality for audio
songs present at this web site is genuinely fabulous.
Along with the whole thing which seems to be developing within this specific subject matter, a significant percentage of opinions happen to be very radical. However, I beg your pardon, but I do not subscribe to your whole strategy, all be it radical none the less. It looks to us that your remarks are actually not entirely rationalized and in fact you are generally yourself not really completely confident of your assertion. In any event I did take pleasure in looking at it.