Text extraction, often known as keyword extraction, is a method of automatically scanning text and extracting relevant or core words and phrases from unstructured data such as news articles, surveys, and customer service issues using machine learning.
In this blog, we will see how we can extract text by using pytesseract library.
First, need to install the necessary libraries.
! apt install tesseract-ocr ! pip install Pillow !pip install pytesseract
After installing necessary libraries such as tesseract-ocr ,pytesseract for image to text extraction and Pillow for reading the image for the directory.
We’ll use Pillow’s Image class to open the image and pytesseract to detect the string in the image.
from PIL import Image import pytesseract def ocr_core(filename): """ This function will handle the core OCR processing of images. """ text = pytesseract.image_to_string(Image.open(os.path.join(filename))) return text print(ocr_core('img1.PNG'))