OCR - pytesseract & jieba

OCR - pytesseract & jieba

LAVI

2023-04-26 2023-04-26 Created 2024-09-11 2024-09-11 Updated

pytesseract & jieba

光學文字識別（Optical Character Recognition，OCR）
- 簡單來說，就是能夠將「圖片」上文字資訊翻譯出來成文字
利用 Python 模組 pytesseract 套件
- 可透過簡單程式碼快速分辨圖片中的文字
目前模組由 Google 團隊開發以及維護
jieba 是 python 套件的中文斷詞器

import os
import jieba
import pytesseract
import pandas as pd

from PIL import Image
from pathlib import Path

pytesseract.pytesseract.tesseract_cmd = r'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'

pathlist = Path("<User pytesseract Path>").glob('**/*.jpg')
# config = r'-c tessedit_char_blacklist= --psm 6'

prevPrompt = ''
prevAns = ''

df = pd.DataFrame()  

for fontPath in pathlist:
  # print(fontPath)
  
  basename = os.path.splitext(os.path.basename(fontPath))[0]
  print('basename: ' + basename)

  # if not os.path.isdir(basename):
  #   os.mkdir(basename)  

  img = Image.open(f'{basename}.jpg')
  text = pytesseract.image_to_string(img, lang='chi_tra')
  text = text.replace(' ', '')
  text = text.split('\n')
  # print(text)

  for item in text[5:15]:
    if item == '' :
      continue
    print(item)

    sentence = jieba.cut(item)
    sentence = (' '.join(sentence))
    print(sentence)

Reference

On this page

OCR - pytesseract & jieba

pytesseract & jieba
Reference