OCR - pytesseract & jieba

OCR - pytesseract & jieba

LAVI

pytesseract & jieba

  • 光學文字識別(Optical Character Recognition,OCR)
    • 簡單來說,就是能夠將「圖片」上文字資訊翻譯出來成文字
  • 利用 Python 模組 pytesseract 套件
    • 可透過簡單程式碼快速分辨圖片中的文字
  • 目前模組由 Google 團隊開發以及維護
  • jieba 是 python 套件的中文斷詞器
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import os
import jieba
import pytesseract
import pandas as pd

from PIL import Image
from pathlib import Path

pytesseract.pytesseract.tesseract_cmd = r'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'

pathlist = Path("<User pytesseract Path>").glob('**/*.jpg')
# config = r'-c tessedit_char_blacklist= --psm 6'

prevPrompt = ''
prevAns = ''

df = pd.DataFrame()

for fontPath in pathlist:
# print(fontPath)

basename = os.path.splitext(os.path.basename(fontPath))[0]
print('basename: ' + basename)

# if not os.path.isdir(basename):
# os.mkdir(basename)

img = Image.open(f'{basename}.jpg')
text = pytesseract.image_to_string(img, lang='chi_tra')
text = text.replace(' ', '')
text = text.split('\n')
# print(text)

for item in text[5:15]:
if item == '' :
continue
print(item)

sentence = jieba.cut(item)
sentence = (' '.join(sentence))
print(sentence)

Reference

On this page
OCR - pytesseract & jieba