python验证码识别

对此类codeimage.jpg验证码,识别正确率高达99.99%哈哈哈哈哈哈哈哈哈哈哈哈
人生苦短,我用python!

最开始没有进行处理经常5、6,1、7分不开,进行降噪和灰度处理之后识别准确率大大增强

import HTMLParser 
import urlparse 
import urllib 
import urllib2 
import cookielib 
import string
import bs4
import xml.dom.minidom
import re
import os
import PIL
import StringIO
from PIL import Image
from PIL import ImageEnhance  
from PIL import ImageFilter
import glob
import pytesser
from pytesser import* 

img_url='http://'
request = urllib2.Request(img_url)
img_data = urllib2.urlopen(request).read()
img_buffer = StringIO.StringIO(img_data)
img = Image.open(img_buffer)

box = (10,5,50,20) #(left, up, right, down)
region = img.crop(box)

imgry = region.convert('L')
threshold = 140  
table = []  
for i in range(256):  
    if i < threshold:  
        table.append(0)  
    else:  
        table.append(1)
out = imgry.point(table,'1')

out.save("coded.jpg")

print image_to_string(out)

勤劳工作图:
orcscr.png


updated at 2017-1-19 4:06
今天再次使用验证码识别功能,发现pytesser只是对tesseract的封装,且无法在Linux环境下使用,改进后代码直接调用tesesseract进行识别:

import urllib2
import PIL
import StringIO
from PIL import Image
import pytesseract

def ocr(img):
    return pytesseract.image_to_string(img,lang="eng",config="-psm 7")

img_url='http://third.lanqiao.cn/api/action/directmail/vcode2'
#img_url = 'https://static.droomo.top/2015/10/31/370942641162179.jpg'
request = urllib2.Request(img_url)
img_data = urllib2.urlopen(request).read()
img_buffer = StringIO.StringIO(img_data)
img = Image.open(img_buffer)

imgry = img.convert('L')
threshold = 90
table = []
for i in range(256):
    if i < threshold:
        table.append(0)
    else:
        table.append(1)
out = imgry.point(table,'1')

#out.save("coded.jpg")

print ocr(out)

需要安装tesseract-ocr
Ubuntu:

apt-get install tesseract-ocr

改进后对此类验证码识别率70%+ vcode2.jpg,错误主要发生在进行过滤后细字体文字丢失和验证码本身显示不全。

Tag: none

5 comments

  1. _moon _moon

    python是一种效率比较低、占硬件非常多的语言

    1. CIN CIN

      !!!前几天还和我说python是世界上最好的语言的说!!!

      1. _moon _moon

        扯淡,php才是世界上最好的语言

        1. moon moon

          Python是世界上最好的语言,真的不骗你

  2. CIN CIN

    果然是闲人一只=。=

Leave a new comment