This differs from other python wrappers like pytesseract [1] and pyocr [2] in that tesserocr binds the tesseract c-api. The others call out to the tesseract executable via `subprocess`.
Hey ddmf, you might want to use the latest beta version directly from the master branch of the repo as it has an important issue fixed (the way the PIL image is converted to pix). Otherwise it's safer to use the SetImageFile method instead of SetImage.
Even more strangely, I also installed pytesseract yesterday. I was having trouble getting the visible text from a project's web pages. I thought it'd be easy to screenshot the pages and OCR them. It didn't work, at least my first experiment produced just gibberish. Then I just used BeautifulSoup and I got it done. I don't know why I didn't think of the easier solution first.
tesserocr releases the Global Interpreter Lock (GIL) while performing some tasks which allows threads to run them in parallel without waiting for the GIL to be available.
The aim of the project was to bring most tesseract's API functionality to a pythonic interface without the need for any subprocess calls. No model tuning involved, just a wrapper :)
Feature request: It would be great if you could also hook-in to the model training API, if any, of Tesseract. That way, you could drive the training via python rather than using Tesseract's accompanying executables.
It was never tested on Windows, perhaps you can try the Linux steps on Cygwin? I'm interested in seeing how this can work on Windows as I developed this with Linux in mind which was suitable for my use-case
[1] https://github.com/madmaze/pytesseract [2] https://github.com/jflesch/pyocr