MATLAB Answers

0

High speed OCR and parallel processing

Asked by Azura Hashim on 29 Jan 2017
Latest activity Commented on by Azura Hashim on 2 Feb 2017
Hello,
I've written a number of programs to read tabular data in images using Matlab's ocr function. I have cleaned up the image files before using OCR (binarize, etc.). However it is taking about ~4 secs per 100 rows of single column data. Unfortunately I have hundreds of thousands of rows to work with so I need a way to speed this up. Using ROI or cropping the image into individual table cells didn't make much difference. Can someone help by pointing out some options?
  • Is there a way to make OCR run faster?
  • I have seen some documentation on parallel processing and was wondering if that could help. My computer has 4 cores. Should I explore the following?
  1. hyperthreading
  2. increase number of workers more than the number of cores
  3. increase number of threads per worker.
In essence I'm looking to split the hundreds of image files to be processed separately and want to maximise the speed.
Thank you.

  0 Comments

Sign in to comment.

1 Answer

Answer by Walter Roberson
on 29 Jan 2017
 Accepted Answer

To get ocr() to maybe run faster you would need to train a custom network. This assumes that fonts and handwriting sloppiness are more restricted for your situation (eg. one font of one size) ; if you have a general written OCR problem then training your own network is not likely to speed anything up.
The general task of OCR could, I suspect, be done faster using different algorithms. I say that thinking about the speed of the automatic mail sorters. On the other hand those do not have to deal with hundreds of rows.
You need to profile your code. Hyperthreading is an advantage if you are waiting on I/O. If you are busy with computations then Hyperthreading can slow things down. Assigning more threads than cores or more workers than cores leads to contention for resources unless they typically spend a lot of time waiting for interrupts.
parfor and SPMD are not always more productive. They are most effective for low IO high computation where the matrices involved are small or moderate and you do not do extensive tasks such as eigenvalues or \ operation. With larger matrices and vectorized code especially code that does linear algebra then you would typically get better performance leaving it not explicitly parallel so that it can use the multithreaded high performance libraries (those have much lower overhead than creating workers)

  1 Comment

Thanks for your help!

Sign in to comment.