Improving the text extraction efficiency using some OCR - computer-vision

I am very new to Computer Vision. I have lots of images like this:
Sample image
I want to extract the entire table as text. I tried pytesseract to extract text from the image. I tried the sample code as below:
try:
import Image
except ImportError:
from PIL import Image
from pytesseract import *
im = Image.open('/home/Downloads/b.png')
text = image_to_string(im, lang='eng')
print text
But results are really bad. Some sample:
II) Han H31 Precvsva 111
II) Pegalran Corn m
11) Quama camume. m
15) Sansmlg Eledra. KR
II) snaru Corn/Japan 11>
II) 15 msnlay Co 1111 KR
13)]ah1lC1rcuvl Inc us
II) Iaman Semioan... 1w
I1)Japan msulay Inc 11>
I1) Schneider Fleck... 511
II) campal Elec|ram 111
II) 5111-9110 onlme 5. JP
I1) C1500 syaens Inc us
Is) Warned Semic. 111
II) Mvcran Techmla. us
I1) Camnuler Sclenc
I1) Flex Lid us
I111me1 Corn 115
How can I improve the efficiency? Can I achieve 80-90% accuracy? All my images are in same format. So can I improve the accuracy for my use-case? Any suggestions will help.
Update: I tried using OCR.space, but it didn't work on the following image at all:
Test

The main problem with your image is, that it is only 96 dpi (and OCR often expect 300 dpi). I changed your image to 300 dpi and resample it to 200% with IrfanView using Lancosz algorithm. This should be equivalent to some convert statement.
By taking this new image as input for Tesseract the output looks better:
2?) Hon Hai Precisio...
23) Pegatron Corp
2) Quanta Compute...
25) Samsung Electro...
26) Sharp Corp/Japan
27) LG Display Co Ltd
8) Jabil Circuit Inc
M) Taiwan Semicon...
3) Japan Display Inc
31) Schneider Electr...
3) Compal Electroni...
33) GungHo Online E...
X) Cisco Systems Inc
33) Advanced Semic...
%) Micron Technolo...
3) Computer Scienc...
3) Flex Ltd
3) Intel Corp
TW
TW
TW
LG
JP
LG
US
TW
JP
FR
TW
JP
US
TW
US
US
US
US
1.80%
10.40%
-9.50%
2.72%
-0.57%
5.03%
3.90%
1.38%
1.30%
1.33%
-0.13%
-6.21%
0.31%
-3.63%
-0.20%
1.33%
-1.56%
3.91%
53.67%
60.08%
64.85%
5.97%
27.10%
30.28%
24.00%
16.26%
53.70%
0.92%
14.28%
51.70%
0.73%
39.13%
11.00%
6. 7335
2.61%
5.65B
5.078
1 58B
1.808
1.108
1.028
1.278
70.89M
785.20M
177. 44M
90.56M
925.18M
436.70M
89.24M
411.54M
411.06M
cogs
COGS
COGS
[eels
COGS
[eelcis
COGS
CAPEX
COGS
SG&A
CAPEX
cogs
COGS
SG&A
COGS
[eoles
54.66%
16.33%
14.84%
4. 05%
3.65%
3.30%
3.26%
3.23%
3.00%
2.90%
2.85%
2.28%
1. 503;
1.47%
1.42%
142%
#2015A CF
£2015A CF
#2015A CF
Estimate
#2016A CF
Estimate
#2016A CF
Estimate
#2016A CF
Estimate
Estimate
#2015A CF
Estimate
Estimate
#201701 CF
Estimate
Estimate
Estimate
03/30/2016
03/17/2016
03/31/2016
06/10/2016
06/23/2016
02/24/2017
10/20/2016
05/09/2016
06/21/2016
05/27/2016
10/19/2016
03/22/2016
01/03/2017
02/22/2017
01/09/2017
01/03/2017
01/30/2017
01/03/2017
However, the third column is here ignored completely and some other values are also missing completely. Maybe, the layout recognition has some problems...

Related

regular expression to return the text which has 1 or more periods in between parenthesis

I have a text which has 1 or more 2 period in between parenthesis.
K= 'Product will be hot(These cooking instructions were developed using an 100 watt microwave oven. For lower wattage ovens, up to an additional 2 minutes cooking time may be required).'
I'd like to extract or eliminate that entire text.I have tried
re.search(r'\((.*?)+\)',K).group(1)
and
K[K.find("(")+1:K.find(")")]
but none of them returns the text
IIUC, the following regex will remove any text between parentheses that contains one or more periods, as well as the parentheses themselves:
re.sub('\(.*?\.+.*\)','', K)
Example:
>>> re.sub('\(.*?\.+.*\)','', K)
'Product will be hot.'
To extract the text instead of removing it, use re.findall with the same regex:
>>> re.findall('\(.*?\.+.*\)', K)
['(These cooking instructions were developed using an 100 watt microwave oven. For lower wattage ovens, up to an additional 2 minutes cooking time may be required)']
[Edit]: To match if there are more than one set of braces, the following works:
K='Product will be hot (These cooking instructions were. developed using an 100 watt microwave oven). For lower wattage ovens (up to an additional 2 minutes. cooking time may be required).'
>>> re.findall('\(.*?\.+.*?\)', K)
['(These cooking instructions were. developed using an 100 watt microwave oven)', '(up to an additional 2 minutes. cooking time may be required)']
>>> re.sub('\(.*?\.+.*?\)', '', K)
'Product will be hot . For lower wattage ovens .'
You can use expression:
(?<=\()[^()]*(?=\))
Try the expression live here.
Use re.findall to find the text you are interested in.
import re
K = 'Product will be hot(These cooking instructions were developed using an 100 watt microwave oven. For lower wattage ovens, up to an additional 2 minutes cooking time may be required).'
print(re.findall(r'(?<=\()[^()]*(?=\))',K))
Prints:
['These cooking instructions were developed using an 100 watt microwave oven. For lower wattage ovens, up to an additional 2 minutes cooking time may be required']
Alternatively wrap the character set in a capturing group:
import re
K = 'Product will be hot(These cooking instructions were developed using an 100 watt microwave oven. For lower wattage ovens, up to an additional 2 minutes cooking time may be required).'
print(re.search(r'(?<=\()([^()]*)(?=\))',K).group(1))
Prints:
These cooking instructions were developed using an 100 watt microwave oven. For lower wattage ovens, up to an additional 2 minutes cooking time may be required
This takes care that no substitution is done if more than two periods are in the parentheses, and also, that not two parenthesized sections get merged thus eliminating text between them:
>>> re.sub(r'\(([^.(]*\.){1,2}[^.()]*\)',"",K)
'Product will be hot.'
If you also want to remove parenthesized sections with more than two periods, you may simply replace {1,2} by a +:
>>> re.sub(r'\(([^.(]*\.)+[^.()]*\)',"",K)

Regex code for product models and codes

I found a very useful regex code in order to extract product codes here, this is the expression:
\b((?:[a-z]+\S*\d+|\d\S*[a-z]+)[a-z\d_-]*)\b
It works almost perfectly, but I need to detect and extract only the product codes that have a length of at least 5 digits.
For example, for the following strings:
5T COFFEE BREW FOR BLACK & DECKER DCM-601B
10T COFFEE BREW FOR BLACK & DECKER DCM-1100B
10T COFFEE BREW FOR BLACK & DECKER DCM-1100W
8T COFFEE BREW FOR BLACK & DECKER CM-1509
Rice Cookers 15T DOMESTIC USE RC5428, ELECTRIC BLACK & DECKER
Rice Cookers 15T RC/5723 DOMESTIC USE, ELECTRIC BLACK & DECKER
Rice Cookers B D REF.RC3203
Hand mixer, S / M, PS62509R
SLOW COOKING POTS, HAMILTON BEACH, HB33136T
OVEN 110V TOSTA SANKEY REF.TO-9
24 PZA METAL TEAPOT S / M CHINA REF: 92479
ELECTRIC RICE COOKER, 1.5 L ROYAL ROA-15SV
ELECTRIC RICE COOKER, 1.8 L ROYAL ROA-18SV
ELECTRIC RICE COOKER, 2.2 L ROYAL ROA-22SV
ELECTRIC RICE COOKER, 2.8 L ROYAL ROA-28SV
Waffle Makers DOMESTIC USE, ELECTRIC BLACK & DECKER G-49TD
2.00 PZA TOAST OVEN, METAL / GLASS ROYAL, CHINA, REF: RTH-28A
20.00 PZA RICE, METAL, BLACK & DECKER, CHINA, REF: RCB550S
I get:
5TDCM-601B
10TDCM-1100B
10TDCM-1100W
8TCM-1509
15TRC5428
15TRC/5723
REF.RC3203
PS62509R
HB33136T
REF.TO-9
92479
ROA-15SV
ROA-18SV
ROA-22SV
ROA-28SV
G-49TD
2.00RTH-28A
20.00RCB550S
Desired outcome:
DCM-601B
DCM-1100B
DCM-1100W
CM-1509
RC5428
RC/5723
REF.RC3203
PS62509R
HB33136T
REF.TO-9
92479
ROA-15SV
ROA-18SV
ROA-22SV
ROA-28SV
G-49TD
RTH-28A
RCB550S
How can I do this?
If we assume that your codes contain 5 or more non-whitespace symbols, and there must be at least 1 digit, the regex for the codes will be:
\b(?!\d+\.\d+)(?=\S*\d)\S{5,}\b
See Demo 1
The (?!\d+\.\d+) disallows float/decimal numbers like 1.2345 or 12.44.
I'm not quite sure if I understood your question, but you can use a regex like this to get the product codes you want:
((?:\w{2,}\.)?\w{1,}[.\/-]?\d+\w+)(?=\b)
Working demo

Regex in Apache Spark

I have a text file that reads like this:-
This recipe can be made either with a stand mixer, or by hand with a bowl, a wooden spoon, and strong arms. If you use salted butter, please omit the added salt in this recipe.
Yum
Ingredients
1 1/4 cups all-purpose flour (160 g)
1/4 teaspoon salt
1/2 teaspoon baking powder
1/2 cup unsalted butter (1 stick, or 8 Tbsp, or 112g) at room temperature
1/2 cup white sugar (90 g)
1/2 cup dark brown sugar, packed (85 g)
1 large egg
1 teaspoon vanilla extract
1/2 teaspoon instant coffee granules or instant espresso powder
1/2 cup chopped macadamia nuts (3 1/2 ounces, or 100 g)
1/2 cup white chocolate chips
Method
1 Preheat the oven to 350°F (175°C). Vigorously whisk together the flour, and baking powder in a bowl and set aside.
I want to extract the data between words Ingredients and Method.
I have written a regex (?s)(?<=\bIngredients\b).*?(?=\bMethod\b)
to extract the data and it's working fine.
But when I try to that using spark-shell like following, it doesn't give me anything.
val b = sc.textFile("/home/akshat/file.txt")
val regex = "(?s)(?<=\bIngredients\b).*?(?=\bMethod\b)".r
regex.findAllIn(b).foreach(println)
Please tell me where I am going wrong and what steps should I take to
correct this?
Thanks in advance!
what you need to do is
Read the file using WholeTextFiles (so it does not break lines and you read entire data together)
Write a function which takes a string and outputs a string using that regex
so, it may look like (in python)
Blockquote
def getWhatIneed(s):
output = <my regexp>
return output
b = sc.WholeTextFiles(...)
c = b.map(getWhatIneed)
Now, c is also a RDD. You need to collect it before you print it. Output of collect is a normal array/list
print c.collect()

Help: Extracting data tuples from text... Regex or Machine learning?

I would really appreciate your thoughts on the best approach to the following problem. I am using a Car Classified listing example which is similar in nature to give an idea.
Problem: Extract a data tuple from the given text.
Here are some characteristics of the data.
The vocabulary (words) in the text is limited to a specific domain. Lets assume 100-200 words at the most.
Text that needs to be parsed is a headline like a Car Ad data shown below. So each record corresponds to one tuple (row).
In some cases some of the attributes may be missing. So for example, in raw data row #5 below the year is missing.
Some words go together (bigrams). Like "Low miles".
Historical data available = 10,000 records
Incoming New Data volume = 1000-1500 records / week
The expected output should be in the form of (Year,Make,Model, feature). So the output should look like
1 -> (2009, Ford, Fusion, SE)
2 -> (1997, Ford, Taurus, Wagon)
3 -> (2000, Mitsubishi, Mirage, DE)
4 -> (2007, Ford, Expedition, EL Limited)
5 -> ( , Honda, Accord, EX)
....
....
Raw Headline Data:
1 -> 2009 Ford Fusion SE - $7000
2 -> 1997 Ford Taurus Wagon - $800 (san jose east)
3 -> '00 Mitsubishi Mirage DE - $2499 (saratoga) pic
4 -> 2007 Ford Expedition EL Limited - $7800 (x)
5 -> Honda Accord ex low miles - $2800 (dublin / pleasanton / livermore) pic
6 -> 2004 HONDA ODASSEY LX 68K MILES - $10800 (danville / san ramon)
7 -> 93 LINCOLN MARK - $2000 (oakland east) pic
8 -> #######2006 LEXUS GS 430 BLACK ON BLACK 114KMI ####### - $19700 (san rafael) pic
9 -> 2004 Audi A4 1.8T FWD - $8900 (Sacramento) pic
10 -> #######2003 GMC C2500 HD EX-CAB 6.0 V8 EFI WHITE 4X4 ####### - $10575 (san rafael) pic
11 -> 1990 Toyota Corolla RUNS GOOD! GAS SAVER! 5SPEED CLEAN! REG 2011 O.B.O - $1600 (hayward / castro valley) pic img
12 -> HONDA ACCORD EX 2000 - $4900 (dublin / pleasanton / livermore) pic
13 -> 2009 Chevy Silverado LT Crew Cab - $23900 (dublin / pleasanton / livermore) pic
14 -> 2010 Acura TSX - V6 - TECH - $29900 (dublin / pleasanton / livermore) pic
15 -> 2003 Nissan Altima - $1830 (SF) pic
Possible choices:
A machine learning Text Classifier (Naive Bayes etc)
Regex
What I am trying to figure out is if RegEx is too complicated for the job and a Text classifier is an overkill?
If the choice is to go with a text classifier then what would you consider to be the easiest to implement.
Thanks in advance for your kind help.
This is a well studied problem called information extraction. It is not straight forward to do what you want to do, and it is not as simple as you make it sound (ie machine learning is not an overkill). There are several techniques, you should read an overview of the research area.
Check this IE library for writing extraction rule< I think it will work best for you problem.
There also example how to create fast dictionary matching.
I think that the ARX or Phoebus systems may suit your needs if you already have annotated data and a list of words associated to each field. Their approach is a mix of information extraction and information integration.
There are a few good entity recognition libraries. Have you taken a look at Apache opennlp?
As a user looking for a specific model of car the task is easier. I'm pretty sure I could classify, say, most Ford Rangers since I know what to look for with regexp.
I think your best bet is to write a function for each car model with type String -> Maybe Tuple. Then run all these on each input and throw away those inputs resulting in zero or too many tuples.
You should use a tool like Amazon Mechanical Turk for this. Human microtasking. Another alternative is to use a data entry freelancer. upWork is a great place to look. You can get excellent quality results and the cost is very reasonable for each.

Working out the bounding box of a coutnry

I'm wondering if there is a service that to get a set of lat,long points that when connected into a polygon show the outline of a country
Ideally I would like to search by country, and get back an array of lat,long coordinates. Is there such a service?
IF you are happy to consider not using a webservice, this data is available at varying resolutions from Natural Earth. The data is in the public domain.
have a closer look here;
http://en.wikipedia.org/wiki/User:The_Anome/country_bounding_boxes
User:The Anome/country bounding boxes A first hack, based on all
places in the NGA GNS dataset, not (yet) properly handling latitude
wrap-round at ±180°. Country names are mapped from FIPS country codes.
This works pretty well for all countries that do not cross the 180°
meridian. Russia is a notable exception. This dataset does not include
the United States.
country longmin latmin longmax latmax
AA -70.983 12.400 -69.850 12.617
Antigua_and_ -62.417 16.817 -61.650 17.750
United_Arab_ 45.000 22.167 59.250 26.133
Afghanistan 60.433 29.150 75.033 38.484
Algeria -8.700 18.027 70.554 37.203
Azerbaijan 44.783 38.417 50.858 41.911
Albania 19.000 39.583 21.050 42.659
Armenia 43.443 38.857 46.589 41.300
Andorra 1.417 42.433 1.783 42.650
Angola 10.000 -33.806 24.350 -3.033
Argentina -73.533 -58.583 -53.367 -21.783
Australia 112.467 -55.050 168.000 -9.133
AT 122.983 -12.667 124.050 -12.000
Austria 1.200 46.373 19.000 49.017
AV -63.667 18.150 -62.917 18.600
Bahrain 45.000 25.000 50.954 26.566
Barbados -59.667 12.967 -59.383 13.333
Botswana 20.000 -28.517 29.350 24.583
BD -64.908 32.233 -64.617 32.417
Belgium 2.367 49.500 6.400 51.683
Bahamas -86.000 20.000 -70.000 29.547
Bangladesh 84.000 20.600 92.683 26.817
Belize -89.950 15.000 -75.000 18.483
Bosnia_and_H 15.746 42.558 19.671 45.268
Bolivia -69.650 -26.867 -57.550 9.678
Burma 91.833 6.000 102.000 28.350
Benin -4.000 5.000 92.219 21.322
Belarus 22.550 50.717 32.850 56.133
Solomon_Isla -130.000 -45.000 170.200 3.751 WRAPPED
Brazil -73.817 -33.733 -28.850 16.800
BS 39.700 -21.417 39.700 -21.417
Bhutan 80.000 26.217 92.717 30.000
Bulgaria 22.371 41.000 28.600 44.215
BV 3.278 -54.467 3.483 -54.386
Brunei 110.000 -2.000 120.000 15.000
Yahoo! GeoPlanet, the service Stack Overflow are using for their careers site seems to do bounding boxes.
Here is a blog post with detailed query examples.
This repo contains a set of square bounding boxes. Example below:
{
"AF": ["Afghanistan", [60.5284298033, 29.318572496, 75.1580277851, 38.4862816432]],
"AO": ["Angola", [11.6400960629, -17.9306364885, 24.0799052263, -4.43802336998]],
"AL": ["Albania", [19.3044861183, 39.624997667, 21.0200403175, 42.6882473822]],
"AE": ["United Arab Emirates", [51.5795186705, 22.4969475367, 56.3968473651, 26.055464179]],
"AR": ["Argentina", [-73.4154357571, -55.25, -53.628348965, -21.8323104794]],
"AM": ["Armenia", [43.5827458026, 38.7412014837, 46.5057198423, 41.2481285671]],
"AQ": ["Antarctica", [-180.0, -90.0, 180.0, -63.2706604895]],
"…"
}
Full set of boxes:
https://github.com/sandstrom/country-bounding-boxes