Extracting text from OCR image file - regex

I am trying to extract few fields from OCR image. I am using pytesseract to read OCR image file and this is working as expected.
Code :
import pytesseract
from PIL import Image
import re
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-
OCR\tesseract.exe"
value = Image.open("ocr.JPG")
text = pytesseract.image_to_string(value)
print(text)
Output :
ALS 1 Emergency Base Rate
Y A0427 RE ABC
Anbulance Mileage Charge
Y A0425 RE ABC
Disposable Supplies
Y A0398 RH ABC
184800230, x
Next, I have to extract A0427 and A0425 from the text.. but the problem is I am not loop through the whole line.. it's taking one character at a time and that's why my regular expression isn't working..
Code:
for line in text :
print(line)
x= re.findall(r'^A[0-9][0-9][0-9][0-9]', text)
print(x)

Get rid of that for loop also, use only
x= re.findall(r'A[0-9][0-9][0-9][0-9]', text)
without any loop. ('remove ^ too')

text is a string, default behavior for Python when looping over a string using a for-loop is to loop through the characters (as a string is basically a list of characters).
To loop through the lines, first split the text into lines using text.splitlines():
for line in text.splitlines() :
print(line)
x= re.findall(r'^A[0-9][0-9][0-9][0-9]', text)
print(x)
EDIT: Or use Patels answer to skip the loop all together :)

The problem in your regex is start anchor ^ which expects your matching text A0425 should start from the very start of line and that is indeed not the case as you have Y and space before it. So just remove ^ from your regex and then you should be getting all expected strings. Also, you can change four of this [0-9] to write as [0-9]{4} and your shortened regex becomes,
A[0-9]{4}
Regex Demo
You need to modify your current code like this,
import pytesseract
from PIL import Image
import re
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-
OCR\tesseract.exe"
value = Image.open("ocr.JPG")
text = pytesseract.image_to_string(value)
print(re.findall(r'A[0-9]{4}', text))
This should prints all your matches without needing to loop individually into lines,
['A0427', 'A0425', 'A0398']

Related

Making a text file which will contain my list items and applying regular expression to it

I am supposed to make a code which will read a text file containing some words with some common linguistic features. Apply some regular expression to all of the words and write one file which will have the changed words.
For now let's say my text file named abcd.txt has these words
king
sing
ping
cling
booked
looked
cooked
packed
My first question starts from here. In my simple text file how to write these words to get the above mentioned results. Shall I write them line-separated or comma separated?
This is the code provided by user palvarez.
import re
with open("new_abcd", "w+") as new, open("abcd") as original:
for word in original:
new_word = re.sub("ing$", "xyz", word)
new.write(new_word)
Can I add something like -
with open("new_abcd", "w+") as file, open("abcd") as original:
for word in original:
new_aword = re.sub("ed$", "abcd", word)
new.write(new_aword)
in the same code file? I want something like -
kabc
sabc
pabc
clabc
bookxyz
lookxyz
cookxyz
packxyz
PS - I don't know whether mentioning this is necessary or not, but I am supposed to do this for a Unicode supported script Devanagari. I didn't use it here in my examples because many of us here can't read the script. Additionally that script uses some diacritics. eg. 'का' has one consonant character 'क' and one vowel symbol 'ा' which together make 'का'. In my regular expression I need to condition the diacritics.
I think the approach you have with one word by line is better since you don't have to trouble yourself with delimiters and striping.
With a file like this:
king
sing
ping
cling
booked
looked
cooked
packed
And a code like this, using re.sub to replace a pattern:
import re
with open("new_abcd.txt", "w") as new, open("abcd.txt") as original:
for word in original:
new_word = re.sub("ing$", "xyz", word)
new_word = re.sub("ed$", "abcd", new_word)
new.write(new_word)
It creates a resulting file:
kxyz
sxyz
pxyz
clxyz
bookabcd
lookabcd
cookabcd
packabcd
I tried out with the diacritic you gave us and it seems to work fine:
print(re.sub("ा$", "ing", "का"))
>>> कing
EDIT: added multiple replacement. You can have your replacements into a list and iterate over it to do re.sub as follows.
import re
# List where first is pattern and second is replacement string
replacements = [("ing$", "xyz"), ("ed$", "abcd")]
with open("new_abcd.txt", "w") as new, open("abcd.txt") as original:
for word in original:
new_word = word
for pattern, replacement in replacements:
new_word = re.sub(pattern, replacement, word)
if new_word != word:
break
new.write(new_word)
This limits one modification per word, only the first that modifies the word is taken.
It is recommended that for starters, utilize the with context manager to open your file, this way you do not need to explicitly close the file once you are done with it.
Another added advantage is then you are able to process the file line by line, this will be very useful if you are working with larger sets of data. Writing them in a single line or csv format will then all depend on the requirement of your output and how you would want to further process them.
As an example, to read from a file and say substitute a substring, you can use re.sub.
import re
with open('abcd.txt', 'r') as f:
for line in f:
#do something here
print(re.sub("ing$",'ring',line.strip()))
>>
kring
sring
pring
clring
Another nifty trick is to manage both the input and output utilizing the same context manager like:
import re
with open('abcd.txt', 'r') as f, open('out_abcd.txt', 'w') as o:
for line in f:
#notice that we add '\n' to write each output to a newline
o.write(re.sub("ing$",'ring',line.strip())+'\n')
This create an output file with your new contents in a very memory efficient way.
If you'd like to write to a csv file or any other specific formats, I highly suggest you spend sometime to understand Python's input and output functions here. If linguistics in text is what you are going for that understand encoding of different languages and further study Python's regex operations.

Multiple line text unto clipboard, adding bullets to wiki mark up

I would like to have the text printed out as how it shown on the exercise whereby the Lists of List has a * on each line and each are in a new line. I am still new to python and the Automate the Boring Stuff with Python book is kind of confusing sometimes.
I started by typing the text into the Python editor and having Pyperclip to copy it unto the clipboard. The problem is Pyperclip only accepts a single string, in which form the text is copied to the clipboard.
#! python3
#bulletPointerAdder.py - Adds Wikipedia bullet points to the start
#of each line of text on the clipboard.
#! python3
#bulletPointerAdder.py - Adds Wikipedia bullet points to the start
#of each line of text on the clipboard.
In the Python shell:
import pyperclip
>>> text = 'Lists of monkeys Lists of donkeys Lists of pankeys'
>>> pyperclip.copy(text)
>>>
RESTART: C:\Users\User\AppData\Local\Programs\Python\Python37-
32\bulletpointadder.py
>>> text
'* Lists of monkeys Lists of donkeys Lists of pankeys'
>>>
import os
import pyperclip
text = pyperclip.paste()
#Separate lines and add starts.
lines = text.split(os.linesep)
for i in range(len(lines)): # loop through all indexes in the "lines"
list
lines[i] = '* ' + lines[i] # add star to each sting in "lines" list
text = os.linesep.join(lines)
pyperclip.copy(text)
I actually want the text to be printed out like the sample below, but the problem is I am getting it print out as a single string.
Lists of animals
Lists of aquarium life
Lists of biologists by author abbreviation
Lists of cultivars
Understand this first and move to step 3:
We split the text along its newlines to get a list in which each item is one line of the text. We store the list in lines and then loop through the items in lines.
For each line, we add a star and a space to the start of the line. Now each string in lines begins with a star.
import pyperclip
text = pyperclip.paste()
# TODO manipulate the text in clipboard
lines = text.split('\n') # Each word is split into new line
for i in range(len(lines)):
lines[i] = '* ' + lines[i] # Each word gets a * prefix
text = '\n'.join(lines) # all the newlines created are joind back
pyperclip.copy(text) # whole content is than copied into clipboard
print(text)
With this code if you copy a list of things it will still be a list of things as it is intended.

Split regex matches into multiple lines

I'm using regex to read a line, gather all the matches and print each match as a new line.
So far i have read the line and extracted the data I need but the code prints it all in a single line.
Is there a way to print each match separately?
Here is the code i have been using:
import os
import re
msg = "0,0.000000E+000,NCAP,64Q34,39,39,1028,NCAP,1,1,NCAP"
text = [msg.split(',')]
which gives me [['0', '0.000000E+000', 'NCAP', '64Q34', '39', '39', '1028', 'NCAP', '1', '1', 'NCAP']].
Searching for data between ' ' will get me the individual results.
Using the code below will find all matches but it keeps it all as one line, giving me the same as the input.
text = str(text)
line = text.strip()
m = re.findall("'(.+?)'", line)
found = str(m)
print(found+ '\n')
I am unsure what you are trying to capture using regexs, but from what I understand you want to split msg up by commas ',' and print each element on a new line.
msg = "0,0.000000E+000,NCAP,64Q34,39,39,1028,NCAP,1,1,NCAP"
msg = msg.split(',')
for m in msg:
print(m)
>>> 0
0.000000E+000
NCAP
...
This will print each element of msg on a new line - the elements of msg are split up by ','.
I would also use this great online interactive regex tester to test your regexs in real time to understand how to use regex / which expressions to use. (make sure to select python language).

Python script to extract data from text file

I have a text file which have some website list links like
test.txt:
http://www.site1.com/
http://site232546ee.com/
https://www.site3eiue213.org/
http://site4.biz/
I want to make a simple python script which can extract only site names with length of 8 characters... no name more than 8 characters.... the output should be like:
output.txt:
site1
site2325
site3eiu
site4
i have written some code:
txt1 = open("test.txt").read()
txt2 = txt1.split("http://www.")
f = open('output.txt', 'w')
for us in txt2:
f.write(us)
print './done'
but i don't know how to split() more than one command in one line ... i also tried it with import re module but don't able to know that how to write code for it.
can some one help me please to make this script. :(
you can achieve this using regular expression as below.
import re
no = 8
regesx = "\\bhttp://www.|\\bhttp://|\\bhttps://www."
text = "http://site232546ee.com/"
match = re.search(regesx, text)
start = match.end(0)
end = start+no
string1 = text[start:end]
end = string1.find('.')
if end > 0:
final = string1[0:end]
else:
final = string1
print(final)
You said you want to extract site names with 8 characters, but the output.txt example shows bits of domain names. If you want to filter out domain names which have eight or less characters, here is a solution.
Step 1: Get all the domain names.
import tldextract
import pandas as pd
text_s=''
list_u=('http://www.site1.com/','http://site232546ee.com/','https://www.site3eiue213.org/','http://site4.biz/')
#http:\//www.(\w+).*\/?
for l in list_u:
extracted = tldextract.extract(l)
text_s+= extracted.domain + ' '
print (text_s) #gives a string of domain names delimited by whitespace
Step 2: filter domain names with 8 or less characters.
word= text_s.split()
lent= [len(x) for x in text_s.split()]
word_len_list = pd.DataFrame(
{'words': word,
'char_length': lent,
})
word_len_list[(word_len_list.char_length <= 8)]
Output looks like this:
words char_length
0 site1 5
3 site4 5
Disclaimer: I am new to Python. Please ignore any unnecessary and/or stupid steps I may have written
Have you tried printing txt2 before doing anything with it? You will see that it did not do what (I expect) you wanted it to do, since there's only one "http://www." available in the text. Try to split at a newline \n. That way you get a list of all the urls.
Then, for each url you'll want to strip the front and back, which you can do with regular expression but which can be quite hard, depending on what you want to be able to strip off. See here.
When you have found a regular expression that works for you, simply check the domain for its length and write those domains to a file that satisfy your conditions using an if statement (if len(domain) <= 8: f.write(domain))

Regular expression syntax in python

I try to write a python scripts to analys a data txt.I want the script to do such things:
find all the time data in one line, and compare them.but this is my first time to write RE syntax.so I write a small script at 1st.
and my script is:
import sys
txt = open('1.txt','r')
a = []
for eachLine in txt:
a.append(eachLine)
import re
pattern = re.compile('\d{2}:\d{2}:\d{2}')
for i in xrange(len(a)):
print pattern.match(a[i])
#print a
and the output is always None.
my txt is just like the picture:
what's the problem? plz help me. thx a lot.
and my python is python 2.7.2.my os is windows xp sp3.
Didn't you miss one of the ":" in you regex? I think you meant
re.compile('\d{2}:\d{2}:\d{2}')
The other problems are:
First, if you want to search in the hole text, use search instead of match. Second, to access your result you need to call group() in the match object returned by your search.
Try it:
import sys
txt = open('1.txt','r')
a = []
for eachLine in txt:
a.append(eachLine)
import re
pattern = re.compile('\d{2}:\d{2}:\d{2}')
for i in xrange(len(a)):
match = pattern.search(a[i])
print match.group()
#print a
I think you're missing the colons and dots in your regex. Also try using re.search or re.findall instead on the entire text. Like this:
import re, sys
text = open("./1.txt", "r").read() # or readlines() to make a list of lines
pattern = re.compile('\d{2}:\d{2}:\d{2}')
matches = pattern.findall(text)
for i in matches:
print(i);