Regular expression syntax in python - regex

I try to write a python scripts to analys a data txt.I want the script to do such things:
find all the time data in one line, and compare them.but this is my first time to write RE syntax.so I write a small script at 1st.
and my script is:
import sys
txt = open('1.txt','r')
a = []
for eachLine in txt:
a.append(eachLine)
import re
pattern = re.compile('\d{2}:\d{2}:\d{2}')
for i in xrange(len(a)):
print pattern.match(a[i])
#print a
and the output is always None.
my txt is just like the picture:
what's the problem? plz help me. thx a lot.
and my python is python 2.7.2.my os is windows xp sp3.

Didn't you miss one of the ":" in you regex? I think you meant
re.compile('\d{2}:\d{2}:\d{2}')
The other problems are:
First, if you want to search in the hole text, use search instead of match. Second, to access your result you need to call group() in the match object returned by your search.
Try it:
import sys
txt = open('1.txt','r')
a = []
for eachLine in txt:
a.append(eachLine)
import re
pattern = re.compile('\d{2}:\d{2}:\d{2}')
for i in xrange(len(a)):
match = pattern.search(a[i])
print match.group()
#print a

I think you're missing the colons and dots in your regex. Also try using re.search or re.findall instead on the entire text. Like this:
import re, sys
text = open("./1.txt", "r").read() # or readlines() to make a list of lines
pattern = re.compile('\d{2}:\d{2}:\d{2}')
matches = pattern.findall(text)
for i in matches:
print(i);

Related

Regex to extract text from request id

i have a log where a certain part is requestid in that text is there which i have to extract
Ex: RES_1621480647_49610052479341623017223137119508459972977816017376903362_Book,
Can any1 pls help in extracting Book out of it
Consider string splitting instead
>>> s = "RES_1621480647_49610052479341623017223137119508459972977816017376903362_Book"
>>> s.split("_")[-1]
'Book'
It seems that string splitting will be more efficient, if you must use regular expressions, here is an example.
#!/usr/bin/env python3
import re
print(
re.findall(r"^\w+_\d+\d+_(\w+)$",'RES_1621480647_49610052479341623017223137119508459972977816017376903362_Book')
)
// output: ['Book']

Extracting text from OCR image file

I am trying to extract few fields from OCR image. I am using pytesseract to read OCR image file and this is working as expected.
Code :
import pytesseract
from PIL import Image
import re
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-
OCR\tesseract.exe"
value = Image.open("ocr.JPG")
text = pytesseract.image_to_string(value)
print(text)
Output :
ALS 1 Emergency Base Rate
Y A0427 RE ABC
Anbulance Mileage Charge
Y A0425 RE ABC
Disposable Supplies
Y A0398 RH ABC
184800230, x
Next, I have to extract A0427 and A0425 from the text.. but the problem is I am not loop through the whole line.. it's taking one character at a time and that's why my regular expression isn't working..
Code:
for line in text :
print(line)
x= re.findall(r'^A[0-9][0-9][0-9][0-9]', text)
print(x)
Get rid of that for loop also, use only
x= re.findall(r'A[0-9][0-9][0-9][0-9]', text)
without any loop. ('remove ^ too')
text is a string, default behavior for Python when looping over a string using a for-loop is to loop through the characters (as a string is basically a list of characters).
To loop through the lines, first split the text into lines using text.splitlines():
for line in text.splitlines() :
print(line)
x= re.findall(r'^A[0-9][0-9][0-9][0-9]', text)
print(x)
EDIT: Or use Patels answer to skip the loop all together :)
The problem in your regex is start anchor ^ which expects your matching text A0425 should start from the very start of line and that is indeed not the case as you have Y and space before it. So just remove ^ from your regex and then you should be getting all expected strings. Also, you can change four of this [0-9] to write as [0-9]{4} and your shortened regex becomes,
A[0-9]{4}
Regex Demo
You need to modify your current code like this,
import pytesseract
from PIL import Image
import re
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-
OCR\tesseract.exe"
value = Image.open("ocr.JPG")
text = pytesseract.image_to_string(value)
print(re.findall(r'A[0-9]{4}', text))
This should prints all your matches without needing to loop individually into lines,
['A0427', 'A0425', 'A0398']

Correction in Regex for unicode

I need help for regex. My regex is not producing the desired results. Below is my code:
import re
text='<u+0001f48e> repairs <u+0001f6e0><u+fe0f>your loved<u+2764><u+fe0f>one
on the spot<u+26a1>'
regex=re.compile(r'[<u+\w+]+>')
txt=regex.findall(text)
print(txt)
Output
['<u+0001f48e>', '<u+0001f6e0>', '<u+fe0f>', 'loved<u+2764>', '<u+fe0f>', 'spot<u+26a1>']
I know, regex is not correct. I want output as:
'<u+0001f48e>', '<u+0001f6e0><u+fe0f>', '<u+2764><u+fe0f>', '<u+26a1>'
import re
regex = re.compile(r'<u\+[0-9a-f]+>')
text = '<u+0001f48e> repairs <u+0001f6e0><u+fe0f>your loved<u+2764><u+fe0f>one on the spot<u+26a1>'
print(regex.findall(text))
# output:
['<u+0001f48e>', '<u+0001f6e0>', '<u+fe0f>', '<u+2764>', '<u+fe0f>', '<u+26a1>']
That is not exactly what you want, but its almost there.
Now, to achieve what you are looking for, we make our regex more eager:
import re
regex = re.compile(r'((?:<u\+[0-9a-f]+>)+)')
text = '<u+0001f48e> repairs <u+0001f6e0><u+fe0f>your loved<u+2764><u+fe0f>one on the spot<u+26a1>'
print(regex.findall(text))
# output:
['<u+0001f48e>', '<u+0001f6e0><u+fe0f>', '<u+2764><u+fe0f>', '<u+26a1>']
Why won't you add optional 2nd tag search:
regex=re.compile(r'<([u+\w+]+>(<u+fe0f>)?)')
This one works fine with your example.

Splitting a string in Python based on a regex pattern

I have a bytes object that contains urls:
> body.decode("utf-8")
> 'https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/\r\n\r\nhttps://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/\r\n'
I need to split it into a list with each url as a separate element:
import re
pattern = '^(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$'
urls = re.compile(pattern).split(body.decode("utf-8"))
What I get is a list of one element with all urls pasted together:
['https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/\r\n\r\nhttps://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/\r\n']
How do I split each url into a separate element?
Try splitting it with \s+
Try this sample python code,
import re
s = 'https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/\r\n\r\nhttps://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/\r\n'
urls = re.compile('\s+').split(s)
print(urls)
This outputs,
['https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/', 'https://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/', '']
Does this result looks ok? Or we can work on it and make as you desire.
In case you don't want empty string ('') in your result list (because of \r\n in the end), you can use find all to find all the URLs in your string. Sample python code for same is following,
import re
s = 'https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/\r\n\r\nhttps://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/\r\n'
urls = re.findall('http.*?(?=\s+)', s)
print(urls)
This gives following output,
['https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/', 'https://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/']

How to find and replace the pattern ");" in a text file?

I have a text file which contains special characters. I want to replace ");" with "firstdata);seconddata".
The catch is that both ")" and ";" should be together and then replaced with "firstdata);seconddata".
I have the following code.
import re
string = open('trial.txt').read()
new_str = re.sub('[);]', 'firstdata);seconddata', string)
open('b.txt', 'w').write(new_str)
Please, suggest me how to change my code to get the right output.
This should do:
import re
with open("file.txt", "r") as rfile:
s = rfile.read()
rplce = re.sub('\);', "REPLACED", s)
with open("file.txt", "w") as wfile:
wfile.write(rplce)
You can use the built in str.replace() method in Python
string = "foobar);"
string.replace(");", 'firstdata);seconddata') # -> 'foobarfirstdata);seconddata'
Here are the docs for common string operations like this in Python
https://docs.python.org/3/library/string.html
You may use more simple way.
with open('input_file.txt', 'r') as input_file:
with open('output_file.txt', 'w') as output_file:
for line in input_file:
x = line.replace('findtext','replacetext')
output_file.write(x)