Replace Words proceeding it with Regex - regex

I have two strings like this:
word=list()
word.append('The.Eternal.Evil.of.Asia.1995.DVDRip.x264.AC3')
word.append('Guzoo.1986.VHSDVDRiP.x264.AC3.HS.ES-SHAG')
I want to remove the words starting from VHSDVDRIP and DVDRIP onward. So from The.Eternal.Evil.of.Asia.1995.DVDRip.x264.AC3 to The.Eternal.Evil.of.Asia.1995. and Guzoo.1986.VHSDVDRiP.x264.AC3.HS.ES-SHAG to Guzoo.1986.
I tried the following but it doesn't work:
re.findall(r"\b\." + 'DVDRIP' + r"\b\.", word)

You could use re.split for that (regex101):
s = 'The.Eternal.Evil.of.Asia.1995.DVDRip.x264.AC3'
import re
print( re.split(r'(\.[^.]*dvdrip\.)', s, 1, flags=re.I)[0] )
Prints:
The.Eternal.Evil.of.Asia.1995
Some test cases:
lst = ['The.Eternal.Evil.of.Asia.1995.DVDRip.x264.AC3',
'Guzoo.1986.VHSDVDRiP.x264.AC3.HS.ES-SHAG']
import re
for item in lst:
print( re.split(r'(\.[^.]*dvdrip\.)', item, 1, flags=re.I)[0] )
Prints:
The.Eternal.Evil.of.Asia.1995
Guzoo.1986

If you wish to replace those instances, that I'm guessing, with an empty string, maybe this expression with an i flag may be working:
import re
regex = r"(?i)(.*)(?:\w+)?dvdrip\W(.*)"
test_str = """
The.Eternal.Evil.of.Asia.1995.DVDRip.x264.AC3
Guzoo.1986.VHSDVDRiP.x264.AC3.HS.ES-SHAG
"""
subst = "\\1\\2"
print(re.sub(regex, subst, test_str))
Output
The.Eternal.Evil.of.Asia.1995.x264.AC3
Guzoo.1986.VHSx264.AC3.HS.ES-SHAG
The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.

Consider re.sub:
import re
films = ["The.Eternal.Evil.of.Asia.1995.DVDRip.x264.AC3", "Guzoo.1986.VHSDVDRiP.x264.AC3.HS.ES-SHAG"]
for film in films:
print(re.sub(r'(.*)VHSDVDRiP.*|DVDRip.*', r'\1', film))
Output:
The.Eternal.Evil.of.Asia.1995.
Guzoo.1986.
Note: this leaves the trailing period, as requested.

Related

Expression to extract country name?

I have a dataframe of coefficients for countries, where each coefficient looks like:
s = "C(Country)[T.China]"
s2 = "C(Country)[T.Italy]"
s3 = "C(Country)[T.United States]"
How would I go about extracting just the country name (i.e: "China" or "Italy"?)
And can this be done with a "strip" command instead of regex?
This expression will do the job:
re.findall('T.([a-z|A-Z]*)',s)
My guess is that maybe this simple expression would work:
T\.\s*([^]]+)
Test
import re
regex = r"T\.\s*([^]]+)"
test_str = ("C(Country)[T.China]\n"
"C(Country)[T.Italy]\n"
"C(Country)[T.United States]")
print(re.findall(regex, test_str))
Output
['China', 'Italy', 'United States']
The expression is explained on the top right panel of this demo if you wish to explore/simplify/modify it.

Why regex match is not what I expect?

I try to implement following code, but it's prints out not what I expect
import re
def regex_search(txt):
lst = re.findall(r'(\d{1,3}\.){3}', txt)
return lst
print(regex_search("123.45.67.89"))
It prints out ['67.'] when I expect ['123.', '45.', '67.']. Where I am wrong? Help please.
Thanks in advance.
There is no need to even use regex here:
input = "123.45.67.89"
parts = input.split(".")
parts = [s + "." for s in parts]
parts = parts[:-1]
print(parts)
['123.', '45.', '67.']
You should remove the quantifer {3} and also get rid of extra grouping and write this code,
import re
def regex_search(txt):
lst = re.findall(r'\d{1,3}\.', txt)
return lst
print(regex_search("123.45.67.89"))
Prints your expected output,
['123.', '45.', '67.']
Also, as you were using this (\d{1,3}\.){3}, then it will match exactly three of this \d{1,3}\. repeatedly three times which would be 123.45.67. as whole match but will group1 will capture only the last match which is 67. in your case and hence prints only that value in list.
I changed your regex slightly and now it should work.
import re
def regex_search(txt):
lst = re.findall(r'(\d{1,3}\.)', txt)
return lst
print(regex_search("123.45.67.89"))
The output will now be.
['123.', '45.', '67.']
You can also do it without regex, splitting on . and ignoring last element
import re
def regex_search(txt):
items = txt.split('.')[:-1]
return [item+'.' for item in items]
print(regex_search("123.45.67.89"))
The output will be
['123.', '45.', '67.']

Correction in Regex for unicode

I need help for regex. My regex is not producing the desired results. Below is my code:
import re
text='<u+0001f48e> repairs <u+0001f6e0><u+fe0f>your loved<u+2764><u+fe0f>one
on the spot<u+26a1>'
regex=re.compile(r'[<u+\w+]+>')
txt=regex.findall(text)
print(txt)
Output
['<u+0001f48e>', '<u+0001f6e0>', '<u+fe0f>', 'loved<u+2764>', '<u+fe0f>', 'spot<u+26a1>']
I know, regex is not correct. I want output as:
'<u+0001f48e>', '<u+0001f6e0><u+fe0f>', '<u+2764><u+fe0f>', '<u+26a1>'
import re
regex = re.compile(r'<u\+[0-9a-f]+>')
text = '<u+0001f48e> repairs <u+0001f6e0><u+fe0f>your loved<u+2764><u+fe0f>one on the spot<u+26a1>'
print(regex.findall(text))
# output:
['<u+0001f48e>', '<u+0001f6e0>', '<u+fe0f>', '<u+2764>', '<u+fe0f>', '<u+26a1>']
That is not exactly what you want, but its almost there.
Now, to achieve what you are looking for, we make our regex more eager:
import re
regex = re.compile(r'((?:<u\+[0-9a-f]+>)+)')
text = '<u+0001f48e> repairs <u+0001f6e0><u+fe0f>your loved<u+2764><u+fe0f>one on the spot<u+26a1>'
print(regex.findall(text))
# output:
['<u+0001f48e>', '<u+0001f6e0><u+fe0f>', '<u+2764><u+fe0f>', '<u+26a1>']
Why won't you add optional 2nd tag search:
regex=re.compile(r'<([u+\w+]+>(<u+fe0f>)?)')
This one works fine with your example.

python regex search for elimitation

i have the following case
import re
target_regex = '^(?!P\-[5678]).*'
pattern = re.compile(target_regex, re.IGNORECASE)
mylists=['p-1.1', 'P-5']
target_object_is_found = pattern.findall(''.join(mylists))
print "target_object_is_found:", target_object_is_found
this will give
target_object_is_found: ['P-1.1P-5']
but from my regex what i need is P-1.1 alone eliminating P-5
You joined the items in mylist and P-5 is no longer at the start of the string.
You may use
import re
target_regex = 'P-[5-8]'
pattern = re.compile(target_regex, re.IGNORECASE)
mylists=['p-1.1', 'P-5']
target_object_is_found = [x for x in mylists if not pattern.match(x)]
print("target_object_is_found: {}".format(target_object_is_found))
# => target_object_is_found: ['p-1.1']
See the Python demo.
Here, the P-[5-8] pattern is compiled with re.IGNORECASE flag and is used to check each item inside mylist (see the [...] list comprehension) with the regex_objext.match method that looks for a match at the start of string only. The match result is reversed, see not after if.
So, all items are returned that do not start with (?i)P-[5-8] pattern.

Regular expression syntax in python

I try to write a python scripts to analys a data txt.I want the script to do such things:
find all the time data in one line, and compare them.but this is my first time to write RE syntax.so I write a small script at 1st.
and my script is:
import sys
txt = open('1.txt','r')
a = []
for eachLine in txt:
a.append(eachLine)
import re
pattern = re.compile('\d{2}:\d{2}:\d{2}')
for i in xrange(len(a)):
print pattern.match(a[i])
#print a
and the output is always None.
my txt is just like the picture:
what's the problem? plz help me. thx a lot.
and my python is python 2.7.2.my os is windows xp sp3.
Didn't you miss one of the ":" in you regex? I think you meant
re.compile('\d{2}:\d{2}:\d{2}')
The other problems are:
First, if you want to search in the hole text, use search instead of match. Second, to access your result you need to call group() in the match object returned by your search.
Try it:
import sys
txt = open('1.txt','r')
a = []
for eachLine in txt:
a.append(eachLine)
import re
pattern = re.compile('\d{2}:\d{2}:\d{2}')
for i in xrange(len(a)):
match = pattern.search(a[i])
print match.group()
#print a
I think you're missing the colons and dots in your regex. Also try using re.search or re.findall instead on the entire text. Like this:
import re, sys
text = open("./1.txt", "r").read() # or readlines() to make a list of lines
pattern = re.compile('\d{2}:\d{2}:\d{2}')
matches = pattern.findall(text)
for i in matches:
print(i);