Correction in Regex for unicode - regex

I need help for regex. My regex is not producing the desired results. Below is my code:
import re
text='<u+0001f48e> repairs <u+0001f6e0><u+fe0f>your loved<u+2764><u+fe0f>one
on the spot<u+26a1>'
regex=re.compile(r'[<u+\w+]+>')
txt=regex.findall(text)
print(txt)
Output
['<u+0001f48e>', '<u+0001f6e0>', '<u+fe0f>', 'loved<u+2764>', '<u+fe0f>', 'spot<u+26a1>']
I know, regex is not correct. I want output as:
'<u+0001f48e>', '<u+0001f6e0><u+fe0f>', '<u+2764><u+fe0f>', '<u+26a1>'

import re
regex = re.compile(r'<u\+[0-9a-f]+>')
text = '<u+0001f48e> repairs <u+0001f6e0><u+fe0f>your loved<u+2764><u+fe0f>one on the spot<u+26a1>'
print(regex.findall(text))
# output:
['<u+0001f48e>', '<u+0001f6e0>', '<u+fe0f>', '<u+2764>', '<u+fe0f>', '<u+26a1>']
That is not exactly what you want, but its almost there.
Now, to achieve what you are looking for, we make our regex more eager:
import re
regex = re.compile(r'((?:<u\+[0-9a-f]+>)+)')
text = '<u+0001f48e> repairs <u+0001f6e0><u+fe0f>your loved<u+2764><u+fe0f>one on the spot<u+26a1>'
print(regex.findall(text))
# output:
['<u+0001f48e>', '<u+0001f6e0><u+fe0f>', '<u+2764><u+fe0f>', '<u+26a1>']

Why won't you add optional 2nd tag search:
regex=re.compile(r'<([u+\w+]+>(<u+fe0f>)?)')
This one works fine with your example.

Related

Regex to extract text from request id

i have a log where a certain part is requestid in that text is there which i have to extract
Ex: RES_1621480647_49610052479341623017223137119508459972977816017376903362_Book,
Can any1 pls help in extracting Book out of it
Consider string splitting instead
>>> s = "RES_1621480647_49610052479341623017223137119508459972977816017376903362_Book"
>>> s.split("_")[-1]
'Book'
It seems that string splitting will be more efficient, if you must use regular expressions, here is an example.
#!/usr/bin/env python3
import re
print(
re.findall(r"^\w+_\d+\d+_(\w+)$",'RES_1621480647_49610052479341623017223137119508459972977816017376903362_Book')
)
// output: ['Book']

how to trim the specific lines starting and ending with character in a string in java

I have stored the multiline string in java as shown in code below it shows the output as :
aa
bb
hhh me $ hdddhd hhhdhhdhh
hrx
$
dddsss
I dont need the line starting with hhh me $ and in between lines and upto $.
I need to get output as
aa
bb
hrx
dddsss
I have tried like this on eclipse
import java.io.File;
import java.io.FileNotFoundException;
import java.io.PrintWriter;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class dummyFile {
public static void main(String[] args) throws FileNotFoundException {
String line = new StringBuilder()
.append("aa\n\n")
.append("bb\n\n")
.append("hhh me $ hdddhd hhhdhhdhh\n\n")
.append("hrx\n\n")
.append("$\n\n")
.append("dddsss")
.toString();
System.out.println(line);
String pattern = "hhh me (.)";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(line);
if (m.find())
{
System.out.println(m.group(1));
}
if (line.contains("hhh me "+ m.group(1)))
{
line.replace(
line.substring(
line.indexOf("banner mod " +m.group(1)),
line.lastIndexOf(m.group(1))+1
),
""
)
.replace("\n\n", "\n");
}
System.out.println(line);
}
}
Could some one please help ??
Phew, that was a fun one (if you're insane like me!)
(?!.*?\$.*?)^.+?(?:\n\n|$).*?
You'll need the regex options global and multiline. For most regex instances that's just a matter of formatting it like:
/(?!.*?\$.*?)^.+?(?:\n\n|$).*?/gm
However for Java there may be some options you need to supply, I'm not 100% sure.
That pattern will give you multiple matches, which you can glue back together with StringBuilder, for example.
If you REALLY want, I'll edit my answer and break down exactly what it's doing if you need me to.
This sounds a lot like homework that I don't want to do for you. But I'll throw some stuff up here that will hopefully help you figure it out.
Your regex isn't going to match what you want. (.) will capture a single character, and it won't capture new line characters. So you'll have to fix that. + matches one or more of the previous character set and * matches zero or more of the previous character. Seems like you also want to make sure you're matching from $ to $. You're working inside Java strings so you have to escape it.
Try something like this for your regex:
final String pattern = "hhh me \\$([a-zA-Z\\s\n\r]*)\\$";
Then in Eclipse or in Java Docs look around the Matcher class for some helpful methods to find/replace matches you've got (The stuff inside () in a regular expression).
Maybe something like Matcher.replaceFirst() will help.

Regular expression: Matching multiple occurrences in the same line

I have a string that I need to match using regex. It works perfectly fine when I have a single occurrence in a single line, however, when there are multiple occurrences of the same string in a single line I'm not getting any matches. Can you please help?
Sample strings:
MS17010314 MS00030208 IL00171198 IH09850115 IH99400409 IH99410409
IL01771010 IL01791002 IL01930907 IL02360907 CM00010904 IH09520115
MS00201285 MS19050708 MS00370489 MS19011285T
Regex that I tried:
(([A-Z]{2}[0-9]{8,9}[A-Z]{1})|([A-Z]{2}[0-9]{8,9}))
This seems to work fine:
a = '''MS17010314 MS00030208 IL00171198 IH09850115 IH99400409 IH99410409
IL01771010 IL01791002 IL01930907 IL02360907 CM00010904 IH09520115
MS00201285 MS19050708 MS00370489 MS19011285T'''
import re
patterns = ['[A-Z]{2}[0-9]{8,9}[A-Z]{1}','[A-Z]{2}[0-9]{8,9}']
pattern = '({})'.format(')|('.join(patterns))
matches = re.findall(pattern, a)
print([match for sub in matches for match in sub if match])
#['MS17010314', 'MS00030208', 'IL00171198', 'IH09850115', 'IH99400409',
# 'IH99410409', 'IL01771010', 'IL01791002', 'IL01930907', 'IL02360907',
# 'CM00010904', 'IH09520115', 'MS00201285', 'MS19050708', 'MS00370489',
# 'MS19011285T']
I've added a way to combine all patterns.
i tried using python and the following code worked
import re
s='''MS17010314 MS00030208 IL00171198 IH09850115 IH99400409 IH99410409
IL01771010 IL01791002 IL01930907 IL02360907 CM00010904 IH09520115
MS00201285 MS19050708 MS00370489 MS19011285T'''
lst_of_regex = [a,b]
pattern = '|'.join(lst_of_regex)
print(re.findall(pattern,s))

Regular expression syntax in python

I try to write a python scripts to analys a data txt.I want the script to do such things:
find all the time data in one line, and compare them.but this is my first time to write RE syntax.so I write a small script at 1st.
and my script is:
import sys
txt = open('1.txt','r')
a = []
for eachLine in txt:
a.append(eachLine)
import re
pattern = re.compile('\d{2}:\d{2}:\d{2}')
for i in xrange(len(a)):
print pattern.match(a[i])
#print a
and the output is always None.
my txt is just like the picture:
what's the problem? plz help me. thx a lot.
and my python is python 2.7.2.my os is windows xp sp3.
Didn't you miss one of the ":" in you regex? I think you meant
re.compile('\d{2}:\d{2}:\d{2}')
The other problems are:
First, if you want to search in the hole text, use search instead of match. Second, to access your result you need to call group() in the match object returned by your search.
Try it:
import sys
txt = open('1.txt','r')
a = []
for eachLine in txt:
a.append(eachLine)
import re
pattern = re.compile('\d{2}:\d{2}:\d{2}')
for i in xrange(len(a)):
match = pattern.search(a[i])
print match.group()
#print a
I think you're missing the colons and dots in your regex. Also try using re.search or re.findall instead on the entire text. Like this:
import re, sys
text = open("./1.txt", "r").read() # or readlines() to make a list of lines
pattern = re.compile('\d{2}:\d{2}:\d{2}')
matches = pattern.findall(text)
for i in matches:
print(i);

regex to strip out image urls?

I need to separate out a bunch of image urls from a document in which the images are associated with names like this:
bellpepper = "http://images.com/bellpepper.jpg"
cabbage = "http://images.com/cabbage.jpg"
lettuce = "http://images.com/lettuce.jpg"
pumpkin = "http://images.com/pumpkin.jpg"
I assume I can detect the start of a link with:
/http:[^ ,]+/i
But how can I get all of the links separated from the document?
EDIT: To clarify the question: I just want to strip out the URLs from the file minus the variable name, equals sign and double quotes so I have a new file that is just a list of URLs, one per line.
Try this...
(http://)([a-zA-Z0-9\/\\.])*
If the format is constant, then this should work (python):
import re
s = """bellpepper = "http://images.com/bellpepper.jpg" (...) """
re.findall("\"(http://.+?)\"", s)
Note: this is not "find an image in a file" regexp, just an answer to the question :)
do you mean to say you have that kind of format in your document and you just want to get the http part? you can just split on the "=" delimiter without regex
$f = fopen("file","r");
if ($f){
while( !feof($f) ){
$line = fgets($f,4096);
$s = explode(" = ",$line);
$s = preg_replace("/\"/","",$s);
print $s[1];
}
fclose($f);
}
on the command line :
#php5 myscript.php > newfile.ext
if you are using other languages other than PHP, there are similar string splitting method you can use. eg Python/Perl's split(). please read your doc to find out
You may try this, if your tool supports positive lookbehind:
/(?<=")[^"\n]+/