RegEx CRLF but preserve CRLF with preceding character - regex

I have a text file with the following content:
aaaaaaCRLF
bbbbbb'CRLF
ccccccCRLF
I want the remove the CRLF in the lines where there is no ' before the CRLF. The destination text should be:
aaaaaa
bbbbbb'CRLF
cccccc
Any idea on how to do this with RegEx?

I'm using Python, but this should work with any decent Regex engine. Use a negative lookbehind!
>>> import re
>>> s = "aaaaaaa\r\nasdasdasd'\r\nasdasdas\r\n"
>>> p = r"(?<!')\r\n"
>>> re.sub(p, '', s)
"aaaaaaaasdasdasd'\r\nasdasdas"
edit: Oh, you mean \r\n. Pattern adapted.

Related

Regex to not match a specific string, but with additional check

So for example I have this string
var = 'column1;column2;column3\r\nval1;val2;val3\r\n;val4;val5;val6\r\n'
I want to be able to find all \r\n and replace it with temp\r\n, but I want to ignore column3\r\n
Tried to do ^(?!.*column3).*$\r\n but the \r\n syntax does not work
You want to use a negative lookbehind, that is make the substitution when \r\n is not preceded by column3:
re.sub(r'(?<!column3)\r\n', r'temp\r\n', var)
For example:
>>> import re
>>>
>>> var = 'column1;column2;column3\r\nval1;val2;val3\r\n;val4;val5;val6\r\n'
>>> new_text = re.sub(r'(?<!column3)\r\n', r'temp\r\n', var)
>>> new_text
'column1;column2;column3\r\nval1;val2;val3temp\r\n;val4;val5;val6temp\r\n'
>>>

python3: regex need to character to match but dont want in output

I have a string named
Set-Cookie: BIGipServerApp_Pool_SSL=839518730.47873.0000; path=/
I am trying to extract 839518730.47873.0000 from it. For exact string I am fine with my regex but If I include any digit before 1st = then its all going wrong.
No Digit
>>> m=re.search('[0-9.]+','Set-Cookie: BIGipServerApp_Pool_SSL=839518730.47873.0000; path=/')
>>> m.group()
'839518730.47873.0000'
With Digit
>>> m=re.search('[0-9.]+','Set-Cookie: BIGipServerApp_Pool_SSL2=839518730.47873.0000; path=/')
>>> m.group()
'2'
Is there any way I can extract `839518730.47873.0000' only but doesnt matter what else lies in the string.
I tried
>>> m=re.search('=[0-9.]+','Set-Cookie: BIGipServerApp_Pool_SSL=839518730.47873.0000; path=/')
>>> m.group()
'=839518730.47873.0000'
As well but its starting with '=' in the output and I dont want it.
Any ideas.
Thank you.
If your substring always comes after the first =, you can just use capture group with =([\d.]+) pattern:
import re
result = ""
m = re.search(r'=([0-9.]+)','Set-Cookie: BIGipServerApp_Pool_SSL2=839518730.47873.0000; path=/')
if m:
result = m.group(1) # Get Group 1 value only
print(result)
See the IDEONE demo
The main point is that you match anything you do not need and match and capture (with the unescaped round brackets) the part of pattern you need. The value you need is in Group 1.
You can use word boundaries:
\b[\d.]+
RegEx Demo
Or to make match more targeted use lookahead for next semi-colon after your matched text:
\b[\d.]+(?=\s*;)
RegEx Demo2
Update :
>>> m.group(0)
'839518730.47873.0000'
>>> m=re.search(r'\b[\d.]+','Set-Cookie: BIGipServerApp_Pool_SSL2=839518730.47873.0000; path=/')
>>> m.group(0)
'839518730.47873.0000'
>>>

Finding out unknown matched words

I have a regex pattern:
import regex as re
re.sub(r'(.*)\bHello (.*) BGC$\b', "OTR", 'Hello People BGC')
This will replace to give OTR, but how do I find out what the matched characters are within the (.*)?
Using regex==2016.1.10, Python 3.5.1
Compile the pattern and then call match() and sub() separately:
>>> pattern = re.compile(r'^Hello (.*?) BGC$')
>>> s = 'Hello People BGC'
>>> pattern.match(s).group(1)
'People'
>>> pattern.sub("OTR", s)
'OTR'

Regular expression to match a word while preserving end of line

I have a string as follow:
str = 'chem biochem chem chemi hem achem abcchemde chem\n asd chem\n'
I want to replace the word "chem" with "chemistry" while preserving the end of line character ('\n'). I also want the regex not match words like 'biochem', 'chemi', 'hem', 'achem' and 'abcchemde'. How can I do this?
Here's what I'm using but it doesn't work:
import re
re.sub(r'[ ^c|c]hem[$ ]', r' chemistry ', str)
Thank you
use word boundaries:
>>> s = 'chem biochem chem chemi hem achem abcchemde chem\n asd chem\n'
>>> import re
>>> re.sub(r'\bchem\b','chemistry',s)
'chemistry biochem chemistry chemi hem achem abcchemde chemistry\n asd chemistry\n'
just a note, dont use str as a variable name, that covers the builtin str type
You need to use \b to match a word boundary:
import re
re.sub(r'\bchem\b', r'chemistry', mystring)
(And as R Nar pointed out, you should avoid using str as a variable name.)
I just found the answer. Thanks to #Jota.
The super-simple Regex is as follow:
re.sub(r'\bchem\b', r' chemistry ', str)

Regex to catch a string without () in 3 patterns like abc(ef) ,(ef)abc and (ef)abc(gh)

I have tested this Regex
(?<=\))(.+?)(?=\()|(?<=\))(.+?)\b|(.+?)(?=\()
but it doesn't work for strings like this pattern (ef)abc(gh).
I got a result like this "(ef)abc".
But these 3 regexes (?<=\))(.+?)(?=\() , (?<=\))(.+?)\b, (.+?)(?=\()
do work separately for "(ef)abc(gh)", "(ef)abc" ,"abc(ef)" .
can anyone tell me where the problem is or how can I get the expected result?
Assuming you are looking to match the text from between the elements in parenthesis, try this:
^(?:\(\w*\))?([\w]*)(?:\(\w*\))?$
^ - beginning of string
(?:\(\w*\))? - non-capturing group, match 0 or more alphabetic letters within parens, all optional
([\w]*) - capturing group, match 0 or more alphabetic letters
(?:\(\w*\))? - non-capturing group, match 0 or more alphabetic letters within parens, all optional
$ - end of string
You haven't specified what language you might be using, but here is an example in Python:
>>> import re
>>> string = "(ef)abc(gh)"
>>> string2 = "(ef)abc"
>>> string3 = "abc(gh)"
>>> p = re.compile(r'^(?:\(\w*\))?([\w]*)(?:\(\w*\))?$')
>>> m = re.search(p, string)
>>> m2 = re.search(p, string2)
>>> m3 = re.search(p, string3)
>>> print m.groups()[0]
'abc'
>>> print m2.groups()[0]
'abc'
>>> print m3.groups()[0]
'abc'
\([^)]+\)|([^()\n]+)
Try this.Just grab the capture or group.See demo.
https://regex101.com/r/tX2bH4/6
Your problem is that (.+?)(?=\() matches "(ef)abc" in "(ef)abc(gh)".
The easiest solution to this problem is be more explicit about what you are looking for. In this case by exchanging "any character" ., with "any character that is not a parenthesis" [^\(\)].
(?<=\))([^\(\)]+?)(?=\()|(?<=\))([^\(\)]+?)\b|([^\(\)]+?)(?=\()
A cleaner regexp would be
(?:(?<=^)|(?<=\)))([^\(\)]+)(?:(?=\()|(?=$))