Regex - Replace everything attached to an expression, but not the expression itself - regex

If it matters, I'm working with Python/R for this particular script, but I think this should be a general regex question.
I have something along the format of
"_id" : ObjectID("34z83b3853e820x583203"),
This happens millions of times in a particular file. I want to convert all of these to
"_id" : "34z83b3853e820x583203",
The catch is, I can't just replace any "), with ", as there may be other instances in the file.
Replacing ObjectID(" with " should be trivial.
So essentially, I have to find where there is 15+ character AND numbers mixed, immediately followed by "),
Once found, I need to preserve that string, and just delete the ).
Is there a good way to go about this that I'm missing? Finding an expression and preserving pieces of it?
My initial impression was to use a lookbehind
(?<=[a-zA-Z0-9]{15,}")\)
In hopes that this would look for a ) that is proceeded by a string of 15+ alphanumeric characters, however
1) I do not believe this means it has to be alpha AND numeric, just alpha or numeric or both.
2) It's not catching the desired parenthesis regardless.

You can do both steps together (replacing opening ( and closing parentheses ))
Regex: ObjectID\((\"[a-zA-Z0-9]{15,}\")\)
(\"[a-zA-Z0-9]{15,}\") is the first capturing group and includes the quotes and the alphanumeric characters between which have a rule of 15 or above like you've mentioned. Since this is the first capturing group it is represented by $1
ObjectID\( is the literal ObjectID followed by the opening parentheses \(
\) is the closing parentheses at the end
Replace with: $1
Regex101 Demo
Hope this helps!

Related

regex code: does or does not contain a character

I cant figure this out. I want to capture the string inside the square brackets, with or without characters in it.
[5123512], [412351, 1235123, 5125123], [12312-AA] and []
i want to convert the square brackets into double quote
[5123512] ==> "5123512"
[412351, 1235123, 5125123] ==> "412351, 1235123, 5125123"
[12312-AA] ==> "12312-AA"
[] == > ""
i tried this \[\d+\] and not working
This is my sample data, its a json format.
Square brackets inside the description need not to change, only the attributes.
{"results":
[{"listing": 4613456431,"sku": [5123512],"category":[412351, 1235123,
5125123],"subcategory": "12312-AA", "description":"This is [123sample]"}
{"listing": 121251,"sku":[],"category": [412351],"subcategory": "12312-AA",
"description": "product sample"}]}
TIA
Your regex doesn't work for three reasons :
[ is a meta-character that opens a character class. To match a literal [, you need to escape it with a backslash. ] also is a meta-character when it follows the [ meta-character, but if you escape the [ you shouldn't need to escape the ] (not that it hurts to do so).
\d only captures decimal digits, however your sample contains the letter A. If that's the hexadecimal digit, you will probably want to use [\dA-F] instead of \d, or [\dA-Fa-f] if the digits can be found in small case. If that can be any letter, you could use [\dA-Z] or [\dA-Za-z] depending on your need to match small case letters.
+ means "one or more occurences", so it wouldn't match an empty []. Use the * "0 or more occurences" quantifier instead.
Additionally, you probably need to capture the sequence of digits in a (capturing group) in order to be able to reference it in your replacement pattern.
However, as Andrew Morton suggests, it looks like you should be able to use a plain text search/replace.
First off, regex is a horrible tool for parsing JSON formatted data. I'm sure you'll find plenty of tools to simply read your JSON in vb.net and mangle it in simpler ways than taking it in as text... For example: How to parse json and read in vb.net
Original answer (edited slightly):
You're almost there, but here's a few things you need to change:
in your regex pattern, escape the square brackets: \[ and \]
if you only want to capture all characters in the brackets, then . is a good way to go
the plus sign + means "at least one" — if you want to match empty brackets too, use *? instead
the question mark means "lazy" — it explicitly tells the regex to match the shortest sequence of characters possible (instead of going over to the next square bracket...)
wrap the .*? into parenthesis so that you can reference to that part later when substituting the stuff
finally, the output value / pattern to substitute with is \1 or $1, depending on the context
or "\1" or "$1" if you really need the double quotes in the output — maybe you just need a string variable?
All in all this becomes:
Find this: \[(.*?)\]
Replace with: \1

Regular expression to find and replace ")" in anything matching "pc( * )"?

I'm trying to learn regular expressions to speed up editing my program.
My program has hundreds of references to the 3-dimensional array pc. For example, the array elements might be referred to as pc(i+1,j+1,k), pc(i,j+1,k-1) or pc(i,j,k). I need a regular expression to search for the ending parenthesis so that I can replace it with ",1)". For example, the end goal is to convert pc(i,j,k) to pc(i,j,k,1).
I don't need the regular expression to do the actual replacing -- I don't even know if that's possible -- I just need it to find the ending parenthesis so I can replace it.
Any help or hints would be much appreciated!
Here's an excerpt of the code I would be searching through:
PpPx_ey = 0.5*( FNy(i,j+1,k) *((pc(i,j+1,k)-pc(i-1,j+1,k))/xdiff(i,j,k)+(pc(i+1,j+1,k)-pc(i,j+1,k))/xdiff(i+1,j,k) )+(1.-FNy(i,j+1,k))*((pc(i,j, k)-pc(i-1,j, k))/xdiff(i,j,k)+(pc(i+1,j ,k)-pc(i,j ,k))/xdiff(i+1,j,k)) ).
To further clarify: I'm using the Atom notepad, which allows for regular expressions in the CTRL-F command. I want to use the 'replace' option for things that I CTRL-F, but I need to use a literal string for that part. Thus if I can find the ending ")" in anything that looks like pc( ) using a regular expression, I can replace it with ",1)".
Pretty simple, actually.
This should do it for you:
pc\(.*\)
pc = literally pc
\( = escaped (
.* = anything
\) = escaped )
(pc\(.*?)\)
( - Begins a capture group.
pc - This will match the literal pc
\( - Matches the opening parenthesis. The backslash escapes the
parenthesis, so that it isn't interpretted as the beginning of a
capture group.
.*? - Will lazily match anything. . will match any single
character. * is a quantifier that matches any number (including
zero) of the preceding element, the . in this case. ? causes the
preceding quantifier to be lazy, meaning that it will match the
minimum number of characters possible. This is what prevents matching
pc(i,j+1,k)-pc(i-1,j+1,k) in the string
(pc(i,j+1,k)-pc(i-1,j+1,k))/xdiff(i,j,k) as one match, rather than
two different matches.
) - Ends the capture group.
\) - Same as \(, but matches a closing brace.
The closing brace can be replaced with ,1) as you mentioned. Everything besides the closing brace is captured. The first capture group is usually referenced in a replace string using $1 or \1. So something like $1,1) should replace the closing brace.
Hope this will help you a bit!
According to your question, it seems that you want to find all patterns that like , k + or - number), thus , k+1), k-1), k) should all be found out and replaced.
I write a regex expression, which should be able to fulfill you, but it's not perfect.
It is like this:
import re
s = 'PpPx_ey = 0.5*( FNy(i,j+1,k) *((pc(i,j+1,k)-pc(i-1,j+1,k))/xdiff(i,j,k)+(pc(i+1,j+1,k)-pc(i,j+1,k))/xdiff(i+1, j,k) )+(1.-FNy(i,j+1,k))*((pc(i,j, k)-pc(i-1,j, k))/xdiff(i,j,k)+(pc(i+1,j ,k)-pc(i,j ,k))/xdiff(i+1,j,k)) )'
print re.findall(',\s*k\s*[\+\-]*\s*\d*\s*\)', s)
com = re.compile(',\s*k\s*[\+\-]*\s*\d*\s*\)')
for i in com.finditer(s):
print i.start(), i.group()
str_replaced = re.sub(',\s*k\s*[\+\-]*\s*\d*\s*\)', ', 1)', s)
print str_replaced
The key regex expression is ,\s*k\s*[\+\-]*\s*\d*\s*\), it is not perfect because it will match string like this: ,k+), this kind of string may not need to be found out or may be not even exist.
The expression ,\s*k\s*[\+\-]*\s*\d*\s*\) means: it will match a string: start with ,, then may or may not have blanks or Tabs, then should have letter k, then blanks or not, then may have +, or - or may not have them at all, then blanks or not, then may have a digit number or not, then blanks or not, then the ending parenthesis ).
Check if this will help you.

Regular expression to find number in parentheses, but only at beginning of string

Disclaimer: I'm new to writing regular expressions, so the only problem may be my lack of experience.
I'm trying to write a regular expression that will find numbers inside of parentheses, and I want both the numbers and the parentheses to be included in the selection. However, I only want it to match if it's at the beginning of a string. So in the text below, I would want it to get (10), but not (2) or (Figure 50).
(10) Joystick Switch - Contains control switches (Figure 50)
Two (2) heavy lifting straps
So far, I have (\(\d+\)) which gets (10) but also (2). I know ^ is supposed to match the beginning of a string (or line), but I haven't been able to get it to work. I've looked at a lot of similar questions, both here and on other sites, but have only found parts of solutions (finding things inside of parentheses, finding just numbers at the beginning for a string, etc.) and haven't quite been able to put them together to work.
I'm using this to create a filter in a CAT tool (for those of you in translation) which means that there's no other coding languages involved; essentially, I've been using RegExr to test all of the other expressions I've written, and that's worked fine.
The regex should be
^\(\d+\)
^ Anchors the regex at the start of the string.
\( Matches (. Should be escaped as it has got special meaning in regex
\d+ Matches one or more digits
\) Matches the )
Capturing brackets like (\(\d+\)) are not necessary as there are no other characters matched from the pattern. It is required only when you require to extract parts from a matched pattern
For example if you like to match (50) but to extract digits, 50 from the pattern then you can use
\((\d+)\)
here the \d+ part comes within the captured group 1, That is the captured group 1 will be 50 where as the entire string matched is (50)
Regex Demo
Like so:
^\(\d+\)
^ anchor
Each of ( and ) are regex meta character, so they need to be escaped with \
So \( and \) match literal parenthesis.
( and ) captures.
\d+ match 1 or more digits
Demo

Regex to match one or two quotes but not three in a row

For the life of me I can't figure this one out.
I need to search the following text, matching only the quotes in bold:
Don't match: """This is a python docstring"""
Match: " This is a regular string "
Match: "" ← That is an empty string
How can I do this with a regular expression?
Here's what I've tried:
Doesn't work:
(?!"")"(?<!"")
Close, but doesn't match double quotes.
Doesn't work:
"(?<!""")|(?!"")"(?<!"")|(?!""")"
I naively thought that I could add the alternates that I don't want but the logic ends up reversed. This one matches everything because all quotes match at least one of the alternates.
(Please note: I'm not running the code, so solutions around using __doc__ won't help, I'm just trying to find and replace in my code editor.)
You can use /(?<!")"{1,2}(?!")/
DEMO
Autopsy:
(?<!") a negative look-behind for the literal ". The match cannot have this character in front
"{1,2} the literal " matched once or twice
(?!") a negative look-ahead for the literal ". The match cannot have this character after
Your first try might've failed because (?!") is a negative look-ahead, and (?<!") is a negative look-behind. It makes no sense to have look-aheads before your match, or look-behinds after your match.
I realized that my original problem description was actually slightly wrong. That is, I need to actually only match a single quote character, unless if it's part of a group of 3 quote characters.
The difference is that this is desirable for editing so that I can find and replace with '. If I match "one or two quotes" then I can't automatically replace with a single character.
I came up with this modification to h20000000's answer that satisfies that case:
(?<!"")(?<=(?!""").)"(?!"")
In the demo, you can see that the "" are matched individually, instead of as a group.
This works very similarly to the other answer, except:
it only matches a single "
that leaves us with matching everything we want except it still matches the middle quotes of a """:
Finally, adding the (?<=(?!""").) excludes that case specifically, by saying "look back one character, then fail the match if the next three characters are """):
I decided not to change the question because I don't want to hijack the answer, but I think this can be a useful addition.

regex - Removing text from around numbers in Notepad++

I have a large subset of data that looks like this:
MyApp.Whatever\app.config(115): More stuff here, but possibly with numbers or parenthesis...
I'd like to create a replace filter using Notepad++ that would identify and replace the line number "(115):" and replace it with a tab character followed by the same number.
I've been trying filters such as (\(\d+\):) and (\(\[0-9]+\):), but they keep returning the entire value in the \1 output.
How would I create a filter using Notepad++ that would successfully replace (115): with tab character + 115?
Use a quantifier.. (\(\d+?\):) where the ? will prevent it from being greedy. Also, since everything is in a () it will group it all and treat it as \1 ..
If it was in perl I'd say \((\d+?)\): which should match only the inner part.
Edit:
Just talked with my colleague - he said s/\((\d+)\)/\t\1/ and if you needed app config in front you could just put that in the front.
this should work for your needs
replace
\((\d+)\):
with
\t$1
Replacing (\(\d+\):) with \t\1 will keep the parenthesis and the colon since you've included them in the group (the outer parenthesis), and I think that's what you mean by "they keep returning the entire value."
Instead of escaping those inner parenthesis, escape the outer ones like the other answers have suggested: \((\d+)\): - this says to match a left paren, then match and capture a group of digits, then match a right paren and a colon. Replacing that with \t\1 will get rid of the parens and colon that were not in the captured group.