I'm trying to learn regular expressions to speed up editing my program.
My program has hundreds of references to the 3-dimensional array pc. For example, the array elements might be referred to as pc(i+1,j+1,k), pc(i,j+1,k-1) or pc(i,j,k). I need a regular expression to search for the ending parenthesis so that I can replace it with ",1)". For example, the end goal is to convert pc(i,j,k) to pc(i,j,k,1).
I don't need the regular expression to do the actual replacing -- I don't even know if that's possible -- I just need it to find the ending parenthesis so I can replace it.
Any help or hints would be much appreciated!
Here's an excerpt of the code I would be searching through:
PpPx_ey = 0.5*( FNy(i,j+1,k) *((pc(i,j+1,k)-pc(i-1,j+1,k))/xdiff(i,j,k)+(pc(i+1,j+1,k)-pc(i,j+1,k))/xdiff(i+1,j,k) )+(1.-FNy(i,j+1,k))*((pc(i,j, k)-pc(i-1,j, k))/xdiff(i,j,k)+(pc(i+1,j ,k)-pc(i,j ,k))/xdiff(i+1,j,k)) ).
To further clarify: I'm using the Atom notepad, which allows for regular expressions in the CTRL-F command. I want to use the 'replace' option for things that I CTRL-F, but I need to use a literal string for that part. Thus if I can find the ending ")" in anything that looks like pc( ) using a regular expression, I can replace it with ",1)".
Pretty simple, actually.
This should do it for you:
pc\(.*\)
pc = literally pc
\( = escaped (
.* = anything
\) = escaped )
(pc\(.*?)\)
( - Begins a capture group.
pc - This will match the literal pc
\( - Matches the opening parenthesis. The backslash escapes the
parenthesis, so that it isn't interpretted as the beginning of a
capture group.
.*? - Will lazily match anything. . will match any single
character. * is a quantifier that matches any number (including
zero) of the preceding element, the . in this case. ? causes the
preceding quantifier to be lazy, meaning that it will match the
minimum number of characters possible. This is what prevents matching
pc(i,j+1,k)-pc(i-1,j+1,k) in the string
(pc(i,j+1,k)-pc(i-1,j+1,k))/xdiff(i,j,k) as one match, rather than
two different matches.
) - Ends the capture group.
\) - Same as \(, but matches a closing brace.
The closing brace can be replaced with ,1) as you mentioned. Everything besides the closing brace is captured. The first capture group is usually referenced in a replace string using $1 or \1. So something like $1,1) should replace the closing brace.
Hope this will help you a bit!
According to your question, it seems that you want to find all patterns that like , k + or - number), thus , k+1), k-1), k) should all be found out and replaced.
I write a regex expression, which should be able to fulfill you, but it's not perfect.
It is like this:
import re
s = 'PpPx_ey = 0.5*( FNy(i,j+1,k) *((pc(i,j+1,k)-pc(i-1,j+1,k))/xdiff(i,j,k)+(pc(i+1,j+1,k)-pc(i,j+1,k))/xdiff(i+1, j,k) )+(1.-FNy(i,j+1,k))*((pc(i,j, k)-pc(i-1,j, k))/xdiff(i,j,k)+(pc(i+1,j ,k)-pc(i,j ,k))/xdiff(i+1,j,k)) )'
print re.findall(',\s*k\s*[\+\-]*\s*\d*\s*\)', s)
com = re.compile(',\s*k\s*[\+\-]*\s*\d*\s*\)')
for i in com.finditer(s):
print i.start(), i.group()
str_replaced = re.sub(',\s*k\s*[\+\-]*\s*\d*\s*\)', ', 1)', s)
print str_replaced
The key regex expression is ,\s*k\s*[\+\-]*\s*\d*\s*\), it is not perfect because it will match string like this: ,k+), this kind of string may not need to be found out or may be not even exist.
The expression ,\s*k\s*[\+\-]*\s*\d*\s*\) means: it will match a string: start with ,, then may or may not have blanks or Tabs, then should have letter k, then blanks or not, then may have +, or - or may not have them at all, then blanks or not, then may have a digit number or not, then blanks or not, then the ending parenthesis ).
Check if this will help you.
Related
If it matters, I'm working with Python/R for this particular script, but I think this should be a general regex question.
I have something along the format of
"_id" : ObjectID("34z83b3853e820x583203"),
This happens millions of times in a particular file. I want to convert all of these to
"_id" : "34z83b3853e820x583203",
The catch is, I can't just replace any "), with ", as there may be other instances in the file.
Replacing ObjectID(" with " should be trivial.
So essentially, I have to find where there is 15+ character AND numbers mixed, immediately followed by "),
Once found, I need to preserve that string, and just delete the ).
Is there a good way to go about this that I'm missing? Finding an expression and preserving pieces of it?
My initial impression was to use a lookbehind
(?<=[a-zA-Z0-9]{15,}")\)
In hopes that this would look for a ) that is proceeded by a string of 15+ alphanumeric characters, however
1) I do not believe this means it has to be alpha AND numeric, just alpha or numeric or both.
2) It's not catching the desired parenthesis regardless.
You can do both steps together (replacing opening ( and closing parentheses ))
Regex: ObjectID\((\"[a-zA-Z0-9]{15,}\")\)
(\"[a-zA-Z0-9]{15,}\") is the first capturing group and includes the quotes and the alphanumeric characters between which have a rule of 15 or above like you've mentioned. Since this is the first capturing group it is represented by $1
ObjectID\( is the literal ObjectID followed by the opening parentheses \(
\) is the closing parentheses at the end
Replace with: $1
Regex101 Demo
Hope this helps!
Let's say I have a string in which I wanted to parse from an opening double-quote to a closing double-quote:
asdf"pass\"word"asdf
I was lucky enough to discover that the following PCRE would match from the opening double-quote to the closing double-quote while ignoring the escaped double-quote in the middle (to properly parse the logical unit):
".*?(?:(?!\\").)"
Match:
"pass\"word"
However, I have no idea why this PCRE matches the opening and closing double-quote properly.
I know the following:
" = literal double-quote
.*? = lazy matching of zero or more of any character
(?: = opening of non-capturing group
(?!\") = asserts its impossible to match literal \"
. = single character
) = closing of non-capturing group
" = literal double-quote
It appears that a single character and a negative lookahead are apart of the same logical group. To me , this means the PCRE is saying "Match from a double-quote to zero or more of any character as long as there is no \" right after the character, then match one more character and one single double quote."
However, according to that logic the PCRE would not match the string at all.
Could someone help me wrap my head around this?
It's easier to understand if you change the non-capture group to be a capture group.
Lazy matching generally moves forward one character at a time (vs. greedy matching everything it can and then giving up what it must). But it "moves forward" as far as satisfying the required parts of the pattern after it, which is accomplished by letting the .*? match everything up to r, then letting the negative lookahead + . match the d.
Update: you asked in comment:
how come it matches up to the r at all? shouldn't the negative
lookahead prevent it from getting passed the \" in the string? thanks
for helpin me understand, by the way
No, because it is not the negative lookahead stuff that is matching it. That is why I suggested you change the non-captured group into a captured group, so that you can see it is .*? that matches the \", not (?:(?!\\").)
.*? has the potential to match the entire string, and the regex engine uses that to satisfy the requirement to match the rest of the pattern.
Update 2:
It is effectively the same as doing this: ".*?[^\\]" which is probably a lot easier to wrap your head around.
A (slightly) better pattern would be to use a negative lookbehind like so: ".*?(?<!\\)" because it will allow for an empty string "" to be matched (a valid match in many contexts), but negative lookbehinds aren't supported in all engines/languages (from your tags, pcre supports it, but I don't think you can really do this in bash except e.g. grep -P '[pattern]' .. which basically runs it through perl).
Nothing to add to Crayon Violent explanation, only a little disambiguation and ways to match substrings enclosed between double quotes (with eventually quotes escaped by a backslash inside).
First, it seems that you use in your question the acronym "PCRE" (Perl Compatible Regular Expression) that is the name of a particular regex engine (and by extension or somewhat imprecisely refers to its syntax) in place of the word "pattern" that is the regular expression that describes a group of other strings (whatever the regex engine used).
With Bash:
A='asdf"pass\"word"asdf'
pattern='"(([^"\\]|\\.)*)"'
[[ $A =~ $pattern ]]
echo ${BASH_REMATCH[1]}
You can use this pattern too: pattern='"(([^"\\]+|\\.)*)"'
With a PCRE regex engine, you can use the first pattern, but it's better to rewrite it in a more efficient way:
"([^"\\]*+(?:\\.[^"\\])*+)"
Note that for these three patterns don't need any lookaround. They are able to deal with any number of consecutive backslashes: "abc\\\"def" (a literal backslash and an escaped quote), "abcdef\\\\" (two literal backslashes, the quote is not escaped).
Disclaimer: I'm new to writing regular expressions, so the only problem may be my lack of experience.
I'm trying to write a regular expression that will find numbers inside of parentheses, and I want both the numbers and the parentheses to be included in the selection. However, I only want it to match if it's at the beginning of a string. So in the text below, I would want it to get (10), but not (2) or (Figure 50).
(10) Joystick Switch - Contains control switches (Figure 50)
Two (2) heavy lifting straps
So far, I have (\(\d+\)) which gets (10) but also (2). I know ^ is supposed to match the beginning of a string (or line), but I haven't been able to get it to work. I've looked at a lot of similar questions, both here and on other sites, but have only found parts of solutions (finding things inside of parentheses, finding just numbers at the beginning for a string, etc.) and haven't quite been able to put them together to work.
I'm using this to create a filter in a CAT tool (for those of you in translation) which means that there's no other coding languages involved; essentially, I've been using RegExr to test all of the other expressions I've written, and that's worked fine.
The regex should be
^\(\d+\)
^ Anchors the regex at the start of the string.
\( Matches (. Should be escaped as it has got special meaning in regex
\d+ Matches one or more digits
\) Matches the )
Capturing brackets like (\(\d+\)) are not necessary as there are no other characters matched from the pattern. It is required only when you require to extract parts from a matched pattern
For example if you like to match (50) but to extract digits, 50 from the pattern then you can use
\((\d+)\)
here the \d+ part comes within the captured group 1, That is the captured group 1 will be 50 where as the entire string matched is (50)
Regex Demo
Like so:
^\(\d+\)
^ anchor
Each of ( and ) are regex meta character, so they need to be escaped with \
So \( and \) match literal parenthesis.
( and ) captures.
\d+ match 1 or more digits
Demo
I have a regular expression looking for width=["|\']([^"]*)["|\']
works great when looking for width="750" and width='750' however it does not match width=750
so I got it as far as width=["|\']?([^"]*)["|\'] for optional first quote but the match just continues on and does not return just 750
If you are using a tool or language that supports backreferences, you should be able to use the following:
width=("|'|)(\S*)\1
This will try to match a single quote, double quote, or empty string with the first capture group, and then the \1 at the end will be whatever the first group captured. The value will always be the contents from the second capture group.
I also changed the [^"]* to \S* so this will match any number of non-whitespace characters. This is necessary to make sure that your match doesn't just go to the end of the string when there is no quotes around the value.
Example: http://rubular.com/r/Xg8ageZmgy
Character classes ([]) do not make use of | to mean or; they automatically or everything. You also don't have to escape the single quote (unless of course you're enclosing this whole expression in single quotes). You want:
["' ]?([^"' ]*)["' ]
Try this one:
width\s*=\s*(?:["\']([^"\']*)["\']|\S+)
I just added the \S+ to handle 700 after equal sign as OR condition. Also you do not need to place | inside the character class []
\s* means optional white spaces(zero or more times).
Which regular expression language are you using? Different languages have different details of syntax, so someone might give you an answer that works in their environment but not in yours.
For example, I copied your expression and tried it on some text in Emacs. It found a match in this text:
width=|750|
That's because Emacs regex doesn't use the '|' character to signify "either or" within the '[' and ']' brackets; it interprets it as just one more example of a character that the expression might match.
Also, it looks like your expression doesn't always stop after the 750 in this example:
width='750'
Instead, if there is a '"' character later in the input, it matches everything from the 750 up to that character. (It did the same thing with my earlier example in Emacs if there was a '"' later in the input.)
You will also match the 750 in this (note the mismatched quotation marks):
width='750"
Is that a problem, or is that an acceptable outcome?
When I searched about something, I found an answered question in this site. 2 of the answers contain
/([^.]*)\.(.*)/
on their answer.
The question is located at Find & replace jquery. I'm newbie in javascript, so I wonder, what does it mean? Thanks.
/([^.]*)\.(.*)/
Let us deconstruct it. The beginning and trailing slash are delimiters, and mark the start and end of the regular expression.
Then there is a parenthesized group: ([^.]*) The parentheseis are there just to group a string together. The square brackets denote a "character group", meaning that any character inside this group is accepted in its place. However, this group is negated by the first character being ^, which reverse its meaning. Since the only character beside the negation is a period, this matches a single character that is not a period. After the square brackets is a * (asterisk), which means that the square brackets can be matched zero or more times.
Then we get to the \.. This is an escaped period. Periods in regular expressions have special meaning (except when escaped or in a character group). This matches a literal period in the text.
(.*) is a new paranthesized sub-group. This time, the period matches any character, and the asterisk says it can be repeated as many times as needs to.
In summary, the expression finds any sequence of characters (that isn't a period), followed by a single period, again followed by any character.
Edit: Removed part about shortening, as it defeats the assumed purpose of the regular expression.
It's a regular expression (it matches non-periods, followed by a period followed by anything (think "file.ext")). And you should run, not walk, to learn about them. Explaining how this particular regular expression works isn't going to help you as you need to start simpler. So start with a regex tutorial and pick up Mastering Regular Expressions.
Original: /([^.]*)\.(.*)/
Split this as:
[1] ([^.]*) : It says match all characters except . [ period ]
[2] \. : match a period
[3] (.*) : matches any character
so it becomes
[1]Match all characters which are not . [ period ] [2] till you find a .[ period ] then [3] match all characters.
Anything except a dot, followed by a dot, followed by anything.
You can test regex'es on regexpal
It's a regular expression that roughly searches for a string that doesn't contain a period, followed by a period, and then a string containing any characters.
That is a regular expression. Regular expressions are powerful tools if you use them right.
That particular regex extracts filename and extension from a string that looks like "file.ext".
It's a regular expression that splits a string into two parts: everything before the first period, and then the remainder. Most regex engines (including the Javascript one) allow you to then access those parts of the string separately (using $1 to refer to the first part, and $2 for the second part).
This is a regular expression with some advanced use.
Consider a simpler version: /[^.]*\..*/ which is the same as above without parentheses. This will match just any string with at least one dot. When the parentheses are added, and a match happens, the variables \1 and \2 will contain the matched parts from the parentheses. The first one will have anything before the first dot. The second part will have everything after the first dot.
Examples:
input: foo...bar
\1: foo
\2: ..bar
input: .foobar
\1:
\2: foobar
This regular expression generates two matching expressions that can be retrieved.
The two parts are the string before the first dot (which may be empty), and the string after the first dot (which may contain other dots).
The only restriction on the input is that it contain at least one dot. It will match "." contrary to some of the other answers, but the retrived groups will be empty.
IMO /.*\..*/g Would do the same thing.
const senExample = 'I am test. Food is good.';
const result1 = senExample.match(/([^.]*)\.(.*)/g);
console.log(result1); // ["I am test. Food is good."]
const result2 = senExample.match(/^.*\..*/g);
console.log(result2); // ["I am test. Food is good."]
the . character matches any character except line break characters the \r or \n.
the ^ negates what follows it (in this case the dot)
the * means "zero or more times"
the parentheses group and capture,
the \ allows you to match a special character (like the dot or the star)
so this ([^.]*) means any line break repeated zero or more times (it just eats up carriage returns).
this (.*) part means any string of characters zero or more times (except the line breaks)
and the \. means a real dot
so the whole thing would match zero or more line breaks followed by a dot followed by any number of characters.
For more information and a really great reference on Regular Expressions check out: http://www.regular-expressions.info/reference.html
It's a regular expression, which basically is a pattern of characters that is used to describe another pattern of characters. I once used regexps to find an email address inside a text file, and they can be used to find pretty much any pattern of text within a larger body of text provided you write the regexp properly.