Currently I use this reg ex:
"\bI([ ]{1,2})([a-zA-Z]|\d){2,13}\b"
It was just brought to my attention that the text that I use this against could contain a "\" (backslash). How do I add this to the expression?
Add |\\ inside the group, after the \d for instance.
This expression could be simplified if you're also allowing the underscore character in the second capture register, and you are willing to use metacharacters. That changes this:
([a-zA-Z]|\d){2,13}
into this ...
([\w]{2,13})
and you can also add a test for the backslash character with this ...
([\w\x5c]{2,13})
which makes the regex just a tad easier to eyeball, depending on your personal preference.
"\bI([\x20]{1,2})([\w\x5c]{2,13})\b"
See also:
WP Metacharacter
Metacharacters
Shorthand character class
Both #slavy13 and #dreftymac give you the basic solution with pointers, but...
You can use \d inside a character class to mean a digit.
You don't need to put blank into a character class to match it (except, perhaps, for clarity, though that is debatable).
You can use [:alpha:] inside a character class to mean an alpha character, [:digit:] to mean a digit, and [:alnum:] to mean an alphanumeric (specifically not including underscore, unlike \w). Note that these character classes might mean more characters than you expect; think of accented characters and non-arabic digits, especially in Unicode.
If you want to capture the whole of the information after the space, you need the repetition inside the capturing parentheses.
Contrast the behaviour of these two one-liners:
perl -n -e 'print "$2\n" if m/\bI( {1,2})([a-zA-Z\d\\]){2,13}\b/'
perl -n -e 'print "$2\n" if m/\bI( {1,2})([a-zA-Z\d\\]{2,13})\b/'
Given the input line "I a123", the first prints "3" and the second prints "a123". Obviously, if all you wanted was the last character of the second part of the string, then the original expression is fine. However, that is unlikely to be the requirement. (Obviously, if you're only interested in the whole lot, then using '$&' gives you the matched text, but it has negative efficiency implications.)
I'd probably use this regex as it seems clearest to me:
m/\bI( {1,2})([[:alnum:]\\]{2,13})\b/
Time for the obligatory plug: read Jeff Friedl's "Mastering Regular Expressions".
As I pointed out in my comment to slavy's post, \\ -> \b as a backslash is not a word character. So my suggestion is
/\bI([ ]{1,2})([\p{IsAlnum}\\]{2,13})(?:[^\w\\]|$)/
I assumed that you wanted to capture the whole 2-13 characters, not just the first one that applies, so I adjusted my RE.
You can make the last capture a lookahead if the engine supports it and you don't want to consume it. That would look like:
/\bI([ ]{1,2})([\p{IsAlnum}\\]{2,13})(?=[^\w\\]|$)/
Related
I am trying to replace all the words except the first 3 words from the String (using textpad).
Ex value: This is the string for testing.
I want to extract just 3 words: This is the from above string and remove all other words.
I figured out the regex to match the 3 words (\w+\s+){3} but I need to match all other words except the first 3 words and remove other words. Can someone help me with it?
Exactly how depends on the flavor, but to eliminate everything except the first three words, you can use:
^((?:\S+\s+){2}\S+).*
which captures the first three words into capturing group 1, as well as the rest of the string. For your replace string, you use a reference to capturing group 1. In C# it might look like:
resultString = Regex.Replace(subjectString, #"^((?:\S+\s+){2}\S+).*", "${1}", RegexOptions.Multiline);
EDIT: Added the start-of-line anchor to each regex, and added TextPad specific flags.
If you want to eliminate the first three words, and capture the rest,
^(?:\w+\s+){3}([^\n\r]+)$
?: changes the first three words to a non-capturing group, and captures everything after it.
Is this what you're looking for? I'm not totally clear on your question, or your goal.
As suggested, here's the opposite. Capture the first three words only, and discard the rest:
^(\w+\s+){3}(?:[^\n\r]+)$
Just move the ?: from the first to the second grouping.
As far as replacing that captured group, what do you want it replaced with? To replace each word individually, you'd have to capture each word individually:
^(\w+)\s+(\w+)\s+(\w+)\s+(?:[^\n\r]+)$
And then, for instance, you could replace each with its first letter capitalized:
Replace with: \u$1 \u$2 \u$3
Result is This Is The
In TextPad, lowercase \u in the replacement means change only the next letter. Uppercase \U changes everything after it (until the next capitalization flag).
Try it:
http://fiddle.re/f3hgv
(press on [Java] or whatever language is most relevant. Note that \u is not supported by RegexPlanet.)
Coming from a duplicate question, I'll post a solution which works for "traditional" regex implementations which do not support the Perl extensions \s, \W, etc. Newcomers who are not familiar even with the fact that there are different dialects (aka flavors) of regular expressions are advised to read e.g. Why are there so many different regular expression dialects?
If you have POSIX class support, you can use [[:alpha:]] for \w, [^[:alpha:]] for \W, [[:space:]] for \s, etc. But if we suppose that whitespace will always be a space and you want to extract the first three tokens between spaces, you don't really need even that.
[^ ]+[ ]+[^ ]+[ ]+[^ ]+
matches three tokens separated by runs of spaces. (I put the spaces in brackets to make them stand out, and easy to extend if you want to include other characters than just a single regular ASCII space in the token separator set. For example, if your regex dialect accepts \t for tab, or you are able to paste a regular tab in its place, you could extend this to
[^ \t]+[ \t]+[^ \t]+[ \t]+[^ \t]+
In most shells, you can type a literal tab with ctrl+v tab, i.e. prefix it with an escape code, which is often typed by holding down the ctrl key and typing v.)
To actually use this, you might want to do
grep -Eo '[^ ]+[ ]+[^ ]+[ ]+[^ ]+' file
where the single quotes are necessary to protect the regex from the shell (double quotes would work here, too, but are weaker, or backslashing every character in the regex which has a significance to the shell as a metacharacter) or perhaps
sed -r 's/([^ ]+[ ]+[^ ]+[ ]+[^ ]+).*/\1/' file
to replace every line with just the captured expression (the parentheses make a capturing group, which you can refer back to with \1 in the replacement part in the s command in sed). The -r option selects a slightly more featureful regex dialect than the bare-bones traditional sed; if your sed doesn't have it, try -E, or put a backslash before each parenthesis and plus sign.
Because of the way regular expressions work, the first three is easy because a regular expression engine will always return the first possible match on a line. If you want three tokens starting from the second, you have to put in a skip expression. Adapting the sed script above, that would be
sed -r 's/[^ ]+[ ]+([^ ]+[ ]+[^ ]+[ ]+[^ ]+).*/\1/'
where you'll notice how I put in a token+non-token group before the capture. (This is not really possible with grep -o unless you have grep -P in which case the full gamut of Perl extensions is available to you anyway.)
If your regex dialect supports {m,n} repetition, you can of course refactor the regex to use that. If you need a large number of repetitions, it's certainly both more readable and more maintainable. Just make sure you don't add parentheses where you break up the backreference order (the first left parenthesis creates the first group \1, the second \2, etc.)
sed -r 's/([^ ]+([ ]+[^ ]+){2}).*/\1/' file
Notice how the second parenthesized group is necessary to specify the scope of the {2} repetition (we want to repeat more than just the single character immediately before the left curly brace). The OP's attempt had an error where the repetition was specified outside of the last parenthesis; then, the back reference \1 (or whatever it's called in your dialect -- TextMate seems to use $1, just like Perl) will refer to the last single match of the capturing parentheses, because the repetition is not part of the capture, being outside the capturing parentheses.
I have a file which have the data something like this
34sdf, 434ssdf, 43fef,
34sdf, 434ssdf, 43fef, sdfsfs,
I have to identify the sdfsfs, and replace it and/or print the line.
The exact condition is the tokens are comma separated. target expression starts with a non numeric character, and till a comma is met.
Now i start with [^0-9] for starting with a non numeric character, but the next character is really unknown to me, it can be a number, a special char, an alphabet or even a space. So I wanted a (anything)*. But the previous [] comes into play and spoils it. [^0-9]* or [^0-9].*, or [^0-9]\+.*, or [^0-9]{1}*, or [^0-9][^,]* or [^0-9]{1}[^\,]*, nothing worked till now. So my question is how to write a regex for this (starting character a non numeric, then any character except a comma or any number of character till comma) I am using grep and sed (gnu). Another question is for posix or non-posix, any difference comes there?
Something like that maybe?
(?:(?:^(\D.*?))|(?:,\s(\D.*?))),
This captures the string that starts with a non-numeric character. Tested here.
I'm not sure if sed supports \D, but you can easily replace it with [^0-9] if not, which you already know.
EDIT: Can be trimmed to:
(?:\s|^)(\D.*?),
With sed, and slight modifications to your last regex:
sed -n 's/.*,[ ]*\([^ 0-9][^\,]*\),/\1/p' input
I think pattern (\s|^)(\D[^,]+), will catch it.
It matches white-space or start of string and group of a non-digit followed by anything but comma, which is followed by comma.
You can use [^0-9] if \D is not supported.
This might work for you (GNU sed):
sed '/\b[^0-9,][^,]*/!d' file # only print lines that match
or:
sed -n 's/\b[^0-9,][^,]*/XXX/gp' file # substitute `XXX` for match
How do I say "is not" a certain character in sed?
[^x]
This is a character class that accepts any character except x.
For those not satisfied with the selected answer as per johnny's comment.
'su[^x]' will match 'sum' and 'sun' but not 'su'.
You can tell sed to not match lines with x using the syntax below:
sed '/x/! s/su//' file
See kkeller's answer for another example.
There are two possible interpretations of your question. Like others have already pointed out, [^x] matches a single character which is not x. But an empty string also isn't x, so perhaps you are looking for [^x]\|^$.
Neither of these answers extend to multi-character sequences, which is usually what people are looking for. You could painstakingly build something like
[^s]\|s\($\|[^t]\|t\($\|[^r]\)\)\)
to compose a regular expression which doesn't match str, but a much more straightforward solution in sed is to delete any line which does match str, then keep the rest;
sed '/str/d' file
Perl 5 introduced a much richer regex engine, which is hence standard in Java, PHP, Python, etc. Because Perl helpfully supports a subset of sed syntax, you could probably convert a simple sed script to Perl to get to use a useful feature from this extended regex dialect, such as negative assertions:
perl -pe 's/(?:(?!str).)+/not/' file
will replace a string which is not str with not. The (?:...) is a non-capturing group (unlike in many sed dialects, an unescaped parenthesis is a metacharacter in Perl) and (?!str) is a negative assertion; the text immediately after this position in the string mustn't be str in order for the regex to match. The + repeats this pattern until it fails to match. Notice how the assertion needs to be true at every position in the match, so we match one character at a time with . (newbies often get this wrong, and erroneously only assert at e.g. the beginning of a longer pattern, which could however match str somewhere within, leading to a "leak").
From my own experience, and the below post supports this, sed doesn't support normal regex negation using "^". I don't think sed has a direct negation method...but if you check the below post, you'll see some workarounds.
Sed regex and substring negation
In addition to all the provided answers , you can negate a character class in sed , using the notation [^:[C_CLASS]:] , for example , [^[:blank:]] will match anything which is not considered a space character .
Thanks for the previous assistance everyone!. I have a query regarding RegExp in Perl
My issue is..
I know, when matching you can write m// or // or ## (must include m or s if you use this). What is causing me the confusion is a book example on escaping characters I have. I believe most people escape lots of characters, as a sure fire way of the program working without missing a metacharacter something ie: \# when looking to match # say in an email address.
Here's my issue and I know what this script does:
$date= "15/12/99"
$date=~ s#(\d+)/(\d+)/(\d+)#$1/$2/$3#; << why are no forward slashes escaped??
print($date);
Yet the later example I have, shows it rewritten, as (which i also understand and they're escaped)
$date =~ s/()(\d+)\/(\d+)\/(d+)/$2\/$1\/$3; <<<<which is escaping the forward slashes.
I know the slashes or hashes are programmer preference and their use. What I don't understand is why the second example, escapes the slashes, yet the first doesn't - I have tried and they work both ways. No escaping slashes with hashes? What's even MORE confusing is, looking at yet another book example I also have earlier to this one, using hashes again, they too escape the # symbol.
if ($address =~ m#\##) { print("That's an email address"); } or something similar
So what do you escape from what you don't using hashes or slashes? I know you have to escape metacharacters to match them but I'm confused.
When you build a regexp, you define a character as a delimiter for your regexp i.e. doing // or ##.
If you need to use that character inside your regexp, you will need to escape it so that the regexp engine does not see it as the end of the regexp.
If you build your regexp between forward slashes /, you will need to escape the forward slashes contained in your regexp, hence the escaping in your second example.
Of course, the same rule apply with any character you use as a regexp delimiter, not just forward slashes.
The forward slashes are not meta characters in themselves - only the use of them in the second example as expression separators makes them "special".
The format of a substitute expression is:
s<expression separator char><expression to look for><expression separator char><expression to replace with><expression separator char>
In the first example, using a hash as the first character after the =~ s, makes that character the expression separator, so forward slash is not special and does not require any escaping.
in the second example, the expression separator is indeed the forward slash, so it must be escaped within the expressions themselves.
The regex match-operator allows to define a custom non-whitespace-character as seperator.
In your first example the '#' is used as seperator. So in this regex you don't need to escape the '/' because it hase no special meaning. In the second regex, the seperator char isn't changed. So the default '/' is used. Now you have to escape all '/' in your pattern. Otherwise the parser is confused. :)
If you are not use slashes, the recommend practice is to use the curly braces and the /x modifier.
$date=~ s{ (\d+) \/ (\d+) \/ (\d+) }{$1/$2/$3}x;
Escaping the non-alphanumerics is also a standard even if they are not meta-characters. See perldoc -f quotemeta.
There is another depth to this question about escaping forward slashes with the s operator.
With my example the capturing becomes the problem.
$image_name =~ s/((http:\/\/.+\/)\/)/$2/g;
For this to work the typo with the addition of a second forward slash, had to be captured.
Also, trying to work with just the two slashes did not work. The first slash has to be led by more than one character.
Changing "http://world.com/Photos//space_shots/out_of_this_world.jpg"
To: "http://world.com/Photos/space_shots/out_of_this_world.jpg"
I have a value like this:
"Foo Bar" "Another Value" something else
What regex will return the values enclosed in the quotation marks (e.g. Foo Bar and Another Value)?
In general, the following regular expression fragment is what you are looking for:
"(.*?)"
This uses the non-greedy *? operator to capture everything up to but not including the next double quote. Then, you use a language-specific mechanism to extract the matched text.
In Python, you could do:
>>> import re
>>> string = '"Foo Bar" "Another Value"'
>>> print re.findall(r'"(.*?)"', string)
['Foo Bar', 'Another Value']
I've been using the following with great success:
(["'])(?:(?=(\\?))\2.)*?\1
It supports nested quotes as well.
For those who want a deeper explanation of how this works, here's an explanation from user ephemient:
([""']) match a quote; ((?=(\\?))\2.) if backslash exists, gobble it, and whether or not that happens, match a character; *? match many times (non-greedily, as to not eat the closing quote); \1 match the same quote that was use for opening.
I would go for:
"([^"]*)"
The [^"] is regex for any character except '"'
The reason I use this over the non greedy many operator is that I have to keep looking that up just to make sure I get it correct.
Lets see two efficient ways that deal with escaped quotes. These patterns are not designed to be concise nor aesthetic, but to be efficient.
These ways use the first character discrimination to quickly find quotes in the string without the cost of an alternation. (The idea is to discard quickly characters that are not quotes without to test the two branches of the alternation.)
Content between quotes is described with an unrolled loop (instead of a repeated alternation) to be more efficient too: [^"\\]*(?:\\.[^"\\]*)*
Obviously to deal with strings that haven't balanced quotes, you can use possessive quantifiers instead: [^"\\]*+(?:\\.[^"\\]*)*+ or a workaround to emulate them, to prevent too much backtracking. You can choose too that a quoted part can be an opening quote until the next (non-escaped) quote or the end of the string. In this case there is no need to use possessive quantifiers, you only need to make the last quote optional.
Notice: sometimes quotes are not escaped with a backslash but by repeating the quote. In this case the content subpattern looks like this: [^"]*(?:""[^"]*)*
The patterns avoid the use of a capture group and a backreference (I mean something like (["']).....\1) and use a simple alternation but with ["'] at the beginning, in factor.
Perl like:
["'](?:(?<=")[^"\\]*(?s:\\.[^"\\]*)*"|(?<=')[^'\\]*(?s:\\.[^'\\]*)*')
(note that (?s:...) is a syntactic sugar to switch on the dotall/singleline mode inside the non-capturing group. If this syntax is not supported you can easily switch this mode on for all the pattern or replace the dot with [\s\S])
(The way this pattern is written is totally "hand-driven" and doesn't take account of eventual engine internal optimizations)
ECMA script:
(?=["'])(?:"[^"\\]*(?:\\[\s\S][^"\\]*)*"|'[^'\\]*(?:\\[\s\S][^'\\]*)*')
POSIX extended:
"[^"\\]*(\\(.|\n)[^"\\]*)*"|'[^'\\]*(\\(.|\n)[^'\\]*)*'
or simply:
"([^"\\]|\\.|\\\n)*"|'([^'\\]|\\.|\\\n)*'
Peculiarly, none of these answers produce a regex where the returned match is the text inside the quotes, which is what is asked for. MA-Madden tries but only gets the inside match as a captured group rather than the whole match. One way to actually do it would be :
(?<=(["']\b))(?:(?=(\\?))\2.)*?(?=\1)
Examples for this can be seen in this demo https://regex101.com/r/Hbj8aP/1
The key here is the the positive lookbehind at the start (the ?<= ) and the positive lookahead at the end (the ?=). The lookbehind is looking behind the current character to check for a quote, if found then start from there and then the lookahead is checking the character ahead for a quote and if found stop on that character. The lookbehind group (the ["']) is wrapped in brackets to create a group for whichever quote was found at the start, this is then used at the end lookahead (?=\1) to make sure it only stops when it finds the corresponding quote.
The only other complication is that because the lookahead doesn't actually consume the end quote, it will be found again by the starting lookbehind which causes text between ending and starting quotes on the same line to be matched. Putting a word boundary on the opening quote (["']\b) helps with this, though ideally I'd like to move past the lookahead but I don't think that is possible. The bit allowing escaped characters in the middle I've taken directly from Adam's answer.
The RegEx of accepted answer returns the values including their sourrounding quotation marks: "Foo Bar" and "Another Value" as matches.
Here are RegEx which return only the values between quotation marks (as the questioner was asking for):
Double quotes only (use value of capture group #1):
"(.*?[^\\])"
Single quotes only (use value of capture group #1):
'(.*?[^\\])'
Both (use value of capture group #2):
(["'])(.*?[^\\])\1
-
All support escaped and nested quotes.
I liked Eugen Mihailescu's solution to match the content between quotes whilst allowing to escape quotes. However, I discovered some problems with escaping and came up with the following regex to fix them:
(['"])(?:(?!\1|\\).|\\.)*\1
It does the trick and is still pretty simple and easy to maintain.
Demo (with some more test-cases; feel free to use it and expand on it).
PS: If you just want the content between quotes in the full match ($0), and are not afraid of the performance penalty use:
(?<=(['"])\b)(?:(?!\1|\\).|\\.)*(?=\1)
Unfortunately, without the quotes as anchors, I had to add a boundary \b which does not play well with spaces and non-word boundary characters after the starting quote.
Alternatively, modify the initial version by simply adding a group and extract the string form $2:
(['"])((?:(?!\1|\\).|\\.)*)\1
PPS: If your focus is solely on efficiency, go with Casimir et Hippolyte's solution; it's a good one.
A very late answer, but like to answer
(\"[\w\s]+\")
http://regex101.com/r/cB0kB8/1
The pattern (["'])(?:(?=(\\?))\2.)*?\1 above does the job but I am concerned of its performances (it's not bad but could be better). Mine below it's ~20% faster.
The pattern "(.*?)" is just incomplete. My advice for everyone reading this is just DON'T USE IT!!!
For instance it cannot capture many strings (if needed I can provide an exhaustive test-case) like the one below:
$string = 'How are you? I\'m fine, thank you';
The rest of them are just as "good" as the one above.
If you really care both about performance and precision then start with the one below:
/(['"])((\\\1|.)*?)\1/gm
In my tests it covered every string I met but if you find something that doesn't work I would gladly update it for you.
Check my pattern in an online regex tester.
This version
accounts for escaped quotes
controls backtracking
/(["'])((?:(?!\1)[^\\]|(?:\\\\)*\\[^\\])*)\1/
MORE ANSWERS! Here is the solution i used
\"([^\"]*?icon[^\"]*?)\"
TLDR;
replace the word icon with what your looking for in said quotes and voila!
The way this works is it looks for the keyword and doesn't care what else in between the quotes.
EG:
id="fb-icon"
id="icon-close"
id="large-icon-close"
the regex looks for a quote mark "
then it looks for any possible group of letters thats not "
until it finds icon
and any possible group of letters that is not "
it then looks for a closing "
I liked Axeman's more expansive version, but had some trouble with it (it didn't match for example
foo "string \\ string" bar
or
foo "string1" bar "string2"
correctly, so I tried to fix it:
# opening quote
(["'])
(
# repeat (non-greedy, so we don't span multiple strings)
(?:
# anything, except not the opening quote, and not
# a backslash, which are handled separately.
(?!\1)[^\\]
|
# consume any double backslash (unnecessary?)
(?:\\\\)*
|
# Allow backslash to escape characters
\\.
)*?
)
# same character as opening quote
\1
string = "\" foo bar\" \"loloo\""
print re.findall(r'"(.*?)"',string)
just try this out , works like a charm !!!
\ indicates skip character
My solution to this is below
(["']).*\1(?![^\s])
Demo link : https://regex101.com/r/jlhQhV/1
Explanation:
(["'])-> Matches to either ' or " and store it in the backreference \1 once the match found
.* -> Greedy approach to continue matching everything zero or more times until it encounters ' or " at end of the string. After encountering such state, regex engine backtrack to previous matching character and here regex is over and will move to next regex.
\1 -> Matches to the character or string that have been matched earlier with the first capture group.
(?![^\s]) -> Negative lookahead to ensure there should not any non space character after the previous match
Unlike Adam's answer, I have a simple but worked one:
(["'])(?:\\\1|.)*?\1
And just add parenthesis if you want to get content in quotes like this:
(["'])((?:\\\1|.)*?)\1
Then $1 matches quote char and $2 matches content string.
All the answer above are good.... except they DOES NOT support all the unicode characters! at ECMA Script (Javascript)
If you are a Node users, you might want the the modified version of accepted answer that support all unicode characters :
/(?<=((?<=[\s,.:;"']|^)["']))(?:(?=(\\?))\2.)*?(?=\1)/gmu
Try here.
echo 'junk "Foo Bar" not empty one "" this "but this" and this neither' | sed 's/[^\"]*\"\([^\"]*\)\"[^\"]*/>\1</g'
This will result in: >Foo Bar<><>but this<
Here I showed the result string between ><'s for clarity, also using the non-greedy version with this sed command we first throw out the junk before and after that ""'s and then replace this with the part between the ""'s and surround this by ><'s.
From Greg H. I was able to create this regex to suit my needs.
I needed to match a specific value that was qualified by being inside quotes. It must be a full match, no partial matching could should trigger a hit
e.g. "test" could not match for "test2".
reg = r"""(['"])(%s)\1"""
if re.search(reg%(needle), haystack, re.IGNORECASE):
print "winning..."
Hunter
If you're trying to find strings that only have a certain suffix, such as dot syntax, you can try this:
\"([^\"]*?[^\"]*?)\".localized
Where .localized is the suffix.
Example:
print("this is something I need to return".localized + "so is this".localized + "but this is not")
It will capture "this is something I need to return".localized and "so is this".localized but not "but this is not".
A supplementary answer for the subset of Microsoft VBA coders only one uses the library Microsoft VBScript Regular Expressions 5.5 and this gives the following code
Sub TestRegularExpression()
Dim oRE As VBScript_RegExp_55.RegExp '* Tools->References: Microsoft VBScript Regular Expressions 5.5
Set oRE = New VBScript_RegExp_55.RegExp
oRE.Pattern = """([^""]*)"""
oRE.Global = True
Dim sTest As String
sTest = """Foo Bar"" ""Another Value"" something else"
Debug.Assert oRE.test(sTest)
Dim oMatchCol As VBScript_RegExp_55.MatchCollection
Set oMatchCol = oRE.Execute(sTest)
Debug.Assert oMatchCol.Count = 2
Dim oMatch As Match
For Each oMatch In oMatchCol
Debug.Print oMatch.SubMatches(0)
Next oMatch
End Sub