Let's say I have a long string of text, like a paragraph or more, and there is a specific word that appears frequently, lets call it KEY.
I want to insert a "\n" after the word that comes after each KEY.
So if I have the string KEY Hello this is KEY an example. I want it to look like KEY Hello \nthis is KEY an \nexample
If the task were slightly simpler, and I just wanted to add \n after KEY then I could easily do that with, gsub("KEY","KEY \n",string), but I don't think regex has an elegant way of selecting the word after a match, and even if it did I'm not sure I could use it in a gsub.
What would be a good way to add the \n's where I want them?
You can use a capture group and refer back to it. You have to decide how to handle certain scenarios and the specifics of your case, as Wiktor Stribiżew pointed out.
For the example case presented, look for KEY followed by a space followed by non-whitepspace characters 1 or more times (\\S+) followed by a space:
gsub("(KEY \\S+ )", "\\1\n", string, perl = TRUE)
If you want to be more general in what can follow "KEY", then you can add a character class including what you'll allow (or \s for any whitespace character or \W for any non-alphanumeric/underscore characters, as Wiktor points out). Something like this:
gsub("(KEY[., ;!?]\\S+ )", "\\1\n", string, perl = TRUE)
gsub("(KEY\\s\\S+ )", "\\1\n", string, perl = TRUE)
gsub("(KEY\\W+\\S+ )", "\\1\n", string, perl = TRUE)
Putting whatever punctuation you want in the character class part [., ;!?]
Wiktor's variation(s) may be a bit more robust:
gsub("(KEY\\s+\\S+\\s*)", "\\1\n", string) # \s = white-space character
# \S = non-white-space character
gsub("(KEY\\W+\\w+\\s*)", "\\1\n", string) # \w for alphanumeric/underscore
# \W for the opposite of \w.
These variants don't require a space after the next word (\\s* for 0 or more white-space characters) and they can match one or more whitespace characters after KEY or one or more non-alphanumerics/underscores after KEY.
Related
So I have the following:
^[a-zA-Z]+\b(myword+-)\b*
which I thought would match
^ start of string
[a-zA-Z] any alpha character
+ of one or more characters
\b followed by a word break
(myword+-) followed by myword which could include one or more special characters
\b followed by a word break
\* followed by anything at all
One: it does not work - it does not match anything
Two: any special characters included in {myword+-) throws an error
I could escape the special characters, but I don't know in advance what they might be, so I would have to escape all the possibilites, or perhaps I could just escape every character in {\m\y\w\o\r\d\\+\\-)
Edited to add:
Sorry, I knew I should have given more information
I have a series of strings to seach through in the form:
extra android-sdk and more that is of no interest
extra android-ndk and more that is of no interest
extra anjuta-extra and more that is of no interest
community c++-gtk-utils and more that is of no interest
and I have a list of items to search for in the strings:
android-sdk
android-ndk
extra
c++-gtk-utils
The item should only match if the second word in the string is an exact match to the item, so:
android-sdk will match the first string
android-ndk will match the second string
extra wuill NOT match the third string
c++-gtk-utils will match the fourth string
So (myword+-) is the item I am searching for "which could include one or more special characters"
Thanks for the help
Andrew
OK, with the help from above I worked it out.
This regex does exactly what I wanted, bear in mind that I am working in tcl (note the spaces to delimit the search word):
^[a-zA-Z]+\y extra \y *
where the search word is "extra".
It is necessary to escape any characters in the search string which may be interpreted by regex as qualifiers etc e.g +
So this will also work:
^[a-zA-Z]+\y dbus-c\+\+ \y *
Andrew
Strong recommendation: if you want to match literal strings, don't use regular expressions.
If we have this sample data:
set strings {
{extra android-sdk and more that is of no interest}
{extra android-ndk and more that is of no interest}
{extra anjuta-extra and more that is of no interest}
{community c++-gtk-utils and more that is of no interest}
}
set search_strings {
android-sdk
android-ndk
extra
c++-gtk-utils
}
Then, to find matches in the 2nd word of each string, we'll just use the eq string equality operator
foreach string $strings {
foreach search $search_strings {
if {[lindex [split $string] 1] eq $search} {
puts "$search matches $string"
}
}
}
outputs
android-sdk matches extra android-sdk and more that is of no interest
android-ndk matches extra android-ndk and more that is of no interest
c++-gtk-utils matches community c++-gtk-utils and more that is of no interest
If you insist on regular expression matching, you can escape any special characters to take away their usual regex meaning. Here, we'll take the brute force approach: any non-word chars will get escaped, so that the pattern may look like ^\S+\s+c\+\+\-gtk\-utils
foreach string $strings {
foreach search $search_strings {
set pattern "^\\S+\\s+[regsub -all {\W} $search {\\&}]"
if {[regexp $pattern $string]} {
puts "$search matches $string"
}
}
}
I was hoping to be able to make a portion of a regular expression to be a literal string, like
set pattern "^\\S+\\s+(***=$string)"
set pattern "^\\S+\\s+((?q)$string)"
but both failed.
Tcl regular expressions are documented at
https://www.tcl.tk/man/tcl8.6/TclCmd/re_syntax.htm
Also note your pattern ^[a-zA-Z]+\b(myword+-)\b* does not provide for any whitespace between the first and second words.
Disclaimer: Since your question lacks information what input and output is expected, I will give it a try to tell you why your Regex isn't working at all. Since it's not a full answer you might not want to mark it as accepted and possibly wait for someone to give you an example of working solution, as soon as you provide necessary information.
Notes:
quantifier characters (*, +, ? etc.) are applied to literal character or character class (a.k.a character group, namely characters/ranges inside [ ]) - when in your regex you write (myword+-) the only thing the + sign is applied to is letter 'd', nothing else.
what is myword in your regex? If you want a set of characters use [ ] combined with character ranges and/or character tokens such as \w (all word characters, such as letters and some special characters) or \d (all digit characters)
you also seem to misunderstand and misuse groups ("( )"), character classes ("[ ]") and quantifier notation ("{ }")
c = re.split(r'\w+', message)
print(c)
message contains '!nano speak', but the regex is giving me this in return:
>>> ['!', ' ', '\r\n']
I'm very new to regex, but this seems like something I should get, and I can't seem to find this problem in search. It seems like it's doing exactly the opposite, and I'm sure it's a lower-case w.
re.split is using the regex as a delimiter to split the string. You set the delimiter to be any number of alphanumeric characters. This means that it will return everything between words.
In order to get the tokens defined by the regex you can use re.findall:
>>> re.findall(r'\w+', '!nano speak')
['nano', 'speak']
\w matches word character (alphanumeric and underscore), so in the string "!nano speak", it matches everything except "!" and the space, then splitting according to "nano" and "space". So you get "!", " " and "\r\n".
To remove all non characters, you should
re.sub("[^a-zA-Z]+", "", "!nano speak")
the title explain the question itself.
more specifically i need to write a regex in order to accept a "question", something like: "how are you today?". So the last character must be a "?".
I tried something like this:
m/[^a-zA-Z0-9\W{1}]/
but it accept any input with 1 or more \W character
The regex you gave in your question does not do what you think it does.
m/[^a-zA-Z0-9\W{1}]/
This will match any character that is not a-z, A-Z, 0-9, any non word character (\W), {, or }. The ^ inside the square brackets negate the content of the char group. It's not the beginning of the line if it's in there!
If you need to validate any input that has a questionmark at at the end, all you need it the questionmark and the end-of-line metacharacer.
/\?$/
The ? is a metacharacter itself, so you need to escape it with a backslash (\).
If you want to match a whole sentence with the questionmark at the end, think of what kinds of characters could be in the sentence. It will not only be \w probably.
Play around with your input and your regex on http://regex101.com/, that will make it easier because it explains what's going on.
accept a "question", something like:"how are you today?"
How about:
$string =~ /^(?:[a-z0-9]+\s*)+\?$/i;
This may works:
if( $question =~ m!([\w\s]+)\?$! ) {
print "question text: $1\n";
}
The regexpr looks for \w and \s (spaces, tabs, ...) you often have in a text before the question mark at the last position
Try this. I hope to you expect match any character in preceding the ?, this is favor for you
'm/[.+\?$]/ '
.is helps to match the any character of the string
\Ignore the function of the ? (match 0 or 1 time in preceding character) then $ matches the last character.
I have a string For Exampe
string SampleString = "F456-G12345-9090-GHI"
I need to add a optional white space between all characters in the above string.
The above string needs to match the same string which may or may not have the white space between ewach character. The other string will be like
string samplestring1 = "F456-G12345- 9090 -GHI"
Thanks
Padma
I'm not positive that I'm understanding what you'll be matching. If you're looking for a specific string, then the easiest way is probably to substitute all white space for '' across the string and then do the match.
In perl I'd do:
$string =~ s/\s//g;
while ($string =~ m/F456-G12345-9090-GHI/g) {
# Do something
}
If you're looking for multiple strings, and not just a specific one, you might just want to add \s as a potential match [\w\s-]+
However, if you're going to be matching against a specific string, I'd just toss the whitespace whole cloth first rather than performing an expensive regex checking for (and discarding) any whitespace found before checking the string.
you will probably have to add \s* between each character. (or other control characters for whitespace)
\s*F\s*4\s*5\s*6\s*-\s*G\s*1\s*2\s*3\s*4\s*5\s*-\s*9\s*0\s*9\s*0\s*-\s*G\s*H\s*I\s*
Or, depending on your regex dialect, you might be able to pass an option to ignore whitespace in the source text, but it would depend on which regex library you're using.
In perl I want to substitute any character not [A-Z]i or [0-9] and replace it with "_" but only if this non alphanumerical character occurs between two alphanumerical characters. I do not want to touch non-alphanumericals at the beginning or end of the string.
I know enough regex to replace them, just not to only replace ones in the middle of the string.
s/(\p{Alnum})\P{Alnum}(\p{Alnum})/${1}_${2}/g;
Of course that would hurt your chanches with "#A#B%C", so you might use a look-arounds:
s/(?<=\p{Alnum})\P{Alnum}(?=\p{Alnum})/_/g;
That way you isolate it to just the non "alnum" character.
Or you could use the "keep flag", as well and get the same thing done.
s/\p{Alnum}\K\P{Alnum}(?=\p{Alnum})/_/g;
EDIT based on input:
To not eat a newline, you could do the following:
s/\p{Alnum}\K[^\p{Alnum}\n](?=\p{Alnum})/_/g;
Try this:
my $str = 'a-2=c+a()_';
$str =~ s/(?<=[A-Z0-9])[^A-Z0-9](?=[A-Z0-9])/\1_\2/gi;