trim spaces on captured group regex - regex

after searching everywhere, How can whitespace be trimmed from a Regex capture group?, seems to have the closest thing, and I can almost taste it, but still no cigar....
So here goes, for my very first post to StackOverflow.....
I am trying to capture Exactly 27 characters of a string (for reasons that really don't matter).
So I used regex: .{27}
on this string "Joey Went to the Store and found some great steaks!"
the result was "Joey Went to the Store and "
Bingo exactly 27 characters.
But that result is causing errors in my program because of the trailing space. So now I need to also take that result and trim the space after "and " to return the result without the trailing space. the final result needs to be "Joey Went to the Store and".
here's the kicker, I need it to all work from a single regex because the application can only apply 1 regex (really dumb program, but I'm stuck with it).

Take a look at this regex:
^.{26}[^\s]?
It will match 26 characters starting from beginning of line and will match the 27th only if it is not a white space character. See the demo below for more details.
Regex Demo

Related

Regex to remove unwanted text in gene sequences

I have gene sequences that can have actual string text in them I want to remove with regex. I would like to try to remove the errant text in a generic way with regex. I'd like to remove all characters up to 10 chars between any invalid characters. I am assuming that anything between invalid chars up to 10 chars apart is part of the invalid text.
example :
BADTEXTATTHEBEGINNINGATCATCGGCCCATGCATMOREBADTEXTINTHEMIDDLEGCGGGGATCGCCCCTTTAAAATHISISSOMETEXTATTHEENDIWANTREMOVED
Valid sequence characters are ATCG. Can we create a regex to reduce the above string to
GATCATCGGCCCATGCATGCGGGGATCGCCCCTTTAAAAT?
I understand that the G at the beginning of this final sequence is the last character of the word BEGINNING, which is the "bad" text at the beginning of the string. I realize with regex, it is impossible to identify words, so I am willing to live this limitation. Same with the T at the end, which is the first letter of "THIS".
I've tried to do something with repeated capture groups that allow for a certain number of chars between bad characters, but I can't seem to make it work right. Maybe someone can help me...
This regex does not quite work to capture everything.
([^ACTG].{1,10}[^ACTG])+
Initial string:
BADTEXTATTHEBEGINNINGATCATCGGCCCATGCATMOREBADTEXTINTHEMIDDLEGCGGGGATCGCCCCTTTAAAATHISISSOMETEXTATTHEENDIWANTREMOVED
String after replacing non-ACGT:
-A-T--TATT----G-----GATCATCGGCCCATGCAT-----A-T--T--T--------GCGGGGATCGCCCCTTTAAAAT---------T--TATT-------A-T-------
For this sample, a run of up to four ACGT characters can appear in the unwanted text. Examining other samples may give a sensible upper bound.
Perhaps "starts and ends with invalid character and contains no long runs of valid characters" is a better measure to use than "1 to 10 characters, starting and ending with invalid character"?
A regex for this is:
[^ACGT]((?![ACGT]{5,}).)*[^ACGT]
and matches:
BADTEXTATTHEBEGINNIN
MOREBADTEXTINTHEMIDDLE
HISISSOMETEXTATTHEENDIWANTREMOVED

Regex how to get a full match of nth word (without using non-capturing groups)

I am trying to use Regex to return the nth word in a string. This would be simple enough using other answers to similar questions; however, I do not have access to any of the code. I can only access a regex input field and the server only returns the 'full match' and cannot be made to return any captured groups such as 'group 1'
EDIT:
From the developers explaining the version of regex used:
"...its javascript regex so should mostly be compatible with perl i
believe but not as advanced, its fairly low level so wasn't really
intended for use by end users when originally implemented - i added
the dropdown with the intention of having some presets going
forwards."
/EDIT
Sample String:
One Two Three Four Five
Attempted solution (which is meant to get just the 2nd word):
^(?:\w+ ){1}(\S+)$
The result is:
One Two
I have also tried other variations of the regex:
(?:\w+ ){1}(\S+)$
^(?:\w+ ){1}(\S+)
But these just return the entire string.
I have tried replicating the behaviour that I see using regex101 but the results seem to be different, particularly when changing around the ^ and $.
For example, I get the same output on regex101 if I use the altered regex:
^(?:\w+ ){1}(\S+)
In any case, none of the comparing has helped me actually achieve my stated aim.
I am hoping that I have just missed something basic!
===EDIT===
Thanks to all of you who have contributed thus far, however, I am still running into issues. I am afraid that I do not know the language or restrictions on the regex other than what I can ascertain through trial and error, therefore here is a list of attempts and results all of which are trying to return "Two" from a sample of:
One Two Three Four Five
\w+(?=( \w+){1}$)
returns all words
^(\w+ ){1}\K(\w+)
returns no words atall (so I assume that \K does not work)
(\w+? ){1}\K(\w+?)(?= )
returns no words at all
\w+(?=\s\w+\s\w+\s\w+$)
returns all words
^(?:\w+\s){1}\K\w+
returns all words
====
With all of the above not working, I thought I would test out some others to see the limitations of the system
Attempting to return the last word:
\w+$
returns all words
This leads me to believe that something strange is going on with the start ^ and end $ characters, perhaps the server puts these in automatically if they are omitted? Any more ideas greatly appreciated.
I don't known if your language supports positive lookbehind, so using your example,
One Two Three Four Five
here is a solution which should work in every language :
\w+ match the first word
\w+$ match the last word
\w+(?=\s\w+$) match the 4th word
\w+(?=\s\w+\s\w+$) match the 3rd word
\w+(?=\s\w+\s\w+\s\w+$) match the 2nd word
So if a string contains 10 words :
The first and the last word are easy to find. To find a word at a position, then you simply have to use this rule :
\w+(?= followed by \s\w+ (10 - position) times followed by $)
Example
In this string :
One Two Three Four Five Six Seven Height Nine Ten
I want to find the 6th word.
10 - 6 = 4
\w+(?= followed by \s\w+ 4 times followed by $)
Our final regex is
\w+(?=\s\w+\s\w+\s\w+\s\w+$)
Demo
It's possible to use reset match (\K) to reset the position of the match and obtain the third word of a string as follows:
(\w+? ){2}\K(\w+?)(?= )
I'm not sure what language you're working in, so you may or may not have access to this feature.
I'm not sure if your language does support \K, but still sharing this anyway in case it does support:
^(?:\w+\s){3}\K\w+
to get the 4th word.
^ represents starting anchor
(?:\w+\s){3} is a non-capturing group that matches three words (ending with spaces)
\K is a match reset, so it resets the match and the previously matched characters aren't included
\w+ helps consume the nth word
Regex101 Demo
And similarly,
^(?:\w+\s){1}\K\w+ for the 2nd word
^(?:\w+\s){2}\K\w+ for the 3rd word
^(?:\w+\s){3}\K\w+ for the 4th word
and so on...
So, on the down side, you can't use look behind because that has to be a fixed width pattern, but the "full match" is just the last thing that "full matches", so you just need something whose last match is your word.
With Positive look-ahead, you can get the nth word from the right
\w+(?=( \w+){n}$)
If your server has extended regex, \K can "clear matched items", but most regex engines don't support this.
^(\w+ ){n}\K(\w+)
Unfortunately, Regex doesn't have a standard "match only n'th occurrence", So counting from the right is the best you can do. (Also, Regex101 has a searchable quick reference in the bottom right corner for looking up special characters, just remember that most of those characters are not supported by all regex engines)

REGEX in R: extracting words from a string

i guess this is a common problem, and i found quite a lot of webpages, including some from SO, but i failed to understand how to implement it.
I am new to REGEX, and I'd like to use it in R to extract the first few words from a sentence.
for example, if my sentence is
z = "I love stack overflow it is such a cool site"
id like to have my output as being (if i need the first four words)
[1] "I love stack overflow"
or (if i need the last four words)
[1] "such a cool site"
of course, the following works
paste(strsplit(z," ")[[1]][1:4],collapse=" ")
paste(strsplit(z," ")[[1]][7:10],collapse=" ")
but i'd like to try a regex solution for performance issues as i need to deal with very huge files (and also for the sake of knowing about it)
I looked at several links, including
Regex to extract first 3 words from a string and
http://osherove.com/blog/2005/1/7/using-regex-to-return-the-first-n-words-in-a-string.html
so i tried things like
gsub("^((?:\S+\s+){2}\S+).*",z,perl=TRUE)
Error: '\S' is an unrecognized escape in character string starting ""^((?:\S"
i tried other stuff but it usually returned me either the whole string, or the empty string.
another problem with substr is that it returns a list. maybe it looks like the [[]] operator is slowing things a bit (??) when dealing with large files and doing apply stuff.
it looks like the Syntax used in R is somewhat different ?
thanks !
You've already accepted an answer, but I'm going to share this as a means of helping you understand a little more about regex in R, since you were actually very close to getting the answer on your own.
There are two problems with your gsub approach:
You used single backslashes (\). R requires you to escape those since they are special characters. You escape them by adding another backslash (\\). If you do nchar("\\"), you'll see that it returns "1".
You didn't specify what the replacement should be. Here, we don't want to replace anything, but we want to capture a specific part of the string. You capture groups in parentheses (...), and then you can refer to them by the number of the group. Here, we have just one group, so we refer to it as "\\1".
You should have tried something like:
sub("^((?:\\S+\\s+){2}\\S+).*", "\\1", z, perl = TRUE)
# [1] "I love stack"
This is essentially saying:
Work from the start of the contents of "z".
Start creating group 1.
Find non-whitespace (like a word) followed by whitespace (\S+\s+) two times {2} and then the next set of non-whitespaces (\S+). This will get us 3 words, without also getting the whitespace after the third word. Thus, if you wanted a different number of words, change the {2} to be one less than the number you are actually after.
End group 1 there.
Then, just return the contents of group 1 (\1) from "z".
To get the last three words, just switch the position of the capturing group and put it at the end of the pattern to match.
sub("^.*\\s+((?:\\S+\\s+){2}\\S+)$", "\\1", z, perl = TRUE)
# [1] "a cool site"
For getting the first four words.
library(stringr)
str_extract(x, "^\\s*(?:\\S+\\s+){3}\\S+")
For getting the last four.
str_extract(x, "(?:\\S+\\s+){3}\\S+(?=\\s*$)")

regular expression to remove the first word of each line

I am trying to make a regular expression that grabs the first word (including possible leading white space) of each line. Here it is:
/^([\s]+[\S]*).*$/\1//
This code does not seem to be working (see http://regexr.com?34o6m). The code is supposed to
Begin at the start of the line
Create a capturing group where it places the first word (with possible leading white space)
Grab the rest of the line
Substitute the entire line with just the inside of the first capturing group
I tried another version also:
/\S(?<=\s).*^//
It looks like this one fails too (http://regexr.com?34o6s). The goal here was to
Find the first non-whitespace character.
Look behind to make sure it has a whitespace character behind it (i.e. not the first letter of the line).
Grab the rest of the line.
Erase everything the expression just grabbed.
Any insight to what is going wrong would be greatly appreciated. Thanks!
Try this regular expression
^(\s*.*?\s).*
Demo: gskinner
You mixed up your + and *.
/^([\s]*[\S]+).*$/\1/
This means zero or more spaces followed by one or more non-spaces.
You might also want to use $1 instead of \1:
/^([\s]*[\S]+).*$/$1/
Okay, well this seems to work using replace() in Javascript:
/^([\s]*[\S]+).*$/
I tested it on www.altastic.com/regexinator, which as far as I know is accurate [I made it though, so it may not be ;-) ]
remove the first two words
#"^.asterisk? .asterisk? "
this works for me
when posted, the asterisk sign doesn't show. have no idea.
if you want to remove the first word, simply start the regex as follow
a dot sign
an asterisk sign
a question mark
a space
replace with ""

Matching line without and with lower-case letters

I want to match two consecutive lines, with the first line having no lower-case letter and the second having lower-case letter(s), e.g.
("3.2 A MEMORY ABSTRACTION: ADDRESS SPACES 177" "#205")
("3.3.1 Paging 187" "#215")
Why would the Regex ^(?!.*[:lower:]).*$\n^(.*[:lower:]).*$ match each of the following two-line examples?
("1.3.3 Disks 24" "#52")
("1.3.4 Tapes 25" "#53")
("1.5.4 Input/Output 41" "#69")
("1.5.5 Protection 42" "#70")
("3.1 NO MEMORY ABSTRACTION 174" "#202")
("3.2 A MEMORY ABSTRACTION: ADDRESS SPACES 177" "#205")
("3.3.1 Paging 187" "#215")
("3.3.2 Page Tables 191" "#219")
Thanks and regards!
ADDED:
For a example such as:
("3.1 NO MEMORY ABSTRACTION 174" "#202")
("3.2 A MEMORY ABSTRACTION: ADDRESS SPACES 177" "#205")
("3.3.1 Paging 187" "#215")
("3.3.2 Page Tables 191" "#219")
How shall I match only the middle two lines not the first three lines or all the four lines?
To use a POSIX "character class" like [:lower:], you have to enclose it in another set of square brackets, like this: [[:lower:]]. (According to POSIX, the outer set of brackets form a bracket expression and [:lower:] is a character class, but to everyone else the outer brackets define a character class and the inner [:lower:] is obsolete.)
Another problem with your regex is that the first part is not required to consume any characters; everything is optional. That means your match can start on the blank line, and I don't think you want that. Changing the second .* to .+ fixes that, but it's just a quick patch.
This regex seems to match your specification:
^(?!.*[[:lower:]]).+\n(?=.*[[:lower:]]).*$
But I'm a little puzzled, because there's nothing in your sample data that matches. Is there supposed to be?
Using Rubular, we can see what's matched by your initial expression, and then, by adding a few excess capturing groups, see why it matches.
Essentially, the negative look-ahead followed by .* will match anything. If you merely want to check that the first line has no lower-case letters, check that explicitly, e.g.
^(?:[^a-z]+)$
Finally, I'd assuming you want the entire second line, you can do this for the second part:
^(.*?(?=[:lower:]).*?)$
Or to match your inital version:
^(.*?(?=[:lower:])).*?$
The reluctant qualifiers (*?) seemed to be necessary to avoid matching across lines.
The final version I ended up with, thus, is:
^(?:[^a-z]+)$\n^(.*?(?=[:lower:]).*?)$
This can be seen in action with your test data here. It only captures the line ("3.2 A MEMORY ABSTRACTION: ADDRESS SPACES 177" "#205").
Obviously, the regex I've used might be quite specific to Ruby, so testing with your regex engine may be somewhat different. There are many easily Google-able online regex tests, I just picked on Rubular since it does a wonderful job of highlighting what is being matched.
Incidentally, if you're using Python, the Python Regex Tool is very helpful for online testing of Python regexes (and it works with the final version I gave above), though I find the output visually less helpful in trouble-shooting.
After thinking about it a little more, Alan Moore's point about [[:lower:]] is spot on, as is his point about how the data would match. Looking back at what I wrote, I got a little too involved in breaking-down the regex and missed something about the problem as described. If you modify the regex I gave above to:
^(?:[^[:lower:]]+)$\n^(.*?(?=[[:lower:]]).*?)$
It matches only the line ("3.3.1 Paging 187" "#215"), which is the only line with lowercase letters following a line with no lowercase letters, as can be seen here. Placing a capturing group in Alan's expression, yielding ^(?!.*[[:lower:]]).+\n((?=.*[[:lower:]]).*)$ likewise captures the same text, though what, exactly, is matched is different.
I still don't have a good solution for matching multiple lines.