In what ways can I improve this regular expression? - regex

I have written this regex that works, but honestly, it’s like 75% guesswork.
The goal is this: I have lots of imports in Xcode, like so:
#import <UIKit/UIKit.h>
#import "NSString+MultilineFontSize.h"
and I only want to return the categories that contain +. There are also lots of lines of code throughout the source which include + in other contexts.
Right now, this returns all of the proper lines throughout the Xcode project. But if there is one thing I’ve learned from googling and searching Stack Overflow for regex tutorials, it is that there are LOTS of different ways to do things. I’d love to see all of the different ways you guys can come up with that make it either more efficient or more bulletproof regarding potential spoofs or misses.
^\#import+.[\"]*+.(?:(?!\+).)*+.*[\"]
Thanks in advance for all of your help.
Update
Also I suppose I’ll accept the answer of whoever does this with the shortest string, without missing any possible spoofs. But again, thanks to everyone who participates in this learning experience.
Resources from answers
This is an awesome resource for practicing regex from Dan Rasmussen: RegExr

The first thing I notice is that your + characters are misplaced: t+. matches t one or more times, followed by a single character .. I'm assuming you wanted to match the end of import, followed by one or more of any character: import.+
Secondly, # doesn't need to be escaped.
Here's what I came up with: ^#import\s+(.*\+.*)$
\s+ matches one or more whitespace character, so you're guaranteed that the line actually starts with #import and not #importbutnotreally or anything else.
I'm not familiar with xcode syntax, but the following part of the expression, (.*\+.*), simply matches any string with a + character somewhere in it. This means invalid imports may be matched, but I'm working under the assumption your trying to match valid code. If not, this will need to be modified to validate the importer syntax as well.
P.S. To test your expression, try RegExr. You can hover over characters to check what they do.

sed 's:^#import \(.*[+].*\):\1:' FILE
will display
"NSString+MultilineFontSize.h"
for your sample.

Related

Matching within matches by extending an existing Regex

I'm trying to see if its possible to extend an existing arbitrary regex by prepending or appending another regex to match within matches.
Take the following example:
The original regex is cat|car|bat so matching output is
cat
car
bat
I want to add to this regex and output only matches that start with 'ca',
cat
car
I specifically don't want to interpret a whole regex, which could be quite a long operation and then change its internal content to match produce the output as in:
^ca[tr]
or run the original regex and then the second one over the results. I'm taking the original regex as an argument in python but want to 'prefilter' the matches by adding the additional code.
This is probably a slight abuse of regex, but I'm still interested if it's possible. I have tried what I know of subgroups and the following examples but they're not giving me what I need.
Things I've tried:
^ca(cat|car|bat)
(?<=ca(cat|car|bat))
(?<=^ca(cat|car|bat))
It may not be possible but I'm interested in what any regex gurus think. I'm also interested if there is some way of doing this positionally if the length of the initial output is known.
A slightly more realistic example of the inital query might be [a-z]{4} but if I create (?<=^ca([a-z]{4})) it matches against 6 letter strings starting with ca, not 4 letter.
Thanks for any solutions and/or opinions on it.
EDIT: See solution including #Nick's contribution below. The tool I was testing this with (exrex) seems to have a slight bug that, following the examples given, would create matches 6 characters long.
You were not far off with what you tried, only you don't need a lookbehind, but rather a lookahead assertion, and a parenthesis was misplaced. The right thing is: Put the original pattern in parentheses, and prepend (?=ca):
(?=ca)(cat|car|bat)
(?=ca)([a-z]{4})
In the second example (without | alternative), the parentheses around the original pattern wouldn't be required.
Ok, thanks to #Armali I've come to the conclusion that (?=ca)(^[a-z]{4}$) works (see https://regexr.com/3f4vo). However, I'm trying this with the great exrex tool to attempt to produce matching strings, and it's producing matches that are 6 characters long rather than 4. This may be a limitation of exrex rather than the regex, which seems to work in other cases.
See #Nick's comment.
I've also raised an issue on the exrex GitHub for this.

Regex character required between 1st and 8th character

I am currently using this regex to limit the characters that can be used "([A-Za-z0-9_-]+)". I now have an additional requirement to require a hyphen between the 1st and 8th character. I am not sure where to begin for this and my search results have not been fruitful. Could anyone point me in a direction or give me pointers of where to get started with this request? I can usually cobble together some regex on my own through examples here and elsewhere on the web, but I can't find anything similar to these requirements.
here are some good examples of what I mean:
this-isvalid
so-isthis
Thank you in advance!
Yeah, typically when you know the requirements use an online regex checker.
http://www.regexplanet.com/advanced/java/index.html
There's a number of them, you can google them.
You can go ahead and specify between 1 and 7 copies of that and then a dash so something like:
(^[A-Za-z0-9_]{1,7}-[A-Za-z0-9_]+)

select area within characters using regex (spaces are an issue)

Some other guy asked a similar question earlier which got a lot of down votes, and I was interested in solving it. I came to a similar issue and would like some help with it.
Take into consideration this wall of text:
__don't__ and __do it__
__yellow__
__green__ and __purple__
I would like to select all the area within the underscores __'s
I attempted the following regex:
/__[!-~]+__/g which worked great on most things. I would like to add the ability to have spaces within the underscores. __do it__ will not be encapsulated in the search because it includes a space which was ruled out by the regex. I attempted the following:
/__[ -~]+__/g
It didn't work as planned, and selected everything from the very first __ to the very last. I was wondering how to tell the regex it has reached the end of a search once it sees a space after a __.
Here is the regex you could play around with below:
http://regexr.com/39br7
I tried using __[^ ]/g at the end but It didn't seem to help.
You could simply use the below regex,
__[^_]*__
DEMO
__(.*?)__
This seems to work.Look at the demo.
http://regex101.com/r/lJ1jB1/1

Monitoring bad links with RegEx in Google Analytics

How do I optimize this to find all links ending in weird typos, yet still exclude correct links (ending with .html) from the results?
htmll$|hhtml$|httml$|htmml$|htmll$|btml$|hml$|htl$
Thanks in advance!
Wow, that's some pretty restrictive regex rules but that kinda makes it interesting.
since we have no character negation but we do have character classes we could do:
[a-gi-z]tml$|h[a-su-z]ml|ht[a-ln-z]l|htm[a-km-z]
for my second suggestion and:
h.+tml|ht.+ml|htm.+l|html.+
to replace the first option leading to a total of:
[a-gi-z]tml$|h[a-su-z]ml|ht[a-ln-z]l|htm[a-km-z]|h.+tml|ht.+ml|htm.+l|html.+
EDIT: Having noticed that the .+'s can catch things we don't want this should be changed slightly.
(.*[a-gi-z]tml|h.*[a-su-z]ml|ht.*[a-ln-z]l|htm.*[a-km-z])$

Regular Expression matching anything after a word

I am looking to find anything that matches this pattern, the beginning word will be:
organism aogikgoi egopetkgeopt foprkgeroptk 13
So anything that starts with organism needs to be found using regex.
^organism will match anything starting with "organism".
^organism(.*) will also capture everything that follows, into the variable that contains the first match (which varies according to language -- in Perl it's $1).
Also just wanna add for others newbies like me and their various circumstances, you can do it in various ways depending on your text and what you are tryna do.
Like here's an Example where I wanna delete everything after ?spam so I could use .?spm.+ or .?spm.+ or any other ways as long you are creative about it lol.
This might come in handy, here's a Link | Link where you can find some basic necessary regex and their meanings.