Given a file name of 22-PLUMB-CLR-RECTANGULAR.0001.rfa I need a RegEx to match it. Basically it's any possible characters, then . and 4 digits and one of four possible file extensions.
I tried ^.?\.\d{4}\.(rvt|rfa|rte|rft)$ , which I would have thought would be correct, but I guess my RegEx understanding has not progressed as much as I thought/hoped. Now, .?\.\d{4}\.(rvt|rfa|rte|rft)$ does work and the ONLY difference is that I am not specifying the start of the string with ^. In a different situation where the file name is always in the form journal.####.txt I used ^journal\.\d{4}\.txt$ and it matched journal.0001.txt like a champ. Obviously when I am specifying a specific string, not any characters with .? is the difference, but I don't understand the WHY.
That never matches the mentioned string since ^.? means match beginning of input string then one optional single character. Then it looks for a sequence of dots and digits and nothing's there. Because we didn't yet pass the first character.
Why does it work without ^? Because without ^ it is allowed to go through all characters to find a match and it stops right before R and continues matching up to the end.
That's fine but with first approach it should be ^.*. Kleene star matches every thing greedily then backtracks but ? is the operator here which makes preceding pattern optional. That means one character, not many characters.
Related
I need to search through a larger text file.
This is an example of what I'm searching through.
https://pastebin.com/JFVy2TEt
recipes.addShaped("basemetals:adamantine_arrow", <basemetals:adamantine_arrow> * 4, [[<ore:nuggetAdamantine>], [<basemetals:adamantine_rod>], [<minecraft:feather>]]);
I need to look for lines that match a specific part in the first argument.
For example the "_arrow" part in the above line.
And erase everything that doesn't match on the "_arrow" in the first argument.
And the arguments differ across all of them.
And also with different names in the place where "basemetals:adamantine" is in the above line.
And since the further arguments are all different I can't wrap my head around on how to include the end only when the first thing matches.
Edit: The end goal being to ease sort my 3k+ line text file.
basic, blacksmith, carpenter, chef, chemist, engineer, farmer, jeweler, mage, mason, scribe, tailor
I think what you're trying to do is filter your text file by removing lines that don't fit a set criteria. I've chosen the Atom text editor for this solution (because I'm running Windows OS and can't install gedit, and I want to ensure you have a working example).
To remove only lines that don't have a first argument ending in _arrow, one could do (?!recipes\.addShaped\("[^"]+_arrow")recipes.+\r?\n? and replace with nothing.
As a note: this task is made more difficult by Atom's low regex support. In a more well-supported environment, my answer would probably be ^recipes\.addShaped("[^"]+(?<!_arrow)").+\r?\n? (with multiline mode).
Also, please read "What should I do when someone answers my question?".
Regex explained:
(?! ) is a negative lookahead, which peeks at the succeeding text to ensure it doesn't contain "_arrow" at end of the first argument.
\. is an escaped literal period
[^"] is a character class that signifies a character that is not a ".
+ is a quantifier which tells the regex to match the preceding character or subexpression as many times as possible, with a minimum of one time.
. is a wildcard, representing any character
\r?\n? is used to match any kind of newline, with the ? quantifier making each character optional.
Everything else it literal characters; it represents exactly what it matches.
I need some help with building up my regex.
What I am trying to do is match a specific part of text with unpredictable parts in between the fixed words. An example is the sentence one gets when replying to an email:
On date at time person name has written:
The cursive parts are variable, might contains spaces or a new line might start from this point.
To get this, I built up my regex as such: On[\s\S]+?at[\s\S]+?person[\s\S]+?has written:
Basically, the [\s\S]+? is supposed to fill in any letter, number, space or break/new line as I am unable to predict what could be between the fixed words tha I am sure will always be there.
Now comes the hard part, when I would add the word "On" somewhere in the text above the sentence that I want to match, the regex now matches a much bigger text than I want. This is due to the use of [\s\S]+.
How am I able to make my regex match as less characters as possible? Using "?" before the "+" to make it lazy does not help.
Example is here with words "From - This - Point - Everything:". Cases are ignored.
Correct: https://regexr.com/3jdek.
Wrong because of added "From": https://regexr.com/3jdfc
The regex is to be used in VB.NET
A more real life, with html tags, can be found here. Here, I avoided using [\s\S]+? or (.+)?(\r)?(\n)?(.+?)
Correct: https://regexr.com/3jdd1
Wrong: https://regexr.com/3jdfu after adding certain parts of the regex in the text above. Although, in html, barely possible to occur as the user would never write the matching tag himself, I do want to make sure my regex is correctjust in case
These things are certain: I know with what the part of text starts, no matter where in respect to the entire text, I know with what the part of text ends, and there are specific fixed words that might make the regex more reliable, but they can be ommitted. Any text below the searched part is also allowed to be matched, but no text above may be matched at all
Another example where it goes wrong: https://regexr.com/3jdli. Basically, I have less to go with in this text, so the regex has less tokens to work with. Adding just the first < already makes the regex take too much.
From my own experience, most problems are avoided when making sure I do not use any [\s\S]+? before I did a (\r)?(\n)? first
[\s\S] matches all character because of union of two complementary sets, it is like . with special option /s (dot matches newlines). and regex are greedy by default so the largest match will be returned.
Following correct link, the token just after the shortest match must be geschreven, so another way to write without using lazy expansion, which is more flexible is to prepend the repeated chracter set by a negative lookahead inside loop,
so
<blockquote type="cite" [^>]+?>[^O]+?Op[^h]+?heeft(.+?(?=geschreven))geschreven:
becomes
<blockquote type="cite" [^>]+?>[^O]+?Op[^h]+?heeft((?:(?!geschreven).)+)geschreven:
(?: ) is for non capturing the group which just encapsulates the negative lookahead and the . (which can be replaced by [\s\S])
(?! ) inside is the negative lookahead which ensures current position before next character is not the beginning of end token.
Following comments it can be explicitly mentioned what should not appear in repeating sequence :
From(?:(?!this)[\s\S])+this(?:(?!point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
or
From(?:(?!From|this)[\s\S])+this(?:(?!point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
or
From(?:(?!From|this)[\s\S])+this(?:(?!this|point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
to understand what the technic (?:(?!tokens)[\s\S])+ does.
in the first this can't appear between From and this
in the second From or this can't appear between From and this
in the third this or point can't appear between this and point
etc.
Quite a simple one in theory but can't quite get it!
I want a regex in ant which matches anything as long as it has a slash on the end.
Below is what I expect to work
<regexp id="slash.end.pattern" pattern="*/"/>
However this throws back
java.util.regex.PatternSyntaxException: Dangling meta character '*' near index 0
*/
^
I have also tried escaping this to \*, but that matches a literal *.
Any help appreciated!
Your original regex pattern didn't work because * is a special character in regex that is only used to quantify other characters.
The pattern (.)*/$, which you mentioned in your comment, will match any string of characters not containing newlines, however it uses a possibly unnecessary capturing group. .*/$ should work just as well.
If you need to match newline characters, the dot . won't be enough. You could try something like [\s\S]*/$
On that note, it should be mentioned that you might not want to use $ in this pattern. Suppose you have the following string:
abc/def/
Should this be evaluated as two matches, abc/ and def/? Or is it a single match containing the whole thing? Your current approach creates a single match. If instead you would like to search for strings of characters and then stop the match as soon as a / is found, you could use something like this: [\s\S]*?/.
I am trying to get one regular expression that does the following:
makes sure there are no white-space characters
minimum length of 8
makes sure there is at least:
one non-alpha character
one upper case character
one lower case character
I found this regular expression:
((?=.*[^a-zA-Z])(?=.*[a-z])(?=.*[A-Z])(?!\s).{8,})
which takes care of points 2 and 3 above, but how do I add the first requirement to the above regex expression?
I know I can do two expressions the one above and then
\s
but I'd like to have it all in one, I tried doing something like ?!\s but I couldn't get it to work. Any ideas?
^(?=.*[^a-zA-Z])(?=.*[a-z])(?=.*[A-Z])\S{8,}$
should do. Be aware, though, that you're only validating ASCII letters. Is Ä not a letter for your requirements?
\S means "any character except whitespace", so by using this instead of the dot, and by anchoring the regex at the start and end of the string, we make sure that the string doesn't contain any whitespace.
I also removed the unnecessary parentheses around the entire expression.
Tim's answer works well, and is a good reminder that there are many ways to solve the same problem with regexes, but you were on the right track to finding a solution yourself. If you had changed (?!\s) to (?!.*\s) and added the ^ and $ anchors to the end, it would work.
^((?=.*[^a-zA-Z])(?=.*[a-z])(?=.*[A-Z])(?!.*\s).{8,})$
I'm using vim to do some pattern matching on a text file. I've enabled search highlighting so that I know exactly what is getting matched on each search and am getting confused.
Consider searching for [a-z]* on the following text:123456789abcdefghijklmnopqrstuvwxyxz987654321ABCDEFGHIJKLMNOPQRSTUVWQXZ
I expected this search to match zero or more consecutive characters that are in the range [a-z]. Instead, I get a match on the entire line.
Should this be the expected behaviour?
Thanks,
Andrew
It's matching the empty strings that occur after every character. It has no way of highlighting empty ranges, so it looks like everything is highlighted.
Try searching for [a-z]\+ instead.
Empty string matches [a-z]*... therefore this thing is matching everywhere. Perhaps you want to cut down some of the cases by doing [a-z]+ (1 or more), or [a-z]{4,} (4 or more).
You're not getting a match on the entire line, you're getting a match on every character. Your pattern also matches nothing at all, which is matched by every single character.