Regex to match mixed lower alphanumeric strings - regex

I'm working with a large document in Sublime Text 3, whose Find and Replace feature takes regex. Each string in the document is separated by a line break. I need a regex that will match strings made up of lowercase alphanumeric characters mixed in any order, such as the following:
aa0555aaaaf
593dm03ks03
19204f02040
After looking into regex, the best I've been able to come up with so far is the below:
^[a-z][0-9]{11,}$\n
...although this only seems to match strings that start with letters and end in numbers, and for some reason doesn't seem to be case-sensitive either:
aa09304030
AA00450354

Try this one.
^[a-z0-9]{11,}$\n
Updated:
Remember to enable "case sensitive"
Updated:
Thanks #Wiktor Stribiżew about the inline modifier of "case insenstive mode"
(?-i)^[a-z0-9]{11,}$\R?

Related

Regex to match \d\d_\d\d\d only

Could you please help me define a regex that would:
match the word r'(\d+_\d\d\d(?:_back)?)'
"word" means that it shouldn't be preceded or followed by anything except for the proper punctuation signs or beginning/end of string/line
work in multiline strings, anywhere in the strings, and in strings consisting only of this pattern and nothing else
not match in %96_175" and 44_5555 (because neither the % nor the 4th "5" are punctuation characters).
Examples:
Pass (12_345, 012_345, or 012_345_back is the found group):
['12_345',
'bla-bla 012_345',
'bla-bla 12_345 bla-bla',
'34\n012_345',
'012_345\n34',
'text—012_345—text',
'text--12_345, text',
'text. 012_345_back.']
Fail (no match here):
[
'text12_345',
'12_345text',
'12_3456',
'%12_345',
'!12_345',
'.12-345',
'12_345_front'
]
What I am trying to distinguish is the proper identifier of the form \d+_\d\d\d(?:_back), inserted by a user in a comment in my web-site, from the same string being part of another string. The simple regex worked until someone inserted a link to a Wikipedia article ending with "№_175', which was URL-encoded to %E2%84%96_175, "96_175" matching my pattern.
I've got stuck at trying to match the "proper punctuation signs" or the beginning or end of string or line in a string. And by then the regex was already so complex (I was listing all reasonable unicode punctuation characters I could think of) that I thought I was doing something wrong. I also have difficulties excluding extra digits but including possible end of line or string.
Depending how do you need to handle (or not-handle) non-letter non-proper-punctuation symbols you can either rely on Python re word detection \b (as suggested by one of answers) or enumerate the 'proper' punctuation marks in opening and closing non-matching group.
With old regex (Python 2.5) you could use a punctuation wildcard \p
(?:\p*|^|\s)(\d+_\d\d\d)(_back)?(?:\n|\p|$|\s)
With modern re (Python 2.6 and higher)
just replace \p with string.punctuation along the lines of
https://stackoverflow.com/a/37708340/5874981
For starter, assuming that sufficiently 'proper' are only full stop, comma and hyphen try
(?:^|\s|\.|,|-)(\d+_\d\d\d)(_back)?(?:$|\s|\.|,|-)
I'm not sure if I'm misunderstanding the question but if the only problem you're having is to match a whole word and ignore any other characters than the ones you want, I'd suggest you to try regex word boundary
So your regular expression would be \b\d+_\d\d\d(?:_back)?\b
Give it a try and tell me if that's what you need.

Skip Second String Between Characters with Regex

I've been working on a regex issue. I have a lot of lines formatted like this:
3240985|#Apple.-+240538|34346|346356356|36433565|6agf8s89auf
The end goal should look like this:
#Apple.-+240538|6agf8s89auf
#Apple.-+240538 is random characters, and 6agf8s89auf is random alphanumeric characters.
I've been using (.*?)[\|] and replacing the parts I need with blank characters in Notepad++ but it's impossible to complete it this way with the number of lines I have.
The regex for this kind of string is (?:(?<=^)|(?<=\|))(\d+(?:$|\|))
Demo: https://regex101.com/r/sO0fZ2/2
However Find and Replace in Notepad++ may have some issues because Notepad++ finds and replace strings only once. Some other text editors like, sublime text find and replaces the contents recursively. However you can simple overcome this by clicking Replace All button multiple times.
Input
Result after clicking "Replace All in All Opened Documents" twice
In sublime text, you can achieve this in single click:
Input
Result
P.S.: I'm not aware if there's any feature in Notepad++ that finds and replaces the content recursively. You can google for that. If there's any feature like that, then you can use it. However, I think that this shouldn't be a problem because it will only require a couple of more clicks.
There is a simple approach with an alternation:
^\d+\||\|\d+(?=\||$)
Details:
^\d+\| - Branch 1 matching a chunk of 1+ digits (\d+) at the beginning of the string (^) and a | after them
| - alternation operator meaning OR
\|\d+(?=\||$) - a literal pipe (\|, must be escaped) with 1+ digits after it (\d+) that are followed with a literal pipe or end of string ((?=...) is a positive lookahead that does not advance the regex index, thus, you can still match adjacent matches with the same pattern.)

How to exlude certain word on regex

I have a text document that I need to modify. Most of the words are seperated by "-" (minus) character.
So in sublime text, I tried this pattern:
(\w+)\-(\w+)
This pattern works perfectly fine but there is one word that has "-" (minus) character naturally in the document. (Eg: foo-bar)
So I need a pattern that finds all minus seperated words but exludes "foo-bar"
Sorry if this question asked before but I couldn't find the answer I needed
You can use a negative look-ahead (with optional i switch to match words in a case-insensitive way):
(?i)(?!\bfoo\-bar\b)\b(\w+)-(\w+)\b
Mind that this will only work with non-overlapping matches.
See example:
If you want to replace a hyphen with space in cases I provided in the screenshot, you can use (?!\bfoo\-bar\b)\b(\w+)\-(?=\w) search regex and replace with $1 (result: go there now):

Regular expression - finding specific string with at least one capital letter

I am looking for a regular expression which matches a specific string which:
always start with "fu:
always ends with "
and contains at least one capital letter in between those start and ending points
point 3 is the part I really can't solve.
the regex "fu:(.*)?" matches all the strings apart from point 3.
[edit]
its pretty close now, the only problem is it doesnt stop after the second ".
Basically this string:
"fu:no capital letter:,some other random text WITH CAPITAL LETTERS"
is a match but shouldnt.
The regex that will work for you is this:
/^"fu:.*?[A-Z].*?"$/
Here the live demo of above regex
^"fu:.*[A-Z].*"$
Don't forget about multiline mode if you wish to search in several lines of text.
^"fu: - starts with "fu:
.* - any other characters
[A-Z] - capital letter
.* - other characters
"$ - " at the end
Good tool to test it: http://www.regexplanet.com/advanced/java/index.html
Something like
^"fu:([^"]*?[A-Z][^"]*?)"$
I commented on a problem with anubhava's solution (that it only matches upper case letters in the range A through Z), but then found the solution myself. Note that this requires a POSIX-compliant regular expression engine with support for Unicode.
My solution is
/^"fu:.*[[:upper:]].*"$/
It solves the problem of finding upper case letters in other languages than English (with partially or completely different alphabets).
An example in Ruby:
rx = /^"fu:.*[[:upper:]].*"$/
arr = ['"fu:Berlin"', '"fu:İstanbul"', '"fu:Washington"', '"fu:Örebro"', '"fu:Москва"']
arr.map {|s| s.scan rx}
In this case, all of the strings are matched.

Regex to match whole word with a particular definition of a word

I am doing a file search and replace for occurrences of specific words in perl. I'm not usually much of a perl or regex user. I have searched for other regex questions here but I couldn't find one which was quite right so I'm asking for help. My search and replace currently looks like this:
s/originalword/originalword_suffix/g
This matches cases of originalword that appear in the middle of another word, which I don't want. In my application of search and replace, a whole word can be defined as having the letters of the latin alphabet in lowercase or capital letters and the digits 0-9 and the symbol _ in any uninterrupted sequence. Anything else besides these characters, including any other symbols or any form of whitespace including line breaks or tabs, indicate operations or separators of some kind so they are outside the word boundaries. How do I modify my search and replace to only match whole words as I've defined them, without matching substrings?
Examples:
in the case that originalword = cat and originalword_suffix = cat_tastic
:cat { --> :cat_tastic {
:catalog { --> no change
Use the \b anchor to match only on a word boundary:
s/\bcat\b/cat_tastic/g
Although Perl has a slightly different definition of what a "word" is. Reading the perlre reference guide a couple of times might help you understand regexps a bit better.
Running perl -pi -e "YOUR_REGEXP" in a terminal and entering in lines of text can help you understand and debug what a particular regexp is doing.
You could try:
s/([^0-9a-z_])([0-9a-z_]+)([^0-9a-z_])/$1$2_tastic$3/gi
Basically, a non-word character, then a set of word characters, followed by a non-word character. The $1,$2,$3 represent the captured groups, and you replace $2 with $2_suffix.
Hope that helps, not a perl guy buy pretty regex-savvy. Note that the above will fail if the word is the very first or very last thing in a string. Not sure if perl regexen allow the syntax, but if so, fixing the first/last issue could be done with:
s/(^|[^0-9a-z_])([0-9a-z_]+)([^0-9a-z_]|$)/$1$2_tastic$3/gi
Using ^ and $ to match beginning/end of string.
See the example on this page which explains boundary matchers
Enter your regex: \bdog\b
Enter input string to search: The dog plays in the yard.
I found the text "dog" starting at index 4 and ending at index 7.
Enter your regex: \bdog\b
Enter input string to search: The doggie plays in the yard.
No match found.