How to exlude certain word on regex - regex

I have a text document that I need to modify. Most of the words are seperated by "-" (minus) character.
So in sublime text, I tried this pattern:
(\w+)\-(\w+)
This pattern works perfectly fine but there is one word that has "-" (minus) character naturally in the document. (Eg: foo-bar)
So I need a pattern that finds all minus seperated words but exludes "foo-bar"
Sorry if this question asked before but I couldn't find the answer I needed

You can use a negative look-ahead (with optional i switch to match words in a case-insensitive way):
(?i)(?!\bfoo\-bar\b)\b(\w+)-(\w+)\b
Mind that this will only work with non-overlapping matches.
See example:
If you want to replace a hyphen with space in cases I provided in the screenshot, you can use (?!\bfoo\-bar\b)\b(\w+)\-(?=\w) search regex and replace with $1 (result: go there now):

Related

RegEx help for NotePad++

I need help with RegEx I just can't figure it out I need to search for broken Hashtags which have an space.
So the strings are for Example:
#ThisIsaHashtagWith Space
But there could also be the Words "With Space" which I don't want to replace.
So important is that the String starts with "#" then any character and then the words "With Space" which I want to replace to "WithSpace" to repair the Hashtags.
I have a Document with 10k of this broken Hashtags and I'm kind of trying the whole day without success.
I have tried on regex101.com
with following RegEx:
^#+(?:.*?)+(With Space)
Even I think it works on regex101.com it doesn't in Notepad++
Any help is appreciated.
Thanks a lot.
BR
In your current regex you match a # and then any character and in a capturing group match (With Space).
You could change the capturing group to capture the first part of the match.
(#+.*?)With Space
Then you could use that group in the replacement:
$1WithSpace
As an alternative you could first match a single # followed by zero or more times any character non greedy .*? and then use \K to reset the starting point of the reported match.
Then match With Space.
#+(?:.*?)\KWith Space
In the replacement use WithSpace
If you want to match one or more times # you could use a quantifier +. If the match should start at the beginning of string you could use an anchor ^ at the start of the regex.
Try using ^(#.+?)(With\s+Space) for your regex as it also matches multiple spaces and tab characters - if you have multiple rows that you want to affect do gmi for the flags. I just tried it with the following two strings, each on a separate line in Notepad++
#blablaWith Space
#hello###$aWith Space
The replace with value is set to $1WithSpace and I've tried both replaceAll and replace one by one - seems to result in the following.
#blablaWithSpace
#hello###$aWithSpace
Feel free to comment with other strings you want replaced. Also be sure that you have selected the Regular Extension search mode in NPP.
Try this? (#.*)( ).
I tried this in Notepad++ and you should be able to just replace all with $1. Make sure you set the find mode to regular expressions first.
const str = "#ThisIsAHashtagWith Space";
console.log(str.replace(/(#.*)( )/g, "$1"));

Regex to match \d\d_\d\d\d only

Could you please help me define a regex that would:
match the word r'(\d+_\d\d\d(?:_back)?)'
"word" means that it shouldn't be preceded or followed by anything except for the proper punctuation signs or beginning/end of string/line
work in multiline strings, anywhere in the strings, and in strings consisting only of this pattern and nothing else
not match in %96_175" and 44_5555 (because neither the % nor the 4th "5" are punctuation characters).
Examples:
Pass (12_345, 012_345, or 012_345_back is the found group):
['12_345',
'bla-bla 012_345',
'bla-bla 12_345 bla-bla',
'34\n012_345',
'012_345\n34',
'text—012_345—text',
'text--12_345, text',
'text. 012_345_back.']
Fail (no match here):
[
'text12_345',
'12_345text',
'12_3456',
'%12_345',
'!12_345',
'.12-345',
'12_345_front'
]
What I am trying to distinguish is the proper identifier of the form \d+_\d\d\d(?:_back), inserted by a user in a comment in my web-site, from the same string being part of another string. The simple regex worked until someone inserted a link to a Wikipedia article ending with "№_175', which was URL-encoded to %E2%84%96_175, "96_175" matching my pattern.
I've got stuck at trying to match the "proper punctuation signs" or the beginning or end of string or line in a string. And by then the regex was already so complex (I was listing all reasonable unicode punctuation characters I could think of) that I thought I was doing something wrong. I also have difficulties excluding extra digits but including possible end of line or string.
Depending how do you need to handle (or not-handle) non-letter non-proper-punctuation symbols you can either rely on Python re word detection \b (as suggested by one of answers) or enumerate the 'proper' punctuation marks in opening and closing non-matching group.
With old regex (Python 2.5) you could use a punctuation wildcard \p
(?:\p*|^|\s)(\d+_\d\d\d)(_back)?(?:\n|\p|$|\s)
With modern re (Python 2.6 and higher)
just replace \p with string.punctuation along the lines of
https://stackoverflow.com/a/37708340/5874981
For starter, assuming that sufficiently 'proper' are only full stop, comma and hyphen try
(?:^|\s|\.|,|-)(\d+_\d\d\d)(_back)?(?:$|\s|\.|,|-)
I'm not sure if I'm misunderstanding the question but if the only problem you're having is to match a whole word and ignore any other characters than the ones you want, I'd suggest you to try regex word boundary
So your regular expression would be \b\d+_\d\d\d(?:_back)?\b
Give it a try and tell me if that's what you need.

trying to find the correct regular expression

I have the following cases that should match with a regular expression, I've tried several combinations and have read a lot of answers but still no clue on how to solve it.
the rule is, find any combination of . inside a quoted string, atm I have the following regexp
\"\w*((..)|(.))\w*\"
that covers most of the cases:
mmmas"A.F"asdaAA
196.34.45.."asd."#
".add"
sss"a.aa"sss
".."
"a.."
"a..a"
"..A"
but still having problems with this one:
"WERA.HJJ..J"
I've been testing the regpexp in the http://regexr.com/ site
I will really appreciate any help on this
Change your regex to
\"\w*(\.+\w*)+\"
Update: escape . to match the dot and not any character
demo
From the question, it seems that you need to find every occurrence of one or more dot (along with optional word characters) inside a pair of quotes. The following regex would do this:
\"\w*(\.+\w*)+\"
In "WERA.HJJ..J", you have some word characters followed by a dot which is followed by a sequence of word characters again followed by dot and word characters. Your regex would match one or two dots with a pair of optional word character blocks on either sides only.
The dots in the regex are escaped to avoid them being matched against any character, since it is a metacharacter.
Check here.

Skip Second String Between Characters with Regex

I've been working on a regex issue. I have a lot of lines formatted like this:
3240985|#Apple.-+240538|34346|346356356|36433565|6agf8s89auf
The end goal should look like this:
#Apple.-+240538|6agf8s89auf
#Apple.-+240538 is random characters, and 6agf8s89auf is random alphanumeric characters.
I've been using (.*?)[\|] and replacing the parts I need with blank characters in Notepad++ but it's impossible to complete it this way with the number of lines I have.
The regex for this kind of string is (?:(?<=^)|(?<=\|))(\d+(?:$|\|))
Demo: https://regex101.com/r/sO0fZ2/2
However Find and Replace in Notepad++ may have some issues because Notepad++ finds and replace strings only once. Some other text editors like, sublime text find and replaces the contents recursively. However you can simple overcome this by clicking Replace All button multiple times.
Input
Result after clicking "Replace All in All Opened Documents" twice
In sublime text, you can achieve this in single click:
Input
Result
P.S.: I'm not aware if there's any feature in Notepad++ that finds and replaces the content recursively. You can google for that. If there's any feature like that, then you can use it. However, I think that this shouldn't be a problem because it will only require a couple of more clicks.
There is a simple approach with an alternation:
^\d+\||\|\d+(?=\||$)
Details:
^\d+\| - Branch 1 matching a chunk of 1+ digits (\d+) at the beginning of the string (^) and a | after them
| - alternation operator meaning OR
\|\d+(?=\||$) - a literal pipe (\|, must be escaped) with 1+ digits after it (\d+) that are followed with a literal pipe or end of string ((?=...) is a positive lookahead that does not advance the regex index, thus, you can still match adjacent matches with the same pattern.)

Regex to match whole word with a particular definition of a word

I am doing a file search and replace for occurrences of specific words in perl. I'm not usually much of a perl or regex user. I have searched for other regex questions here but I couldn't find one which was quite right so I'm asking for help. My search and replace currently looks like this:
s/originalword/originalword_suffix/g
This matches cases of originalword that appear in the middle of another word, which I don't want. In my application of search and replace, a whole word can be defined as having the letters of the latin alphabet in lowercase or capital letters and the digits 0-9 and the symbol _ in any uninterrupted sequence. Anything else besides these characters, including any other symbols or any form of whitespace including line breaks or tabs, indicate operations or separators of some kind so they are outside the word boundaries. How do I modify my search and replace to only match whole words as I've defined them, without matching substrings?
Examples:
in the case that originalword = cat and originalword_suffix = cat_tastic
:cat { --> :cat_tastic {
:catalog { --> no change
Use the \b anchor to match only on a word boundary:
s/\bcat\b/cat_tastic/g
Although Perl has a slightly different definition of what a "word" is. Reading the perlre reference guide a couple of times might help you understand regexps a bit better.
Running perl -pi -e "YOUR_REGEXP" in a terminal and entering in lines of text can help you understand and debug what a particular regexp is doing.
You could try:
s/([^0-9a-z_])([0-9a-z_]+)([^0-9a-z_])/$1$2_tastic$3/gi
Basically, a non-word character, then a set of word characters, followed by a non-word character. The $1,$2,$3 represent the captured groups, and you replace $2 with $2_suffix.
Hope that helps, not a perl guy buy pretty regex-savvy. Note that the above will fail if the word is the very first or very last thing in a string. Not sure if perl regexen allow the syntax, but if so, fixing the first/last issue could be done with:
s/(^|[^0-9a-z_])([0-9a-z_]+)([^0-9a-z_]|$)/$1$2_tastic$3/gi
Using ^ and $ to match beginning/end of string.
See the example on this page which explains boundary matchers
Enter your regex: \bdog\b
Enter input string to search: The dog plays in the yard.
I found the text "dog" starting at index 4 and ending at index 7.
Enter your regex: \bdog\b
Enter input string to search: The doggie plays in the yard.
No match found.