I'm trying to add some syntax colouring in vim for constants written in the standard uppercase form:
HELLO_WORLD
_GOOD_BYE_WORLD
when I go to http://regex101.com/ I am able to match these with the following:
/(_*[A-Z]+_*)+
but with vim it doesn't match anything.
/_ will match a single underscore but /_* will not match multiple underscores, it matches every character. After reading some of the vim regex documentation (http://vimdoc.sourceforge.net/htmldoc/pattern.html) it seems as though the underscore is used for extending matches across lines. However, all of the patterns listed in the documentation use \_ (an escaped underscore) as opposed to just the character.
How can I match words of this form?
And why does _* match every character?
I think \<[_A-Z]\+\> will do what you want.
Accepted answer is matching underscores and capital letters contained in lowercase words.
Vim has slightly different regex format, some key characters needs excaping, like + and (), here's your same regex formatted for vim
\(_*[A-Z]\+_*\)\+
For more info you can visit http://vimregex.com/
You can also use vim's magic option \v
/\v(_*[A-Z]+_*)+
http://vim.wikia.com/wiki/Simplifying_regular_expressions_using_magic_and_no-magic
Related
I am trying to extract words that have at least one character from a special character set. It picks up some words and not others. Here is a link to regex101 to test it. This it the regex \b(\w*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ]+\w*)\b, and this is the sample sentence I am using
His full name is Abu ʿĪsa Muḥammad ibn ʿĪsa ibn Sawrah ibn Mūsa ibn
Al-Daḥāk Al-Sulamī Al-Tirmidhī.
It should match the following words:
ʿĪsa Muḥammad ʿĪsa Mūsa Al-Daḥāk Al-Sulamī Al-Tirmidhī
I am not too experienced with regex, so I have no idea what I am doing wrong. If someone knows any tool to find out why a specific word doesn't match a regex pattern, please let me know as well.
You can use
[\w-]*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ][\wāīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ-]*
After matching the one required special character, use another character set to match more occurrences of those characters or normal word characters.
https://regex101.com/r/ovJoLt/2
You can make this work by enabling the Unicode flag /u (so that the word boundary \b assertions support Unicode characters) and adding hyphens to the surrounding character groups:
/\b[\w-]*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ]+[\w-]*\b/gu
Plus, you don't need the capturing group, since the only characters being matched form the desired output anyway (\b is a zero-width assertion).
Demo
You are not doing anything wrong except that to match unicode boundaries you have to enable u modifier or use (?<!\S)\w*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ]+\w*(?!\S)
If you want to match hyphen add it to your character class (?<!\S)\w*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ-]+\w*(?!\S)
I'm trying to match the three first text lines in regex, i.e. the ones ending with form.
value="something form"
value="Second cool form"
value="another silly old form"
value="blabla"
How can I do that?
I don't know what tool you are using, but the following pattern should match the first three lines:
.*form"$
Demo
You could simply use:
.*form"$
In order to work, you would have to turn on multiline mode.
Dot (.) means - match me anything but newline character, asterisk (*) means - match me dot zero or more times after which comes text form. Dollar sign ($) is anchor to the string ending.
Take a look at demo. You should learn more about regular expressions here, this is basic regex matching.
You can try using this:
\w*form\b
\w*: Allows characters in front of form
\b: Makes sure that form is at the end of the string.
Regex 101 demo
Actually if you want to match the 'form' as a separate word, you need something like this:
\Wform\W
\W (capital W) is any character which does not represent a word character, at least in perl-like regex.
Could you please help me define a regex that would:
match the word r'(\d+_\d\d\d(?:_back)?)'
"word" means that it shouldn't be preceded or followed by anything except for the proper punctuation signs or beginning/end of string/line
work in multiline strings, anywhere in the strings, and in strings consisting only of this pattern and nothing else
not match in %96_175" and 44_5555 (because neither the % nor the 4th "5" are punctuation characters).
Examples:
Pass (12_345, 012_345, or 012_345_back is the found group):
['12_345',
'bla-bla 012_345',
'bla-bla 12_345 bla-bla',
'34\n012_345',
'012_345\n34',
'text—012_345—text',
'text--12_345, text',
'text. 012_345_back.']
Fail (no match here):
[
'text12_345',
'12_345text',
'12_3456',
'%12_345',
'!12_345',
'.12-345',
'12_345_front'
]
What I am trying to distinguish is the proper identifier of the form \d+_\d\d\d(?:_back), inserted by a user in a comment in my web-site, from the same string being part of another string. The simple regex worked until someone inserted a link to a Wikipedia article ending with "№_175', which was URL-encoded to %E2%84%96_175, "96_175" matching my pattern.
I've got stuck at trying to match the "proper punctuation signs" or the beginning or end of string or line in a string. And by then the regex was already so complex (I was listing all reasonable unicode punctuation characters I could think of) that I thought I was doing something wrong. I also have difficulties excluding extra digits but including possible end of line or string.
Depending how do you need to handle (or not-handle) non-letter non-proper-punctuation symbols you can either rely on Python re word detection \b (as suggested by one of answers) or enumerate the 'proper' punctuation marks in opening and closing non-matching group.
With old regex (Python 2.5) you could use a punctuation wildcard \p
(?:\p*|^|\s)(\d+_\d\d\d)(_back)?(?:\n|\p|$|\s)
With modern re (Python 2.6 and higher)
just replace \p with string.punctuation along the lines of
https://stackoverflow.com/a/37708340/5874981
For starter, assuming that sufficiently 'proper' are only full stop, comma and hyphen try
(?:^|\s|\.|,|-)(\d+_\d\d\d)(_back)?(?:$|\s|\.|,|-)
I'm not sure if I'm misunderstanding the question but if the only problem you're having is to match a whole word and ignore any other characters than the ones you want, I'd suggest you to try regex word boundary
So your regular expression would be \b\d+_\d\d\d(?:_back)?\b
Give it a try and tell me if that's what you need.
I have the following cases that should match with a regular expression, I've tried several combinations and have read a lot of answers but still no clue on how to solve it.
the rule is, find any combination of . inside a quoted string, atm I have the following regexp
\"\w*((..)|(.))\w*\"
that covers most of the cases:
mmmas"A.F"asdaAA
196.34.45.."asd."#
".add"
sss"a.aa"sss
".."
"a.."
"a..a"
"..A"
but still having problems with this one:
"WERA.HJJ..J"
I've been testing the regpexp in the http://regexr.com/ site
I will really appreciate any help on this
Change your regex to
\"\w*(\.+\w*)+\"
Update: escape . to match the dot and not any character
demo
From the question, it seems that you need to find every occurrence of one or more dot (along with optional word characters) inside a pair of quotes. The following regex would do this:
\"\w*(\.+\w*)+\"
In "WERA.HJJ..J", you have some word characters followed by a dot which is followed by a sequence of word characters again followed by dot and word characters. Your regex would match one or two dots with a pair of optional word character blocks on either sides only.
The dots in the regex are escaped to avoid them being matched against any character, since it is a metacharacter.
Check here.
As I have hard time creating regex that would match letters only including accented characters (ie. Czech characters), I would like to go the other way around for my name validation - detect special characters and numbers.
What would be regex that matches special characters and numbers?
To specify #anubhava's, \w stands for [a-zA-Z0-9_] and capitalizing it negates the character class. If you want to match _ too, you'll have to make your own character class like [^a-zA-Z0-9] (everything but alphanumeric). Also this can be shortened to [^a-z\d] if you use the i modifier. Note, this would also match accented characters since they are not a-zA-Z0-9.
Example
However, I always advice against trying to use a "regular" expression to match a name (since names are not regular). See this blog post.