RegEx in VSCode: capture every character/letter - not just ASCII - regex

I am working with historical text and I want to reformat it with RegEx. Problem is: There are lots of special characters (that is: letters) in the text that are not matched by RegEx character classes like [a-z] / [A-Z] or \w .
For example I want to match the dot (and only the dot) in the following line:
<tag1>Quomodo restituendus locus Demosth. Olÿnth</tag1>
Without the ÿ I could easily work with the mentioned character classes, like:
(?<=(<tag1>(\w|\s)*))\.(?=((\w|\s)*</tag1>))
But it does not work with special characters that are not covered by ASCII. I tried lots of things but I can't make it work so the RegEx really only captures the dot in this very line. If I use more general Expressions like (.)* (instead of (\w|\s)* ) I get many more of the dots in the document (for example dots that are not between an opening and a closing tag but in between two such tagsets), which is not what I want. Any ideas for an expression that covers like all unicode letters?

You may match any text between < and > with [^<>]*:
(?<=(<tag1>[^<>]*))\.(?=([^<>]*</tag1>))
See the regex demo. Not sure you need all those capturing groups, you might get what you need without them:
(?<=<tag1>[^<>]*)\.(?=[^<>]*</tag1>)
See this regex demo. Details:
(?<=<tag1>[^<>]*) - a location immediately preceded with <tag1 and then any zero or more chars other than < and >
\. - a dot
(?=[^<>]*</tag1>) - a location immediately preceded with any zero or more chars other than < and > and then </tag1>.

use a negated character class that exculdes the dot and the opening angle bracket:
(?<=<tag1>[^.<]*(?:<(?!/tag1>)[^.<]*)*)\.
with this kind of pattern it isn't even needed to check the closing tag. But if you absolutely want to check it, ends the pattern with:
(?=[^<]*(?:<(?!/tag1>)[^<]*)*</tag1>)

Related

Regex: delete everything between String and replace with other

So I've been scratching my head over this one, I have over a thousand files that have different values between the strings
<lodDistances content="float_array">
15.000000
25.000000
70.000000
140.000000
500.000000
500.000000
</lodDistances>
I need to replace those values with these
<lodDistances content="float_array">
120.000000
200.000000
300.000000
400.000000
500.000000
550.000000
</lodDistances>
I tried the following without any success
\ (?<=\<lodDistances content\=\"float_array\"\>)(.*)(?=\<\/lodDistances\>)
It seems to find it in regexr but not in a sublime text when I try to find it in files, I constantly get 0 results. Any idea why this is happening?
There are a couple of things that are wrong in your pattern:
\< matches a leading word boundary position (as \b(?=\w)) and \> matches the trailing word boundary position (same as \b(?<=\w)). You wanted to match literal < and > chars, thus, you must NOT escape them
There is no need matching a space before the first <
Since you text is multiline, use either (?s) inline modifier or (?s:...) modifier group to make . match across line breaks, or use a [\s\S] / [\w\W] / [\d\D] workaround
Use a lazy dot pattern to stop matching at first occurrence of the trailing delimiter.
You may use
(?s)(<lodDistances content="float_array">\s*).*?(?=\s*</lodDistances>)
And replace with ${1}<new values>. The curly braces are necessary as the new values are most likely numbers and without the braces, $1n (n stands for a digit here) will be parsed incorrectly (see this YT video for a demo of what it is fraught with).
See the demo below:
V
Regex details:
(?s) - now, . matches line break chars, too
(<lodDistances content="float_array">\s*) - Group 1 capturing <lodDistances content="float_array"> text and then zero or more whitespaces
.*? - any zero or more chars, but as few as possible
(?=\s*</lodDistances>) - a positive lookahead that matches the location that is immediately followed with zero or more whitespaces and </lodDistances> text.
Note that / is not a special regex metacharacter, and since regex delimiter notation is not supported in Sublime Text, you do not have to ever escape it here.

How to allow spaces in between words?

EDIT: I've been experimenting, and it seems like putting this:
\(\w{1,12}\s*\)$
works, however, it only allows space at the end of the word.
example,
Matches
(stuff )
(stuff )
Does not
(st uff)
Regexp:
\(\w{1,12}\)
This matches the following:
(stuff)
But not:
(stu ff)
I want to be able to match spaces too.
I've tried putting \s but it just broke the whole thing, nothing would match. I saw one post on here that said to enclose the whole thing in a ^[]*$ with space in there. That only made the regex match everything.
This is for Google Forms validation if that helps. I'm completely new to regex, so go easy on me. I looked up my problem but could not find anything that worked with my regex. (Is it because of the parenthesis?)
For matching text like (st uff) or (st uff some more) you will need to write your regex like this,
\(\w{1,12}(?:\s+\w{1,12})*\)
Regex explanation:
\( - Literal start parenthesis
\w{1,12} - Match a word of length 1 to 12 like you wanted
(?:\s+\w{1,12})* - You need this pattern so it can match one or more space followed by a word of length 1 to 12 and whole of this pattern to repeat zero or more times
\) - Literal closing parenthesis
Demo
Now if you want to optionally also allow spaces just after starting parenthesis and ending parenthesis, you can just place \s* in the regex like this,
\(\s*\w{1,12}(?:\s+\w{1,12})*\s*\)
^^^ ^^^
Demo with optional spaces
If you are trying to get 12 characters between parentheses:
\([^\)]{1,12}\)
The [^\)] segment is a character class that represents all characters that aren't closing parentheses (^ inverts the class).
If you want some specific characters, like alphanumeric and spaces, group that into the character class instead:
\([\w ]{1,12}\)
Or
\([\w\s]{1,12}\)
If you want 12 word characters with an arbitrary number of spaces anywhere in between:
\(\s*(?:\w\s*){1,12}\)

How to find a particular string

Im using Visual Studio 2017 and in a long long text file Im searching for a particular function but unable to find
here's what the regex Im using
c\.CreateMap\<(\w)+\,\s+Address\>
and I want to in these
c.CreateMap<ClientAddress, Address>()
c.CreateMap<Responses.SiteAddress, Data.Address>()
and so on.
As soon as I add "Address" in the regex it stops matching any.
what am I doing wrong?
You can try this
c\.CreateMap\<\w+\.?\w+?\,\s*\w*?\.?Address\>
Explanation
c\.CreateMap\< - Matches c\.CreateMap\<.
\w+ - Matches any word character one or more time.
\.? - Matches '.' zero or one time.
\, - Matches ','.
\s* - Matches space zero or more time.
\w - Matches word character zero or more time.
\.? - Matches '.' zero or one time.
Address\> - Matches Address\>.
Demo
P.S- In case you also want to match something like this.
c.CreateMap<Responses.SiteAddress.abc, Data.Address.xyz>()
You can use this.
c\.CreateMap\<(\w+\.?\w+?)*\,\s*(?:\w*?\.?)*Address(\.\w*)?\>
Demo
Here is general regex I can suggest:
c\.CreateMap\<[\w.]+,\s+(?:[\w.]+\.)?Address\>\s*\(\s*\)
This will match any term with dots or word characters in the first position in the diamond. In the second, position, it will match Address, or some parent class names, followed by a dot separator, followed by Address.
Demo
Note that I also include the empty function call parentheses in the regex. As well, I allow for flexibility in the whitespace may appear after the diamond, or between the parentheses.
In your second example, you have extra dot which is not handled. Your regex needs little modification. Also, you don't need to escape < or > or , Use this,
c\.CreateMap<([\w.])+,\s+[\w.]*Address>
Demo
To match any of the functions on your question, you can use:
c\.CreateMap[^)]+\)
Regex Demo
Regex Explanation:

Vim Regex: Capture *first* instance of word between characters

My organization has an in-house language, with syntax like:
cmo/create/mo1///tri
createpts/brick/xyz/2,2,2/0.,0.,0./1.,1.,1./1,1,1
I am writing a Vim syntax file, and would like to capture the first instance of a word enclosed by two characters (in this case, /), without capturing the characters themselves.
I.e., the regex would capture, from the lines above,
create
brick
My solution so far is to use this pattern:
[,/=" "].\{-}[,/=" "]
But from /this/and/this/and/this, it will capture /this/and/this/and/this/.
As you can see, the issue is two-fold: (i) my current solution is greedy, and (ii) captures the / characters as well, when I just want the words enclosed by these.
Thanks!
One possible solution:
^[^\/]\+\/\zs[^\/]\+\ze\/
^ anchor the search to the BOL,
[^\/]\+ one or more non-slash characters, as many as possible,
\/ a slash,
\zs start the match here,
[^\/]\+ one or more non-slash characters, as many as possible.

Regex to match a word or a dot

This should be a fairly trivial question but I have spent quite some time and Im unable to do it -
If this is my string -
"this/DT word/NN is/VBZ a/DT dot/NN ./."
I want to extract the immediate neighbors of / , be it a word,comma or a full stop.
(\\w+)/(\\w+) gives the words before n after / but not the full stops etc.
I tried this - "\\.\\/\\.|(\\w+)/(\\w+)" for grabbing the full stops but doesn't seem to work.
Can someone help please.( I am trying this in R)
Thanks!
Note that \w only matches letters, digits and an underscore. A dot/period belongs to punctuation and can be captured with Perl-like \p{P} or POSIX class [:punct:]. Thus, theoretically, you could use something like ([\\w[:punct:]]+)/([\\w[:punct:]]+) (or even a more POSIXish ([[:alpha:][:punct:]]+)/([[:alpha:][:punct:]]+)), but I guess matching non-whitespace characters on both sides of / suits your purpose best.
Here is an alternative to the (\\S+)/(\\S+) regex:
([^\\s]+)/([^\\s]+)
See regex demo
The [^\s] means any symbol other than a whitespace. Note that \S means *any non-whitespace character.
If you can have no non-whitespace characters on either side of /, I believe
([^\\s]*)/([^\\s]*)
or
(\\S*)/(\\S*)
will work better for you since * will match 0 or more characters.
See another demo
You can use this regex
"(\\S+)/(\\S+)"
i.e. grab each non-space text before and after /.
RegEx Demo