Regex for SublimeText Snippet - regex

I've been stuck for a while on this Sublime Snippet now.
I would like to display the correct package name when creating a new class, using TM_FILEPATH and TM_FILENAME.
When printing TM_FILEPATH variable, I get something like this:
/Users/caubry/d/[...]/src/com/[...]/folder/MyClass.as
I would like to transform this output, so I could get something like:
com.[...].folder
This includes:
Removing anything before /com/[...]/folder/MyClass.as;
Removing the TM_FILENAME, with its extension; in this example MyClass.as;
And finally finding all the slashes and replacing them by dots.
So far, this is what I've got:
${1:${TM_FILEPATH/.+(?:src\/)(.+)\.\w+/\l$1/}}
and this displays:
com/[...]/folder/MyClass
I do understand how to replace splashes with dots, such as:
${1:${TM_FILEPATH/\//./g/}}
However, I'm having difficulties to add this logic to the previous one, as well as removing the TM_FILENAME at the end of the logic.
I'm really inexperienced with Regex, thanks in advance.
:]
EDIT: [...] indicates variable number of folders.

We can do this in a single replacement with some trickery. What we'll do is, we put a few different cases into our pattern and do a different replacement for each of them. The trick to accomplish this is that the replacement string must contain no literal characters, but consist entirely of "backreferences". In that case, those groups that didn't participate in the match (because they were part of a different case) will simply be written back as an empty string and not contribute to the replacement. Let's get started.
First, we want to remove everything up until the last src/ (to mimic the behaviour of your snippet - use an ungreedy quantifier if you want to remove everything until the first src/):
^.+/src/
We just want to drop this, so there's no need to capture anything - nor to write anything back.
Now we want to match subsequent folders until the last one. We'll capture the folder name, also match the trailing /, but write back the folder name and a .. But I said no literal text in the replacement string! So the . has to come from a capture as well. Here comes the assumption into play, that your file always has an extension. We can grab the period from the file name with a lookahead. We'll also use that lookahead to make sure that there's at least one more folder ahead:
^.+/src/|\G([^/]+)/(?=[^/]+/.*([.]))
And we'll replace this with $1$2. Now if the first alternative catches, groups $1 and $2 will be empty, and the leading bit is still removed. If the second alternative catches, $1 will be the folder name, and $2 will have captured a period. Sweet. The \G is an anchor that ensures that all matches are adjacent to one another.
Finally, we'll match the last folder and everything that follows it, and only write back the folder name:
^.+/src/|\G([^/]+)/(?=[^/]+/.*([.]))|\G([^/]+)/[^/]+$
And now we'll replace this with $1$2$3 for the final solution. Demo.
A conceptually similar variant would be:
^.+/src/|\G([^/]+)/(?:(?=[^/]+/.*([.]))|[^/]+$)
replaced with $1$2. I've really only factored out the beginning of the second and third alternative. Demo.
Finally, if Sublime is using Boost's extended format string syntax, it is actually possible to get characters into the replacement conditionally (without magically conjuring them from the file extension):
^.+/src/|\G(/)?([^/]+)|\G/[^/]+$
Now we have the first alternative for everything up to src (which is to be removed), the third alternative for the last slash and file name (which is to be removed), and the middle alternative for all folders you want to keep. This time I put the slash to be replaced optionally at the beginning. With a conditional replacement we can write a . there if and only if that slash was matched:
(?1.:)$2
Unfortunately, I can't test this right now and I don't know an online tester that uses Boost's regex engine. But this should do the trick just fine.

Related

How to exclude list of folders from Mercurial/TortoiseHG's .hgignore file?

Ok. I need to ignore a list of files from the version control, except for files in three certain folders (let's call them Folder1, Folder2 and Folder3). I can list all folders I need to ignore as a plain list, but I consider this as not an elegant way, so I wrote the following regex:
.*/(Bin|bin)/(?!Folder1/|Folder2/|Folder3/).*
My thoughts were as follows, from left to right:
.* - Any number of any characters.
/ - Slash symbol, which separates folders from one another.
(Bin|bin) - Folder with "Bin" or "bin" name.
/ - Slash symbol, which separates folders from one another.
(?!Folder1/|Folder2/|Folder3/) - Folder name is not "Folder1/" and is not "Folder2/" and is not "Folder3/". This part was the most complicated, but I googled it somehow. I don't understand why should it work, but it works during the tests.
.* - Any number of any characters.
This expression works perfectly when I test it at regex101.com with a couple of text strings, representing paths to files, but nothing works when I put it in my .hgignore file, as follows:
syntax: regexp
.*/(Bin|bin)/(?!Folder1/|Folder2/|Folder3/).*
For some reason it ignores all files and sub-folders in all "Bin" and "bin" folders. How can I accomplish my task?
P.S. As soon as I know, Mercurial/TortoiseHG uses Python/Perl regular expressions.
Many thanks in advance.
To adjust the question a bit to make it clearer (at least to me), we have any number of /bin/somename/... and .../bin/anothername/... names that should be ignored, along with three sets of .../bin/folder1/..., .../bin/2folder/..., and .../Bin/third/... set of names that should not be ignored.
Hence, we want a regular expression that (without anchoring) will match the names-to-be-ignored but not the ones-to-be-kept. (Furthermore, glob matching won't work, since it's not as powerful: we'll either match too little or too much, and Mercurial lacks the "override with later un-ignore" feature of Git.)
The shortest regular expression for this should be:
/[Bb]in/(?!(folder1|2folder|third)/)
(The part of this regex that actually matches a string like /bin/somename/... is only the /bin/ part, but Mercurial does not look at what matched, only whether something matched.)
The thing is, your example regular expression should work, it's just a longer variant of this same thing with not-required but harmless (except for performance) .* added at the front and back. So if yours isn't working, the above probably won't work either. A sample repository, with some dummy files, that one could clone and experiment with, would help diagnose the issue.
Original (wrong) answer (to something that's not the question)
The shortest regular expression for the desired case is:
/[Bb]in/Folder[123]/
However, if the directory / folder names do not actually meet this kind of pattern, we need:
/[Bb]in/(somedir|another|third)/
Explanation
First, a side note: the default syntax is regexp, so the initial syntax: regexp line is unnecessary. As a result, it's possible that your .hgignore file is not in proper UTF-8 format: see Mercurial gives "invalid pattern" error for simple GLOB syntax. (But that would produce different behavior, so that's probably a problem. It's just worth mentioning in any answer about .hgignore files malfunctioning.)
Next, it's worth noting a few items:
Mercurial tracks only files, not directories / folders. So the real question is whether any given file name matches the pattern(s) listed in .hgignore. If they do match, and the file is currently untracked, the file will not be automatically added with a sweeping "add everything" operation, and Mercurial will not gripe that the file is untracked.
If some file is already tracked, the fact that its name matches an ignore pattern is irrelevant. If the file a/b/c.ext is not tracked and does match a pattern, hg add a/b/c.ext will add it anyway, while hg add a/b will en-masse add everything in a/b but won't add c.ext because it matches the pattern. So it's important to know whether the file is already tracked, and consider what you explicitly list to hg add. See also How to check which files are being ignored because of .hgignore?, for instance.
Glob patterns are much easier to write correctly than regular expressions. Unless you're doing this for learning or teaching purposes, or glob is just not powerful enough, stick with the glob patterns. (In very old versions of Mercurial, glob matching was noticeably slower than regexp matching, but that's been fixed for a long time.)
Mercurial's regexp ignore entries are not automatically anchored: if you want anchored behavior, use ^ at the front, and $ at the end, as desired. Here, you don't want anchored behavior, so you can eliminate the leading and trailing .*. (Mercurial refers to this as rooted rather than anchored, and it's important to note that some patterns are anchored, but .hgignore ones are not.)
Python/Perl regexp (?!...) syntax is the negation syntax: (?!...) matches if the parenthesized expression doesn't match the string. This is part of the problem.
We need not worry about capturing groups (see capturing group in regex) as Mercurial does nothing with the groups that come out of the regular expression. It only cares if we match.
Path names are really slash-separated components. The leading components are the various directories (folders) above the file name, and the final component is the file name. (That is, try not to think of the first parts as folders: it's not that it's wrong, it's that it's less general than "components", since the last part is also a component.)
What we want, in this case, is to match, and therefore "ignore", names that have one component that matches either bin or Bin followed immediately by another component that matches Folder1, Folder2, or Folder3 that is followed by a component-separator (so that we haven't stopped at /bin/Folder1, for instance, which is a file named Folder1 in directory /bin).
The strings bin and Bin both end with a common trailing part of in, so this is recognizable as (B|b)in, but single-character alternation is more easily expressed as a character class: [Bb], which eliminates the need for parentheses and vertical-bars.
The same holds for the names Folder1, Folder2, and Folder3, except that their common string leads rather than trails, so we can use Folder[123].
Suppose we had anchored matches. That is, suppose Mercurial demanded that we match the whole file name, which might be, say, /foo/hello/bin/Folder2/bar/world.ext. Then we'd need .*/[Bb]in/Folder[123]/.*, because we'd need to match any number of characters to skip over /foo/hello before matching /bin/Folder2/, and again skip over any number of characters to match bar/world.ext, in order to match the whole string. But since we don't have anchored matches, we'll find the pattern /bin/Folder2/ within the whole string, and hence ignore this file, using the simpler pattern without the leading and trailing .*.

How to tell Regex that the searched match must be at the beginning or just after tabs?

I'm fairly new to regex, and I can't seem to get this working:
I did some changes in one of my projects and thus I had to change variable names in multiple files. The variables were named as their class (variable lower case, Class uppercase; I know this was not good practice :D) and this confused me, so I replaced them with getters.
I want the regex to find every variable that:
was at the beginning of the line or just had whitespaces before it
was lowercase at the beginning
had a dot after it (because a property of it was used)
For example: song.data should turn to getSong().data while Song.data or this.song or even this.song.data should have stayed the same.
So far I got this regex to work: /^(song)/mg.
My problem now is, that most of my lines are beginning with white spaces (tabs) because they are in funtion bodies and I can't find a regex which accepts tabs at the beginning, but doesn't delete them whilest replacing. I hope this makes any sense for some of you ^^
PS: I already replaced all the names by hand, but now I'm curious to find out how it WOULD HAVE worked with regex
you can use a positive lookahead/lookbehind that a patern precedes or follows a match without including it in the match
(?<=PATERN)MATCH
Example

Regex taking too many characters

I need some help with building up my regex.
What I am trying to do is match a specific part of text with unpredictable parts in between the fixed words. An example is the sentence one gets when replying to an email:
On date at time person name has written:
The cursive parts are variable, might contains spaces or a new line might start from this point.
To get this, I built up my regex as such: On[\s\S]+?at[\s\S]+?person[\s\S]+?has written:
Basically, the [\s\S]+? is supposed to fill in any letter, number, space or break/new line as I am unable to predict what could be between the fixed words tha I am sure will always be there.
Now comes the hard part, when I would add the word "On" somewhere in the text above the sentence that I want to match, the regex now matches a much bigger text than I want. This is due to the use of [\s\S]+.
How am I able to make my regex match as less characters as possible? Using "?" before the "+" to make it lazy does not help.
Example is here with words "From - This - Point - Everything:". Cases are ignored.
Correct: https://regexr.com/3jdek.
Wrong because of added "From": https://regexr.com/3jdfc
The regex is to be used in VB.NET
A more real life, with html tags, can be found here. Here, I avoided using [\s\S]+? or (.+)?(\r)?(\n)?(.+?)
Correct: https://regexr.com/3jdd1
Wrong: https://regexr.com/3jdfu after adding certain parts of the regex in the text above. Although, in html, barely possible to occur as the user would never write the matching tag himself, I do want to make sure my regex is correctjust in case
These things are certain: I know with what the part of text starts, no matter where in respect to the entire text, I know with what the part of text ends, and there are specific fixed words that might make the regex more reliable, but they can be ommitted. Any text below the searched part is also allowed to be matched, but no text above may be matched at all
Another example where it goes wrong: https://regexr.com/3jdli. Basically, I have less to go with in this text, so the regex has less tokens to work with. Adding just the first < already makes the regex take too much.
From my own experience, most problems are avoided when making sure I do not use any [\s\S]+? before I did a (\r)?(\n)? first
[\s\S] matches all character because of union of two complementary sets, it is like . with special option /s (dot matches newlines). and regex are greedy by default so the largest match will be returned.
Following correct link, the token just after the shortest match must be geschreven, so another way to write without using lazy expansion, which is more flexible is to prepend the repeated chracter set by a negative lookahead inside loop,
so
<blockquote type="cite" [^>]+?>[^O]+?Op[^h]+?heeft(.+?(?=geschreven))geschreven:
becomes
<blockquote type="cite" [^>]+?>[^O]+?Op[^h]+?heeft((?:(?!geschreven).)+)geschreven:
(?: ) is for non capturing the group which just encapsulates the negative lookahead and the . (which can be replaced by [\s\S])
(?! ) inside is the negative lookahead which ensures current position before next character is not the beginning of end token.
Following comments it can be explicitly mentioned what should not appear in repeating sequence :
From(?:(?!this)[\s\S])+this(?:(?!point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
or
From(?:(?!From|this)[\s\S])+this(?:(?!point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
or
From(?:(?!From|this)[\s\S])+this(?:(?!this|point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
to understand what the technic (?:(?!tokens)[\s\S])+ does.
in the first this can't appear between From and this
in the second From or this can't appear between From and this
in the third this or point can't appear between this and point
etc.

Vim S&R to remove number from end of InstallShield file

I've got a practical application for a vim regex where I'd like to remove numbers from the end of file location links. For example, if the developer is sloppy and just adds files and doesn't reuse file locations, you'll end up with something awful like this:
PATH_TO_MY_FILES&gt
PATH_TO_MY_FILES1&gt
...
PATH_TO_MY_FILES22&gt
PATH_TO_MY_FILES_ELSEWHERE&gt
PATH_TO_MY_FILES_ELSEWHERE1&gt
...
So all I want to do is to S&R and replace PATH_TO_MY_FILES*\d+ with PATH_TO_MY_FILES* using regex. Obviously I am not doing it quite right, so I was hoping someone here could not spoon feed the answer necessarily, but throw a regex buzzword my way to get me on track.
Here's what I have tried:
:%s\(PATH_TO_MY_FILES\w*\)\(\d+\)&gt:gc
But this doesn't work, i.e. if I just do a vim search on that, it doesn't find anything. However, if I use this:
:%s\(PATH_TO_MY_FILES\w*\)\(\d\)&gt:gc
It will match the string, but the grouping is off, as expected. For example, the string PATH_TO_MY_FILES22 will be grouped as (PATH_TO_MY_FILES2)(2), presumably because the \d only matches the 2, and the \w match includes the first 2.
Question 1: Why doesn't \d+ work?
If I go ahead and use the second string (which is wrong), Vim appears to find a match (even though the grouping is wrong), but then does the replacement incorrectly.
For example, given that we know the \d will only match the last number in the string, I would expect PATH_TO_MY_FILES22&gt to get replaced with PATH_TO_MY_FILES2&gt. However, instead it replaces it with this:
PATH_TO_MY_FILES2PATH_TO_MY_FILES22&gtgt
So basically, it looks like it finds PATH_TO_MY_FILES22&gt, but then replaces only the & with group 1, which is PATH_TO_MY_FILES2.
I tried another regex at Regexr.com to see how it would interpret my grouping, and it looked correct, but maybe a hack around my lack of regex understanding:
(PATH_TO_\D*)(\d*)&gt
This correctly broke my target string into the PATH part and the entire number, so I was happy. But then when I used this in Vim, it found the match, but still replaced only the &.
Question 2: Why is Vim only replacing the &?
Answer 1:
You need to escape the + or it will be taken literally. For example \d\+ works correctly.
Answer 2:
An unescaped & in the replacement portion of a substitution means "the entire matched text". You need to escape it if you want a literal ampersand.

replacing _ to - with sed , but only within href-attribute

I would like to replace in text-fragments like:
<strong>Media Event "New Treatment Options on November 4–5, 2010, in Paris, France<br /></strong>>> more
all underscores with dashes. But only in the href-attribute. As there are hundreds of files the best approach is to work on these files with sed or a small shellscript.
I started with
\shref=\"([^_].+?)([_].+?)\"
but this matches only 1 _ and i don't know the number of _ and i stucked how dynamically could replace the underscores in a unknown number of back-references.
A tool that's specifically geared toward working with HTML is by far preferable since trying to work with it using regexes can lead to madness.
However, assuming that there's only one href per line, you might be able to use this divide-and-conquer technique:
sed 's/\(.*href="\)\([^"]*\)\(".*\)/\1\n\2\n\3/;:a;s/\(\n.*\)_\(.*\n\)/\1-\2/;ta;s/\n//g' inputfile
Explanation:
s/\(.*href="\)\([^"]*\)\(".*\)/\1\n\2\n\3/ - put newlines around the contents of the href
:a;s/\(\n[^\n]*\)_\([^\n]*\n\)/\1-\2/;ta - replace the underscores one-by-one in the text between the newlines, t branches to label :a if a substitution was made
s/\n//g - remove the newlines added in the first step
Regular expressions are simply fundamentally the wrong tool for this job. There is too much context that must be matched.
Instead, you'll need to write something that goes character-by-character, with two modes: one in which it just copies all input, and one in which it replaces underscore with dash. On finding the start of an href it enters the second mode, on leaving an href it returns to the first. This is essentially a limited form of a tokenizer.