regex to ignore lines with specific symbol - regex

I'm trying to get a regex to set change relative paths to absolute with an alias, and ignore any lines with alias already indicated with the # symbol to prevent folder/file name matches. I've got as far as the replace and match , but I can't do the ignore lines with # bit. I would also like it to match the /foldername/ forward slashes either side when selecting.
https://regex101.com/r/vRUegE/1/
I would have expected the lines with # to be ignored
Here is the correct response thanks to Wiktor:
Working example
Using a combination of WinGrep and these regex's it was easy to refactor hundreds of paths in hundreds of files in minutes!

You may add a negative lookahead (?!#) after the positive lookbehind:
(?<=from..)(?!#)(.*)(?=module)(.*)(module)
^^^^^
See this regex demo. The (?!#) will fail the match once # is found right after from and any 2 chars immediately to the right of it.
Note that the regex might need further adjusting as (?=module) does not make much sense here. You might as well use (?<=from..)(?!#)(.*)(module).

difficult to say in which language you do it, seem java , regexp is bit different for different languages .
(?<=from..)(.*)(?=module)(.*(module)/g
here you refer (.*) - e.q. any repeat chars , you need to change to
([^\#]*) ( or ([^#]*) )
e.q. any non # char

Related

Can I improve simplicity using negative lookahead to find the last folder in a file path?

I’m trying to find a simpler solution to locating the last folder path in a file list that does not contain a file of type, but must use lookarounds. Can anyone explain some improvements in my regex code that follows?
Search text:
c:\this\folder\goes\findme.txt
c:\this\folder\cant\findme.doc
c:\this\folder\surecanfind.txt
c:\\anothertest.rtf
c:\t.txt
RegEx:
(?<=\\)[^\\\n\r]+?(?=\\[^\\]*\.)(?!.*\.doc)
Expected result:
‘goes’
‘folder’
Can the RegEx lookahead be improved and simplified? Thanks for the help.
In your original regex:
(?<=\\)[^\\\n\r]+?(?=\\[^\\]*\.)(?!.*\.doc)
there isn't really much to improve in terms of the use of lookarounds.
The positive look behind is necessary to tell the regex when it is allowed to begin a match.
The positve look ahead is necessary to terminate the expansion of the +? quantifier.
And the negative look ahead is needed to negate invalid matches.
You might be able to condense both look aheads into one. But keeping them separate is more efficient, since if the evaluation of one fails, it can skip the evaluation of the second.
However, if your looking for a more efficient/"normal" Regex, I would typically use something like:
^.*\\(.+?)\\[^\\]+\.(?!doc).+$
instead of using lookarounds to exclude everything except my desired output from a match, I'd include my desired output in a capture group.
this allows me to tell regex to only check for a match once per line, instead of after ever \ character.
Then, to get my desired output, all I have to do is grab the content of capture group 1 from each match.
working example
orignal (98,150 steps)
Capture Groups (66,586 steps)
Hopefully that'll help you out

Regex to find two words on the page

I'm trying to find all pages which contain words "text1" and "text2".
My regex:
text1(.|\n)*text2
it doesn't work..
If your IDE supports the s (single-line) flag (so the . character can match newlines), you can search for your items with:
(text1).*(text2)|\2.*\1
Example with s flag
If the IDE does not support the s flag, you will need to use [\s\S] in place of .:
(text1)[\s\S]*(text2)|\2[\s\S]*\1
Example with [\s\S]
Some languages use $1 and $2 in place of \1 and \2, so you may need to change that.
EDIT:
Alternately, if you want to simply match that a file contains both strings (but not actually select anything), you can utilize look-aheads:
(?s)^(?=.*?text1)(?=.*?text2)
This doesn't care about the order (or number) of the arguments, and for each additional text that you want to search for, you simply append another (?=.*?text_here). This approach is nice, since you can even include regex instead of just plain strings.
text0[\s\S]*text1
Try this.This should do it for you.
What this does is match all including multiline .similar to having .*? with s flag.
\s takes care of spaces,newlines,tabs
\S takes care any non space character.
If you want the regex to match over several lines I would try:
text1[\w\W]*text2
Using . is not a good choice, because it usually doesn't match over multiple lines. Also, for matching single characters I think using square brackets is more idiomatic than using ( ... | ... )
If you want the match to be order-independent then use this:
(?:text1[\w\W]*text2)|(?:text2[\w\W]*text1)
Adding a response for IntelliJ
Building on #OnlineCop's answer, to swap the order of two expressions in IntelliJ,you would style the search as in the accepted response, but since IntelliJ doesn't allow a one-line version, you have to put the replace statement in a separate field. Also, IntelliJ uses $ to identify expressions instead of \.
For example, I tend to put my nulls at the end of my comparisons, but some people prefer it otherwise. So, to keep things consistent at work, I used this regex pattern to swap the order of my comparisons:
Notice that IntelliJ shows in a tooltip what the result of the replacement will be.
For me works text1*{0,}(text2){0,}.
With {0,} you can decide to get your keyword zero or more times OR you set {1,x} to get your keyword 1 or x-times (how often you want).

What mistake did I do for this unexpected negative lookahead subpattern?

I am actually working with a .tsv database whose headers are full of meaningful things for me.
I thus wanted to rip them off from the header to something that I & others users (non proficient with relational databases, so we mostly use Excel in the end to organize data and process it) would be more able to handle with Excel, by breaking them up with tabs.
Example header:
>(name1)database-ID:database2-ID:value1:value2
(I know this seems strange to put values in an header but this is descriptive of parameters of the third value associated to the header, that we don't have to mess here)
output as:
name1\tdatabase-ID\tdatabase2-ID\tvalue1\tvalue2\n
I thus pasted my data (headers, one per line) in EmEditor (BOOST syntax) and came with this regex:
>\((.*)\)(.*?)\:(.*?)\:(.*?)\:(.*?)\n
with each capturing group being then separated from others by inserting tabs between each others. It works, with perfect matches, no problem.
But I became aware there were malformed lines that didn't respected the logic of the whole database, and I wanted to make an expression to separate them at once.
If I make it with wrong lines it would be:
>(name1)database-ID:database2-ID:value1-1:value1-2\n
>(name2)database-ID:database2-ID:value2-1:value2-2\n
>(name3)database-ID:database2-ID:value3-1value3-2\n
Last line is ill-formed because it lacks the : between both last values.
I want it to be matched by working around the original expression that recognizes well-formed lines.
I perfectly know that I could came with different solutions by slightly tweaking my first expression for eliminating the good lines and retrieving misformed one after but
I don't want a solution to my process, I just want to understand what I made not well there; so that I become more educated (and not just more tricky by being able to circumvent my mistakes that I can't resolve):
I tried a negation of the above mentioned expression:
([^(>\((.*)\)(.*?)\:(.*?)\:(.*?)\:(.*?)\n)])
That doesn't match with anything.
I tried a negative lookahead, but It will be extremely, painfully slow then will match every 0-length matches possible in the document:
(?!(^>\((.*)\)(.*?)\:(.*?)\:(.*?)\:(.*?)\n))
I thus added a group capture for a string of characters behind,
but it doesn't work either:
(?!(^>\((.*)\)(.*?)\:(.*?)\:(.*?)\:(.*?)\n))(^.*?)
So please explain me where I have been wrong with the negating group ([^whatever]) and the use of the negative lookahead?
So please explain me where I have been wrong with the negating group ([^whatever]) and the use of the negative lookahead?
Let's address the question first: What does [^(pattern)] do?
You seem to have a misunderstanding and expect it to:
Match everything except the subpattern pattern. (Negation)
What it actually does is to:
Match any character that aren't (, p, a, t, ... n, ).
Therefore, the pattern
([^(>\((.*)\)(.*?)\:(.*?)\:(.*?)\:(.*?)\n)])
... Matches a character that aren't (, >, (, ... \n, ).
As for the negative lookahead, you're simply doing it wrong. The anchor ^ is in the wrong position, therefore your assertion will fail to provide any useful help. It's also not what negative lookaheads are for altogether.
(?!(^>\((.*)\)(.*?)\:(.*?)\:(.*?)\:(.*?)\n))
I'll explain what this does:
(?! Open negative lookahead group: Assert the position does not match this pattern, without moving the pointer position.
( Capturing group. The use of capturing groups in negative lookaheads are useless, as the subpattern in negative lookahead groups never matches.
^ Assert position at start of string.
>\( Literal character sequence ">(".
(.*) Capturing group which matches as many characters as possible except newlines, then backtracks.
\) Literal character ")".
(.*?) Capturing group with reluctant zero-to-one match of any characters except newlines.
\: Literal character ":".
(.*?)\:(.*?)\:(.*?)
\n A new line.
) Closes capturing group.
) Closes negative lookahead group. When this assertion is finished, the pointer position is same as beginning, and thus the resulting match is zero-length.
Note that the anchor is nested within the negative lookahead group. It should be at the start:
^(?!(>\((.*)\)(.*?)\:(.*?)\:(.*?)\:(.*?)\n))
While this doesn't return anything useful, it explains what is wrong, since you don't need a solution. ;)
In case you are in need of a solution suddenly, please refer to this relevant answer of mine (I'm not adding anything else into the post):
Rails 3 - Precompiling all css, sass and scss files in a folder
You could do this simply through PCRE Verb (*SKIP)(*F). The below regex would match all the bad-lines.
(?:^>\([^()]*\):[^:]*:[^:]*:[^:]*:[^:\n]*$)(*SKIP)(*F)|^.+
DEMO
Based on what I have been reading from Unihedron;
This is what I came for in emEditor:
^(?!>\(([A-Za-z0-9_\'\-]*?)\)(([A-Za-z0-9_\'\-]*?)\:){3}([A-Za-z0-9_\'\-]*?)\n).*\n
>(name1)database-ID:database2-ID:value1-1:value1-2
(NOT MATCH)
>(name2)database-ID:database2-ID:value2-1:value2-2
(NOT MATCH)
>(name3)database-ID:database2-ID:value3-1value3-2
(MATCH)
>(name3)database-ID::database2-ID:value3-1:value3-2
(MATCH)
(the character class avoid discarding names including special characters without making it possible to have two subsequent ":".)
I also could achieve the same results with:
(?!^>\(([A-Za-z0-9_\'\-]*?)\)(([A-Za-z0-9_\'\-]*?)\:){3}([A-Za-z0-9_\'\-]*?)\n)^.*\n
So I guess that all along capturing groups were what was messing with my lookahead.
Now I acknowledge that Avinash Raj is more efficient with the (*SKIP)(*F)|^.+ pattern, just that I didn't know about those functions and I also wanted to understand my logic / syntax mistake. (Thanks to Unihedron for that)

.hgignore a folder except some subfolders

I want to ignore a folder but preserve some of its folders.
I Tried regexp matching like this
syntax: regexp
^site/customer/\b(?!.*/data/.*).*
Unfortunately this doesn't work.
I read in this answer that python only does fixed-width negative lookups.
Is my desired ignoring impossible?
Python regex is cool
Python does support negative lookahead lookups (?=.*foo). But it doesn't support arbitrary-length negative lookbehind lookups (?<=foo.*). It needs to be fixed (?<=foo..).
Which means it's definitely possible to solve your problem.
The problem
You've got the following regex: /customer/(?!.*/data/.*).*.
Let's take an input example /customer/data/name. It matches for a reason.
/customer/data/name
^^^^^^^^^^ -> /customer/ match !
^ (?!.*/data/.*) Let's check if there is no /data/ ahead
The problem is here, we've already matched "/"
so the regex only finds "data/name" instead of "/data/name"
^^^^^^^^^ .* match !
Fixing your regex
Basically we just need to remove that one forward slash, we add an anchor ^ to make sure it's the beginning of string and make sure we just match customer by using \b : ^/customer\b(?!.*/data/).*.
Online demo

Adding "/index.html" to paths in Vim

I'm trying to append "/index.html" to some folder paths in a list like this:
path/one/
/another/index.html
other/file/index.html
path/number/two
this/is/the/third/path/
path/five
sixth/path/goes/here/
Obviously the text only needs to be added where it does not exist yet. I could achieve some good results with (vim command):
:%s/^\([^.]*\)$/\1\/index.html/
The only problem is that after running this command, some lines like the 1st, 5th and 7th in the previous example end up with duplicated slashes. That's easy to solve too, all I have to do is search for duplicates and replace with a single slashes.
But the question is:
Isn't there a better way to achieve the correct result at once?
I'm a Vim beginner, and not a regex master also. Any tips are really appreciated!
Thanks!
So very close :)
Just add an optional slash to the end of the regex:
\/\?
Then you need to change the rest of the pattern to a non-greedy match so that it ignores a trailing slash. The syntax for a non-greedy match in vim (replacing the *) is:
\{-}
So we end up with:
:%s/^\([^\.]\{-}\)\/\?$/\1\/index.html/
(Doesn't hurt to be safe and escape the period.)
Vim's regex supports the ability to match a bit of text foo if it does or doesn't precedes or follows some other text bar without matching bar, and this is exactly the sort of thing you're looking for. Here you want to match the end of line with an optional /, but only if the / isn't followed by index.html, and then replace it with /index.html. A quick look at Vim's help tells me \#<! is exactly what to use. It tells Vim that the preceding atom must be in the text but not in what's matched. With a little experimentation, I get
:%s;/\?\(index\.html\)\#<!$;/index.html;
I use ; to delimit the parts of the :s command so that I don't have to escape any / in the regex or replacement expression. In this particular situation, it's not a big deal though.
The / is optional, and we say so with \?.
We need to group index.html together because otherwise our special \#<! would only affect the l otherwise.