RegEx for matching everything between two special characters [duplicate] - regex

This question already has answers here:
RegEx to select everything between two characters?
(4 answers)
Closed 3 years ago.
I want to find all characters between 2 special characters. I can't find the solution though because there are new lines that are not included. It's prolly easy, but I can't seem to find the right regex for it.
How do I solve this problem?
The source data is structured like this:
\#(.*)\;
doesn't include new lines and
(?!\#)([\S\s])(?!=\;)
doesn't work also.
It selects everything, but doesn't do the group trick...
Source looks like this:
#first line of text;
#second line of text;
#third line could easy
be on a new line;
#forth etc;
#this could (#hi,#hi,#hi) also
happen though:));
#so.... any idea;
any new line starts with # and every line ends with ;

I see two problems in your regex,
You are missing quantifier in your [\S\s] due to which it will only match one character.
Second you need a non-greedy regex so it doesn't match all the lines.
Also, where you wrote this (?!#) I guess you meant to write any one character among them, for which you should place it in a character set like this [?!#]
You need this regex, where you can capture your text from group1
#([\w\W]*?);
Regex Demo
And like you attempted, if you want your full match to only select the intended text, you can use lookaround.
Regex Demo with lookarounds so your full match is intended text only
Also, writing [^;]* (which also matches newlines) is way faster than .*? hence you should preferably use this regex,
(?<=[?!#])[^;]*(?=;)
Regex Demo with best performance

You just need to modify your first regex a little bit so that it looks like this:
#([\s\S]*?);
. will only match non new line characters. So I replaced it with [\s\S] - the set of whitespaces union the set of non-whitespaces - the set of all characters. If your regex engine has the "single line" option, you can turn that on, and . will match new lines as well.
I also made * lazy. Otherwise it will just be one whole match that matches all the way to the last ;. For more info, see this question.
You don't need to escape the ;.

You have to use either a single line flag /s or add whitespace characters \s as second alternative to all characters .. Also, your * quantifier must be lazy/non-greedy, so the whole regex stops at first ; it founds.
#((?:.|\s)*?); or #(.*?);/s

Related

Notepad++: How to remove all string except containing period [duplicate]

This question already has answers here:
How to match only strings that do not contain a dot (using regular expressions)
(3 answers)
Closed 3 years ago.
I have numerous SELECT statements conjoined by UNION keyword in a single file. What I want to do is to extract all the db.table strings only? How can I delete all words not containing period (.) using regex in notepad++ editor? Database and table are the only ones with a period.
It's okay with me even if new lines are not removed. Though, as a learning bonus for everyone seeing this post, you can also show the regex that trims the new lines, that will show this output:
db.table1
db.table2
...
db.tablen
You may try the following find and replace, in regex mode:
Find: (?<=^|\s)[^.]+(?=$|\s)
Replace: <empty string>
Demo
Note that my replacement only removes the undesired terms in the query; it does not make an effort to remove stray or leftover whitespace. To do that, you can easily do a quick second replacement to remove whitespace you don't want.
Edit:
It appears that Notepad++ doesn't like the variable width lookbehinds I used in the pattern. Here is a refactored, and more verbose version, which uses strictly fixed width lookbehinds:
(^[^.]+$)|(^[^.]+(?=\s))|((?<=\s)[^.]+$)|((?<=\s)[^.]+(?=\s))
Demo
The logic in both of the above patterns is to match a word consisting entirely of non dot characters, which are surrounded on either side by one or more of the following:
start of the string (^)
end of the string ($)
any type of whitespace (\s)
My guess is that maybe this expression:
([\s\S]*?)(\S*(\.)\S*)
being replaced with $2\n or:
(\S*(\.)\S*)|(.+?)
with $1 might work.
Demo 1
Demo 2

Regex for single line comments

I'm trying to make a regex to identify a comment. It has to start with // and end with a new line or a *) pattern.
For now, I manage to get this (\/\/)([^\n\r]+), but I am unsuccessful to add the *) pattern.
Any tips?
Try it like this:
^\/\/[^\n\r]+(?:[\n\r]|\*\))$
Matches
^ Beginning of the string
\/\/ Match two forward slashes
[^\n\r]+ Match not a newline or a carriage return 1 or more times
(?: Non capturing group
[\n\r]|\*\) Match a newline or a carriage return or *)
) Close non capturing group
$ The end of the string
Edit:
Updated according to the comments, this is the final regex:
\/\/[^\n\r]+?(?:\*\)|[\n\r])
You can use (\/\/)(.+?)(?=[\n\r]|\*\)).
?= means the last group is a positive lookahead. It only assert the following characters can match the new-line-or-*) pattern. If you want to match the new-line-or-*) pattern as well, just remove ?=.
.+? means lazy matching, i.e. matching characters as few as possible. So for string such as // something *) something *), it will stop matching before the first *).
Note this pattern does not match //\n (your previous regex does not as well) because + means at least one characters. If you want to match such string, use * instead of + in the regex.
Finally, although you can use regex to parse such single line comments, as Jerry Coffin said in comment, don't try to parse programming source codes using regexes, because the language constituted by all legal source codes is commonly not a regular language.
extendind the answer of #the-fourth-bird if you need to find a block of single lines of comments, something like this changing 3 for the number of lines, should help to find a bigger blocks
^(\/\/.*[\r\n]){3}$
And if trying to find a block of comment with /** */ here explain a few ways.

Regex Select groups not found in a pattern

I have been looking at the various topics on Regex on SO, and they are all saying that to find the invert (select all that doesn't fit the criteria) you simply use the[^] syntax or negative lookahead.
I have tried using both of these methods on my Regex but the results are not adequate the [^] especially seems to take all its contents literally (even when escaped).
What I need this for:
I have a massive SQL line with a SQL dump I'm trying to remove all characters that are not the line id, and the numerical value of one column.
My regex works in matching exactly what I'm looking for; what I need to do is to invert this match so I can remove all non-matching parts in my IDE.
My regex:
/(\),\(\d{1,4},)|(,\d{10},)/
This matches a "),(<number upto 4 digits>," or ",<number of ten digits>," .
The subject
My subject is a 500Kb line of an SQL dump looking something like this (I have already removed a-z and other unwanted characters in previous simple find/replaces):
),(39,' ',1,'01761472100','#','9 ','20',1237213277,0,1237215419,''),(40,' ',3,'01445731203','#',' ','-','22 2','210410//816',1237225423,0,1484651768,''),(4270,' /
My aim is to use a regex to achive the following output:
),(39,,1237213277,,1237215419,),(40,,1237225423,,1484651768,),(4270,
Which I can then go over again and easily remove repetitions such as commas.
I have read that Negation in Regex is tricky, So, what is the syntax to get the regex I've made to work inverted? To remove all non-matching groups? What can you recommend as a way of solving this without spending hours manually reading the lines?
You may use a really helpful (*SKIP)(?!) (=(*SKIP)(*F) or (*SKIP)(*FAIL)) construct in PCRE to match these texts you know and then skip and match all other text to remove:
/(?:\),\(\d{1,4},|,\d{10},)(*SKIP)(?!)|./s
See the regex demo
Details:
(?:\),\(\d{1,4},|,\d{10},) - match 1 of the 2 alternatives:
\),\(\d{1,4}, - ),(, then 1 to 4 digits and then ,
| - or
,\d{10}, - a comma, 10 digits, a comma
(*SKIP)(?!) - omit the matched text and proceed to the next match
| - or
. - any char (since /s DOTALL modifier is passed to the regex)
The same can be done with
/(\),\(\d{1,4},|,\d{10},)?./s
and replacing with $1 backreference (since we need to put back the text captured with the patterns we need to keep), see another regex demo.

Skip Second String Between Characters with Regex

I've been working on a regex issue. I have a lot of lines formatted like this:
3240985|#Apple.-+240538|34346|346356356|36433565|6agf8s89auf
The end goal should look like this:
#Apple.-+240538|6agf8s89auf
#Apple.-+240538 is random characters, and 6agf8s89auf is random alphanumeric characters.
I've been using (.*?)[\|] and replacing the parts I need with blank characters in Notepad++ but it's impossible to complete it this way with the number of lines I have.
The regex for this kind of string is (?:(?<=^)|(?<=\|))(\d+(?:$|\|))
Demo: https://regex101.com/r/sO0fZ2/2
However Find and Replace in Notepad++ may have some issues because Notepad++ finds and replace strings only once. Some other text editors like, sublime text find and replaces the contents recursively. However you can simple overcome this by clicking Replace All button multiple times.
Input
Result after clicking "Replace All in All Opened Documents" twice
In sublime text, you can achieve this in single click:
Input
Result
P.S.: I'm not aware if there's any feature in Notepad++ that finds and replaces the content recursively. You can google for that. If there's any feature like that, then you can use it. However, I think that this shouldn't be a problem because it will only require a couple of more clicks.
There is a simple approach with an alternation:
^\d+\||\|\d+(?=\||$)
Details:
^\d+\| - Branch 1 matching a chunk of 1+ digits (\d+) at the beginning of the string (^) and a | after them
| - alternation operator meaning OR
\|\d+(?=\||$) - a literal pipe (\|, must be escaped) with 1+ digits after it (\d+) that are followed with a literal pipe or end of string ((?=...) is a positive lookahead that does not advance the regex index, thus, you can still match adjacent matches with the same pattern.)

How to match until get specific pattern in Regex

I have a scenario where i want to match specific word and then match everything until i get another pattern. For example
ABC=145865865
Then anything comes in ways
and then
Date=11/11/2001
I have tried (.*?) but it only match that specific line in my scenario i have multiple lines of data in between.
How can i do this?
Closest guess to what I think you're looking for:
ABC=(\d+)[\s\S]*?Date=(\d\d/\d\d/\d{4})
This uses [\s\S] which means "either a whitespace character or not a whitespace character", which is equivalent to "any character". The . can also be set to match any character, but I tend to prefer [\s\S] because it does just that without having to set flags. You haven't specified the language you are using so I can't tell you how to set such a flag anyway (it's re.DOTALL in Python).
Multiple lines? If you mean you have newline characters (\n) in between then you need to set the DOTALL flag, as follows:
Pattern p = Pattern.compile(<your-regex-here>, Pattern.DOTALL)
The above will match new line characters between the two strings.