How to remove a line break between two strings (RegEx) - regex

I am trying to develop a quick hack in SublimeText2 (not ideal, I know):
I have this (frequent) code in my markup:
{% url '
someURL ' %}
How can I use regex to remove the line breaks such that I have {% url 'someURL '%}
I have succeeded in selecting eveything between the brackets:
\{\%[\s\S]*?\%\}
However, I can't figure out how to select only the linebreaks \n and double spaces within it.

Use the below regex and then replace the match with a single space.
(?s)\s+(?=(?:(?!%}|\{%).)*%\})
DEMO
Explanation:
(?s) set flags for this block (with . matching
\n) (case-sensitive) (with ^ and $
matching normally) (matching whitespace
and # normally)
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times)
(?= look ahead to see if there is:
(?: group, but do not capture (0 or more
times):
(?! look ahead to see if there is not:
%} '%}'
| OR
\{ '{'
% '%'
) end of look-ahead
. any character
)* end of grouping
% '%'
\} '}'
) end of look-ahead

You can use this pattern:
(?:\G(?!\A)|\{%)[^%\r\n]*\K(?:\r?\n)+(?=[^%]*%\})
The replacement is an empty string.
This pattern ensure that you are always between tags {% and %} using the \G anchor that matches the position at the end of the previous match.
The \K removes all that have been matched on the left from the match result. So only the CRLF or LF is removed.
This pattern can be improved if you want to allow % characters between tags:
(?:\G(?!\A)|\{%)(?:[^%\r\n]|%(?!\}))*\K(?:\r?\n)+(?=(?:[^%]|%(?!\}))*%\})
or more efficient (if it is possible with sublimetext):
(?:\G(?!\A)|\{%)(?>[^%\r\n]+|%+(?!\}))*\K(?:\r?\n)+(?=(?>[^%]+|%+(?!\}))*%\})
a little shorter (if sublimetext regex engine is smart enough):
(?:\G(?!\A)|{%)(?>[^%\r\n]+|%+(?!}))*\K\R+(?=(?>[^%]+|%+(?!}))*%})
Note: if you are sure that tags are always balanced, you can remove the last lookahead (but this way is less safe):
(?:\G(?!\A)|{%)(?>[^%\r\n]+|%+(?!}))*\K\R+

(\{%.*)\n\s*(.*%\})
With replace string \1\2 will change
{% url '
someURL ' %}
to {% url 'someURL ' %}

Related

How to match fuzzy empty div with a regular expression?

I have the following HTML code:
<div id="page126-div" style="position:relative;width:918px;height:1188px;">
</div>
<div id="page127-div" style="position:relative;width:918px;height:1188px;">
sometext for example
</div>
<div id="page128-div" style="position:relative;width:918px;height:1188px;">
</div>
My task is to match empty divs. Empty means in this context that they do not content at all (no characters between open > and closing <) or contain just newline, or just a space or newline or less than 5 characters. So emptyness is pretty fuzzy.
If I would match all divs, not only empty I would use the following regex:
\<div id="page.*?"\>.*?\<\/div\>
Naturally I should use it with dotall modifier.
But when I try to match only empty divs I try to use this expression:
\<div id="page.*?"\>.{0,5}?\<\/div\>
I expect to get first and last(third) divs, because they contain: opening div tag with attributes, then div content that can be from 0 to 5 characters and closing div tag.
First match is right, but second match is second and third divs stacked together instead of third div only.
I do not understand why.
This regex is pretty straight-forward:
<div id=\"[^"]+?\" style=[^>]+?>(\s|\n|[^\n]{,5})<\/div>
Just notice it doesn't necessarily requires the exact same id and style properties.
You can give this a try.
Scraper Series
/(?><div(?=(?:[^>"']|"[^"]*"|'[^']*')*?\sid\s*=\s*(?:(['"])\s*page(?:(?!\1)[\S\s])*\1))\s+(?:"[\S\s]*?"|'[\S\s]*?'|(?:(?!\/>)[^>])?)+>)\s*[\S\s]{0,5}\s*<\/div\s*>/
https://regex101.com/r/x8jf8D/1
Formatted
(?>
< div # div tag
(?= # Asserttion (a pseudo atomic group)
(?: [^>"'] | " [^"]* " | ' [^']* ' )*?
\s id \s* = \s*
(?:
( ['"] ) # (1), Quote
\s* page # With 'id = "page XXX"
(?:
(?! \1 )
[\S\s]
)*
\1
)
)
\s+
(?:
" [\S\s]*? "
| ' [\S\s]*? '
| (?:
(?! /> )
[^>]
)?
)+
>
)
\s* # Optional whitespaces (remove if necessary)
[\S\s]{0,5} # Optional 1-5 anything (including wsp)
\s* # Optional whitespaces (remove if necessary)
</div \s* >

How I can find all of newline and characters between comments tags?

How i can match below codes . For example I have :
<!--[if !mso]>
<style>
v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style>
<![endif]-->
And I need to :
<!--.*>(.|\n)*<!.*-->
I need match with regular expression just , then replace that . I don't need to keep any tags .But I need start for find from <!--[if !mso]> and find end with <![endif]-->.
Use [\s\S]*? to do a non-greedy match of any character zero or more times.
<!--.*?>([\s\S]*?)<!.*?-->
OR
(?s)<!--.*?>(.*?)<!.*?-->
(?s) DOTALL modifier which makes dot in your regex to match also the line breaks (\n, \r)
Try this:
<!--.*>([\s\S]*)<!\[endif\]-->
Demo: https://regex101.com/r/aL9bW9/1
<!--.*?>((?:.|\n)*?)<!.*?-->
You regex is fine.Just make all * greedy quantifiers non greedy.See demo.
https://regex101.com/r/tX2bH4/38
Try this:
(?s)<!--((?!-->).)*-->
Explanation of each item in the regular expression:
NODE EXPLANATION
----------------------------------------------------------
(?s) set flags for this block (with . matching
\n) (case-sensitive) (with ^ and $
matching normally) (matching whitespace
and # normally)
----------------------------------------------------------
<!-- '<!--'
----------------------------------------------------------
( group and capture to \1 (0 or more times
(matching the most amount possible)):
----------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------
--> '-->'
----------------------------------------------------------
) end of look-ahead
----------------------------------------------------------
. any character
----------------------------------------------------------
)* end of \1 (NOTE: because you are using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \1)
----------------------------------------------------------
--> '-->'

Sublime SQL REGEX highlighting

I'm trying to modify an existing language definition in sublime
It's for SQL
Currently (\#|##)(\w)* is being used to match against local declared parameters (e.g. #MyParam) and also system parameters (e.g. ##Identity or ##FETCH_STATUS)
I'm trying to split these into 2 groups
System parameters I can get like (\#\#)(\w)* but I'm having problems with the local parameters.
(\#)(\w)* matches both #MyParam and ##Identity
(^\#)(\w)* only highlights the first match (i.e. #MyParam but not #MyParam2
Can someone help me with the correct regex?
Try the below regex to capture local parameters and system parameters into two separate groups.
(?<=^| )(#\w*)(?= |$)|(?<=^| )(##\w*)(?= |$)
DEMO
Update:
Sublime text 2 supports \K(discards the previous matches),
(?m)(?:^| )\K(#\w*)(?= |$)|(?:^| )\K(##\w*)(?= |$)
DEMO
Explanation:
(?m) set flags for this block (with ^ and $
matching start and end of line) (case-
sensitive) (with . not matching \n)
(matching whitespace and # normally)
(?: group, but do not capture:
^ the beginning of a "line"
| OR
' '
) end of grouping
\K '\K' (resets the starting point of the
reported match)
( group and capture to \1:
# '#'
\w* word characters (a-z, A-Z, 0-9, _) (0 or
more times)
) end of \1
(?= look ahead to see if there is:
' '
| OR
$ before an optional \n, and the end of a
"line"
) end of look-ahead
| OR
(?: group, but do not capture:
^ the beginning of a "line"
| OR
' '
) end of grouping
\K '\K' (resets the starting point of the
reported match)
( group and capture to \2:
## '##'
\w* word characters (a-z, A-Z, 0-9, _) (0 or
more times)
) end of \2
(?= look ahead to see if there is:
' '
| OR
$ before an optional \n, and the end of a
"line"

I can't find proper regexp

I have the following file(like this scheme, but much longer):
LSE ZTX
SWX ZURN
LSE ZYT
NYSE CGI
There are 2 words (like i.e. LSE ZTX) in every line with optional spaces and/or tabs at the beginning, at the end and always in between.
Could someone help me to match these 2 words with regexp? Following the example I wish to have LSE in $1 and ZTX in $2 for the first line, SWX in $1 and ZURN in $2 for the second etc.
I have tried something like:
$line =~ /(\t|\s)*?(.*?)(\t|\s)*?(.*?)/msgi;
$line =~ /[\t*\s*]?(.*?)[\t*\s*]?(.*?)/msgi;
I don't know how can I say, that there could be either spaces or tabs (or both of them mixed, so for ex. \t\s\t)
If you want to just match the two first words, the most basic thing is to just match any sequence of characters that are not whitespace:
my ($word1, $word2) = $line =~ /\S+/g;
This will capture the first two words in $line into the variables, if they exist. Note that parentheses are not required when using the /g modifier. Use an array instead if you want to capture all existing matches.
Always two words, you don't need to match the entire line, so your most simple regex would be:
/(\w+)\s+(\w+)/
I think this is what you want
^\s*([A-Z]+)\s+([A-Z]+)
See it here on Regexr, you find the first code of a row in group 1 and the second in group 2. \s is a whitespace character, it includes e.g. spaces, tabs and newline characters.
In Perl it is something like this:
($code1, $code2) = $line =~ /^\s*([A-Z]+)\s+([A-Z]+)/i;
I think you are reading the text file row by row, so you don't need the modifiers s and m, and g is also not needed.
In case the codes are not only ASCII letters, then replace [A-Z] with \p{L}. \p{L} is a Unicode property that will match every letter in every language.
\s includes also tabulation so your regex looks like:
$line =~ /^\s*([A-Z]+)\s+([A-Z]+)/;
the first word is in the first group ($1) and the second in $2.
You can change [A-Z] to whatever's more convenient with your needs.
Here is the explanation from YAPE::Regex::Explain
The regular expression:
(?-imsx:^\s*([A-Z]+)\s+([A-Z]+))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[A-Z]+ any character of: 'A' to 'Z' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
[A-Z]+ any character of: 'A' to 'Z' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
With option "Multiline" this Regex:
^\s*(?<word1>\S+)\s+(?<word2>\S+)\s*$
Will give you N matches each containing 2 groups named:
- word1
- word2
^\s*([A-Z]{3,4})\s+([A-Z]{3,4})$
What this does
^ // Matches the beginning of a string
\s* // Matches a space/tab character zero or more times
([A-Z]{3,4}) // Matches any letter A-Z either 3 or 4 times and captures to $1
\s+ // Then matches at least one tab or space
([A-Z]{3,4}) // Matches any letter A-Z either 3 or 4 times and captures to $2
$ // Matches the end of a string
You can use split here:
use strict;
use warnings;
while (<DATA>) {
my ( $word1, $word2 ) = split;
print "($word1, $word2)\n";
}
__DATA__
LSE ZTX
SWX ZURN
LSE ZYT
NYSE CGI
Output:
(LSE, ZTX)
(SWX, ZURN)
(LSE, ZYT)
(NYSE, CGI)
Assuming the spaces at the start of the line are what you use to identify your codes you want, try this:
Split your string up at newlines, then try this regex:
^\s+(\w+\s+){2}$
This will only match lines that start with some space, followed by a (word - some space - word), then end with some space.
# ^ --> String start
# \s+ --> Any number of spaces
# (\w+\s+){2} --> A (word followed by some space)x2
# $ --> String end.
However, if you want to capture the codes alone, try this:
$line =~ /^\s*(\w+)\s+(\w+)/;
# \s* --> Zero or more whitespace,
# (\w+) --> Followed by a word (group #1),
# \s+ --> Followed by some whitespace,
# (\w+) --> Followed by a word (group #2),
This will match all your codes
/[A-Z]+/

Non-Greedy Single Character Match Regex

I'm doing a non-greedy match like this
'(?<C2>.+?)'
to find a group inside a quotes. This works well, until I want to do something like this
'(?<C2>.+?)' as
to match something in quotes followed by a space, following by the word as.
But now, the following will not match as desired
'hello'123'hello2' as
I want this to not match at all...but it ends up matching the whole chunk
'hello'123'hello2'
as C2
What's the best way to force the non-greedy .+? to include up to the first occurance of a ', not the first occurance of ' as
This seems to work
(?<C2>'[^']+')(?= as)
Explanation
NODE EXPLANATION
--------------------------------------------------------------------------------
(?<C2> group and capture to C2:
--------------------------------------------------------------------------------
' '\''
--------------------------------------------------------------------------------
[^']+ any character except: ''' (1 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
' '\''
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
as ' as'
--------------------------------------------------------------------------------
) end of look-ahead
Even without the lookahead (?= as), (?<C2>'[^']+') will match quoted strings in a non-greedy way as expected.
You can try;
'(?<C2>[^']+?)' as
I think I understood your question differently than those who have replied so far. By
What's the best way to force the non-greedy .+? to include up to the first occurance of a ', not the first occurance of ' as
did you mean to say you wanted to match the word between the first two ', i.e. hello, not hello2? In that case, this is my suggestion:
'(?<C2>.+?)'(?! as)
The negative lookahead will ensure that you will not match the word which comes before as.
In case I misunderstood your request: sorry.