Regex allow only one dash or only one space - regex

I want an expression that allows number and one dash OR number and one space. Space or dash are optional.
I tried this
/^([0-9]+(-[0-9]+)?)|([0-9]+(\s[0-9]+)?)$/
Accepted regular expressions:
11-222
444 99

You can put the OR in the middle of your expression: ^([0-9]+)(\s|-)([0-9]+)$ works with your examples in Notepad++.

Let's explain your regex.
^ # beginning of line
( # start group 1
[0-9]+ # 1 or more digits
( # start group 2
- # a hyphen
[0-9]+ # 1 or more digits
)? # end group 2, optional
) # end group 1
| # OR
( # start group 3
[0-9]+ # 1 or more digits
( # start group 4
\s # a space
[0-9]+ # 1 or more digits
)? # end group 4, optional
) # end group 3
$ # end of line
The OR acts between the group 1 at the beginning of the line and the group 3 at the end of the line. But you want group 1 and group 3 anchored at the beginning and at the end.
Add a group over group 1 and 3:
^(([0-9]+(-[0-9]+)?)|([0-9]+(\s[0-9]+)?))$
You can use non capture groups (more efficient) instead of capture group
^(?:(?:[0-9]+(?:-[0-9]+)?)|(?:[0-9]+(?:\s[0-9]+)?))$
Combine the hyphen and the space in a character class and remove the superfluous groups:
^[0-9]+(?:[-\s][0-9]+)?$
If your regex flavour supports it, change the [0-9] into \d. Finally your regex becomes:
^\d+(?:[-\s]\d+)?$
Much simpler, no?

Related

Replace multiple occurrences in the same line

I'm using Notepad++ to replace some lines. Basically what I want to do is:
line 1 -
STR::P=FOOXPATTERN=5 AND MORETHINGS YPATTERN=9 BUT XPATTERN=3 AND YPATTERN=20
line 2 -
MOR::P=BAR XPATTERN=1 STRSTR MORETHINGS YPATTERN=1BUT XPATTERN=10 AND YPATTERN=40
...
So this must be transformed in:
line 1
XPATTERN=5|YPATTERN=9|PATTERN=3|YPATTERN=20
line 2 -
XPATTERN=1|YPATTERN=1|XPATTERN=10|YPATTERN=40
My point is that I can have many XPATTERN and many YPATTERN in the same line. Then I would like to replace all my line for the pattern found.
I tried to use negation on regex, but with no success.
Ctrl+H
Find what: (?:^|\G(?!^)).*?((?:XPATTERN|YPATTERN)=\d+)(?:(?!(?:XPATTERN|YPATTERN)=).)*($)?
Replace with: $1(?2:|))
CHECK Match case
CHECK Wrap around
CHECK Regular expression
UNCHECK . matches newline
Replace all
Explanation:
(?: # non capture group
^ # beginning of line
| # OR
\G(?!^) # restart from last match position, not at the beginning of line
) # end group
.*? # 0 or more any character but newline
( # group 1
(?: # non capture group
XPATTERN # XPATTERN
| # OR
YPATTERN # YPATTERN
) # end group
=\d+ # equal sign followed by 1 or more digits
) # end group 1
(?: # non capture group
(?! # negative lookahead, make sure we haven't after:
(?: # non capture group
XPATTERN # XPATTERN
| # OR
YPATTERN # YPATTERN
) # end group
= # equal sign
) # end lookahead
. # any character but newline
)* # end group, may appear 0 or more times
($)? # group 1, end of line, optional
Replacement:
$1 # content of group 1 (i.e. X or Y PATTERN = digits)
(?2 # IF group 2 exists (end of line), do nothing
: # ELSE
| # add a pipe character
) # ENDIF
Screen capture (before):
Screen capture (after):
Use a regexp that matches the pattern and anything before it, and replaces it with just the pattern.
Replace: .*?((XPATTERN|YPATTERN|ZPATTERN|...)=\d+)
With: |\1
If there's something after all the patterns, you can remove the rest after the above replacements with:
Replace: ^((\|(XPATTERN|YPATTERN|ZPATTERN|...)=\d+)*).*
With: \1
This will leave a | at the beginning of each line. You can remove that as a third step:
Replace: ^\|
With: empty string

How can I move a column with variable length in between one vertical bar "|" and "["?

My file has 4000k lines. I need to reformat it. So, I am trying notepad++ (or awk). The structure every line is
acc|GENBANK|ABJ91977.1|GENBANK|DQ876324|pol protein Tabulator[Human immunodeficiency virus 1]TabulatorTLWQRPFVTIKVGGQLKEALLDTGADDTVLEEIELPGRWKPKMIGGIGGFIKVRQYDQIXVEICGHKAIGTVLVGPTPVNVIGRNLMTQIGCTLN
The characters among the 4th vertical bar | and the first [ is variable length. Only I am looking for tips or where to focus to do it myself. I tried to print with awk but how there are one part variable in length, I obtained different results. Neither I can select by columns.
I would like to obtain a file with this structure
acc|GENBANK|ABJ91977.1|GENBANK|DQ876324,acc|GENBANK|ABJ91977.1|GENBANK|DQ876324,pol protein
and other file with this structure
acc|GENBANK|ABJ91977.1|GENBANK|DQ876324TabulatorTLWQRPFVTIKVGGQLKEALLDTGADDTVLEEIELPGRWKPKMIGGIGGFIKVRQYDQIXVEICGHKAIGTVLVGPTPVNVIGRNLMTQIGCTLN
TAB are in bold letters - Tabulator
Here is a way to do for the first file.
Ctrl+H
Find what: (^[^|]+(?:\|[^|]+){4})\|(.+?)\h+\[.+$
Replace with: $1,$1,$2
check Wrap around
check Regular expression
UNCHECK . matches newline
Replace all
Explanation:
( # group 1
^ # beginning of line
[^|]+ # 1 or more non pipe
(?: # start non capture group
\| # a pipe
[^|]+ # 1 or more non pipe
){4} # end group, must appear 4 times
) # end group 1
\| # a pipe
(.+?) # group 2, 1 or more any character but newline, not greedy
\h+ # 1 or more horizontal spaces (space or tabulation)
\[ # 1 openning square bracket
.+ # 1 or more any character but newline
$ # end of line
Replacement:
$1 # content of group 1
, # a comma
$1 # content of group 1
, # a comma
$2 # content of group 2
Result for given example:
acc|GENBANK|ABJ91977.1|GENBANK|DQ876324,acc|GENBANK|ABJ91977.1|GENBANK|DQ876324,pol protein
Screen capture:
For the second file:
Ctrl+H
Find what: (^[^|]+(?:\|[^|]+){4})\|.+?\h+\[.+?\](.+)$
Replace with: $1$2
check Wrap around
check Regular expression
UNCHECK . matches newline
Replace all
Explanation:
( # group 1
^ # beginning of line
[^|]+ # 1 or more non pipe
(?: # start non capture group
\| # a pipe
[^|]+ # 1 or more non pipe
){4} # end group, must appear 4 times
) # end group 1
\| # a pipe
.+? # 1 or more any character but newline, not greedy
\h+ # 1 or more horizontal spaces (space or tabulation)
\[ # 1 openning square bracket
.+? # 1 or more any character but newline, not greedy
\] # a closing square bracket
(.+) # group 2, 1 or more any character but newline
$ # end of line
Screen capture:

How to capture recursive groups in a regex?

I am trying to capture a pattern which can appear multiple times in a regex in different groups. The pattern which can appear multiple times is :
(\b\\d{4}\\s*\\d{4}\\s*\\d{4}\\s*\\d{4}\b\\s*)
Please see complete test#here!
The expected output should be :
Full Match:
Group1:1111 1111 1111 1111
Group2:2222 2222 2222 2222
... GroupN...
how can this be achieved ?
If I understand the problem correctly, we would be wishing for matching a four-digits and space pattern being repeated three times, followed by another four-digits, and we can likely start with a simple expression such as:
(\d{4}\s)\1\1(\d{4}\s?)
Demo 1
Or if we would be matching a four-digits pattern four times, and space three times, we would likely start with this expression:
(\d{4})(\s+)\1\2\1\2\1
Demo 2
RegEx Circuit
jex.im visualizes regular expressions:
Use:
(?:<select\b|\G).*?(\b\d{4}(?:\s*\d{4}){3}\b)(?=.*?</select>)
Demo
Explanation:
(?: # non capture group
<select\b # literally
| # OR
\G # restart from previous match position
) # end group
.*? # 0 or more any character, you may use [\s\S]*?
( # start group 1
\b # word boundary
\d{4} # 4 digits
(?: # non capture group
\s* # 0 or more spaces
\d{4} # 4 digits
){3} # end group, may appear 3 times
\b # word boundary
) # end group 1
(?= # lookahead, make sure we have aftre:
.*? # 0 or more any character
</select> # end tag
) # end lookahead
Sample code (php):
preg_match_all('~(?:<select\b|\G).*?(\b\d{4}(?:\s*\d{4}){3}\b)(?=.*?</select>))~', $html, $matches);
print_r($matches[1]);

Excluding hyphens after first instance

I'm trying to develop a regex expression which pulls the first few characters before the first instance of a hyphen, and then saves the second group of elements after the first hyphen.
Here's the regex:
^([^-]*)(?(?=-)(\S.*)|())
And here are few test cases:
SSB x Dj Chad - Crazy Beat - Tarraxo
Dj [R]afaa [F]ox -Tarraxo Do Inicio Das Aulas ( Nova Escola Producões )
Dj Snakes Share - MaloncyBeatz - Perfecto
Tarraxo Das Brasileiras [2014] [TxiGa Pro]
The IF statement handles the last condition perfectly, but my issue is for the first few items, it returns the second group 'with' the hyphen instead of excluding it.
In other words:
Dj Snakes Share - MaloncyBeatz - Perfecto should return:
Group 1: Dj Snakes Share
Group 2: MaloncyBeatz - Perfecto
Instead, Group 2 is: - MaloncyBeatz - Perfecto
Update
https://regex101.com/r/2BQPNg/12
Using ^([^-]*)[^-]\W*(.*), it works, but it raises a problem for the last case (where there is no hyphen). It excludes the ].
My solution:
^([^-]+?)\s*(?:-\s*(.*))?$
^ // start of line
([^-]+?) // 1+ not '-' chars, lazily matched (first captured group)
\s* // 0+ white-space chars
(?: // grouped, not captured
- // dash
\s*(.*) // 0+ white-space chars then anything (second captured group)
)? // 0 or 1 time
$ // end of line
Flags: global, multi-line
Demo
501 steps reduced to 164 steps:
^[^-]+$|^((?:\w[^-]*)?\w)\W+(\w.*)
^ # start of line
[^-]+ # 1 or more not '-'
$ # end of line
| # OR
^ # start of line
( # start of group (captured)
(?: # start of group (not captured)
\w[^-]* # a word char then 0 or more not '-'
)? # 0 or 1 times
\w) # a word char, then end of group
\W+ # 1 or more non-word chars
(\w.*) # a word char then 0 or more anything (captured)
Demo
You are using this regex:
^([^-]*)[^-]\W*(.*)
Here, you have an extra [^-] in your regex that is causing first group to match one character less than the match.
You can use this regex:
^([^-]*)(?:\s+-\s*(.*))?$
RegEx Demo

Regex between a string

Example:
I have the following string
a125A##THISSTRING##.test123
I need to find THISSTRING. There are many strings which are nearly the same so I'd like to check if there is a digit or letter before the ## and also if there is a dot (.) after the ##.
I have tried something like:
([a-zA-Z0-9]+##?)(.+?)(.##)
But I am unable to get it working
You can use look behind and look ahead:
(?<=[a-zA-Z0-9]##).*?(?=##\.)
https://regex101.com/r/i3RzFJ/2
But I am unable to get it working.
Let's deconstruct what your regex ([a-zA-Z0-9]+##?)(.+?)(.##) says.
([a-zA-Z0-9]+##?) match as many [a-zA-Z0-9] followed by a # followed by optional #.
(.+?) any character as much as possible but fewer times.
(.##) any character followed by two #. Now . consumes G and then ##. Hence THISSTRING is not completely captured in group.
Lookaround assertions are great but are little expensive.
You can easily search for such patterns by matching wanted and unwanted and capturing wanted stuff in a capturing group.
Regex: (?:[a-zA-Z0-9]##)([^#]+)(?:##\.)
Explanation:
(?:[a-zA-Z0-9]##) Non-capturing group matching ## preceded by a letter or digit.
([^#]+) Capturing as many characters other than #. Stops before a # is met.
(?:##\.) Non-capturing group matching ##. literally.
Regex101 Demo
Javascript Example
var myString = "a125A##THISSTRING##.test123";
var myRegexp = /(?:[a-zA-Z0-9]##)([^#]+)(?:##\.)/g;
var match = myRegexp.exec(myString);
console.log(match[1]);
You wrote:
check if there is a digit or letter before the ##
I assume you mean a digit / letter before the first ## and
check for a dot after the second ## (as in your example).
You can use the following regex:
[a-z0-9]+ # Chars before "##", except the last
(?: # Last char before "##"
(\d) # either a digit - group 1
| # or
([a-z]) # a letter - group 2
)
##? # 1 or 2 "at" chars
([^#]+) # "Central" part - group 3
##? # 1 or 2 "at" chars
(?: # Check for a dot
(\.) # Captured - group 4
| # or nothing captured
)
[a-z0-9]+ # The last part
# Flags:
# i - case insensitive
# x - ignore blanks and comments
How it works:
Group 1 or 2 captures the last char before the first ##
(either group 1 captures a digit or group 2 captures a letter).
Group 3 catches the "central" part (THISSTRING,
a sequence of chars other than #).
Group 4 catches a dot, if any.
You can test it at https://regex101.com/r/ATjprp/1
Your regex has such an error that a dot matches any char.
If you want to check for a literal dot, you must escape it
with a backslash (compare with group 4 in my solution).