Excluding hyphens after first instance - regex

I'm trying to develop a regex expression which pulls the first few characters before the first instance of a hyphen, and then saves the second group of elements after the first hyphen.
Here's the regex:
^([^-]*)(?(?=-)(\S.*)|())
And here are few test cases:
SSB x Dj Chad - Crazy Beat - Tarraxo
Dj [R]afaa [F]ox -Tarraxo Do Inicio Das Aulas ( Nova Escola Producões )
Dj Snakes Share - MaloncyBeatz - Perfecto
Tarraxo Das Brasileiras [2014] [TxiGa Pro]
The IF statement handles the last condition perfectly, but my issue is for the first few items, it returns the second group 'with' the hyphen instead of excluding it.
In other words:
Dj Snakes Share - MaloncyBeatz - Perfecto should return:
Group 1: Dj Snakes Share
Group 2: MaloncyBeatz - Perfecto
Instead, Group 2 is: - MaloncyBeatz - Perfecto
Update
https://regex101.com/r/2BQPNg/12
Using ^([^-]*)[^-]\W*(.*), it works, but it raises a problem for the last case (where there is no hyphen). It excludes the ].

My solution:
^([^-]+?)\s*(?:-\s*(.*))?$
^ // start of line
([^-]+?) // 1+ not '-' chars, lazily matched (first captured group)
\s* // 0+ white-space chars
(?: // grouped, not captured
- // dash
\s*(.*) // 0+ white-space chars then anything (second captured group)
)? // 0 or 1 time
$ // end of line
Flags: global, multi-line
Demo
501 steps reduced to 164 steps:
^[^-]+$|^((?:\w[^-]*)?\w)\W+(\w.*)
^ # start of line
[^-]+ # 1 or more not '-'
$ # end of line
| # OR
^ # start of line
( # start of group (captured)
(?: # start of group (not captured)
\w[^-]* # a word char then 0 or more not '-'
)? # 0 or 1 times
\w) # a word char, then end of group
\W+ # 1 or more non-word chars
(\w.*) # a word char then 0 or more anything (captured)
Demo

You are using this regex:
^([^-]*)[^-]\W*(.*)
Here, you have an extra [^-] in your regex that is causing first group to match one character less than the match.
You can use this regex:
^([^-]*)(?:\s+-\s*(.*))?$
RegEx Demo

Related

How to regex string that can end with a number and group each part

I have the following test strings:
Battery Bank 1
Dummy 32 Segment 12
System
Modbus 192.168.0.1 Group
I need a regex that can match and group these as follows:
Group 1: Battery Bank
Group 2: 1
Group 1: Dummy 32 Segment
Group 2: 12
Group 1: System
Group 2: null
Group 1: Modbus 192.168.0.1 Group
Group 2: null
Basically, capture everything (including numbers) into group 1 unless the string ends with a whitespace followed by 1 or more digits. If it does, capture this number into group 2.
This regex is not doing what I need as everything is captured into the first group.
([\w ]+)( \d+)?
https://regex101.com/r/GEtb5G/1/
Basically, capture everything (including numbers) into group 1 unless the string ends with a whitespace followed by 1 or more digits. If it does, capture this number into group 2.
You may use this group that allows an empty match in 2nd capture group:
^(.+?) *(\d+|)$
Updated RegEx Demo
RegEx Details:
^: Start
(.+?): Match 1+ of any character (lazy) in capture group #1
*: Match 0 or more spaces
(\d+|): Match 1+ digits or nothing in 2nd capture group
$: End
You can use
^\s*(.*[^\d\s])(?:\s*(\d+))?\s*$
See the regex demo (note \s are replaced with spaces since the test string in the demo is a single multiline string).
If the regex is to be used with a multiline flag to match lines in a longer multiline text, you can use
^[^\S\r\n]*(.*[^\d\s])(?:[^\S\r\n]*(\d+))?[^\S\r\n]*$
See the regex demo.
Details:
^ - start of a string
\s* - zero or more whitespaces
(.*[^\d\s]) - Group 1: any zero or more chars other than line break chars as many as possible and then a char other than a digit and whitespace
(?:\s*(\d+))? - an optional sequence of
\s* - zero or more whitespaces
(\d+) - Group 2: one or more digits
\s* - zero or more whitespaces
$ - end of string.
In the second regex, [^\S\r\n]* matches any zero or more whitespaces other than LF and CR chars.

REGEX to Capture only address blocks and exclude text separated by blank lines

I have a large document of address blocks that contain a myriad of different address formats.
The document has sections of paragraphs, pictures, and random text and throughout these different sections are large groups of address blocks. The address blocks will always have a blank line before and after the address and they will always end with a ZIP (+4 is optional).
Unfortunately, the Addresses vary so much that I can't come up with a way to capture the specific components (sometimes there is only recipient and others there is recipient and ATTN line. sometimes there is a secondary unit address, etc..).
I did come up with a regex pattern to match the address blocks within the document; however, it is not completely accurate. I would like to capture only the address blocks but my pattern is also capturing random lines of text in between the address blocks.
My pattern is:
[regex]$pattern = "(?xm)\n(
^[\w\d\-\.\s]+(\d{5})(?:\-\d{4})?
)";
An example of what it is capturing is:
DUSHANBE PLACEISTAN
DASHB FARMINTON
PSC 123 BOX 1
APO AP 12345
DETACHMENT ATTACHMENT
SECURITY GUARD OFFICE
AMERICAN EMB E01
UNIT 1712
APO AE 54321-7798
TASHKENT UZBEKISTAN
TONE TENTKASH DOS
75485 TORSHEN PL
WASHINGTON DC 12345-1234
In the example above it should not be capturing DUSHANBE PLACEISTAN or TASHKENT UZBEKISTAN (only the blocks of addresses).
Any and insight into how to properly parse the text would be greatly appreciated.
I believe you could use the regular expression
(?:^\w+(?: +\w+)* *\r?\n)+\w+(?: +\w+)* +\d{5}(?:\-\d{4})? *$
Demo
The regex engine performs the following operations (I've escaped the space character to make them more visible).
(?: # begin non-cap grp
^ # match beginning of line
\w+ # match 1+ word chars
(?:\ +\w+) # match 1+ spaces, 1+ word chars in non-cap grp
* # execute non-cap grp 0+ times
\ *\r?\n # match 0+ spaces, return char(s)
) # end non-cap grp
+ # execute non-cap grp 1+ times
\w+ # match 1+ word chars
(?:\ +\w+) # match 1+ spaces, 1+ word chars in non-cap grp
* # execute non-cap grp 0+ times
\ + # match 1+ spaces
\d{5} # match 5 digits
(?:\-\d{4}) # match '-' then 4 digits in non-cap grp
? # optionally match non-cap grp
\ * # match 0+ spaces
$ # match end of line
If there has to be a newline in front of the first line, you might also use an lookbehind assertion for a new line and match as few lines as possible until you can match the zip code format.
(?<=\r?\n)(?:\S.*\r?\n)+?.*\d{5}(?:\-\d{4})?$
Explanation
(?<=\r?\n) Positive lookbehind, assert what is on the left is a newline
(?: Non capture group
\S.*\r?\n Match the whole line, starting with a non whitespace char preventing empty lines
)+? Close group and repeat 1+ times non greedy
.*\d{5}(?:\-\d{4})? Match the whole line ending on the zip code pattern
$ End of line
Regex demo
An alternative pattern could be matching all lines that don't end with the zip code pattern until you encounter the lines that does.
(?<=\r?\n)(?:(?!.*\d{5}(?:\-\d{4})?$)\S.*\r?\n)+.*\d{5}(?:\-\d{4})?$
Regex demo

Regex allow only one dash or only one space

I want an expression that allows number and one dash OR number and one space. Space or dash are optional.
I tried this
/^([0-9]+(-[0-9]+)?)|([0-9]+(\s[0-9]+)?)$/
Accepted regular expressions:
11-222
444 99
You can put the OR in the middle of your expression: ^([0-9]+)(\s|-)([0-9]+)$ works with your examples in Notepad++.
Let's explain your regex.
^ # beginning of line
( # start group 1
[0-9]+ # 1 or more digits
( # start group 2
- # a hyphen
[0-9]+ # 1 or more digits
)? # end group 2, optional
) # end group 1
| # OR
( # start group 3
[0-9]+ # 1 or more digits
( # start group 4
\s # a space
[0-9]+ # 1 or more digits
)? # end group 4, optional
) # end group 3
$ # end of line
The OR acts between the group 1 at the beginning of the line and the group 3 at the end of the line. But you want group 1 and group 3 anchored at the beginning and at the end.
Add a group over group 1 and 3:
^(([0-9]+(-[0-9]+)?)|([0-9]+(\s[0-9]+)?))$
You can use non capture groups (more efficient) instead of capture group
^(?:(?:[0-9]+(?:-[0-9]+)?)|(?:[0-9]+(?:\s[0-9]+)?))$
Combine the hyphen and the space in a character class and remove the superfluous groups:
^[0-9]+(?:[-\s][0-9]+)?$
If your regex flavour supports it, change the [0-9] into \d. Finally your regex becomes:
^\d+(?:[-\s]\d+)?$
Much simpler, no?

Using regex on a file to pull data out. Having issues with multi-line

I am looking to get to the next line of data within a text file. Here is an example of data from the file I am working with.
0519 ABF 244 AN A1 ADV STUFF 1.0 2.0 Somestuff 018 0155 MTWTh 10:30A 11:30A 20 20 0 6.7
Somestuff 011 0145 MTWTh 12:30P 1:30P
I have been trying to move to the next line by utilizing a variety of code such as.. carriage return \n using \s+ to replace the large space after 6.7. using m like so //m not finding a result just yet.
Here is some example code
while !regex_file.eof?
line = regex_file.gets.chomp
if line =~ ^.*?\d{4}\s+[A-Z]+\s+\d{3}.+$
puts line
end
end
Using https://rubular.com/ this particular set of code matches my desired output for the first line
0519 ABF 244 AN A1 ADV STUFF 1.0 2.0 Somestuff 018 0155 MTWTh 10:30A 11:30A 20 20 0 6.7
but does not match and haven't figured out how to match the next line.
Somestuff 011 0145 MTWTh 12:30P 1:30P
Try something like this: the \n captures the new line, and you can apply your own rules to capture anything you want which comes after \n - see below pls:
^.*\d{4}\s+[A-Z]+\s+\d{3}.+\n.*$
I've made an arbitrary assumption about the requirements for matching the second line. It is more demanding than the requirements for matching the first that are reflected in your regex, but I thought the additional complexity would have some educational value for you.
Here is a regular expression (untested) for matching both lines. Note you don't need ^.*? at the beginning of the regex and for the part of the regex that matches the first line .+$ adds nothing, so I removed it. After all you are just matching each line separately (line), and will display the entire line if there's a match. As well, the end-of-string anchor \z is more appropriate than the end-of-line anchor ($), though either can be used.
r = /
(?: # begin non-capture group
\d{4} # match 4 digits
\s+ # match > 0 whitespaces
[A-Z]+ # match > 0 uppercase letters
\s+ # match > 0 whitespaces
\d{3} # match 3 digits
| # or
\b # match a (zero-width) word break
[A-Z] # match 1 uppercase letter
[a-z]* # match >= 0 lowercase letter
\s+ # match > 0 whitespaces
\d{3} # match 3 digits
\s+ # match > 0 whitespaces
\d{4} # match 4 digits
\s+ # match > 0 whitespaces
[A-Za-z]+ # match > 0 letters
(?: # begin non-capture group
\s+ # match > 0 whitespaces
(?: # begin a non-capture group
0\d # match 0 followed by any digit
| # or
1[012] # match 1 followed by 0, 1 or 2
) # end non-capture group
: # match a colon
[0-5][0-9] # match 0-5 followed by 0-9
){2} # end non-capture group and execute twice
) # end non-capture group
/x # free-spacing regex definition mode
This regular expression is conventionally written as follows.
r = /(?:\d{4}\s+[A-Z]+\s+\d{3}|\b[A-Z][a-z]*\s+\d{3}\s+\d{4}\s+[A-Za-z]+(?:\s+(?:0\d|1[012]):[0-5][0-9]){2})/
You might go through the file putsing matching lines as follows:
File.foreach(fname) { |line| puts line if line.match? r }
See IO::foreach, which is a very convenient method for reading files line-by-line. Note IO class methods (such foreach) are commonly invoked with File as their receiver. That's OK, as File.superclass #=> IO, so File inherits those methods from IO.
When used without a block foreach returns an enumerator, which is often convenient as well. If, for example, you wished to return an array of matching lines (rather than puts them), you could write:
File.foreach(fname).with_object([]) do |line, arr|
arr << line.chomp if line.match? r
end
Your current regex:
^.*?\d{4}\s+[A-Z]+\s+\d{3}.+$
matches in this order:
the beginning of the line (^)
zero or more characters non-greedy .*?
four digits (\d{4})
one or more spaces (\s+)
one or more capital letters ([A-Z]+)
one or more spaces
three digits (\d{3})
one or more characters (.+)
the end of the line ($)
The second line of your file is:
Somestuff 011 0145 MTWTh 12:30P 1:30P
starts matching 0145 MTWT but then fails to match \d{3}

Regex between a string

Example:
I have the following string
a125A##THISSTRING##.test123
I need to find THISSTRING. There are many strings which are nearly the same so I'd like to check if there is a digit or letter before the ## and also if there is a dot (.) after the ##.
I have tried something like:
([a-zA-Z0-9]+##?)(.+?)(.##)
But I am unable to get it working
You can use look behind and look ahead:
(?<=[a-zA-Z0-9]##).*?(?=##\.)
https://regex101.com/r/i3RzFJ/2
But I am unable to get it working.
Let's deconstruct what your regex ([a-zA-Z0-9]+##?)(.+?)(.##) says.
([a-zA-Z0-9]+##?) match as many [a-zA-Z0-9] followed by a # followed by optional #.
(.+?) any character as much as possible but fewer times.
(.##) any character followed by two #. Now . consumes G and then ##. Hence THISSTRING is not completely captured in group.
Lookaround assertions are great but are little expensive.
You can easily search for such patterns by matching wanted and unwanted and capturing wanted stuff in a capturing group.
Regex: (?:[a-zA-Z0-9]##)([^#]+)(?:##\.)
Explanation:
(?:[a-zA-Z0-9]##) Non-capturing group matching ## preceded by a letter or digit.
([^#]+) Capturing as many characters other than #. Stops before a # is met.
(?:##\.) Non-capturing group matching ##. literally.
Regex101 Demo
Javascript Example
var myString = "a125A##THISSTRING##.test123";
var myRegexp = /(?:[a-zA-Z0-9]##)([^#]+)(?:##\.)/g;
var match = myRegexp.exec(myString);
console.log(match[1]);
You wrote:
check if there is a digit or letter before the ##
I assume you mean a digit / letter before the first ## and
check for a dot after the second ## (as in your example).
You can use the following regex:
[a-z0-9]+ # Chars before "##", except the last
(?: # Last char before "##"
(\d) # either a digit - group 1
| # or
([a-z]) # a letter - group 2
)
##? # 1 or 2 "at" chars
([^#]+) # "Central" part - group 3
##? # 1 or 2 "at" chars
(?: # Check for a dot
(\.) # Captured - group 4
| # or nothing captured
)
[a-z0-9]+ # The last part
# Flags:
# i - case insensitive
# x - ignore blanks and comments
How it works:
Group 1 or 2 captures the last char before the first ##
(either group 1 captures a digit or group 2 captures a letter).
Group 3 catches the "central" part (THISSTRING,
a sequence of chars other than #).
Group 4 catches a dot, if any.
You can test it at https://regex101.com/r/ATjprp/1
Your regex has such an error that a dot matches any char.
If you want to check for a literal dot, you must escape it
with a backslash (compare with group 4 in my solution).