Music Chord part splitting Regex - regex

This is a follow-up question to this one: Regex for matching a music Chord, asked by me.
Now that I have a regex to know whether a String representation of a chord is valid or not (previous question), how can I effectively get the three different parts of the chord (the root note, accidentals and chord type) into seperate variables?
I could do simple string manipulation, but I guess that it would be easier to build on the previous code and use regex for that, or am I am wrong?
Here is the updated code from the aforementioned question:
public static void regex(String chord) {
String notes = "^[CDEFGAB]";
String accidentals = "(#|##|b|bb)?";
String chords = "(maj7|maj|min7|min|sus2)";
String regex = notes + accidentals + chords;
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(chord);
System.out.println("regex is " + regex);
if (matcher.find()) {
int i = matcher.start();
int j = matcher.end();
System.out.println("i:" + i + " j:" + j);
}
else {
System.out.println("no match!");
}
}
Thanks.

Enclosing something with parentheses (except in cases with special meaning) creates a capturing group, or subpattern.
You already have accidentals and chords grouped as subpatterns like that, but you need to add parentheses to notes to capture that as a subpattern too.
String notes = "^([CDEFGAB])";
String accidentals = "(#|##|b|bb)?";
String chords = "(maj7|maj|min7|min|sus2)";
By convention, the string that is matched by the entire pattern is group 0, then every subpattern is captured as group 1, group 2, and so on.
I'm not a Java guy, but after reading the docs it looks like you would access your subpattern matches using .group():
String note = matcher.group(1);
String acci = matcher.group(2);
String chor = matcher.group(3);
Edit:
Originally, I suggested String accidentals = "((?:#|##|b|bb)?)";, because I was worried that the second subpattern being optional would have caused a group numbering problem if no match existed for it. However, a little testing suggests that even without wrapping it in a non-capturing grouping (?: ) like that, group 2 is always present but empty if there was no match. (Empty string in group 2 was the desired effect anyway.) So, it seems that
... = "(#|##|b|bb)?"; probably would suffice after all.

You've already done the work. Just add one more capture group so that your final regex becomes:
^([CDEFGAB])(#|##|b|bb)?(maj7|maj|min7|min|sus2)?$
And your note, accidental, and chord will be in the first, second, and third captures, respectively.

I like the accepted answer, but as a guitar player I do encounter chords with an extra bass note added, such as G/D, or A/D, or D/F#. Of course, there are a number of other chord names you might encounter such as : 5 , 6, min6, 9, min9, sus4 ... etc. You might consider adding to the number of possible string chords, and then adding something for the bass accidentals if you have any:
String chords =
"(maj|maj7|maj9|maj11|maj13|maj9#11|maj13#11|6|add9|maj7b5|maj7#5||min|m7|m9|m11|m13|
m6|madd9|m6add9|mmaj7|mmaj9|m7b5|m7#5|7|9|11|13|7sus4|7b5|7#5|7b9|7#9|7b5b9|7b5#9|
7#5b9|9#5|13#11|13b9|11b9|aug|dim|dim7|sus4|sus2|sus2sus4|-5|)";
String bass = "/([CDEFGAB])";
In order to complete the "String chords" definition, you might want to consult a chord dictionary. CHEERS!

Related

Match same number of repetitions as previous group

I'm trying to match strings that are repeated the same number of times, like
abc123
abcabc123123
abcabcabc123123123
etc.
That is, I want the second group (123) to be matched the same number of times as the first group (abc). Something like
(abc)+(123){COUNT THE PREVIOUS GROUP MATCHED}
This is using the Rust regex crate https://docs.rs/regex/1.4.2/regex/
Edit As I feared, and pointed out by answers and comments, this is not possible to represent in regex, at least not without some sort of recursion which the Rust regex crate doesn't for the time being support. In this case, as I know the input length is limited, I just generated a rule like
(abc123)|(abcabc123123)|(abcabcabc123123123)
Horribly ugly, but got the job done, as this wasn't "serious" code, just a fun exercise.
As others have commented, I don't think it's possible to accomplish this in a single regex. If you can't guarantee the strings are well-formed then you'd have to validate them with the regex, capture each group, and then compare the group lengths to verify they are of equal repetitions. However, if it's guaranteed all strings will be well-formed then you don't even need to use regex to implement this check:
fn matching_reps(string: &str, group1: &str, group2: &str) -> bool {
let group2_start = string.find(group2).unwrap();
let group1_reps = (string.len() - group2_start) / group1.len();
let group2_reps = group2_start / group2.len();
group1_reps == group2_reps
}
fn main() {
assert_eq!(matching_reps("abc123", "abc", "123"), true);
assert_eq!(matching_reps("abcabc123", "abc", "123"), false);
assert_eq!(matching_reps("abcabc123123", "abc", "123"), true);
assert_eq!(matching_reps("abcabc123123123", "abc", "123"), false);
}
playground
Pure regular expressions are not able to represent that.
There may be some way to define back references, but I am not familiar with regexp syntax in Rust, and this would technically be a way to represent something more than a pure regular expression.
There is however a simple way to compute it :
use a regexp to make sure your string is a ^((abc)*)((123)*)$
if your string matches, take the two captured substrings, and compare their lengths
Building a pattern dynamically is also an option. Matching one, two or three nested abc and 123 is possible with
abc(?:abc(?:abc(?:)?123)?123)?123
See proof. (?:)? is redundant, it matches no text, (?:...)? matches an optional pattern.
Rust snippet:
let a = "abc"; // Prefix
let b = "123"; // Suffix
let level = 3; // Recursion (repetition) level
let mut result = "".to_string();
for _n in 0..level {
result = format!("{}(?:{})?{}", a, result, b);
}
println!("{}", result);
// abc(?:abc(?:abc(?:)?123)?123)?123
There's an extension to the regexp libraries, that is implemented from the old times unix and that allows to match (literally) an already scanned group literally after the group has been matched.
For example... let's say you have a number, and that number must be equal to another (e.g. the score of a soccer game, and you are interested only in draws between the two teams) You can use the following regexp:
([0-9][0-9]*) - \1
and suppose we feed it with "123-123" (it will match) but if we use "123-12" that will not match, as the \1 is not the same string as what was matched in the first group. When the first group is matched, the actual regular expression converts the \1 into the literal sequence of characters that was matched in the first group.
But there's a problem with your sample... is that there's no way to end the first group if you try:
([0-9][0-9]*)\1
to match 123123, because the automaton cannot close the first group (you need at least a nondigit character to make the first group to finalize)
But for example, this means that you can use:
\+(\([0-9][0-9]*\))\1(-\1)*
and this will match phone numbers in the form
+(358)358-358-358
or
+(1)1-1-1-1-1-1-1
(the number in between the parenthesys is catched as a sample, and then you use the group to build a sequence of that number separated by dashes. You can se the expression working in this demo.)

How to handle redundant cases in regex?

I have to parse a file data into good and bad records the data should be of format
Patient_id::Patient_name (year of birth)::disease
The diseases are pipe separated and are selected from the following:
1.HIV
2.Cancer
3.Flu
4.Arthritis
5.OCD
Example: 23::Alex.jr (1969)::HIV|Cancer|flu
The regex expression I have written is
\d*::[a-zA-Z]+[^\(]*\(\d{4}\)::(HIV|Cancer|flu|Arthritis|OCD)
(\|(HIV|Cancer|flu|Arthritis|OCD))*
But it's also considering the records with redundant entries
24::Robin (1980)::HIV|Cancer|Cancer|HIV
How to handle these kind of records and how to write a better expression if the list of diseases is very large.
Note: I am using hadoop maponly job for parsing so give answer in context with java.
What you might do is capture the last part with al the diseases in one group (named capturing group disease) and then use split to get the individual ones and then make the list unique.
^\d*::[a-zA-Z]+[^\(]*\(\d{4}\)::(?<disease>(?:HIV|Cancer|flu|Arthritis|OCD)(?:\|(?:HIV|Cancer|flu|Arthritis|OCD))*)$
For example:
String regex = "^\\d*::[a-zA-Z]+[^\\(]*\\(\\d{4}\\)::(?<disease>(?:HIV|Cancer|flu|Arthritis|OCD)(?:\\|(?:HIV|Cancer|flu|Arthritis|OCD))*)$";
String string = "24::Robin (1980)::HIV|Cancer|Cancer|HIV";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);
if (matcher.find()) {
String[] parts = matcher.group("disease").split("\\|");
Set<String> uniqueDiseases = new HashSet<String>(Arrays.asList(parts));
System.out.println(uniqueDiseases);
}
Result:
[HIV, Cancer]
Regex demo | Java demo
You need the negative lookahead.
Try using this regex: ^\d*::[^(]+?\s*\(\d{4}\)::(?!.*(HIV|Cancer|flu|Arthritis|OCD).*\|\1)((HIV|Cancer|flu|Arthritis|OCD)(\||$))+$.
Explanation:
The initial string ^\d*::[^(]+?\s*\(\d{4}\):: is just an optimized one to match Alex.jr example (your version did not respect any non-alphabetic symbols in names)
The negative lookahead block (?!.*(HIV|Cancer|flu|Arthritis|OCD).*\|\1) stands for "look forth for any disease name, encountered twice, and reject the string, if found any. Its distinctive feature is the (?! ... ) signature.
Finally, ((HIV|Cancer|flu|Arthritis|OCD)(\||$))+$ is also an optimized version of your block (HIV|Cancer|flu|Arthritis|OCD)(\|(HIV|Cancer|flu|Arthritis|OCD))*, oriented to avoid redundant listing.
Probably the easier to maintain method is that you use a bit changed regex,
like below:
^\d*::[a-zA-Z.]+\s\(\d{4}\)::((?:HIV|Cancer|flu|Arthritis|OCD|\|(?!\|))+)$
It contains:
^ and $ anchors (you want that the entire string is matched,
not its part).
A capturing group, including a repeated non-capturing group (a container
for alternatives). One of these alternatives is |, but with a negative
lookahead for immediately following | (this way you disallow 2 or
more consecutive |).
Then, if this regex matched for a particular row, you should:
Split group No 1 by |.
Check resulting string array for uniqueness (it should not contain
repeating entries).
Only if this check succeeds, you should accept the row in question.

Regular Expressions: about Greediness, Laziness and Substrings

I have the following string:
123322
In theory, the regex 1.*2 should match the following:
12 (because * can be zero characters)
12332
123322
If I use the regex 1.*2 it matches 123322.
Using 1.*?2, it will match 12.
Is there a way to match 12332 too?
The perfect thing would be to get all possible matchess in the string (no matter if one match is substring of another)
No, unless there is something else added to the regex to clarify what it should do it will either be greedy or non-greedy. There is no in-betweeny ;)
1(.*?2)*$
you will have multiple captures which you can concatenate to form all possible matches
see here:regex tester
click on 'table' and expand the captures tree
You would need a separate expression for each case, depending on the number of twos you want to match:
1(.*?2){1} #same as 1.*?2
1(.*?2){2}
1(.*?2){3}
...
Generally, this isn't possible. A regex matching engine isn't really designed to find overlapping matches. A quick solution is simply to check the pattern on all substrings manually:
string text = "1123322";
for (int start = 0; start < text.Length - 1; start++)
{
for (int length = 0; length <= text.Length - start; length++)
{
string subString = text.Substring(start, length);
if (Regex.IsMatch(subString, "^1.*2$"))
Console.WriteLine("{0}-{1}: {2}", start, start + length, subString);
}
}
Working example: http://ideone.com/aNKnJ
Now, is it possible to get a whole-regex solution? Mostly, the answer is no. However, .Net does has a few tricks in its sleeve to help us: it allows variable length lookbehind, and allows each capturing group to remember all captures (most engines only return the last match of each group). Abusing these, we can simulate the same for loop inside the regex engine:
string text = "1123322!";
string allMatchesPattern = #"
(?<=^ # Starting at the local end position, look all the way to the back
(
(?=(?<Here>1.*2\G))? # on each position from the start until here (\G),
. # *try* to match our pattern and capture it,
)* # but advance even if you fail to match it.
)
";
MatchCollection matches = Regex.Matches(text, allMatchesPattern,
RegexOptions.ExplicitCapture | RegexOptions.IgnorePatternWhitespace);
foreach (Match endPosition in matches)
{
foreach (Capture startPosition in endPosition.Groups["Here"].Captures)
{
Console.WriteLine("{0}-{1}: {2}", startPosition.Index,
endPosition.Index - 1, startPosition.Value);
}
}
Note that currently there's a small bug there - the engine doesn't try to match the last ending position ($), so you loose a few matches. For now, adding a ! at the end of the string solves that issue.
working example: http://ideone.com/eB8Hb

Get second part of a string using RegEx

I have string like this "first#second", and I wonder how to get "second" part without "#" symbol as result of RegEx, not as match capture using brackets
upd: I forgot to add one more special char at the end of string, real string is "first#second*"
Simple regex:
/#(.*)$/
If you really don't want it to be a match capture, and you know there's a # in the string but none in the part you want, you can do
/[^#]*$/
and the whole regex is what you want.
If you must use regex, and you insist on not using capturing groups, you can use lookbehind in flavors that support them like this:
(?<=#).*
Or you can also capture just anything but #, to the end of the string, so something like this:
[^#]*$
The capturing group option, of course, is:
#(.*)
\__/
1
This matches the # too, but group 1 captures the part that you want.
Lastly, a non-regex alternative may look something like this:
secondPart = wholeString.substring( wholeString.indexOf("#") + 1 )
There may be issues with some of these solutions if # can also appear (perhaps escaped) anywhere else in the string.
References
regular-expressions.info
Lookarounds, Brackets for Capturing, Anchors
/[a-z]+#([a-z]+)/
You can use lookaround to exclude parts of an expression.
http://www.regular-expressions.info/lookaround.html
if your using java then
you can consider using Pattern & Matcher class. Pattern gives you a compiled, optimizer version of Regular expression. Matcher gives a complete internals of RE Matches.
Both Pattern.match & String.spilt gives same result where in first is compartively faster.
for e.g)
String s = "first#second#third";
String re = "#";
Pattern p = Pattern.compile(re);
Matcher m = p.matcher();
int ms = 0;
int me = 0;
while( m.find() ) {
System.out.println("start "+m.start()+" end "+ m.end()+" group "+m.group());
me = m.start();
System.out.println(s.substring(ms,me));
ms = m.end();
}
if other language u can consider using back-reference & groups also. if you find any repetitions.

Match at every second occurrence

Is there a way to specify a regular expression to match every 2nd occurrence of a pattern in a string?
Examples
searching for a against string abcdabcd should find one occurrence at position 5
searching for ab against string abcdabcd should find one occurrence at position 5
searching for dab against string abcdabcd should find no occurrences
searching for a against string aaaa should find two occurrences at positions 2 and 4
Use capturing groups.
foo.*?(foo)
Use a regex like this to match all occurrences in a string. Every returned match will contain a second occurrence as its first captured group.
Here's an example that matches every second occurrence of \d+ in Python using findall:
import re
input = '10 is less than 20, 5 is less than 10'
second_occurrences = re.findall(r'\d+.*?(\d+)', input)
print(second_occurrences)
Output:
['20', '10']
Suppose the pattern you want is abc+d. You want to match the second occurrence of this pattern in a string.
You would construct the following regex:
abc+d.*?(abc+d)
This would match strings of the form: <your-pattern>...<your-pattern>. Since we're using the reluctant qualifier *? we're safe that there cannot be another match of between the two. Using matcher groups which pretty much all regex implementations provide you would then retrieve the string in the bracketed group which is what you want.
If you're using C#, you can either get all the matches at once (i.e. use Regex.Matches(), which returns a MatchCollection, and check the index of the item: index % 2 != 0).
If you want to find the occurrence to replace it, use one of the overloads of Regex.Replace() that uses a MatchEvaluator (e.g. Regex.Replace(String, String, MatchEvaluator). Here's the code:
using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
string input = "abcdabcd";
// Replace *second* a with m
string replacedString = Regex.Replace(
input,
"a",
new SecondOccuranceFinder("m").MatchEvaluator);
Console.WriteLine(replacedString);
Console.Read();
}
class SecondOccuranceFinder
{
public SecondOccuranceFinder(string replaceWith)
{
_replaceWith = replaceWith;
_matchEvaluator = new MatchEvaluator(IsSecondOccurance);
}
private string _replaceWith;
private MatchEvaluator _matchEvaluator;
public MatchEvaluator MatchEvaluator
{
get
{
return _matchEvaluator;
}
}
private int _matchIndex;
public string IsSecondOccurance(Match m)
{
_matchIndex++;
if (_matchIndex % 2 == 0)
return _replaceWith;
else
return m.Value;
}
}
}
}
Would something like
(pattern.*?(pattern))*
work for you?
Edit:
The problem with this is that it uses the non-greedy operator *?, which can require an awful lot of backtracking along the string instead of just looking at each letter once. What this means for you is that this could be slow for large gaps.
Back references can find interesting solutions here. This regex:
([a-z]+).*(\1)
will find the longest repeated sequence.
This one will find a sequence of 3 letters that is repeated:
([a-z]{3}).*(\1)
There's no "direct" way of doing so but you can specify the pattern twice as in: a[^a]*a that match up to the second "a".
The alternative is to use your programming language (perl? C#? ...) to match the first occurence and then the second one.
EDIT: I've seen other responded using the "non-greedy" operators which might be a good way to go, assuming you have them in your regex library!