I can't find out with regex - regex

I'm trying to grab all the chords from a song and place into HTML elements. I'm using this regex
/(\W)(\s)( Chord )\s/
This regex doesn't find all the chords I need. What am I doing wrong?
For example song:
Intro: D G D G D G A G
D G
Klausei ko taip žiūriu
D G
Tavęs ausys neapgavo
D G
Aš kilnų tikslą turiu
D G
Sukišt liežuvį į burną tavo
A G D E G
Tai jokia nepagarba
A D E G
Ir ne kančia
A D E G
Ir ne bėda
A D E G
Greičiau likimo dovana
D D G
Mes pasirengę tegu
Mums gimsta dukros ir arba sūnūs
Arba visi kartu
Ir kuo daugiau, lai mums linksma būna
Tai beveik prabanga
Jokia bėda
Ir ne kančia
Tiesiog likimo dovana

Regex expression works with patterns not with 'knowledgement', I mean, the regex doesn't know really what kind of pattern is a Chorus or a url or whatever.
But you can define a regex with your pattern knowledgement to capture the things that you believe that belong to your pattern.
In this case you want to capture the chords, which appears to be the capital single letters in the range from A-G or de Upper-lower case in the same range followed by letter m.
With spaces by spaces at possibly both sides.
So, you can define this regex:
/(?<=\s)([ABCDEFG]|Am|Bm|Cm|Dm|Em|Gm)(?=\s)/gm
Which means (?<=\s) : look for \spaces at the beginning of the pattern but don't capture them.
Then ([ABCDEFG]|Am|Bm|Cm|Dm|Em|Gm) : look for one letter of the collection [ABC...G] or the combination Am or Bm or...
Then (?=\s) which looks for \spaces at the end of the pattern (without capture them).
https://regex101.com/r/iE1xN3/1
Also you can redefine your regex into this,
/(?<=\s)([A-G]m?)(?=\s)/gm
Which is the same but expressed in other way, where ([A-G]m?): it means, look for a letter in the range of A...G which can be followed by the letter m.
https://regex101.com/r/iE1xN3/2
For javascript (which doesn't support look-behind you can do this:
/(\b)([A-G]m?)(?=\s)/gm
https://regex101.com/r/iE1xN3/3
thanks #stribizhev for the feedback)

I don't know what exactly you want to replace with.
I am using:
\b([A-G])\b
Which means word break, the letters A-G, word break.
https://regex101.com/r/kG0kE5/1
One problem with this method and mayo's answer is if you have lyrics with the word "A" in them.
I would probably write a program that went line by line to determine if the entire line was only chords, and only process those lines.
For example:
A long, long time ago I can still remember how that music used to make me smile .
Most solutions will end up picking "A" in "A long" and recognizing it as a chord.

You can use a regex like this to match the relevan part of a "line" with chords (with multiline m and global g modifiers).
This works selecting sections with one or more "Chords" (reducing the false positive cases):
\b(?:([A-G]m?) *)+$
Try the regex online here
NB: note that the online solution skip correctly a line such as A first letter of the alphabet but as a caveat it matches a trailing A-G (this is an improbable event in lyrics).
In php code:
$re = "/\\b(?:([A-G]m?) *)+$/m";
$str = "Intro: D G D G D G A G\n\nD G\nKlausei ko taip žiūriu\nD G\nTavęs ausys neapgavo\nD G\nAš kilnų tikslą turiu\nD G\nSukišt liežuvį į burną tavo\nA G D E G\nTai jokia nepagarba\nA D E G\nIr ne kančia\nA D E G\nIr ne bėda\nA D E G\nGreičiau likimo dovana\n\nD D G \n\nMes pasirengę tegu\nMums gimsta dukros ir arba sūnūs\nArba visi kartu\nIr kuo daugiau, lai mums linksma būna\n\nTai beveik prabanga\nJokia bėda\nIr ne kančia\nTiesiog likimo dovana";
preg_match_all($re, $str, $matches);
print_r($matches[0]);
Output:
Array
(
[0] => D G D G D G A G
[1] => D G
[2] => D G
[3] => D G
[4] => D G
[5] => A G D E G
[6] => A D E G
[7] => A D E G
[8] => A D E G
[9] => D D G
)

Related

Find a regular expression that describes those words that don't contain two consecutive a's over the alphabet {a,b}

I've tried to write a grammar for the language. Here is my grammar:
S -> aS | bS | λ
I also wanted to generate the word "bbababb" which does not have two consecutive a's.
I started with,
bS => bbS => bbaS => bbabS => bbabaS => bbababS => bbababbS => bbababbλ => bbababb.
And finally I tried the following regular expression,
(a+b*)a*(a+b*)
I really appreciate your help.
Let's try to write some rules that describe all strings that don't have two consecutive a's:
the empty string is in the language
if x is a string in the language ending in a, you can add b to the end to get another string in the language
if x is a string in the language ending in b, you can add an a or a b to it to get another string in the language
This lets us write down a grammar:
S -> e | aB | bS
B -> e | bS
That grammar should work for us. Consider your string bbababb:
S -> bS -> bbS -> bbaB -> bbabS
-> bbabaB -> bbababS -> bbababbS
-> bbababb
To turn a regular grammar such as this into a regular expression, we can write equations and solve for S:
S = e + aB + bS
B = e + bS
Replace for B:
S = e + a(e + bS) + bS
= e + a + abS + bS
= e + a + (ab + b)S
Now we can eliminate recursion to solve for S:
S = (ab + b)*(e + a)
This gives us a regular expression: (ab + b)*(e + a)
a must always be followed by b, except the last char, so you can express it as "b or ab, with an optional trailing a":
\b(b|ab)+a?\b
See live demo.
\b (word boundaries) might be able to be removed depending on your usage and regex engine.

Regex for replacing multiple spaces and dashes with or without spaces

I can do this with two separate regex passes, but this is already slow and doing two doesn't help, so I want to be able to do it in one pass.
I want to:
replace multiple spaces with one space
replace a dash (hyphen) with a space
However, if the dash has a space on either side of it then the dash and any spaces either side to be replaced with just one space.
As an example:
a - b c-d e -f g- h i - j k - l m - n
must end up like
a b c d e f g h i j k l m n
I have tried things like this:
\s+| - | -|- |-
but that doesn't work:
a b c d e f g h i j k l m n
Use the following regexp to match multiple spaces or dashes;
[\s-]+
Replace with a single space.
[\s-]+ with a global 'g' modifier and replace with one single space.
See here
Regex:
(?:\s*-\s*)+|\s{2,}
REplacement string:
<space>
DEMO

Regular Expression: search multiple string with linefeed delimited by ";"

I have a string such this that described a structured data source:
Header whocares;
SampleTestPlan 2
a b
c d;
Test abc;
SampleTestPlan 3
e f
g h
i l;
Wafer 01;
EndOfFile;
Every field...
... is starting with "FieldName"
... is ending with ";"
... may contain linefeed
My need is to find with regular expression the values of SampleTestPlan that's repeated twice. So...
1st value is:
2
a b
c d
2nd value is
3
e f
g h
i l
I've performed several attempts with such search string:
/SampleTestPlan(.\s)/gm
/SampleTestPlan(.\s);/gm
/SampleTestPlan(.*);/gm
but I need to understand much better how Regular Expression work as I'm definitively a newbie on them and I need to learn a lot.
Thanks in advance to anyone that may help me!
Stefano, Milan, ITALY
You could use the following regex:
(?<=\w\b)[^;]+(?=;)
See it working live here on regex101!
How it works:
It matches everything that is:
preceded by a sequence of characters: \w+
followed by a ;
contains anything (at least one character) except a ; (including newlines).
For example, for that input:
Header whocares;
SampleTestPlan 2
a b
c d;
Test abc;
SampleTestPlan 3
e f
g h
i l;
Wafer 01;
EndOfFile;
It matches 5 times:
whocares
then:
2
a b
c d
then:
abc
then:
3
e f
g h
i l
then:
01
Assuming your input will be always in this well formatted like the sample, try this:
/SampleTestPlan(\s+\d+.*?);/sg
Here, /s modifier means Dot matches newline characters
You can try this at online.
That would be /SameTestPlan([^;]+)/g. [^abc] means any character which is not a, b or c.

Lua text parsing, space handling

I'm a newbie to Lua. And I want to parse the text like
Phase1:A B Phase2:A B Phase3:W O R D Phase4:WORD
to
Phase1 Phase2 Phase3 Phase4
A A B W O R D WORD
I used string.gmatch(s, "(%w+):(%w+)"), I can only get
Phase1 Phase2 Phase3 Phase4
A A W WORD
How can I get missing B, O, R, D back?
Or do I need to write pattern for every phases? How to do that?
The input text in your example doesn't have any clear delimiter between the phrases so parsing it accurately with regex is tricky.
This would be much easier to parse if you add a delimiter symbol like a , to separate the phrases.
Phrase1:A B, Phrase2:A B, Phrase3:W O R D,Phrase4:WORD
You can then parse it with this pattern:
s = "Phrase1:A B, Phrase2:A B, Phrase3:W O R D,Phrase4:WORD"
for k, v in s:gmatch "(Phrase%d+):([^,]+)" do
print(k, v)
end
outputs:
Phrase1 A B
Phrase2 A B
Phrase3 W O R D
Phrase4 WORD
If it's not possible to relax the above constraint, you can try this pattern:
s:gmatch "Phrase%d+:%w[%w ]* "
Note there's a caveat with this pattern, the string you're parsing needs to have an extra space at the end or the last phrase won't get parsed.
for k, v in s:gsub('%s*(%w+:)','\0%1'):gmatch'%z(%w+):(%Z*)'
– #Egor Skriptunoff
This pattern works better.

Regex to calculate straight poker hand - Using ASCII CODE

In another question I learned how to calculate straight poker hand using regex (here).
Now, by curiosity, the question is: can I use regex to calculate the same thing, using ASCII CODE?
Something like:
regex: [C][C+1][C+2][C+3][C+4], being C the ASCII CODE (or like this)
Matches: 45678, 23456
Doesn't matches: 45679 or 23459 (not in sequence)
Your main problem is really going to be that you're not using ASCII-consecutive encodings for your hands, you're using numerics for non-face cards, and non-consecutive, non-ordered characters for face cards.
You need to detect, at the start of the strings, 2345A, 23456, 34567, ..., 6789T, 789TJ, 89TJQ, 9TJQK and TJQKA.
These are not consecutive ASCII codes and, even if they were, you would run into problems since both A2345 and TJQKA are valid and you won't get A being both less than and greater than the other characters in the same character set.
If it has to be done by a regex, then the following regex segment:
(2345A|23456|34567|45678|56789|6789T|789TJ|89TJQ|9TJQK|TJQKA)
is probably the easiest and most readable one you'll get.
There is no regex that will do what you want as the other answers have pointed out, but you did say that you want to learn regex, so here's another meta-regex approach that may be instructional.
Here's a Java snippet that, given a string, programmatically generate the pattern that will match any substring of that string of length 5.
String seq = "ABCDEFGHIJKLMNOP";
System.out.printf("^(%s)$",
seq.replaceAll(
"(?=(.{5}).).",
"$1|"
)
);
The output is (as seen on ideone.com):
^(ABCDE|BCDEF|CDEFG|DEFGH|EFGHI|FGHIJ|GHIJK|HIJKL|IJKLM|JKLMN|KLMNO|LMNOP)$
You can use this to conveniently generate the regex pattern to match straight poker hands, by initializing seq as appropriate.
How it works
. metacharacter matches "any" character (line separators may be an exception depending on the mode we're in).
The {5} is an exact repetition specifier. .{5} matches exactly 5 ..
(?=…) is positive lookahead; it asserts that a given pattern can be matched, but since it's only an assertion, it doesn't actually make (i.e. consume) the match from the input string.
Simply (…) is a capturing group. It creates a backreference that you can use perhaps later in the pattern, or in substitutions, or however you see fit.
The pattern is repeated here for convenience:
match one char
at a time
|
(?=(.{5}).).
\_________/
must be able to see 6 chars ahead
(capture the first 5)
The pattern works by matching one character . at a time. Before that character is matched, however, we assert (?=…) that we can see a total of 6 characters ahead (.{5})., capturing (…) into group 1 the first .{5}. For every such match, we replace with $1|, that is, whatever was captured by group 1, followed by the alternation metacharacter.
Let's consider what happens when we apply this to a shorter String seq = "ABCDEFG";. The ↑ denotes our current position.
=== INPUT === === OUTPUT ===
A B C D E F G ABCDE|BCDEFG
↑
We can assert (?=(.{5}).), matching ABCDEF
in the lookahead. ABCDE is captured.
We now match A, and replace with ABCDE|
A B C D E F G ABCDE|BCDEF|CDEFG
↑
We can assert (?=(.{5}).), matching BCDEFG
in the lookahead. BCDEF is captured.
We now match B, and replace with BCDEF|
A B C D E F G ABCDE|BCDEF|CDEFG
↑
Can't assert (?=(.{5}).), skip forward
A B C D E F G ABCDE|BCDEF|CDEFG
↑
Can't assert (?=(.{5}).), skip forward
A B C D E F G ABCDE|BCDEF|CDEFG
↑
Can't assert (?=(.{5}).), skip forward
:
:
A B C D E F G ABCDE|BCDEF|CDEFG
↑
Can't assert (?=(.{5}).), and we are at
the end of the string, so we're done.
So we get ABCDE|BCDEF|CDEFG, which are all the substrings of length 5 of seq.
References
regular-expressions.info/Dot, Repetition, Grouping, Lookaround
Something like regex: [C][C+1][C+2][C+3][C+4], being C the ASCII CODE (or like this)
You can not do anything remotely close to this in most regex flavors. This is simply not the kinds of patterns that regex is designed for.
There is no mainstream regex pattern that will succintly match any two consecutive characters that differ by x in their ASCII encoding.
For instructional purposes...
Here you go (see also on ideone.com):
String alpha = "ABCDEFGHIJKLMN";
String p = alpha.replaceAll(".(?=(.))", "$0(?=$1|\\$)|") + "$";
System.out.println(p);
// A(?=B|$)|B(?=C|$)|C(?=D|$)|D(?=E|$)|E(?=F|$)|F(?=G|$)|G(?=H|$)|
// H(?=I|$)|I(?=J|$)|J(?=K|$)|K(?=L|$)|L(?=M|$)|M(?=N|$)|N$
String p5 = String.format("(?:%s){5}", p);
String[] tests = {
"ABCDE", // true
"JKLMN", // true
"AAAAA", // false
"ABCDEFGH", // false
"ABCD", // false
"ACEGI", // false
"FGHIJ", // true
};
for (String test : tests) {
System.out.printf("[%s] : %s%n",
test,
test.matches(p5)
);
}
This uses meta-regexing technique to generate a pattern. That pattern ensures that each character is followed by the right character (or the end of the string), using lookahead. That pattern is then meta-regexed to be matched repeatedly 5 times.
You can substitute alpha with your poker sequence as necessary.
Note that this is an ABSOLUTELY IMPRACTICAL solution. It's much more readable to e.g. just check if alpha.contains(test) && (test.length() == 5).
Related questions
How does the regular expression (?<=#)[^#]+(?=#) work?
SOLVED!
See in http://jsfiddle.net/g48K9/3
I solved using closure, in js.
String.prototype.isSequence = function () {
If (this == "A2345") return true; // an exception
return this.replace(/(\w)(\w)(\w)(\w)(\w)/, function (a, g1, g2, g3, g4, g5) {
return code(g1) == code(g2) -1 &&
code(g2) == code(g3) -1 &&
code(g3) == code(g4) -1 &&
code(g4) == code(g5) -1;
})
};
function code(card){
switch(card){
case "T": return 58;
case "J": return 59;
case "Q": return 60;
case "K": return 61;
case "A": return 62;
default: return card.charCodeAt();
}
}
test("23456");
test("23444");
test("789TJ");
test("TJQKA");
test("8JQKA");
function test(cards) {
alert("cards " + cards + ": " + cards.isSequence())
}
Just to clarify, ascii codes:
ASCII CODES:
2 = 50
3 = 51
4 = 52
5 = 53
6 = 54
7 = 55
8 = 56
9 = 57
T = 84 -> 58
J = 74 -> 59
Q = 81 -> 60
K = 75 -> 61
A = 65 -> 62