Lua text parsing, space handling - regex

I'm a newbie to Lua. And I want to parse the text like
Phase1:A B Phase2:A B Phase3:W O R D Phase4:WORD
to
Phase1 Phase2 Phase3 Phase4
A A B W O R D WORD
I used string.gmatch(s, "(%w+):(%w+)"), I can only get
Phase1 Phase2 Phase3 Phase4
A A W WORD
How can I get missing B, O, R, D back?
Or do I need to write pattern for every phases? How to do that?

The input text in your example doesn't have any clear delimiter between the phrases so parsing it accurately with regex is tricky.
This would be much easier to parse if you add a delimiter symbol like a , to separate the phrases.
Phrase1:A B, Phrase2:A B, Phrase3:W O R D,Phrase4:WORD
You can then parse it with this pattern:
s = "Phrase1:A B, Phrase2:A B, Phrase3:W O R D,Phrase4:WORD"
for k, v in s:gmatch "(Phrase%d+):([^,]+)" do
print(k, v)
end
outputs:
Phrase1 A B
Phrase2 A B
Phrase3 W O R D
Phrase4 WORD
If it's not possible to relax the above constraint, you can try this pattern:
s:gmatch "Phrase%d+:%w[%w ]* "
Note there's a caveat with this pattern, the string you're parsing needs to have an extra space at the end or the last phrase won't get parsed.

for k, v in s:gsub('%s*(%w+:)','\0%1'):gmatch'%z(%w+):(%Z*)'
– #Egor Skriptunoff
This pattern works better.

Related

A little misunderstanding about this regex pattern

Let H be column 1, E be column 2, L column 3, P 4
I understand where the H comes from.
I also see how the L works.
But I am a bit confused on E and P.
If we look horizontally, the regex HE|LL|0+ only matches {HE, LL, 0 (1 or more times)}
The regex EP|IP|EF matches {EP, IP, EF}
How is it that the string E matches both of these conditions?
Similarly with [PLEASE], which matches {P, L, E, A, S, E} (any combination of these letters), only matches with EP from the vertical regex, then why is there just a P?
Am I reading this incorrectly? This was taken from regexcrossword
I think you misunderstand the nature of the crossword.
The string HE matches HE|LL|O+
The string LP matches [PLEASE]+
The string HL matches [^SPEAK]+
The string EP matches EP|IF|EF
Each row and column matches its regex, so the solution is valid.
Like, the following statement doesn't make sense...
How is it that the string E matches both of these conditions?
There is no string E. There are two strings, HE and EP.

I can't find out with regex

I'm trying to grab all the chords from a song and place into HTML elements. I'm using this regex
/(\W)(\s)( Chord )\s/
This regex doesn't find all the chords I need. What am I doing wrong?
For example song:
Intro: D G D G D G A G
D G
Klausei ko taip žiūriu
D G
Tavęs ausys neapgavo
D G
Aš kilnų tikslą turiu
D G
Sukišt liežuvį į burną tavo
A G D E G
Tai jokia nepagarba
A D E G
Ir ne kančia
A D E G
Ir ne bėda
A D E G
Greičiau likimo dovana
D D G
Mes pasirengę tegu
Mums gimsta dukros ir arba sūnūs
Arba visi kartu
Ir kuo daugiau, lai mums linksma būna
Tai beveik prabanga
Jokia bėda
Ir ne kančia
Tiesiog likimo dovana
Regex expression works with patterns not with 'knowledgement', I mean, the regex doesn't know really what kind of pattern is a Chorus or a url or whatever.
But you can define a regex with your pattern knowledgement to capture the things that you believe that belong to your pattern.
In this case you want to capture the chords, which appears to be the capital single letters in the range from A-G or de Upper-lower case in the same range followed by letter m.
With spaces by spaces at possibly both sides.
So, you can define this regex:
/(?<=\s)([ABCDEFG]|Am|Bm|Cm|Dm|Em|Gm)(?=\s)/gm
Which means (?<=\s) : look for \spaces at the beginning of the pattern but don't capture them.
Then ([ABCDEFG]|Am|Bm|Cm|Dm|Em|Gm) : look for one letter of the collection [ABC...G] or the combination Am or Bm or...
Then (?=\s) which looks for \spaces at the end of the pattern (without capture them).
https://regex101.com/r/iE1xN3/1
Also you can redefine your regex into this,
/(?<=\s)([A-G]m?)(?=\s)/gm
Which is the same but expressed in other way, where ([A-G]m?): it means, look for a letter in the range of A...G which can be followed by the letter m.
https://regex101.com/r/iE1xN3/2
For javascript (which doesn't support look-behind you can do this:
/(\b)([A-G]m?)(?=\s)/gm
https://regex101.com/r/iE1xN3/3
thanks #stribizhev for the feedback)
I don't know what exactly you want to replace with.
I am using:
\b([A-G])\b
Which means word break, the letters A-G, word break.
https://regex101.com/r/kG0kE5/1
One problem with this method and mayo's answer is if you have lyrics with the word "A" in them.
I would probably write a program that went line by line to determine if the entire line was only chords, and only process those lines.
For example:
A long, long time ago I can still remember how that music used to make me smile .
Most solutions will end up picking "A" in "A long" and recognizing it as a chord.
You can use a regex like this to match the relevan part of a "line" with chords (with multiline m and global g modifiers).
This works selecting sections with one or more "Chords" (reducing the false positive cases):
\b(?:([A-G]m?) *)+$
Try the regex online here
NB: note that the online solution skip correctly a line such as A first letter of the alphabet but as a caveat it matches a trailing A-G (this is an improbable event in lyrics).
In php code:
$re = "/\\b(?:([A-G]m?) *)+$/m";
$str = "Intro: D G D G D G A G\n\nD G\nKlausei ko taip žiūriu\nD G\nTavęs ausys neapgavo\nD G\nAš kilnų tikslą turiu\nD G\nSukišt liežuvį į burną tavo\nA G D E G\nTai jokia nepagarba\nA D E G\nIr ne kančia\nA D E G\nIr ne bėda\nA D E G\nGreičiau likimo dovana\n\nD D G \n\nMes pasirengę tegu\nMums gimsta dukros ir arba sūnūs\nArba visi kartu\nIr kuo daugiau, lai mums linksma būna\n\nTai beveik prabanga\nJokia bėda\nIr ne kančia\nTiesiog likimo dovana";
preg_match_all($re, $str, $matches);
print_r($matches[0]);
Output:
Array
(
[0] => D G D G D G A G
[1] => D G
[2] => D G
[3] => D G
[4] => D G
[5] => A G D E G
[6] => A D E G
[7] => A D E G
[8] => A D E G
[9] => D D G
)

Replace nth occurence of a character by another

I hope this isn't a duplicated, I didn't find an answer and I need help from regexp wizards.
I have a string and I would like to replace the second space found in it by a \n, but I don't know how to use indices (this way) in a regular expression :
For example :
# I have :
"a b c d e f"
# I want :
> "a b/nc d e f"
Also I would like to know how I can "repeat" this replacement: each two occurences of space replace by \n.
For example :
"a b c d e f"
> "a b\nc d\ne f"
(\\S+\\s+\\S+)\\s+
You can use this and replace by \1\n or $1\n.See demo.
https://regex101.com/r/yG7zB9/29

Regex for replacing multiple spaces and dashes with or without spaces

I can do this with two separate regex passes, but this is already slow and doing two doesn't help, so I want to be able to do it in one pass.
I want to:
replace multiple spaces with one space
replace a dash (hyphen) with a space
However, if the dash has a space on either side of it then the dash and any spaces either side to be replaced with just one space.
As an example:
a - b c-d e -f g- h i - j k - l m - n
must end up like
a b c d e f g h i j k l m n
I have tried things like this:
\s+| - | -|- |-
but that doesn't work:
a b c d e f g h i j k l m n
Use the following regexp to match multiple spaces or dashes;
[\s-]+
Replace with a single space.
[\s-]+ with a global 'g' modifier and replace with one single space.
See here
Regex:
(?:\s*-\s*)+|\s{2,}
REplacement string:
<space>
DEMO

Regular Expression: search multiple string with linefeed delimited by ";"

I have a string such this that described a structured data source:
Header whocares;
SampleTestPlan 2
a b
c d;
Test abc;
SampleTestPlan 3
e f
g h
i l;
Wafer 01;
EndOfFile;
Every field...
... is starting with "FieldName"
... is ending with ";"
... may contain linefeed
My need is to find with regular expression the values of SampleTestPlan that's repeated twice. So...
1st value is:
2
a b
c d
2nd value is
3
e f
g h
i l
I've performed several attempts with such search string:
/SampleTestPlan(.\s)/gm
/SampleTestPlan(.\s);/gm
/SampleTestPlan(.*);/gm
but I need to understand much better how Regular Expression work as I'm definitively a newbie on them and I need to learn a lot.
Thanks in advance to anyone that may help me!
Stefano, Milan, ITALY
You could use the following regex:
(?<=\w\b)[^;]+(?=;)
See it working live here on regex101!
How it works:
It matches everything that is:
preceded by a sequence of characters: \w+
followed by a ;
contains anything (at least one character) except a ; (including newlines).
For example, for that input:
Header whocares;
SampleTestPlan 2
a b
c d;
Test abc;
SampleTestPlan 3
e f
g h
i l;
Wafer 01;
EndOfFile;
It matches 5 times:
whocares
then:
2
a b
c d
then:
abc
then:
3
e f
g h
i l
then:
01
Assuming your input will be always in this well formatted like the sample, try this:
/SampleTestPlan(\s+\d+.*?);/sg
Here, /s modifier means Dot matches newline characters
You can try this at online.
That would be /SameTestPlan([^;]+)/g. [^abc] means any character which is not a, b or c.