Substitute the markdown italic to html using regex in Perl - regex

To convert the markdown italic text $script into html, I've written this:
my $script = "*so what*";
my $res =~ s/\*(.)\*/$1/g;
print "<em>$1</em>\n";
The expected result is:
<em>so what</em>
but it gives:
<em></em>
How to make it give the expected result?

Problems:
You print the wrong variable.
You switch variable names halfway through.
. won't match more than one character.
You always add one EM element, even if no stars are found.
You always add one EM element, even if multiple pairs of stars are found.
You add the EM element around the entire output, not just the portion in stars.
Fix:
$script =~ s{\*([^*]+)\*}{<em>$1</em>}g;
print "$script\n";
or
my $res = $script =~ s{\*([^*]+)\*}{<em>$1</em>}gr;
print "$res\n";
But that's not it. Even with all the aforementioned problems fixed, your parser still has numerous other bugs. For example, it misapplies italics for all of the following:
**Important**Correct: ImportantYour code: *Important*
4 * 5 * 6 = 120Correct: 4 * 5 * 6 = 120Your code: 4 5 6 = 120
4 * 6 = 20 is *wrong*Correct: 4 * 6 = 20 is wrongYour code: 4 6 = 20 is wrong*
`foo *bar* baz`Correct: foo *bar* bazYour code: `foo bar baz`
\*I like stars\*Correct: *I like stars*Your code: \I like stars\

Related

How to optimize the python code written using regex and for loops?

I have two lists and I need to perform a string match. I have used three for loops and re.pattern to solve. I am getting the expected using existing code (part1), but I need to optimized the code (part2) as it takes a longer time when I apply for lengthy data.
part1
texts = ['foo abc', 'foobar xyz', 'xyz baz32', 'baz 45','fooz','bazzar','foo baz']
terms = ['foo','baz','apple']
output_list = []
for term in terms:
pattern_term = r'\b(?:{})\b'.format(term)
try:
for i in range(len(texts)):
line_text = texts[i]
for match in re.finditer(pattern_term, line_text):
start_index = match.start()
output_list.append([i, start_index, line_text[start_index:], term])
except:
pass
output:
Explaination fo columns names :
Index = index of texts when pattern matches
Start_index = start index where pattern matches inside text
Match_text = complete text of that matching
Match_term = term with it matches
pd.DataFrame(output_list, columns = ['Index', 'Start_index', 'Match_text', 'Match_term'])
Index Start_index Match_text Match_term
0 0 0 foo abc foo
1 6 0 foo baz foo
2 3 0 baz 45 baz
3 6 4 baz baz
I have tried the following code (part2), but its output is partial:
part 2
df = pd.DataFrame({'Match_text': texts})
pat = r'\b(?:{})\b'.format('|'.join(terms))
df[df['Match_text'].str.contains(pat)]
output
Match_text
0 foo abc
3 baz 45
6 foo baz
Your code is already good since you need to find occurrences of whole words inside longer strings, and you create the regex pattern before the loop where the texts are processed with the regex.
The regex already is good, the only thing about it is the redundant non-capturing group that you may discard because you check term by term, there is no alternation inside the group. You might also compile the regex:
pattern_term = re.compile(r'\b{}\b'.format(term))
Then, you may get rid of temporary variables in the for loop:
for i in range(len(texts)):
for match in pattern_term.finditer(texts[i]):
output_list.append([i, match.start(), texts[i][match.start():], term])

generate all combinations of strings based on template

How to generate all combinations of strings based on template?
For example:
- Template string of
"{I|We} want {|2|3|4} {apples|pears}"
The curly braces "{...}" identify a group or words, each word separated by "|".
The class should generate strings with every combination of words within each word group.
I know it's finite automata, and also regex. How to efficiently generate combination?
For example
G[0][j] [want] G[1][j] G[2][j]"
G[0] = {I, We}
G[1] = {2, 3, 4}
G[2] = {apples, pears}
firstly, generate all possible combination c = [0..1][0..2][0..1]:
000
001
010
011
020
021
100
101
110
111
120
121
and then for each c replace G[i][j] by G[i][c[i]]
Shell glob
$ for q in {I,We}\ want\ {2,3,4}\ {apples,pears}; do echo "$q" ; done
I want 2 apples
I want 2 pears
I want 3 apples
I want 3 pears
I want 4 apples
I want 4 pears
We want 2 apples
We want 2 pears
We want 3 apples
We want 3 pears
We want 4 apples
We want 4 pears
The most functional solution to this problem I found so far is the Python module sre_yield.
The goal of sre_yield is to efficiently generate all values that can
match a given regular expression, or count possible matches
efficiently.
Emphasis added by me.
To apply it to your stated problem: Formulate your template as regex pattern and use it in sre_yield to get all possible combinations or count possible matches like this:
import sre_yield
result = set(sre_yield.AllStrings("(I|We) want (|2|3|4) (apples|pears)"))
result.__len__()
result
Output:
16
{'I want apples',
'I want pears',
'I want 2 apples',
'I want 2 pears',
'I want 3 apples',
'I want 3 pears',
'I want 4 apples',
'I want 4 pears',
'We want apples',
'We want pears',
'We want 2 apples',
'We want 2 pears',
'We want 3 apples',
'We want 3 pears',
'We want 4 apples',
'We want 4 pears'}
PS: Instead of a list as shown on the project page I use a set to avoid duplicates. If this is not what you want go with a list.
The principle is:
Regex -> NFA
NFA -> minimal DFA
DFS-walk through the DFA (collecting all characters)
This principle is implemented, e.g. in RexLex:
DeterministicAutomaton dfa = Pattern.compileGenericAutomaton("(I|We) want (2|3|4)? (apples|pears)")
.toAutomaton(new FromGenericAutomaton.ToMinimalDeterministicAutomaton());
if (dfa.getProperty().isAcyclic()) {
for (String s : dfa.getSamples(1000)) {
System.out.println(s);
}
}
Convert each set of strings {...} into a string array so you have n arrays.
So for "{I|We} want {|2|3|4} {apples|pears}" we would have 4 arrays.
Place each of those arrays into another array. In my example I will call the collection
This is Java code, but its simple enough that you should be able to convert it to any language. I didn't test but it should work.
void makeStrings(String[][] wordSet, ArrayList<String> collection) {
makeStrings(wordSet, collection, "", 0, 0);
}
void makeStrings(String[][] wordSet, ArrayList<String> collection, String currString, int x_pos, int y_pos) {
//If there are no more wordsets in the whole set add the string (this means 1 combination is completed)
if (x_pos >= wordSet.length) {
collection.add(currString);
return;
}
//Else if y_pos is outof bounds (meaning no more words within the smaller set {...} return
else if (y_pos >= wordSet[x_pos].length) {
return;
}
else {
//Generate 2 new strings, one to send "vertically " and one "horizontally"
//This string accepts the current word at x.y and then moves to the next word subset
String combo_x = currString + " " + wordSet[x_pos][y_pos];
makeStrings(wordSet, collection, combo_x, x_pos + 1, 0);
//Create a copy of the string and move to the next string within the same subset
String combo_y = currString;
makeStrings(wordSet, collection, combo_y, x_pos , y_pos + 1);
}
}
*Edit for corrections

Assistance with building regex

I hate doing this but I've banged my head for hours just trying to figure out regexes, so I am finally resorting to asking the experts.
-1,AAABO,ABOAO
-2,ABBBO,BABBO
-3,AAACO,ACAAO
-4,ABDDO,BADDO
-5,AAABF,ABFAA
-6,BBBGO,BGBBO
I am looking to match multiple substrings but only between the commas.
For example:
AA and B would return rows 1,5
BB and O would return 2 and 6
BBB and G would return row 6
AA C and O would return row 3
I would build this dynamically as needed.
The 2nd step would be filtering on the beginning or end of the string after the 2nd comma
For example (start):
AB would return row 1 and 5
For example (end):
BO would return row 2 and 6
and then I need to combine all 3 filters.
For example
AAA O (contains from 2nd column)
AB (begins with)
O (ends with)
returns row 1
I could do multiple passes if required.
I would be delighted with any guidance.
You want the regex
/^.*?,(?=[^,]*AAA)(?=[^,]*O).*?,AB.*O$/
with commentary
/
^.*?, # consume the first field
(?=[^,]*AAA) # look ahead in the 2nd field for AAA
(?=[^,]*O) # look ahead in the 2nd field for O
.*?, # consume the 2nd field
AB.*O$ # the 3rd field starts with AB and ends with O
/x
which you can generate like this
sub gen_regex {
my ($begins, $ends, #contains) = #_;
my $regex = "^.*?,"
. join("", map {"(?=[^,]*$_)"} #contains)
. ".*?,$begins.*$ends\$";
return qr/$regex/;
}
my $re = gen_regex('AB', 'O', qw(AAA O));
and then use it like this:
while (<>) { say $. if /$re/ }

Scala Regular Expression Oddity

I have this regular expression:
^(10)(1|0)(.)(.)(.)(.{18})((AB[^|]*)\||(AQ[^|]*)\||(AJ[^|]*)\||(AF[^|]*)\||(CS[^|]*)\||(CR[^|]*)\||(CT[^|]*)\||(CK[^|]*)\||(CV[^|]*)\||(CY[^|]*)\||(DA[^|]*)\||(AO[^|]*)\|)+AY([0-9]*)AZ(.*)$
To give it a bit of organization, there's really 3 parts:
// Part 1
^(10)(1|0)(.)(.)(.)(.{18})
// Part 2
// Optional Elements that begin with two characters and is terminated by a |
// May appear at most once
((AB[^|]*)\||(AQ[^|]*)\||(AJ[^|]*)\||(AF[^|]*)\||(CS[^|]*)\||(CR[^|]*)\||(CT[^|]*)\||(CK[^|]*)\||(CV[^|]*)\||(CY[^|]*)\||(DA[^|]*)\||(AO[^|]*)\|)+
// Part 3
AY([0-9]*)AZ(.*)$
Part 2 is the part that I'm having trouble with but I believe the current regular expression says any of these given elements will appear one or more times. I could have done something like: (AB.*?|) but I don't need the pipe in my group and wasn't quite sure how to express it.
This is my sample input - it's SIP2 if you've seen it before (please disregard checksum, I know it's not valid):
101YNY201406120000091911AOa|ABb|AQc|AJd|CKe|AFf|CSg|CRh|CTi|CVj|CYk|DAl|AY1AZAA71
This is my snippet of Scala code:
val regex = """^(10)(1|0)(.)(.)(.)(.{18})((AB[^|]*)\||(AQ[^|]*)\||(AJ[^|]*)\||(AF[^|]*)\||(CS[^|]*)\||(CR[^|]*)\||(CT[^|]*)\||(CK[^|]*)\||(CV[^|]*)\||(CY[^|]*)\||(DA[^|]*)\||(AO[^|]*)\|)+AY([0-9]*)AZ(.*)$""".r
val msg = "101YNY201406120000091911AOa|ABb|AQc|AJd|CKe|AFf|CSg|CRh|CTi|CVj|CYk|DAl|AY1AZAA71"
val m = regex.findFirstMatchIn(msg)) match {
case None => println("No match")
case Some(x) =>
for (i <- 0 to x.groupCount) {
println(i + " " + x.group(i))
}
}
This is my output:
0 101YNY201406120000091911AOa|ABb|AQc|AJd|CKe|AFf|CSg|CRh|CTi|CVj|CYk|DAl|AY1AZAA71
1 10
2 1
3 Y
4 N
5 Y
6 201406120000091911
7 DAl|
8 ABb
9 AQc
10 AJd
11 AFf
12 CSg
13 CRh
14 CTi
15 CKe
16 CVj
17 CYk
18 DAl
19 AOa
20 1
21 AA71
Note the entry that starts with 7. Can anyone explain why that's there?
I'm using Scala 2.10.4 but I believe regular expressions in Scala simply uses Java's regular expression. I'm certainly open to other suggestions for parsing strings.
EDIT: Based on wingedsubmariner's response, I was able to fix my regular expression:
^(10)(1|0)(.)(.)(.)(.{18})(?:AB([^|]*)\||AQ([^|]*)\||AJ([^|]*)\||AF([^|]*)\||CS([^|]*)\||CR([^|]*)\||CT([^|]*)\||CK([^|]*)\||CV([^|]*)\||CY([^|]*)\||DA([^|]*)\||AO([^|]*)\|)+AY([0-9]*)AZ(.*)$
Basically adding ?: to indicate I was not interested in the group!
You get a matched group for each set of parentheses, the order being the order of the opening parenthesis in the regex. Matched group 7 corresponds to the opening parenthesis that begins your "Group 2":
((AB[^|]*)\||(AQ[^|]*)\||(AJ[^|]*)\||(AF[^|]*)\||(CS[^|]*)\||(CR[^|]*)\||(CT[^|]*)\||(CK[^|]*)\||(CV[^|]*)\||(CY[^|]*)\||(DA[^|]*)\||(AO[^|]*)\|)+
^
|
This parenthesis
Each matched group takes on the value of the last part of the text that matched, which in this case is DAl| because it was the last piece of text to match the "Group 2" expression.
Here is a simpler example that demonstrates the behavior:
val regex = """((A)\||(B)\|)+""".r
val msg = "A|B|A|B|"
regex.findFirstMatchIn(msg) match {
case None => println("No match")
case Some(x) =>
for (i <- 0 to x.groupCount) {
println(i + " " + x.group(i))
}
}
Which produces:
0 A|B|A|B|
1 B|
2 A
3 B

How to get values inside nested braces using perl

I have a string having list of expressions inside braces. I want to get the details by splitting it in an array.
I have tried like this.
#!/usr/bin/perl
sub main() {
my $string = <STDIN>;
while ($string =~ /(\((?:(?1)|[^()]*+)++\))|[^()\s]++/g)
{
print "$&\n"
}
main();
InPut : (+ (+ 4 3) ( - 3 2) 5)
Output should be : (+ (+ 3 4) ( - 2 3) 5)
(+ 3 4)
( - 2 3)
which i'm trying to store it in an array and then evaluate seprately.. But not sure thats the right approach.
Basically i'm trying to evaluate an expression as below.
4+3 =7 , 3-2 =1 , and then 7+1+5 = 13
Final output should be 13
Can any one kindly help me on this?
Use the following expression /(?=(\((?>[^()]+|(?1))*\)))/g
See it in action here: http://regex101.com/r/eI7iP5