combining multiple regular expressions - regex

I have a text file that contains the following lines
! R1 R(1,2) 1.0881
! R2 R(1,3) 1.0881
! R3 R(1,4) 1.0881
! R4 R(1,5) 1.0881
! A1 A(2,1,3) 109.4712
! A2 A(2,1,4) 109.4712
! A3 A(2,1,5) 109.4712
! A4 A(3,1,4) 109.4712
! A5 A(3,1,5) 109.4712
! A6 A(4,1,5) 109.4712
! D1 D(2,1,4,3) -120.0
! D2 D(2,1,5,3) 120.0
! D3 D(2,1,5,4) -120.0
! D4 D(3,1,5,4) 120.0
To match everything, I am using two different Regular expressions.
RE1 = !\s\w(\d)\s+R\((\d),(\d+)\)\s+(\d\.\d+
RE2 = !\s\w(\d)\s+\w\((\d)+,\d,\d\)?,?\d?\s?\)\s+\d?\-?\d\d\d?.\d?\d?\d?\d?
How do I go about combining these two REs so that the code checks for one of the REs. Based one some of posts on SO, I have tried using '|' to concatnate the two expressions but all my attempts have resulted in a typeerror Here is one of my attempts:
pattern = re.compile(re.compile(r'!\s\w(\d)\s+R\((\d),(\d+)\)\s+(\d\.\d+)') | re.compile(r'!\s\w(\d)\s+\w\((\d)+,\d,\d\)?,?\d?\s?\)\s+\d?\-?\d\d\d?.\d?\d?\d?\d?'))

This should get everything you need in a single regex
([A-Z])(\d+)\s+\1\((\d+(?:,\d+)*)\)\s+(-?\d+\.\d+)
https://regex101.com/r/bJdcSc/1
( [A-Z] ) # (1)
( \d+ ) # (2)
\s+ \1 \(
( # (3 start)
\d+
(?: , \d+ )*
) # (3 end)
\) \s+
( -? \d+ \. \d+ ) # (4)

Maybe,
!\s+[A-Z](\d)\s{2,}[A-Z]\((\d+),(\d+)?,?(\d+)?,?(\d+)?,?\)\s{2,}(-?\d+\.\d*)
might be close to what you like to write.
Demo
Test
import re
regex = r"!\s+[A-Z](\d)\s{2,}[A-Z]\((\d+),(\d+)?,?(\d+)?,?(\d+)?,?\)\s{2,}(-?\d+\.\d*)"
string = """
! R1 R(1,2) 1.0881
! R2 R(1,3) 1.0881
! R3 R(1,4) 1.0881
! R4 R(1,5) 1.0881
! A1 A(2,1,3) 109.4712
! A2 A(2,1,4) 109.4712
! A3 A(2,1,5) 109.4712
! A4 A(3,1,4) 109.4712
! A5 A(3,1,5) 109.4712
! A6 A(4,1,5) 109.4712
! D1 D(2,1,4,3) -120.0
! D2 D(2,1,5,3) 120.0
! D3 D(2,1,5,4) -120.0
! D4 D(3,1,5,4) 120.0
"""
print(re.findall(regex, string))
Output
[('1', '1', '2', '', '', '1.0881'), ('2', '1', '3', '', '', '1.0881'),
('3', '1', '4', '', '', '1.0881'), ('4', '1', '5', '', '', '1.0881'),
('1', '2', '1', '3', '', '109.4712'), ('2', '2', '1', '4', '',
'109.4712'), ('3', '2', '1', '5', '', '109.4712'), ('4', '3', '1',
'4', '', '109.4712'), ('5', '3', '1', '5', '', '109.4712'), ('6', '4',
'1', '5', '', '109.4712'), ('1', '2', '1', '4', '3', '-120.0'), ('2',
'2', '1', '5', '3', '120.0'), ('3', '2', '1', '5', '4', '-120.0'),
('4', '3', '1', '5', '4', '120.0')]
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:

Related

Converting full-width characters to half-width characters

I have a program for converting full width characters to half width. It works fine, except for the number zero. Full-width zero is not converting to half-width zero.
Perl
use strict;
use warnings;
use warnings qw(FATAL utf8);
use utf8;
use feature qw(unicode_strings);
use open qw(:std :utf8);
unless ( #ARGV == 2 ) {
print "Usage: script.pl input_file output_file\n";
exit;
}
my %fwhw = (
'0' => '0', '1' => '1', '2' => '2', '3' => '3', '4' => '4',
'5' => '5', '6' => '6', '7' => '7', '8' => '8', '9' => '9',
'A' => 'A', 'B' => 'B', 'C' => 'C', 'D' => 'D', 'E' => 'E',
'F' => 'F', 'G' => 'G', 'H' => 'H', 'I' => 'I', 'J' => 'J',
'K' => 'K', 'L' => 'L', 'M' => 'M', 'N' => 'N', 'O' => 'O',
'P' => 'P', 'Q' => 'Q', 'R' => 'R', 'S' => 'S', 'T' => 'T',
'U' => 'U', 'V' => 'V', 'W' => 'W', 'X' => 'X', 'Y' => 'Y',
'Z' => 'Z', 'a' => 'a', 'b' => 'b', 'c' => 'c', 'd' => 'd',
'e' => 'e', 'f' => 'f', 'g' => 'g', 'h' => 'h', 'i' => 'i',
'j' => 'j', 'k' => 'k', 'l' => 'l', 'm' => 'm', 'n' => 'n',
'o' => 'o', 'p' => 'p', 'q' => 'q', 'r' => 'r', 's' => 's',
't' => 't', 'u' => 'u', 'v' => 'v', 'w' => 'w', 'x' => 'x',
'y' => 'y', 'z' => 'z', '-' => '-', '、' => ', ', ' ' => ' ',
'/' => '/',);
sub slurp {
my $file = shift;
open my $fh_read, '<', $file or die "Could not open file: $!";
return do {local $/; <$fh_read>};
}
sub convert {
my $sub_string = shift;
$sub_string =~ s/(.)/$fwhw{$1}?$fwhw{$1}:$1/seg;
return $sub_string;
}
my $string = slurp($ARGV[0]);
$string =~ s/<target>\s*<g id="\d+">\K(.*?)(?=<\/g>\s*<\/target>)/convert($1)/seg;
open my $fh_write, ">", $ARGV[1] or die "Could not open file: $!";
print $fh_write $string;
close $fh_write;
Here is what I have tried
I have made sure that the number 0 (zero) and the letter O (oh) are indeed different by checking their code points. Full width 0 is \x{ff10}. Full width letter O is \x{ff2f}. I checked this using this code:
use Encode;
sub codepoint_hex {
sprintf "%04x", ord Encode::decode("UTF-8", shift);
}
my $codepoint = codepoint_hex('0');
print $codepoint, "\n";
I have checked that the hash is indeed loading all of the keys and values correctly.
What I haven't tried yet:
I haven't tried to duplicate this oddity on Linux yet. I am using ActiveState Perl 5.24 on Windows 10.
If anyone has any suggestions or sees my mistake, I would be very grateful for the guidance.
Since $fwhw{'0'} returns 0, and since 0 is false, the replacement doesn't occur. Replace
$sub_string =~ s/(.)/$fwhw{$1}?$fwhw{$1}:$1/seg;
with
$sub_string =~ s/(.)/exists($fwhw{$1})?$fwhw{$1}:$1/seg;
If that still doesn't work, use sprintf "%vX", $str to see what you really have.
By the way,
sub convert {
my $sub_string = shift;
$sub_string =~ s/(.)/exists($fwhw{$1})?$fwhw{$1}:$1/seg;
return $sub_string;
}
would be much faster if replaced with
sub convert {
state $chars = join '', keys(%fwhw);
state $re = qr/([\Q$chars\E])/;
return $_[0] =~ s/$re/$fwhw{$1}/gr;
}
Faster yet,
sub convert {
state $s = join '', keys(%fwhw);
state $r = join '', values(%fwhw);
state $tr = eval("sub { $_[0] =~ tr/\Q$s\E/\Q$r\E/r }");
return $tr->($_[0]);
}
You don't need such a huge dictionary with lots of supporting functions like that. Just a simple sed is enough
halfwidth='!"#$%&'\''()*+,-.\/0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~⦅⦆¢£¬¯¦¥₩ '
fullwidth='!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~⦅⦆¢£¬ ̄¦¥₩ '
sed -ie "y/$fullwidth/$halfwidth/" your_file
If you want to do that in perl it's pretty simple too
perl -Mutf8 -i -C -pe 'BEGIN{ use open qw/:std :utf8/; } tr#!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~⦅⦆¢£¬ ̄¦¥₩ #!"\#$%&'\''()*+,-.\/0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~⦅⦆¢£¬¯¦¥₩ # your_file'

Why does the implementation of set() in Python involve randomization?

I already known when I convert list A to set, then to list again (called B), the order will be changed. But I'm quite confused why the order of the list B also changes every time the code is run.
This is my code:
if __name__ == '__main__':
my_list = ['1', '2', '3', '4', '5']
print(my_list)
print(list(set(my_list)))
And this is the result:
1st run:
['1', '2', '3', '4', '5']
['2', '3', '5', '1', '4']
2nd run:
['1', '2', '3', '4', '5']
['2', '4', '3', '5', '1']

How to read data from text file using python2.7?

can anybody try to help me to retrieve numbers in Python and each number to an array:
I have done the following code, it does the job but it reads 10 as two numbers:
with open("test.dat") as infile:
for i, line in enumerate(infile):
if i == 0:
for x in range(0, len(line)):
if(line[x] == ' ' or line[x] == " "):
continue
else:
print(x, " " , line[x], ", ")
initial_state.append(line[x])
---Results:
(0, ' ', '1', ', ')
(2, ' ', '2', ', ')
(4, ' ', '3', ', ')
(6, ' ', '4', ', ')
(8, ' ', '5', ', ')
(10, ' ', '6', ', ')
(12, ' ', '7', ', ')
(14, ' ', '8', ', ')
(16, ' ', '9', ', ')
(18, ' ', '1', ', ')
(19, ' ', '0', ', ')
(21, ' ', '1', ', ')
(22, ' ', '1', ', ')
(24, ' ', '1', ', ')
(25, ' ', '2', ', ')
(27, ' ', '1', ', ')
(28, ' ', '3', ', ')
(30, ' ', '1', ', ')
(31, ' ', '4', ', ')
(33, ' ', '1', ', ')
(34, ' ', '5', ', ')
(36, ' ', '0', ', ')
(37, ' ', '\n', ', ')
index include spaces, please see the line of numbers im trying to add to array
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0
Use .split() to split all fields by looping through, please see the following code, it should do it
with open("test.dat") as infile:
for i, line in enumerate(infile):
if i == 0: # if first line
field = [field.split(" ") for field in line.split(" ")]
for x in range(0, len(field)):
initial_state_arr.append(field[x])
If you are sure each number is separated by a single space why not just split the line and print each element as an array:
with open("test.dat") as infile:
content = infile.read().split()
for index, number in enumerate(content):
print ((index*2, number))
And what is your exact input and expected result? Does the file have multiple spaces between numbers?

Text to a list python

i'm trying to read a sudoku and put it on a list,
i have something like this.
0,0,0,0,7,0,2,6,0
0,6,0,8,0,2,0,3,5
0,0,5,3,0,0,0,7,0
0,7,6,0,0,0,0,2,0
0,8,9,6,0,0,0,4,0
0,3,0,5,4,0,0,8,0
0,0,0,2,8,0,0,0,0
0,2,0,4,0,0,0,0,3
0,0,8,7,0,3,6,0,0
i need convert it on a list like this
board = [['0', '0', '0', '0', '7', '0', '2', '6', '0'], ['0', '6', '0', '8',
'0', '2', '0', '3', '5'], ['0', '0', '5', '3', '0', '0', '0', '7', '0'],
['0','7', '6', '0', '0', '0', '0', '2', '0'], ['0', '8', '9', '6', '0',
'0', '0','4', '0'], ['0', '3', '0', '5', '4', '0', '0', '8', '0'],
['0', '0', '0', '2','8', '0', '0', '0', '0'], ['0', '2', '0', '4', '0',
'0', '0', '0', '3'], ['0','0', '8', '7', '0', '3', '6', '0', '0']]
I'm using this code but have a problem
tablero = open('sd1.txt', 'r')
board = [line.split(',') for line in tablero.readlines()]
The result is:
board = [['0', '0', '0', '0', '7', '0', '2', '6', '0\n'], ['0', '6', '0',
'8', '0', '2', '0', '3', '5\n'], ['0', '0', '5', '3', '0', '0', '0', '7',
'0\n'], ['0', '7', '6', '0', '0', '0', '0', '2', '0\n'], ['0', '8', '9',
'6', '0', '0', '0', '4', '0\n'], ['0', '3', '0', '5', '4', '0', '0', '8',
'0\n'], ['0', '0', '0', '2', '8', '0', '0', '0', '0\n'], ['0', '2', '0',
'4', '0', '0', '0', '0', '3\n'], ['0', '0', '8', '7', '0', '3', '6', '0',
'0\n']]
Use .strip() to remove leading and trailing whitespace (including the trailing newline that is causing your trouble):
board = [line.strip().split(',') for line in tablero.readlines()]
in case you have the problem at the end of line, you can do a right strip as same Jez but only on the right part..basically..it does the same but only the right of the string .
board = [line.rstrip().split(',') for line in tablero.readlines()]
I guess you need to remove the '\n' by using line.strip('\n\r').
Or you could also use line[:-1].split(','), which also removes the last newline character.

Find if certain characters are present in a string

I have the following code and want to see if the string 'userFirstName' contains any of the characters in the char array. If the string does I want it to ask the user to reenter their first name and then check the new name for invalid characters and so on.
char invalidCharacter[] = { '!', '#', '#', '$', '%', '^', '&', '*', '(', ')', '~', '`',
';', ':', '+', '=', '-', '_', '*', '/', '.', '<', '>', '?', ',', '[', ']', '{', '}',
'0', '1', '2', '3', '4', '5', '6', '7', '8', '9' };
cout << "Please enter your first name: " << endl;
cin >> userFirstName;`
Use string::find_first_of to do it.
Assuming that userFirstName is a string:
size_t pos = userFirstName.find_first_of(invalidChars, 0, sizeof(invalidChars));
if (pos != string::npos) {
// username contains an invalid character at index pos
}