How to print a Perl character class? - regex

I was in a code review this morning and came across a bit of code that was wrong, but I couldn't tell why.
$line =~ /^[1-C]/;
This line was suppose to evaluate to a hex character between 1 and C, but I assume this line does not do that. The question is not what does match, but what does this match? Can I print out all characters in a character class? Something like below?
say join(', ', [1-C]);
Alas,
# Examples:
say join(', ', 1..9);
say join(', ', 'A'..'C');
say join(', ', 1..'C');
# Output
Argument "C" isn't numeric in range (or flop) at X:\developers\PERL\Test.pl line 33.
1, 2, 3, 4, 5, 6, 7, 8, 9
A, B, C

It matches every code point from U+0030 ("1") to U+0043 ("C").
The simple answer is to use
map chr, ord("1")..ord("C")
instead of
"1".."C"
as you can see in the following demonstration:
$ perl -Mcharnames=:full -E'
say sprintf " %s U+%05X %s", chr($_), $_, charnames::viacode($_)
for ord("1")..ord("C");
'
1 U+00031 DIGIT ONE
2 U+00032 DIGIT TWO
3 U+00033 DIGIT THREE
4 U+00034 DIGIT FOUR
5 U+00035 DIGIT FIVE
6 U+00036 DIGIT SIX
7 U+00037 DIGIT SEVEN
8 U+00038 DIGIT EIGHT
9 U+00039 DIGIT NINE
: U+0003A COLON
; U+0003B SEMICOLON
< U+0003C LESS-THAN SIGN
= U+0003D EQUALS SIGN
> U+0003E GREATER-THAN SIGN
? U+0003F QUESTION MARK
# U+00040 COMMERCIAL AT
A U+00041 LATIN CAPITAL LETTER A
B U+00042 LATIN CAPITAL LETTER B
C U+00043 LATIN CAPITAL LETTER C
If you have Unicode::Tussle installed, you can get the same output from the following shell command:
unichars -au '[1-C]'
You might be interested in wasting time browsing the Unicode code charts. (This particular range is covered by "Basic Latin (ASCII)".)

This is a simple program to test the range of that regexpr:
use strict;
use warnings;
use Test::More qw(no_plan);
for(my $i=ord('1'); $i<=ord('C'); $i++ ) {
my $char = chr($i);
ok $char =~ /^[1-C]/, "match: $char";
}
Generate this result:
ok 1 - match: 1
ok 2 - match: 2
ok 3 - match: 3
ok 4 - match: 4
ok 5 - match: 5
ok 6 - match: 6
ok 7 - match: 7
ok 8 - match: 8
ok 9 - match: 9
ok 10 - match: :
ok 11 - match: ;
ok 12 - match: <
ok 13 - match: =
ok 14 - match: >
ok 15 - match: ?
ok 16 - match: #
ok 17 - match: A
ok 18 - match: B
ok 19 - match: C
1..19

[1-9A-C] is that match a hex number between 1 and C
[a char-an another char] match all the chars between the two chars in the Unicode table

Related

Capture 1-9 after the last occurrence of 0

I want to capture all numbers between 1 and 9 after the last occurrence of zero except zero in the last digit. I tried this pattern it seems that it doesn’t work.
Pattern: [1-9].*
DATA
0100179835
3000766774
1500396843
1500028408
1508408637
3105230262
3005228061
3105228407
3105228940
0900000000
2100000000
0800000000
1000000001
2200000001
0800000001
1300000001
1000000002
2200000002
0800000002
1300000002
1000000003
2200000003
0800000003
1300000003
1000000004
2200000004
0800000004
1300000004
1000000005
2200000005
0800000005
1300000005
1000000006
2300000006
0800000006
0900000006
1000000007
2300000007
0900000007
0800000007
1000000008
2300000008
0900000008
0800000008
1100000009
2300000009
0900000009
0800000009
1000005217
2000000429
1100000020
1000005000
3000000070
2000000400
1000020000
3000200000
2906000000
Desired Result
179835
766774
396843
28408
8408637
5230262
5228061
5228407
5228940
0
0
0
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
5
5
5
5
6
6
6
6
7
7
7
7
8
8
8
8
9
9
9
9
5217
429
20
5000
70
400
20000
200000
6000000
You can anchor the end of the string and match non-zero digits with an optional trailing zero. Ensure that there is at least one matching digit with a positive lookahead pattern:
(?=\d)[1-9]*0?$
Demo: https://regex101.com/r/uggV37/2
To get desired result:
(?:^0*[1-9]+0*\K0|0\K[1-9]+(?:0[1-9]*|0+)?)$
Explanation
(?: Non capture group for the alternatives
^ Start of string
0*[1-9]+0* Match 1+ digits 1-9 between optional zeroes
\K0 Forget what is matched so far and then match a zero
| Or
0\K Match a zero and forget what is matched so far
[1-9]+ Match 1+ digits 1-9
(?: Non capture group for the alternatives
0[1-9]* Match a zero and optional digits 1-9
| Or
0+ Match 1+ zeroes
)? Close the non capture group
) Close the non capture gruop
$ End of string
See a regex demo.
Match 1 item each line:
'0123056'.match(/(?<=0)[1-9]*0?$/g).filter(m => m != '')
Match multiple item each line:
'0123056 0000210 1205000 1204566 0123456 0012340 0123400'.match(/(?<=0)[1-9]*0?\b/g).filter(m => m != '')

Detecting Special Characters with Regular Expression in python?

df
Name
0 ##
1 R##
2 ghj##
3 Ray
4 *#+
5 Jack
6 Sara123#
7 ( 1234. )
8 Benjamin k 123
9 _
10 _!##_
11 _#_&#+-
12 56##!
Output:
Bad_Name
0 ##
1 *#+
2 _
3 _!##_
4 _#_&#+-
I need to detect the special character through regular expression. If a string contains any alphabet or Number then that string is valid else it will consider as bad string.
I was using '^\W*$' RE, everything was working fine except when the string contains '_'( underscore) it is not treating as Bad String.
Use pandas.Series.str.contains:
df[~df['Name'].str.contains('[a-z0-9]', False)]
Output:
Name
0 ##
4 *#+
9 _
10 _!##_
11 _#_&#+-

Regex for float numbers with two decimals in 0-1 range

I am trying to make a HTML pattern / regex to allow only float numbers between 0 and 1 with maximum two decimals.
So, the following will be correct:
0
0.1
0.9
0.11
0.99
1
And these will be incorrect:
00
0.111
0.999
1.1
2
10
I have no knowledge of regex and I don't understand its syntax and I haven't found one online tool to generate a regex.
I've come with something from what I've gathered from online examples:
^(0[0-1]|\d)(\.\d{1,2})?$
I have added 0[0-1] to set a 0-1 range but it does not work. This regex matches every number between 0 and 9 that can also have maximum 2 decimals.
Try using an alternation where the 0 part can be followed by an optional dot and 2 digits and the 1 part can be followed by an optional dot and 1 or 2 times a zero.
^(?:0(?:\.\d{1,2})?|1(?:\.0{1,2})?)$
^ Start of string
(?: Non capturing group
0(?:\.\d{1,2})? Match 0 and optionally a dot and 1-2 digits
| Or
1(?:\.0{1,2})? Match 1 and optionally a dot and 1-2 zeroes
) Close group
$ End of string
Regex demo
If you are not ease with RegEx, you can use some code to check if the input corresponds with your needs, such as :
function ValidateNumber(num)
{
const floatNumber = Number(num);
return floatNumber != NaN && 0 <= floatNumber && floatNumber <= 1 && ('' + num).length <= 4;
}
const TestArray = [ '42', 42, 0, '0', '1', '1.00', '1.01', '0.01', '0.99', '0.111', 'zero' ]
TestArray.forEach(function(element) {
console.log(element + ' is ' + (ValidateNumber(element) ? '' : 'not ') + 'a valid number');
});

Removing special characters while retaining alpha numeric words

I'm in the middle of cleaning a data set that has this:
[IN]
my_Series = pd.Series(["-","ASD", "711-AUG-M4G","Air G2G", "Karsh"])
my_Series.str.replace("[^a-zA-Z]+", " ")
[OUT]
0
1 ASD
2 AUG M G
3 Air G G
4 Karsh
[IDEAL OUT]
0
1 ASD
2 AUG M4G
3 Air G2G
4 Karsh
My goal is to remove special characters and numbers but it there's a word that contains alphanumeric, it should stay. Can anyone help?
Try with apply to achieve your ideal output.
>>> my_Series = pd.Series(["-","ASD", "711-AUG-M4G","Air G2G", "Karsh"])
Output:
>>> my_Series.apply(lambda x: " ".join(['' if word.isdigit() else word for word in x.replace('-', ' ').split()]))
0
1 ASD
2 AUG M4G
3 Air G2G
4 Karsh
dtype: object
Explanation:
I have replaced - with space and split string on spaces. Then check whether the word is digit or not.
If it is digit replace with empty string else with actual word.
At last we are joining the list.
Edit 1:
regex solution :-
>>> my_Series.str.replace("((\d+)(?=.*\d))|([^a-zA-Z0-9 ])", " ")
0
1 ASD
2 AUG M4G
3 Air G2G
4 Karsh
dtype: object
Explanation:
Using lookaround.
((\d+)(?=.*\d))|([^a-zA-Z0-9 ])
(A number is last if it is followed by any other number) OR (allows alpha numeric)

string padded with optional blank with max length

I have a problem building a regex. this is a sample of the text:
text 123 12345 abc 12 def 67 i 89 o 0 t 2
The numbers are sometimes padded with blanks to the max length (3).
e.g.:
"1" can be "1" or "1 "
"13" can be "13" or "13 "
My regex is at the moment this:
\b([\d](\s*)){1,3}\b
The results of this regex are the following: (. = blank for better visibility)
123.
12....
67.
89.
0....
2
But I need this: (. = blank for better visibility)
123
12.
67.
89.
0..
2
How can I tell the regex engine to count the blanks into the {1,3} option?
Try this:
\b(?:\d[\d\s]{0,2})(?:(?<=\s)|\b)
This will also cover strings like text 123 1 23 12345 123abc 12 def 67 i 89 o 0 t 2 and results in:
123
1.
23.
12.
67.
89.
0..
2
Does this do what you want?
\b(\d){1,3}\s*\b
This will also include whitespace (if available) after the selection.
I think you want this
\b(?:\d[\d\s]{0,2})(?!\d)
See it here on Regexr
the word boundary will not work at the end, because if the end of the match is a whitespace, there is no word boundary. Therefor I use a negative lookahead (?!\d) to ensure that there is no digit following.
But if you have a string like this "1 23". It will match only the "2" and the "23", but not the whitespace after the first "2".
Assuming you want to use the padded numbers somewhere else, break the problem apart into two; (simple) parsing the numbers, and (simple) formatting the numbers (including padding).
while ( $text =~ /\b(\d{1,3})\b/g ) {
printf( "%-3d\n", $1 );
}
Alternatively:
#padded_numbers = map { sprintf( "%-3d", $_ ) } ( $text =~ /\b(\d{1,3})\b/g )