RegEx match all word characters, with umlauts from different languages [duplicate]

RegEx match all word characters, with umlauts from different languages [duplicate] - regex

This question already has answers here:
Why do Perl string operations on Unicode characters add garbage to the string?
(7 answers)
Closed 3 years ago.
I want to check if a person's name is valid.
It should check latin letters, also with umlauts (i.e. öäüÖÄÜé).
unfortunately nothing i've tried works.
regarding many sources (following some links),
https://www.regular-expressions.info/unicode.html
Regex for word characters in any language
\p{L} should work, but it doesn't works for me.
Do i have to use a library for this?
use strict;
use warnings;
my $test = "testString";
print $1 if ($test =~ m/^(\p{L}+)$/); #testString
$test = "testStringö";
print $1 if ($test =~ m/^(\p{L}+)$/); #no print msg
$test = "testéString";
print $1 if ($test =~ m/^(\p{L}+)$/); #no print msg

You need to tell Perl that the source code of your file is in utf8. Add
use utf8;
After
use strict;

Related

perl regrex that captures substring between tic marks

I am trying to find a solution in perl that captures the filename in the following string -- between the tic marks.
my $str = "Saving to: ‘wapenc?T=mavodi-7-13b-2b-3-96-1e3431a’";
(my $results) = $str =~ /‘(.*?[^\\])‘/;
print $results if $results;
I need to end up with wapenc?T=mavodi-7-13b-2b-3-96-1e3431a

The final tick seems to be different in your regex than in the input string - char 8217 (RIGHT SINGLE QUOTATION MARK U+2019) versus 8216 (LEFT SINGLE QUOTATION MARK U+2018). Also, when using Unicode characters in the source, be sure to include
use utf8;
and save the file UTF-8 encoded.
After fixing these two issues, the code worked for me:
#! /usr/bin/perl
use warnings;
use strict;
use utf8;
my $str = "Saving to: ‘wapenc?T=mavodi-7-13b-2b-3-96-1e3431a’";
(my $results) = $str =~ /‘(.*?[^\\])’/;
print $results if $results;

Your tic characters aren't in the 7-bit ASCII character set, so there is a whole character-encoding rabbit hole to go down here. But the quick and dirty solution is to capture everything in between extended characters.
($result) = $str =~ /[^\0-\x7f]+(.*?)[^\0-\x7f]/;
[^\0-\x7f] matches characters with character values not between 0 and 127, i.e., anything that is not a 7-bit ASCII character including new lines, tabs, and other control sequences. This regular expression will work whether your input is UTF-8 encoded or has already been decoded, and may work for other character encodings, too.

String to split a complicated string in Perl [duplicate]

This question already has answers here:
How can I parse quoted CSV in Perl with a regex?
(7 answers)
Closed 7 years ago.
I have a string that looks something like this:
'som,ething', another, 'thin#'g', 'her,e', gggh*
I am trying to get it to split on the commas that are NOT in the elements, like this:
'som,ething'
another
'thin#'g'
'her,e'
gggh*
I am using parse_line(q{,}, 1, $string) but it seems to fail when the string has single quotes in it. Is there something I'm missing?

#!/usr/bin/perl
use strict;
use warnings;
my $string = q{'som,ething', another, 'thin'g', 'her,e', gggh*};
my #splitted = split(/,(?=\s+)/, $string);
print $_."\n" foreach #splitted;
Output:
'som,ething'
another
'thin'g'
'her,e'
gggh*
Demo

It looks like you're trying to parse comma-separated values. The answer is to use Text::CSV_XS since that handles the various weird cases you're likely to find in the data. See How can I parse quoted CSV in Perl with a regex?

Using split is not the way to go. If you are sure your string is well formatted using a global match is more simple, example:
my $line = "'som,ething', another , 'thin#'g', 'her,e' , gggh*";
my #list = $line =~ /\s*('[^#']*(?:#.[^#']*)*+'|[^,]+(?<=\S))\s*/g;
print join("|", #list);
(the (?<=\S) is only here to trim items on the right)

Replacing backslashes with two backslashes using regex [duplicate]

This question already has answers here:
Perl regex: replace all backslashes with double-backslashes
(6 answers)
Closed 8 years ago.
How do I replace single backslashes in a string with double backslashes?
I've tried things such as
s/\\(?!\\)/\\\\/g
s/\\/\\\\/g
s/[^//]/\\\\/g
But they all produce multiple backslashes after each other.
So I want:
\test
to be replaced with
\\test
Edit: Sorry I should also mention that the regex is in a loop so I need a regex that only matches the string if there is ONLY ONE backslash. Once there is more than one backslash then the regex should reject the string. Apologies

The most helpful thing to note is to use a different delimiter for the regex, so things don't get jumbled by all the leaning towers:
my $str = '\test';
$str =~ s{\\}{\\\\}g;
print $str;
Outputs:
\\test
Update
Per your revised specification, if you only want to escape a single backslash, and ignore all others, then just use a negative lookahead and lookbehind assertion:
my $str = <<'END_STR';
\one \\two \\\three
END_STR
print $str;
$str =~ s{(?<!\\)\\(?!\\)}{\\\\}g;
print $str;
Outputs:
\one \\two \\\three
\\one \\two \\\three

echo '\replace' | perl -pe 's/\\/\\\\/g'
\\replace
OR with sed
# echo '\replace' | sed 's/\\/\\\\/g'
\\replace

Perl alphabetic sort with alphanumeric string and regex

I am trying to sort alphanumeric string with perl and am facing the following problem:
Strings are like this: "XXXXX-1.0.0" where numbers can be on one or two digits and are representing release versions.
My problem is that when I have the two following strings:
- XXXXX-1.9.9
- XXXXX-1.10.0
XXXXX-1.9.9 is considered as greater than XXXXX-1.10.0 since 9 is greater than 1.
So I am trying to force numbers to be on two digits with a regexp.
There is the piece of code I am testing:
my $string = "XXXXXX-1.9.9";
my $pre = "0";
$string =~ s/(\.|-)+(\d{1})($|\.)/$1$pre$2$3/g;
print "$string\n";
This gives me the result "XXXXXX-01.9.09" which is not what I am looking for since it will not be sorted correctly. So I have to do that:
my $string = "XXXXXX-1.9.9";
my $pre = "0";
$string =~ s/(\.|-)+(\d{1})($|\.)/$1$pre$2$3/g;
$string =~ s/(\.|-)+(\d{1})($|\.)/$1$pre$2$3/g;
print "$string\n";
To get "XXXXXX-01.09.09"
My question is double:
- Is there a way to sort my strings with Perl without using regex?
- If I have to use regex is there a way to write it so that I do not have to execute it twice?
Thank you in advance.

You can use look around assertions to make sure there are not digits around the digit you are replacing without moving the position over the neighbouring characters.
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my #strings = qw( XXXXX-1.9.9
XXXXX-1.9.10
XXXXX-1.10.0 );
s/(?<=[^0-9]) ([0-9]) (?=[^0-9]|$) /0$1/xg for #strings;
#strings = sort #strings;
s/(?<=[^0-9]) 0+ ([0-9]) /$1/xg for #strings;
say for #strings;

Match string in Perl with space or with no-space in it

$search_buffer="this text has teststring in it, it has a Test String too";
#waitfor=('Test string','some other string');
foreach my $test (#waitfor)
{
eval ('if (lc $search_buffer =~ lc ' . $test . ') ' .
'{' .
' $prematch = $`;' .
' $match = $&; ' .
' $postmatch = ' . "\$';" .
'}');
print "prematch=$prematch\n";
print "match=$match\n"; #I want to match both "teststring" and "Test String"
print "postmatch=$postmatch\n";
}
I need to print both teststring and Test String, can you please help? thanks.

my $search_buffer="this text has teststring in it, it has a Test String too";
my $pattern = qr/test ?string/i;
say "Match found: $1" while $search_buffer =~ /($pattern)/g;

That is a horrible piece of code you have there. Why are you using eval and trying to concatenate strings into code, remembering to interpolate some variables and forgetting about some? There is no reason to use eval in that situation at all.
I assume that you by using lc are trying to make the match case-insensitive. This is best done by using the /i modifier on your regex:
$search_buffer =~ /$test/i; # case insensitive match
In your case, you are trying to match some strings against another string, and you want to compensate for case and for possible whitespace inside. I assume that your strings are generated in some way, and not hard coded.
What you could do is simply to make use of the /x modifier, which will make literal whitespace inside your regex ignored.
Something that you should take into consideration is meta characters inside your strings. If you for example have a string such as foo?, the meta character ? will alter the meaning of your regex. You can disable meta characters inside a regex with the \Q ... \E escape sequence.
So the solution is
use strict;
use warnings;
use feature 'say';
my $s = "this text has teststring in it, it has a Test String too";
my #waitfor= ('Test string','some other string', '#test string');
for my $str (#waitfor) {
if ($s =~ /\Q$str/xi) {
say "prematch = $`";
say "match = $&";
say "postmatch = $'";
}
}
Output:
prematch = this text has teststring in it, it has a
match = Test String
postmatch = too
Note that I use
use strict;
use warnings;
These two pragmas are vital to learning how to write good Perl code, and there is no (valid) reason you should ever write code without them.

This would work for your specific example.
test\s?string
Basically it marks the space as optional [\s]?.
The problem that I'm seeing with this is that it requires you to know where exactly there might be a space inside the string you're searching.
Note: You might also have to use the case-insensitive flag which would be /Test[\s]?String/i

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

RegEx match all word characters, with umlauts from different languages [duplicate] - regex

You need to tell Perl that the source code of your file is in utf8. Add use utf8; After use strict;

Related

perl regrex that captures substring between tic marks

String to split a complicated string in Perl [duplicate]

Replacing backslashes with two backslashes using regex [duplicate]

Perl alphabetic sort with alphanumeric string and regex

Match string in Perl with space or with no-space in it

Categories

Resources