escape special character in perl when splitting a string - regex

i have a file in this format
string: string1
string: string2
string: string3
i want to split the lines by space and :,so initially i wrote this:
my #array = split(/[:\s]/,$lineOfFile);
the result wasn't as expected, because inside #array the split inserts also white space , so after some researches i understood that i have to escape the \s so i wrote
my #array = split(/[:\\s]/,$lineOfFile);
why i have to escape \s, the character : isn't a special character or not?
can someone explain me that?
thanks in advance.

You don't have to double up the backslash. Have you tried it?
split /[:\\s]/, $line
will split on a colon : or a backslash \ or a small S s, giving
("", "tring", " ", "tring1")
which isn't what you want at all. I suggest you split on a colon followed by zero or more spaces
my #fields = split /:\s*/, $line
which gives this result
("string", "string1")
which I think is what you want.

You do not need to double escape \s and the colon is not a character of special meaning. But in your case, it makes sense to avoid using a character class altogether and split on a colon followed by whitespace "one or more" times.
my #array = split(/:\s+/, $lineOfFile);

The problem is, that /[:\s]/ only searches for a single character. Thus, when applying this regex, you get something like
print $array[0], ' - ', $array[1], ' - ', $array[2];
string - - string1
because it splits between : and the whitespace before string1. The string string: string1 is therefore splitted into three parts, string, the empty place between : and the whitespace and string1. However, allowing more characters
my #array = split(/[:\s]+/,$lineOfFile);
works well, since :+whitespace is used for splitting.
print $array[0], ' - ', $array[1];
string - string1

Related

RegEx giving non characters with \w

c = re.split(r'\w+', message)
print(c)
message contains '!nano speak', but the regex is giving me this in return:
>>> ['!', ' ', '\r\n']
I'm very new to regex, but this seems like something I should get, and I can't seem to find this problem in search. It seems like it's doing exactly the opposite, and I'm sure it's a lower-case w.
re.split is using the regex as a delimiter to split the string. You set the delimiter to be any number of alphanumeric characters. This means that it will return everything between words.
In order to get the tokens defined by the regex you can use re.findall:
>>> re.findall(r'\w+', '!nano speak')
['nano', 'speak']
\w matches word character (alphanumeric and underscore), so in the string "!nano speak", it matches everything except "!" and the space, then splitting according to "nano" and "space". So you get "!", " " and "\r\n".
To remove all non characters, you should
re.sub("[^a-zA-Z]+", "", "!nano speak")

Remove special characters from a string except whitespace

I am looking for a regular expression to remove all special characters from a string, except whitespace. And maybe replace all multi- whitespaces with a single whitespace.
For example "[one# !two three-four]" should become "one two three-four"
I tried using str = Regex.Replace(strTemp, "^[-_,A-Za-z0-9]$", "").Trim() but it does not work. I also tried few more but they either get rid of the whitespace or do not replace all the special characters.
[ ](?=[ ])|[^-_,A-Za-z0-9 ]+
Try this.See demo.Replace by empty string.See demo.
http://regex101.com/r/lZ5mN8/69
Use the regex [^\w\s] to remove all special characters other than words and white spaces, then replace:
Regex.Replace("[one# !two three-four]", "[^\w\s]", "").Replace(" ", " ").Trim
METHOD:
instead of trying to use replace use replaceAll eg :
String InputString= "[one# !two three-four]";
String testOutput = InputString.replaceAll("[\\[\\-!,*)##%(&$_?.^\\]]", "").replaceAll("( )+", " ");
Log.d("THE OUTPUT", testOutput);
This will give an output of one two three-four.
EXPLANATION:
.replaceAll("[\\[\\-!,*)##%(&$_?.^\\]]", "") this replaces ALL the special characters present between the first and last brackets[]
.replaceAll("( )+", " ") this replaces more than 1 whitespace with just 1 whitespace
REPLACING THE - symbol:
just add the symbol to the regex like this .replaceAll("[\\[\\-!,*)##%(&$_?.^\\]]", "")
Hope this helps :)

Perl Split using "*"

If I use split like this:
my #split = split(/\s*/, $line);
print "$split[1]\n";
with input:
cat dog
I get:
a
However if I use \s+ in split, I get:
dog
I'm curious as to why they don't produce the same result? Also, what is the proper way to split a string by character?
Thanks for your help.
\s* effectively means zero or more whitespace characters. Between c and a in cat are zero spaces, yielding the result you're seeing.
To the regex engine, your string looks as follows:
c
zero spaces
a
zero spaces
t
multiple spaces
d
zero spaces
o
zero spaces
g
Following this logic, if you use \s+ as a separator, it will only match the multiple spaces between cat and dog.
* matches 0 or more times. Which means it can match the empty string between characters. + matches 1 or more times, which means it must match at least one character.
This is described in the documentation for split:
If PATTERN matches the empty string, the EXPR is split at the match position (between characters).
Additionally, when you split on whitespace, most of the time you really want to use a literal space:
.. split ' ', $line;
As described here:
As another special case, "split" emulates the default behavior of the
command line tool awk when the PATTERN is either omitted or a literal
string composed of a single space character (such as ' ' or "\x20",
but not e.g. "/ /"). In this case, any leading whitespace in EXPR is
removed before splitting occurs, and the PATTERN is instead treated as
if it were "/\s+/"; in particular, this means that any contiguous
whitespace (not just a single space character) is used as a separator.
However, this special treatment can be avoided by specifying the
pattern "/ /" instead of the string " ", thereby allowing only a
single space character to be a separator.
If you want to split a string into a list of individual characters then you should use an empty regex pattern for split, like this
my $line = 'cat';
my #split = split //, $line;
print "$_\n" for #split;
output
c
a
t
Some people prefer unpack, like this
my #split = unpack '(A1)*', $line;
which gives exactly the same result.

How to split string not showing anything in double or single quotes?

I get lines from the text file and then need to split them into words. So eveything in single or double quotes should be ignored.
For example: use line; "$var", print 'comment': "get 'comment % two'"
should be inserted in an array as use, line, print . All other just ignored.
Also I need to check if % sitting inside single or double quotes (like in the above example)
my #array = $file_line =~ /[\$A-z_]{2,}/g; gives all the words (plus anything that contains $) but I can't not to ignore characters in the quotes
Any ideas?
Thanks
I agree with the answer that you can first remove the quoted words
using
$line =~ s/ ( ["'] ) .*? \1 //xg;
However, you should be aware that your regular expression
[\$A-z_]
picks up all the ASCII characters between 'A' and 'z', in particular,
the following punctuation characters:
[ \ ] ^ _ `
So you should either be more explicit in your regular expression
[\$A-Za-z_]
or you should add the case-insensitive flag "i" to your substitution
and just use one case in the regular expression:
$file_line =~ /[\$A-Z_]{2,}/gi;
You can first remove all the quoted words, for example using:
$line =~ s/ ( ["'] ) .*? \1 //xg;
You might want to slightly change it depending on how you want to handle nested quotes, unclosed quotes etc.

Regex to remove what ever comes in front of "\" using powershell

wanted one help, wanted a regex to eliminate a "\" and what ever come before it,
Input should be "vmvalidate\administrator"
and the output should be just "administrator"
$result = $subject -creplace '^[^\\]*\\', ''
removes any non-backslash characters at the start of the string, followed by a backslash:
Explanation:
^ # Start of string
[^\\]* # Match zero or more non-backslash characters
\\ # Match a backslash
This means that if there is more than one backslash in the string, only the first one (and the text leading up to it) will be removed. If you want to remove everything until the last backslash, use
$result = $subject -creplace '(?s)^.*\\', ''
No need to use regex, try the split method:
$string.Split('\')[-1]
"vmvalidate\administrator" -replace "^.*?\\"
^ - from the begin of string
.* - any amount of any chars
? - lazy mode of quantifier
\ - "backslash" using escape character ""
All together it means "Replace all characters from the begin of string until backslash"
This is the way I used to do things before I learned about regex or splitting.
"vmvalidate\administrator".SubString("vmvalidate\administrator".IndexOf('\')+1)