Add a optional white space between characters in a string - regex

I have a string For Exampe
string SampleString = "F456-G12345-9090-GHI"
I need to add a optional white space between all characters in the above string.
The above string needs to match the same string which may or may not have the white space between ewach character. The other string will be like
string samplestring1 = "F456-G12345- 9090 -GHI"
Thanks
Padma

I'm not positive that I'm understanding what you'll be matching. If you're looking for a specific string, then the easiest way is probably to substitute all white space for '' across the string and then do the match.
In perl I'd do:
$string =~ s/\s//g;
while ($string =~ m/F456-G12345-9090-GHI/g) {
# Do something
}
If you're looking for multiple strings, and not just a specific one, you might just want to add \s as a potential match [\w\s-]+
However, if you're going to be matching against a specific string, I'd just toss the whitespace whole cloth first rather than performing an expensive regex checking for (and discarding) any whitespace found before checking the string.

you will probably have to add \s* between each character. (or other control characters for whitespace)
\s*F\s*4\s*5\s*6\s*-\s*G\s*1\s*2\s*3\s*4\s*5\s*-\s*9\s*0\s*9\s*0\s*-\s*G\s*H\s*I\s*
Or, depending on your regex dialect, you might be able to pass an option to ignore whitespace in the source text, but it would depend on which regex library you're using.

Related

Eliminate whitespace around single letters

I frequently receive PDFs that contain (when converted with pdftotext) whitespaces between the letters of some arbitrary words:
This i s a n example t e x t that c o n t a i n s strange spaces.
For further automated processing (looking for specific words) I would like to remove all whitespace between "standalone" letters (single-letter words), so the result would look like this:
This isan example text that contains strange spaces.
I tried to achieve this with a simple perl regex:
s/ (\w) (\w) / $1$2 /g
Which of course does not work, as after the first and second standalone letters have been moved together, the second one no longer is a standalone, so the space to the third will not match:
This is a n example te x t that co n ta i ns strange spaces.
So I tried lockahead assertions, but failed to achieve anything (also because I did not find any example that uses them in a substitution).
As usual with PRE, my feeling is, that there must be a very simple and elegant solution for this...
Just match a continuous series of single letters separated by spaces, then delete all spaces from that using a nested substitution (the /e eval modifier).
s{\b ((\w\s)+\w) \b}{ my $s = $1; $s =~ s/ //g; $s }xge;
Excess whitespace can be removed with a regex, but Perl by itself cannot know what is correct English. With that caveat, this seems to work:
$ perl -pe's/(?<!\S)(\S) (?=\S )/$1/g' spaces.txt
This isan example text that contains strange spaces.
Note that i s a n cannot be distinguished from a normal 4 letter word, that requires human correction, or some language module.
Explanation:
(?<!\S) negative look-behind assertion checks that the character behind is not a non-whitespace.
(\S) next must follow a non-whitespace, which we capture with parens, followed by a whitespace, which we will remove (or not put back, as it were).
(?=\S ) next we check with a look-ahead assertion that what follows is a non-whitespace followed by a whitespace. We do not change the string there.
Then put back the character we captured with $1
It might be more correct to use [^ ] instead of \S. Since you only seem to have a problem with spaces being inserted, there is no need to match tabs, newlines or other whitespace. Feel free to do that change if you feel it is appropriate.

Regular expression to match any word followed by a literal string

So I have the following:
^[a-zA-Z]+\b(myword+-)\b*
which I thought would match
^ start of string
[a-zA-Z] any alpha character
+ of one or more characters
\b followed by a word break
(myword+-) followed by myword which could include one or more special characters
\b followed by a word break
\* followed by anything at all
One: it does not work - it does not match anything
Two: any special characters included in {myword+-) throws an error
I could escape the special characters, but I don't know in advance what they might be, so I would have to escape all the possibilites, or perhaps I could just escape every character in {\m\y\w\o\r\d\\+\\-)
Edited to add:
Sorry, I knew I should have given more information
I have a series of strings to seach through in the form:
extra android-sdk and more that is of no interest
extra android-ndk and more that is of no interest
extra anjuta-extra and more that is of no interest
community c++-gtk-utils and more that is of no interest
and I have a list of items to search for in the strings:
android-sdk
android-ndk
extra
c++-gtk-utils
The item should only match if the second word in the string is an exact match to the item, so:
android-sdk will match the first string
android-ndk will match the second string
extra wuill NOT match the third string
c++-gtk-utils will match the fourth string
So (myword+-) is the item I am searching for "which could include one or more special characters"
Thanks for the help
Andrew
OK, with the help from above I worked it out.
This regex does exactly what I wanted, bear in mind that I am working in tcl (note the spaces to delimit the search word):
^[a-zA-Z]+\y extra \y *
where the search word is "extra".
It is necessary to escape any characters in the search string which may be interpreted by regex as qualifiers etc e.g +
So this will also work:
^[a-zA-Z]+\y dbus-c\+\+ \y *
Andrew
Strong recommendation: if you want to match literal strings, don't use regular expressions.
If we have this sample data:
set strings {
{extra android-sdk and more that is of no interest}
{extra android-ndk and more that is of no interest}
{extra anjuta-extra and more that is of no interest}
{community c++-gtk-utils and more that is of no interest}
}
set search_strings {
android-sdk
android-ndk
extra
c++-gtk-utils
}
Then, to find matches in the 2nd word of each string, we'll just use the eq string equality operator
foreach string $strings {
foreach search $search_strings {
if {[lindex [split $string] 1] eq $search} {
puts "$search matches $string"
}
}
}
outputs
android-sdk matches extra android-sdk and more that is of no interest
android-ndk matches extra android-ndk and more that is of no interest
c++-gtk-utils matches community c++-gtk-utils and more that is of no interest
If you insist on regular expression matching, you can escape any special characters to take away their usual regex meaning. Here, we'll take the brute force approach: any non-word chars will get escaped, so that the pattern may look like ^\S+\s+c\+\+\-gtk\-utils
foreach string $strings {
foreach search $search_strings {
set pattern "^\\S+\\s+[regsub -all {\W} $search {\\&}]"
if {[regexp $pattern $string]} {
puts "$search matches $string"
}
}
}
I was hoping to be able to make a portion of a regular expression to be a literal string, like
set pattern "^\\S+\\s+(***=$string)"
set pattern "^\\S+\\s+((?q)$string)"
but both failed.
Tcl regular expressions are documented at
https://www.tcl.tk/man/tcl8.6/TclCmd/re_syntax.htm
Also note your pattern ^[a-zA-Z]+\b(myword+-)\b* does not provide for any whitespace between the first and second words.
Disclaimer: Since your question lacks information what input and output is expected, I will give it a try to tell you why your Regex isn't working at all. Since it's not a full answer you might not want to mark it as accepted and possibly wait for someone to give you an example of working solution, as soon as you provide necessary information.
Notes:
quantifier characters (*, +, ? etc.) are applied to literal character or character class (a.k.a character group, namely characters/ranges inside [ ]) - when in your regex you write (myword+-) the only thing the + sign is applied to is letter 'd', nothing else.
what is myword in your regex? If you want a set of characters use [ ] combined with character ranges and/or character tokens such as \w (all word characters, such as letters and some special characters) or \d (all digit characters)
you also seem to misunderstand and misuse groups ("( )"), character classes ("[ ]") and quantifier notation ("{ }")

Adding a space character to my regex

I would like some help in getting this regex to accept the space character.
The following regex works ^a|a$|a but this one doesn't ^tips to|tips to$|tips to.
Space is just as-is in a regex (you just put the space character, that should work). Alternatively you can use \s special character. For example, in Perl:
my $test = "Helloworld";
if ($test =~ m/ /)
{
print("Has space\n");
}
Also if you can specify more what you want to use the regex for, we might be able to help better.
try escaping just the last space (the regex engine will then "see" that "tips to" is one block - at least for the last OR)
^tips to|tips to$|tips\ to
or to be on the safe side group what your searching for
^(tips to)|(tips to)$|(tips to)
[EDIT 1]
so here's the solution the OP is using:
^"tips to"|"tips to"$|"tips to"
The regular expression that matches 1 space character is 1 space character.

How to test to see if a string is only whitespace in perl

Whats a good way to test to see if a string is only full of whitespace characters with regex?
if($string=~/^\s*$/){
#is 100% whitespace (remember 100% of the empty string is also whitespace)
#use /^\s+$/ if you want to exclude the empty string
}
(I have decided to edit my post to include concepts in the below conversation with tobyodavies.)
In most instances, you want to determine whether or not something is whitespace, because whitespace is relatively insignificant and you want to skip over a string consisting of merely whitespace. So, I think what you want to determine is whether or not there are significant characters.
So I tend to use the reverse test: $str =~ /\S/. Determining the predicate "string contains one Significant character".
However, to apply your particular question, this can be determined in the negative by testing: $str !~ /\S/
Your regex statement should look for ^\s+$. It will require at least one whitespace.
In case you were wondering, "white space is defined as [\t\n\f\r\p{Z}]". See http://userguide.icu-project.org/strings/regexp.
\t Match a HORIZONTAL TABULATION, \u0009.
\n Match a LINE FEED, \u000A.
\f Match a FORM FEED, \u000C.
\r Match a CARRIAGE RETURN, \u000D.
\p{UNICODE PROPERTY NAME} Match any character with the specified Unicode Property.

PERL-Subsitute any non alphanumerical character to "_"

In perl I want to substitute any character not [A-Z]i or [0-9] and replace it with "_" but only if this non alphanumerical character occurs between two alphanumerical characters. I do not want to touch non-alphanumericals at the beginning or end of the string.
I know enough regex to replace them, just not to only replace ones in the middle of the string.
s/(\p{Alnum})\P{Alnum}(\p{Alnum})/${1}_${2}/g;
Of course that would hurt your chanches with "#A#B%C", so you might use a look-arounds:
s/(?<=\p{Alnum})\P{Alnum}(?=\p{Alnum})/_/g;
That way you isolate it to just the non "alnum" character.
Or you could use the "keep flag", as well and get the same thing done.
s/\p{Alnum}\K\P{Alnum}(?=\p{Alnum})/_/g;
EDIT based on input:
To not eat a newline, you could do the following:
s/\p{Alnum}\K[^\p{Alnum}\n](?=\p{Alnum})/_/g;
Try this:
my $str = 'a-2=c+a()_';
$str =~ s/(?<=[A-Z0-9])[^A-Z0-9](?=[A-Z0-9])/\1_\2/gi;