regex double white space separation problem - regex

I want a regex to match words that are delimited by double or more space characters, e.g.
ABC DE FGHIJ KLM NO P QRST
Notice double or more spaces between the alphabets. Writing regex for such a problem is easy as I only need the first 4 words, as we may search for a word by using \S+ or \S+?
However, for my problem, only 1 white space CAN occur in a word, for example
AB C DE FG HIJ KLM NO P QRST
Here AB C is a word and FG HIJ is a word as well. In short we want to isolate characters that are spearated by double or more white spaces, I tried using this regex,
.+? +.+? +.+? +.+? +
it matches very swiftly, but it takes too much time for strings it doesn't match. (4 matches are given as an example here, in practice I need to match more).
I am in a need of a better regex to accomplish this, so that all the backtracking can be avoided. [^ ]* is a regex which will match uptill a space is encountered. Can't we specify a negated character set where we continue matching in case of a single space and break when 2 are encountered? I've tried using positive lookahead but failed miserably.
I would really appreciate your help. Thanks in advance.
Saad

The simplest solution is to split on \s{2,} to get the "words" you want, but if you insist on scanning for the tokens, then where as before you have \S+, what you have now is \S+(\s\S+)*. That's exactly what it says: \S+, followed by zero or more (\s\S+). You can use non-capturing group for performance, i.e. \S+(?:\s\S+)*. You can even make each repetition possessive if your flavor supports it for extra boost, i.e. \S++(?:\s\S++)*+.
Here's a Java snippet to demonstrate:
String text = "AB C DE FG HIJ KLM NO P QRST";
Matcher m = Pattern.compile("\\S++(?:\\s\\S++)*+").matcher(text);
while (m.find()) {
System.out.println("[" + m.group() + "]");
}
This prints:
[AB C]
[DE]
[FG HIJ]
[KLM]
[NO]
[P]
[QRST]
You can of course substitute just the space character instead of \s if that's your requirement.
References
regular-expressions.info/Character Class, Brackets for Grouping, Repetition, Possessive

if you know what the delimiter is (\s\s+), you could split instead of match.
Simply split on two or more spaces.
Regards
rbo

I think this is even more simple to match 2 or more whitespaces:
\s{2,}
In PHP the split would look like this
$list = preg_split('/\s{2,}/', $string);

What about using this pattern:
\s{2,}

Why not something like \s\s+ (one whitespace character, then one or more whitespace characters)?
Edit: it strikes me that whatever language/toolkit you're using may not support "splitting" a string using a regex directly. In that case, you may want to implement that functionality, and instead of attempting to match the WORDS in the input, match the SPACES, and use the information from those matches (position, length) to extract the words between the matches. In some languages (.NET, others) this functionality is built-in.

If you want to match all the words (allowing one space in a row), try \S+(?:[ ]\S+)* (the character class isn't necessary and can just be a space character, but I included it for clarity). It specifies that at least one non-whitespace character is required, and a space cannot be followed by another one.
You didn't mention what language you're using, but here's an example in PHP:
$string = "AB C DE FG HIJ KLM NO P QRST";
$matches = array();
preg_match_all('/\S+(?:[ ]\S+)*/', $string, $matches);
// $matches will contain 'AB C', 'DE', 'FG HIJ', 'KLM', 'NO', 'P', 'QRST'
If the requirements are at most one space per word, just change the * at the end to a ?: \S+(?:[ ]\S+)?.

Related

Eliminate whitespace around single letters

I frequently receive PDFs that contain (when converted with pdftotext) whitespaces between the letters of some arbitrary words:
This i s a n example t e x t that c o n t a i n s strange spaces.
For further automated processing (looking for specific words) I would like to remove all whitespace between "standalone" letters (single-letter words), so the result would look like this:
This isan example text that contains strange spaces.
I tried to achieve this with a simple perl regex:
s/ (\w) (\w) / $1$2 /g
Which of course does not work, as after the first and second standalone letters have been moved together, the second one no longer is a standalone, so the space to the third will not match:
This is a n example te x t that co n ta i ns strange spaces.
So I tried lockahead assertions, but failed to achieve anything (also because I did not find any example that uses them in a substitution).
As usual with PRE, my feeling is, that there must be a very simple and elegant solution for this...
Just match a continuous series of single letters separated by spaces, then delete all spaces from that using a nested substitution (the /e eval modifier).
s{\b ((\w\s)+\w) \b}{ my $s = $1; $s =~ s/ //g; $s }xge;
Excess whitespace can be removed with a regex, but Perl by itself cannot know what is correct English. With that caveat, this seems to work:
$ perl -pe's/(?<!\S)(\S) (?=\S )/$1/g' spaces.txt
This isan example text that contains strange spaces.
Note that i s a n cannot be distinguished from a normal 4 letter word, that requires human correction, or some language module.
Explanation:
(?<!\S) negative look-behind assertion checks that the character behind is not a non-whitespace.
(\S) next must follow a non-whitespace, which we capture with parens, followed by a whitespace, which we will remove (or not put back, as it were).
(?=\S ) next we check with a look-ahead assertion that what follows is a non-whitespace followed by a whitespace. We do not change the string there.
Then put back the character we captured with $1
It might be more correct to use [^ ] instead of \S. Since you only seem to have a problem with spaces being inserted, there is no need to match tabs, newlines or other whitespace. Feel free to do that change if you feel it is appropriate.

Regex to match and capture a word

I wrote a regex in a perl script to find and capture a word that contains the sequence "fp", "fd", "sp" or "sd" in a sentence. However, the word may contain some non-word characters like θ or ð. The word may be at the beginning or end of the sentence. When I tested this regex on regex101.com, it matches even when the input is nothing.The way I interpret this regex is: match one of the patterns "fp", "fd", "sp" or "sd" and capture everything around it until either a whitespace or the beginning of the line on the left side and a whitespace or end of the line on the right side. This is the regex: ^|\s(.*[fs][ˈ|ˌ]?[pd].*)\s|$ I also tried using the ? quantifier to make the .* pattern lazy, but it still shows a match when the input is nothing.
Here are some examples of what I need it to capture in parentheses:
(fpgθ) tig <br/>
tig (gfpθ) tig<br/>
tig (gθfp)<br/>
Edit: I forgot to explain the middle part. The [ˈˌ]? part (I made a mistake, I don't need the |) just allows for those characters to be between the [fs] and [pd]. I wouldn't want it to match things like tigf pg. I want it to match any word (defined by the space around it - so in a sentence like tig you rθðthe words it contains are tig, you, and rθð). This "word" could be at the end, at the beginning, or in the middle of the sentence. Is there a way to assert the position at the beginning of the string within a bracket? I think that would solve my problem. Also, I tried using \w, but because I have things like θ or ð it doesn't match those.
There is still a little openness in the description, but this works with shown data
use warnings;
use strict;
use feature 'say';
use utf8;
use open ":std", ":encoding(UTF-8)";
my #strs = (
'(fpgθ) tig <br/>',
'tig (gfpθ) tig<br/>',
'tig (gθfp)<br/>',
);
for (#strs)
{
my #m = /\b( \S*? [fs][pd] \S*? )\b/gx;
say "#m" if #m; # no space allowed by the pattern
}
Depending on clarifications you may want to tweak the \S and \b that are used. I capture into an array, with /g, for strings with more than one match. I left parentheses in for an additional test.
The use utf8 allows UTF-8 in the source, so it's for my #strs array only.
The use open pragma, however, is essential as it sets default (PerlIO) input and output layers, in this case standard streams for UTF-8. So you can read from a file and print to a file or console.
find and capture a word that contains the sequence "fp", "fd", "sp" or
"sd" in a sentence. However, the word may contain some non-word
characters like θ or ð.
You should match Unicode letters \p{L} instead of regular word characters \w:
\p{L}*[fs][pd]\p{L}*
Click on the pattern to try it online. I have simplified the pattern according to your latest edits.
use warnings;
use strict;
use utf8;
use open ":std", ":encoding(UTF-8)";
my #regex = qr/\p{L}*[fs][pd]\p{L}*/mp;
my #strs = 'fpgθ tig <br/>
tig gfpθ tig<br/>
tig gθfp<br/>
fptig gfpθ tig<br/>
sddgsdθ(θ#) tig gθfp<br/>';
for (#strs)
{
my #m = /#regex/gm;
print "#m" if #m; # no space allowed by the pattern
}

Regular Expression with wildcards to match any character

I am new to regex and I am trying to come up with something that will match a text like below:
ABC: (z) jan 02 1999 \n
Notes:
text will always begin with "ABC:"
there may be zero, one or more spaces between ':' and (z).
Variations of (z) also possible - (zz), (zzzzzz).. etc but always a
non-digit character enclosed in "()"
there may be zero,one or more
spaces between (z) and jan
jan could be jan, january, etc
date couldbe in any format and may/may not contain other text as part of it so
I would really like to know if there is a regex I can use to capture
anything and everything that is found between '(z)' and '\n'
Any help is greatly appreciated! Thank you
The following should work:
ABC: *\([a-zA-Z]+\) *(.+)
Explanation:
ABC: # match literal characters 'ABC:'
* # zero or more spaces
\([a-zA-Z]+\) # one or more letters inside of parentheses
* # zero or more spaces
(.+) # capture one or more of any character (except newlines)
To get your desired grouping based on the comments below, you can use the following:
(ABC:) *(\([a-zA-Z]+\).+)
Without knowing the exact regex implementation you're making use of, I can only give general advice. (The syntax I will be perl as that's what I know, some languages will require tweaking)
Looking at ABC: (z) jan 02 1999 \n
The first thing to match is ABC: So using our regex is /ABC:/
You say ABC is always at the start of the string so /^ABC/ will ensure that ABC is at the start of the string.
You can match spaces with the \s (note the case) directive. With all directives you can match one or more with + (or 0 or more with *)
You need to escape the usage of ( and ) as it's a reserved character. so \(\)
You can match any non space or newline character with .
You can match anything at all with .* but you need to be careful you're not too greedy and capture everything.
So in order to capture what you've asked. I would use /^ABC:\s*\(.+?\)\s*(.+)$/
Which I read as:
Begins with ABC:
May have some spaces
has (
has some characters
has )
may have some spaces
then capture everything until the end of the line (which is $).
I highly recommend keeping a copy of the following laying about
http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/
This should fulfill your requirements.
ABC:\s*(\(\D+\)\s*.*?)\\n
Here it is with some tests http://www.regexplanet.com/cookbook/ahJzfnJlZ2V4cGxhbmV0LWhyZHNyDgsSBlJlY2lwZRiEjiUM/index.html
Futher reading on regular expressions: http://www.regular-expressions.info/characters.html

PERL-Subsitute any non alphanumerical character to "_"

In perl I want to substitute any character not [A-Z]i or [0-9] and replace it with "_" but only if this non alphanumerical character occurs between two alphanumerical characters. I do not want to touch non-alphanumericals at the beginning or end of the string.
I know enough regex to replace them, just not to only replace ones in the middle of the string.
s/(\p{Alnum})\P{Alnum}(\p{Alnum})/${1}_${2}/g;
Of course that would hurt your chanches with "#A#B%C", so you might use a look-arounds:
s/(?<=\p{Alnum})\P{Alnum}(?=\p{Alnum})/_/g;
That way you isolate it to just the non "alnum" character.
Or you could use the "keep flag", as well and get the same thing done.
s/\p{Alnum}\K\P{Alnum}(?=\p{Alnum})/_/g;
EDIT based on input:
To not eat a newline, you could do the following:
s/\p{Alnum}\K[^\p{Alnum}\n](?=\p{Alnum})/_/g;
Try this:
my $str = 'a-2=c+a()_';
$str =~ s/(?<=[A-Z0-9])[^A-Z0-9](?=[A-Z0-9])/\1_\2/gi;

extract word with regular expression

I have a string 1/temperatoA,2/CelcieusB!23/33/44,55/66/77 and I would like to extract the words temperatoA and CelcieusB.
I have this regular expression (\d+/(\w+),?)*! but I only get the match 1/temperatoA,2/CelcieusB!
Why?
Your whole match evaluates to '1/temperatoA,2/CelcieusB' because that matches the following expression:
qr{ ( # begin group
\d+ # at least one digit
/ # followed by a slash
(\w+) # followed by at least one word characters
,? # maybe a comma
)* # ANY number of repetitions of this pattern.
}x;
'1/temperatoA,' fulfills capture #1 first, but since you are asking the engine to capture as many of those as it can it goes back and finds that the pattern is repeated in '2/CelcieusB' (the comma not being necessary). So the whole match is what you said it is, but what you probably weren't expecting is that '2/CelcieusB' replaces '1/temperatoA,' as $1, so $1 reads '2/CelcieusB'.
Anytime you want to capture anything that fits a certain pattern in a certain string it is always best to use the global flag and assign the captures into an array. Since an array is not a single scalar like $1, it can hold all the values that were captured for capture #1.
When I do this:
my $str = '1/temperatoA,2/CelcieusB!23/33/44,55/66/77';
my $regex = qr{(\d+/(\w+))};
if ( my #matches = $str =~ /$regex/g ) {
print Dumper( \#matches );
}
I get this:
$VAR1 = [
'1/temperatoA',
'temperatoA',
'2/CelcieusB',
'CelcieusB',
'23/33',
'33',
'55/66',
'66'
];
Now, I figure that's probably not what you expected. But '3' and '6' are word characters, and so--coming after a slash--they comply with the expression.
So, if this is an issue, you can change your regex to the equivalent: qr{(\d+/(\p{Alpha}\w*))}, specifying that the first character must be an alpha followed by any number of word characters. Then the dump looks like this:
$VAR1 = [
'1/temperatoA',
'temperatoA',
'2/CelcieusB',
'CelcieusB'
];
And if you only want 'temperatoA' or 'CelcieusB', then you're capturing more than you need to and you'll want your regex to be qr{\d+/(\p{Alpha}\w*)}.
However, the secret to capturing more than one chunk in a capture expression is to assign the match to an array, you can then sort through the array to see if it contains the data you want.
The question here is: why are you using a regular expression that’s so obviously wrong? How did you get it?
The expression you want is simply as follows:
(\w+)
With a Perl-compatible regex engine you can search for
(?<=\d/)\w+(?=.*!)
(?<=\d/) asserts that there is a digit and a slash before the start of the match
\w+ matches the identifier. This allows for letters, digits and underscore. If you only want to allow letters, use [A-Za-z]+ instead.
(?=.*!) asserts that there is a ! ahead in the string - i. e. the regex will fail once we have passed the !.
Depending on the language you're using, you might need to escape some of the characters in the regex.
E. g., for use in C (with the PCRE library), you need to escape the backslashes:
myregexp = pcre_compile("(?<=\\d/)\\w+(?=.*!)", 0, &error, &erroroffset, NULL);
Will this work?
/([[:alpha:]]\w+)\b(?=.*!)
I made the following assumptions...
A word begins with an alphabetic character.
A word always immediately follows a slash. No intervening spaces, no words in the middle.
Words after the exclamation point are ignored.
You have some sort of loop to capture more than one word. I'm not familiar enough with the C library to give an example.
[[:alpha:]] matches any alphabetic character.
The \b matches a word boundary.
And the (?=.*!) came from Tim Pietzcker's post.