Regex to match and capture a word - regex

I wrote a regex in a perl script to find and capture a word that contains the sequence "fp", "fd", "sp" or "sd" in a sentence. However, the word may contain some non-word characters like θ or ð. The word may be at the beginning or end of the sentence. When I tested this regex on regex101.com, it matches even when the input is nothing.The way I interpret this regex is: match one of the patterns "fp", "fd", "sp" or "sd" and capture everything around it until either a whitespace or the beginning of the line on the left side and a whitespace or end of the line on the right side. This is the regex: ^|\s(.*[fs][ˈ|ˌ]?[pd].*)\s|$ I also tried using the ? quantifier to make the .* pattern lazy, but it still shows a match when the input is nothing.
Here are some examples of what I need it to capture in parentheses:
(fpgθ) tig <br/>
tig (gfpθ) tig<br/>
tig (gθfp)<br/>
Edit: I forgot to explain the middle part. The [ˈˌ]? part (I made a mistake, I don't need the |) just allows for those characters to be between the [fs] and [pd]. I wouldn't want it to match things like tigf pg. I want it to match any word (defined by the space around it - so in a sentence like tig you rθðthe words it contains are tig, you, and rθð). This "word" could be at the end, at the beginning, or in the middle of the sentence. Is there a way to assert the position at the beginning of the string within a bracket? I think that would solve my problem. Also, I tried using \w, but because I have things like θ or ð it doesn't match those.

There is still a little openness in the description, but this works with shown data
use warnings;
use strict;
use feature 'say';
use utf8;
use open ":std", ":encoding(UTF-8)";
my #strs = (
'(fpgθ) tig <br/>',
'tig (gfpθ) tig<br/>',
'tig (gθfp)<br/>',
);
for (#strs)
{
my #m = /\b( \S*? [fs][pd] \S*? )\b/gx;
say "#m" if #m; # no space allowed by the pattern
}
Depending on clarifications you may want to tweak the \S and \b that are used. I capture into an array, with /g, for strings with more than one match. I left parentheses in for an additional test.
The use utf8 allows UTF-8 in the source, so it's for my #strs array only.
The use open pragma, however, is essential as it sets default (PerlIO) input and output layers, in this case standard streams for UTF-8. So you can read from a file and print to a file or console.

find and capture a word that contains the sequence "fp", "fd", "sp" or
"sd" in a sentence. However, the word may contain some non-word
characters like θ or ð.
You should match Unicode letters \p{L} instead of regular word characters \w:
\p{L}*[fs][pd]\p{L}*
Click on the pattern to try it online. I have simplified the pattern according to your latest edits.
use warnings;
use strict;
use utf8;
use open ":std", ":encoding(UTF-8)";
my #regex = qr/\p{L}*[fs][pd]\p{L}*/mp;
my #strs = 'fpgθ tig <br/>
tig gfpθ tig<br/>
tig gθfp<br/>
fptig gfpθ tig<br/>
sddgsdθ(θ#) tig gθfp<br/>';
for (#strs)
{
my #m = /#regex/gm;
print "#m" if #m; # no space allowed by the pattern
}

Related

Eliminate whitespace around single letters

I frequently receive PDFs that contain (when converted with pdftotext) whitespaces between the letters of some arbitrary words:
This i s a n example t e x t that c o n t a i n s strange spaces.
For further automated processing (looking for specific words) I would like to remove all whitespace between "standalone" letters (single-letter words), so the result would look like this:
This isan example text that contains strange spaces.
I tried to achieve this with a simple perl regex:
s/ (\w) (\w) / $1$2 /g
Which of course does not work, as after the first and second standalone letters have been moved together, the second one no longer is a standalone, so the space to the third will not match:
This is a n example te x t that co n ta i ns strange spaces.
So I tried lockahead assertions, but failed to achieve anything (also because I did not find any example that uses them in a substitution).
As usual with PRE, my feeling is, that there must be a very simple and elegant solution for this...
Just match a continuous series of single letters separated by spaces, then delete all spaces from that using a nested substitution (the /e eval modifier).
s{\b ((\w\s)+\w) \b}{ my $s = $1; $s =~ s/ //g; $s }xge;
Excess whitespace can be removed with a regex, but Perl by itself cannot know what is correct English. With that caveat, this seems to work:
$ perl -pe's/(?<!\S)(\S) (?=\S )/$1/g' spaces.txt
This isan example text that contains strange spaces.
Note that i s a n cannot be distinguished from a normal 4 letter word, that requires human correction, or some language module.
Explanation:
(?<!\S) negative look-behind assertion checks that the character behind is not a non-whitespace.
(\S) next must follow a non-whitespace, which we capture with parens, followed by a whitespace, which we will remove (or not put back, as it were).
(?=\S ) next we check with a look-ahead assertion that what follows is a non-whitespace followed by a whitespace. We do not change the string there.
Then put back the character we captured with $1
It might be more correct to use [^ ] instead of \S. Since you only seem to have a problem with spaces being inserted, there is no need to match tabs, newlines or other whitespace. Feel free to do that change if you feel it is appropriate.

Problems with perl regex

I need a perl regex to match A.CC3 on a line begining with something followed by anything then, my 'A.CC3 " and then anything...
I am surprised this (text =~ /^\W+\CC.*\A\.CC\[3].*/) is not working
Thanks
\A is an escape sequence that denotes beginning of line, or ^ like in the beginning of your regex. Remove the backslash to make it match a literal A.
Edit: You also seem to have \C in there. You should only use backslash to escape meta characters such as period ., or to create escape sequences, such as \Q .. \E.
At its simplest, a regex to match A.CC3 would be
$text =~ /A\.CC3/
That's all you need. This will match any string with A.CC3 in it. In the comments you mention the string you are matching is this:
my $text = "//%CC Unused Static Globals, A.CC3, Halstead Progam Volume";
You might want to avoid partial matches, in which case you can use word boundary \b
$text =~ /\bA\.CC3\b/
You might require that a line begins with //%
$text =~ m#^//%.*\bA\.CC3\b#
Of course, only you know which parts of the string should be matched and in what way. "Something followed by anything followed by A.CC3 followed by anything" really just needs the first simple regex.
It doesn't seem like you're trying to capture anything. If that's the case, and all you need to do is find lines that contain A.CC3 then you can simply do
if ( index( $str, 'A.CC3' ) >= 0 ) # Found it...
No need for a regex.
Try to give this a shot:
^.*?A\.CC.*$
That will match anything until it reaches A, then a literal ., followed by CC, then anything until end of string.
It depends what you want to match. If you want to pull back the whole line in which the A.CC3 pattern occurs then something like this should work:
^.*A\.CC3.*$

regex double white space separation problem

I want a regex to match words that are delimited by double or more space characters, e.g.
ABC DE FGHIJ KLM NO P QRST
Notice double or more spaces between the alphabets. Writing regex for such a problem is easy as I only need the first 4 words, as we may search for a word by using \S+ or \S+?
However, for my problem, only 1 white space CAN occur in a word, for example
AB C DE FG HIJ KLM NO P QRST
Here AB C is a word and FG HIJ is a word as well. In short we want to isolate characters that are spearated by double or more white spaces, I tried using this regex,
.+? +.+? +.+? +.+? +
it matches very swiftly, but it takes too much time for strings it doesn't match. (4 matches are given as an example here, in practice I need to match more).
I am in a need of a better regex to accomplish this, so that all the backtracking can be avoided. [^ ]* is a regex which will match uptill a space is encountered. Can't we specify a negated character set where we continue matching in case of a single space and break when 2 are encountered? I've tried using positive lookahead but failed miserably.
I would really appreciate your help. Thanks in advance.
Saad
The simplest solution is to split on \s{2,} to get the "words" you want, but if you insist on scanning for the tokens, then where as before you have \S+, what you have now is \S+(\s\S+)*. That's exactly what it says: \S+, followed by zero or more (\s\S+). You can use non-capturing group for performance, i.e. \S+(?:\s\S+)*. You can even make each repetition possessive if your flavor supports it for extra boost, i.e. \S++(?:\s\S++)*+.
Here's a Java snippet to demonstrate:
String text = "AB C DE FG HIJ KLM NO P QRST";
Matcher m = Pattern.compile("\\S++(?:\\s\\S++)*+").matcher(text);
while (m.find()) {
System.out.println("[" + m.group() + "]");
}
This prints:
[AB C]
[DE]
[FG HIJ]
[KLM]
[NO]
[P]
[QRST]
You can of course substitute just the space character instead of \s if that's your requirement.
References
regular-expressions.info/Character Class, Brackets for Grouping, Repetition, Possessive
if you know what the delimiter is (\s\s+), you could split instead of match.
Simply split on two or more spaces.
Regards
rbo
I think this is even more simple to match 2 or more whitespaces:
\s{2,}
In PHP the split would look like this
$list = preg_split('/\s{2,}/', $string);
What about using this pattern:
\s{2,}
Why not something like \s\s+ (one whitespace character, then one or more whitespace characters)?
Edit: it strikes me that whatever language/toolkit you're using may not support "splitting" a string using a regex directly. In that case, you may want to implement that functionality, and instead of attempting to match the WORDS in the input, match the SPACES, and use the information from those matches (position, length) to extract the words between the matches. In some languages (.NET, others) this functionality is built-in.
If you want to match all the words (allowing one space in a row), try \S+(?:[ ]\S+)* (the character class isn't necessary and can just be a space character, but I included it for clarity). It specifies that at least one non-whitespace character is required, and a space cannot be followed by another one.
You didn't mention what language you're using, but here's an example in PHP:
$string = "AB C DE FG HIJ KLM NO P QRST";
$matches = array();
preg_match_all('/\S+(?:[ ]\S+)*/', $string, $matches);
// $matches will contain 'AB C', 'DE', 'FG HIJ', 'KLM', 'NO', 'P', 'QRST'
If the requirements are at most one space per word, just change the * at the end to a ?: \S+(?:[ ]\S+)?.

extract word with regular expression

I have a string 1/temperatoA,2/CelcieusB!23/33/44,55/66/77 and I would like to extract the words temperatoA and CelcieusB.
I have this regular expression (\d+/(\w+),?)*! but I only get the match 1/temperatoA,2/CelcieusB!
Why?
Your whole match evaluates to '1/temperatoA,2/CelcieusB' because that matches the following expression:
qr{ ( # begin group
\d+ # at least one digit
/ # followed by a slash
(\w+) # followed by at least one word characters
,? # maybe a comma
)* # ANY number of repetitions of this pattern.
}x;
'1/temperatoA,' fulfills capture #1 first, but since you are asking the engine to capture as many of those as it can it goes back and finds that the pattern is repeated in '2/CelcieusB' (the comma not being necessary). So the whole match is what you said it is, but what you probably weren't expecting is that '2/CelcieusB' replaces '1/temperatoA,' as $1, so $1 reads '2/CelcieusB'.
Anytime you want to capture anything that fits a certain pattern in a certain string it is always best to use the global flag and assign the captures into an array. Since an array is not a single scalar like $1, it can hold all the values that were captured for capture #1.
When I do this:
my $str = '1/temperatoA,2/CelcieusB!23/33/44,55/66/77';
my $regex = qr{(\d+/(\w+))};
if ( my #matches = $str =~ /$regex/g ) {
print Dumper( \#matches );
}
I get this:
$VAR1 = [
'1/temperatoA',
'temperatoA',
'2/CelcieusB',
'CelcieusB',
'23/33',
'33',
'55/66',
'66'
];
Now, I figure that's probably not what you expected. But '3' and '6' are word characters, and so--coming after a slash--they comply with the expression.
So, if this is an issue, you can change your regex to the equivalent: qr{(\d+/(\p{Alpha}\w*))}, specifying that the first character must be an alpha followed by any number of word characters. Then the dump looks like this:
$VAR1 = [
'1/temperatoA',
'temperatoA',
'2/CelcieusB',
'CelcieusB'
];
And if you only want 'temperatoA' or 'CelcieusB', then you're capturing more than you need to and you'll want your regex to be qr{\d+/(\p{Alpha}\w*)}.
However, the secret to capturing more than one chunk in a capture expression is to assign the match to an array, you can then sort through the array to see if it contains the data you want.
The question here is: why are you using a regular expression that’s so obviously wrong? How did you get it?
The expression you want is simply as follows:
(\w+)
With a Perl-compatible regex engine you can search for
(?<=\d/)\w+(?=.*!)
(?<=\d/) asserts that there is a digit and a slash before the start of the match
\w+ matches the identifier. This allows for letters, digits and underscore. If you only want to allow letters, use [A-Za-z]+ instead.
(?=.*!) asserts that there is a ! ahead in the string - i. e. the regex will fail once we have passed the !.
Depending on the language you're using, you might need to escape some of the characters in the regex.
E. g., for use in C (with the PCRE library), you need to escape the backslashes:
myregexp = pcre_compile("(?<=\\d/)\\w+(?=.*!)", 0, &error, &erroroffset, NULL);
Will this work?
/([[:alpha:]]\w+)\b(?=.*!)
I made the following assumptions...
A word begins with an alphabetic character.
A word always immediately follows a slash. No intervening spaces, no words in the middle.
Words after the exclamation point are ignored.
You have some sort of loop to capture more than one word. I'm not familiar enough with the C library to give an example.
[[:alpha:]] matches any alphabetic character.
The \b matches a word boundary.
And the (?=.*!) came from Tim Pietzcker's post.

How can I preserve whitespace when I match and replace several words in Perl?

Let's say I have some original text:
here is some text that has a substring that I'm interested in embedded in it.
I need the text to match a part of it, say: "has a substring".
However, the original text and the matching string may have whitespace differences. For example the match text might be:
has a
substring
or
has a substring
and/or the original text might be:
here is some
text that has
a substring that I'm interested in embedded in it.
What I need my program to output is:
here is some text that [match starts here]has a substring[match ends here] that I'm interested in embedded in it.
I also need to preserve the whitespace pattern in the original and just add the start and end markers to it.
Any ideas about a way of using Perl regexes to get this to happen? I tried, but ended up getting horribly confused.
Been some time since I've used perl regular expressions, but what about:
$match = s/(has\s+a\s+substring)/[$1]/ig
This would capture zero or more whitespace and newline characters between the words. It will wrap the entire match with brackets while maintaining the original separation. It ain't automatic, but it does work.
You could play games with this, like taking the string "has a substring" and doing a transform on it to make it "has\s*a\s*substring" to make this a little less painful.
EDIT: Incorporated ysth's comments that the \s metacharacter matches newlines and hobbs corrections to my \s usage.
This pattern will match the string that you're looking to find:
(has\s+a\s+substring)
So, when the user enters a search string, replace any whitespace in the search string with \s+ and you have your pattern. The, just replace every match with [match starts here]$1[match ends here] where $1 is the matched text.
In regexes, you can use + to mean "one or more." So something like this
/has\s+a\s+substring/
matches has followed by one or more whitespace chars, followed by a followed by one or more whitespace chars, followed by substring.
Putting it together with a substitution operator, you can say:
my $str = "here is some text that has a substring that I'm interested in embedded in it.";
$str =~ s/(has\s+a\s+substring)/\[match starts here]$1\[match ends here]/gs;
print $str;
And the output is:
here is some text that [match starts here]has a substring[match ends here] that I'm interested in embedded in it.
A many has suggested, use \s+ to match whitespace. Here is how you do it automaticly:
my $original = "here is some text that has a substring that I'm interested in embedded in it.";
my $search = "has a\nsubstring";
my $re = $search;
$re =~ s/\s+/\\s+/g;
$original =~ s/\b$re\b/[match starts here]$&[match ends here]/g;
print $original;
Output:
here is some text that [match starts here]has a substring[match ends here] that I'm interested in embedded in it.
You might want to escape any meta-characters in the string. If someone is interested, I could add it.
This is an example of how you could do that.
#! /opt/perl/bin/perl
use strict;
use warnings;
my $submatch = "has a\nsubstring";
my $str = "
here is some
text that has
a substring that I'm interested in, embedded in it.
";
print substr_match($str, $submatch), "\n";
sub substr_match{
my($string,$match) = #_;
$match =~ s/\s+/\\s+/g;
# This isn't safe the way it is now, you will need to sanitize $match
$string =~ /\b$match\b/;
}
This currently does anything to check the $match variable for unsafe characters.