Word removal using re results in wrong words being removed - regex

Given a text "article_utf8" i want to remove a list of words:
remove = "el|la|de|que|y|a|en|un|ser|se|no|haber|..."
regex = re.compile(r'\b('+remove+r')\b', flags=re.IGNORECASE)
article_out = regex.sub("", article_utf8)
however this is incorrectly removing some words and parts of words for example:
1- aseguro becomes seguro
2- sería becomes í
3- coma becomes com
4- miercoles becomes 'ercoles'

Technically parts of a word can match a regexp. To solve this you would have to make sure that whatever sequence of letters your regexp matches is a single word and not part of it.
One way would be to make the regexp contain leading and trailing spaces, but words could also be separated with periods or commas so you would have to take those into account too if you want to catch all instances.
Alternatively, you can try splitting the list first into words using the built-in split method (https://docs.python.org/2/library/stdtypes.html#str.split). Then I would check each word in the resulting list, remove the ones I don't want and rejoin the strings. This method, however doesn't even need regexps so it's probably not what you intended despite being simple and practical.

After much testing, the following will remove the small words in a natural language string, without removing them from parts of other words:
regex = re.compile(r'[\s]?\b('+remove+')[\b\s\.\,]', flags=re.IGNORECASE)

Related

Regex for sections of the text that contain a set of words

I have a large text that has a few paragraphs. I want to search for the text that contains a set of words, not in any order, for example {"word3", "word2", "word1"}. Need to return the section of the text, which can span multiple sentences or paragraphs.
What is the regular expression for this, please?
You need to have a way to declare how this "section of text" starts and ends.
I will assume that your sections stop at a new line character (\n).
Something like:
(\n?).+(word1|word2|word3).+(\n|\.)
Could make it work. This will return the whole paragraph (assuming that each paragraph is separated by a \n with the next.
lookaheads can be used to ensure multiple conditions, the general form will be
(?=.*word1.*$)(?=.*word2.*$)(?=.*word3.*$).*$
where $ could be changed by what is the end of section.
Word boundaries can be used to avoid sub-word matches, also s switch may be used if . may match newline character.
(?=.*\bword1\b.*$)(?=.*\bword2\b.*$)(?=.*\bword3\b.*$).*$
I agree with mpliax, you must have a way to delimit these sections, a way to define what a paragraph or a sentence is.
Assuming your paragraphs are separated by newlines, and that we're looking for "grep", "contains", and "text", you could use a series of lookaheads to match that paragraph:
([^\n]+(?=grep))([^\n]+(?=text))([^\n]+(?=contains))[^\n]+
Or this slightly different pattern, which assumes a sentence must end with a period, question mark, or exclamation point (a bad assumption?), and tries to match just the sentence:
([^\.?!]+(?=grep))([^\.?!]+(?=text))([^\.?!]+(?=contains))[^\.?!]+
Both patterns follow this structure: ([NON-delimiter]+(?=keyword)) lookaheads, as many as we want one after the other, so that we know we can "see" our keywords before the next delimiter happens. Then we just match the whole paragraph with the last token [NON-delimiter]+.
If you do ignore the order of the set of words, I think there are several combinations for the set of 3 words, e.g. abc, acb, bca, bac, cab, cba.
Thus this seems to be needed to match the possible combinations of the set of 3 words
a(bc|cb)|b(ca|ac)|c(ab|ba)
Demo,,, in which suppose that a means to word1, b to word2, c to word3.
Of course, for the words which essentially accompany white space, the regex also needs white space, so basically it may be like this,
word1 (word2 word3|word3 word2)|word2 (word3 word1|word1 word3)|word3 (word1 word2|word2 word1)

Regular Expression Match (get multiple stuff in a group)

I have trouble working on this regular expression.
Here is the string in one line, and I want to be able to extract the thing in the swatchColorList, specifically I want the word Natural Burlap, Navy, Red
What I have tried is '[(.*?)]' to get everything inside bracket, but what I really want is to do it in one line? is it possible, or do I need to do this in two steps?
Thanks
{"id":"1349306","categoryName":"Kids","imageSource":"7/optimized/8769127_fpx.tif","swatchColorList":[{"Natural Burlap":"8/optimized/8769128_fpx.tif"},{"Navy":"5/optimized/8748315_fpx.tif"},{"Red":"8/optimized/8748318_fpx.tif"}],"suppressColorSwatches":false,"primaryColor":"Natural Burlap","clickableSwatch":true,"selectedColorNameID":"Natural Burlap","moreColors":false,"suppressProductAttribute":false,"colorFamily":{"Natural Burlap":"Ivory/Cream"},"maxQuantity":6}
You can try this regex
(?<=[[,]\{\")[^"]+
If negative lookbehind is not supported, you can use
[[,]\{"([^"]+)
This will save needed word in group 1.
import json
str = '{"id":"1349306","categoryName":"Kids","imageSource":"7/optimized/8769127_fpx.tif","swatchColorList":[{"Natural Burlap":"8/optimized/8769128_fpx.tif"},{"Navy":"5/optimized/8748315_fpx.tif"},{"Red":"8/optimized/8748318_fpx.tif"}],"suppressColorSwatches":false,"primaryColor":"Natural Burlap","clickableSwatch":true,"selectedColorNameID":"Natural Burlap","moreColors":false,"suppressProductAttribute":false,"colorFamily":{"Natural Burlap":"Ivory/Cream"},"maxQuantity":6}'
obj = json.loads(str)
words = []
for thing in obj["swatchColorList"]:
for word in thing:
words.append(word)
print word
Output will be
Natural Burlap
Navy
Red
And words will be stored to words list. I realize this is not a regex but I want to discourage the use of regex on serialized object notations as regular expressions are not intended for the purpose of parsing strings with nested expressions.

Regex command, OR doesn't seem to work

I have part of the following text that I'm reading with C#
"I have to see your driver’s license and print you an ID tag before I can send you through," he said in a flat, automatic sort of way, staring at the horns with blank-eyed fascination.
I'm reading in some lines of this one book, and I'd like to create strings out of all the words, including those with apostrophes. I'd like to split the lines based on non word characters, but I want apostrophes to be included with the word characters, so I ultimately get a list of strings with just words, so that the word "driver's" is together.
I'm using sublime to test out the expressions, but when I do (\W+|\'), apostrophes are still captured. I don't want to split something like "you'd" into two string. \W+ is perfect, but I'd just like to include apostrophes. How could I do that?
If you're looking for a regex matching "between" the words:
[^\w']+
should do.
You can try String.Split: example follows
string _input ="I have to see your driver’s license and print you an ID tag before I can send you through";
string[] _words = _input.Split(' ');
In case you want to remove other characters, for example: single quote (apostrophe) "'" and comma "," and use Replace(), like:
_input = _input.Replace("'", String.Empty).Replace(",",String.Empty);
string[] _words = _input.Split(' ');
You can also use Regex, but its performance is worse than of these methods (if it does matter).
Also, you can try as an example my 'semantic analyzer' app at: http://webinfocentral.com/TECH/SemanticAnalyzer.aspx . It's doing all that stuff and much more (characters to exclude are listed at the left pane). Rgds,

Regular expression to select entire word except first letter, including words such as "Jack's" and "merry-go-round"

I'm trying to use a regular expression to select all of each word except the first character, much as #mahdaeng wanted to do here. The solution offered to his question was to use \B[a-z]. This works fine, except when a word contains some form of punctuation, such as "Jack's" and "merry-go-round". Is there a way to select the entire word including any contained punctuation? (Not including outside punctuation such as "? , ." etc.)
If you can enumerate the acceptable in-word punctuation, you could just expand upon the answer you linked:
\B[a-zA-Z'-]+
A regex really isn't necessary here, since you can just split your word on spaces and deal with each word accordingly. Since you don't mention an underlying language, here's an implementation in Perl:
use strict;
use warnings;
$_="Jack's merry-go-round revolves way too fast!";
my #words=split /\s+/;
foreach my $word(#words)
{
my $stripped_word=substr($word,1);
$stripped_word=~s/[^a-z]$//i; #stripping out end punctuation
print "$stripped_word\n";
}
The output is:
ack's
erry-go-round
evolves
ay
oo
ast
\B[^\s]+
(where ^\s means "not whitespace") should get you what you want assuming the words are whitespace-delimited. If they're also punctuation-delimited, you might need to enumerate the punctuation:
\B[^\s,.?!]+

Regular expression for a list of items separated by comma or by comma and a space

Hey,
I can't figure out how to write a regular expression for my website, I would like to let the user input a list of items (tags) separated by comma or by comma and a space, for example "apple, pie,applepie". Would it be possible to have such regexp?
Thanks!
EDIT:
I would like a regexp for javascript in order to check the input before the user submits a form.
What you're looking for is deceptively easy:
[^,]+
This will give you every comma-separated token, and will exclude empty tokens (if the user enters "a,,b" you will only get 'a' and 'b'), BUT it will break if they enter "a, ,b".
If you want to strip the spaces from either side properly (and exclude whitespace only elements), then it gets a tiny bit more complicated:
[^,\s][^\,]*[^,\s]*
However, as has been mentioned in some of the comments, why do you need a regex where a simple split and trim will do the trick?
Assuming the words in your list may be letters from a to z and you allow, but do not require, a space after the comma separators, your reg exp would be
[a-z]+(,\s*[a-z]+)*
This is match "ab" or "ab, de", but not "ab ,dc"
Here's a simpler solution:
console.log("test, , test".match(/[^,(?! )]+/g));
It doesn't break on empty properties and strips spaces before and after properties.
This thread is almost 7 years old and was last active 5 months ago, but I wanted to achieve the same results as OP and after reading this thread, came across a nifty solution that seems to work well
.match(/[^,\s?]+/g)
Here's an image with some example code of how I'm using it and how it's working
Regarding the regular expression... I suppose a more accurate statement would be to say "target anything that IS NOT a comma followed by any (optional) amount of white space" ?
I often work with coma separated pattern, and for me, this works :
((^|[,])pattern)+
where "pattern" is the single element regexp
This might work:
([^,]*)(, ?([^,]*))*
([^,]*)
Look For Commas within a given string, followed by separating these. in regards to the whitespace? cant you just use commas? remove whitespace?
I needed an strict validation for a comma separated input alphabetic characters, no spaces. I end up using this one is case anyone needed:
/^[a-z]+(,[a-z]+)*$/
Or, to support lower- and uppercase words:
/^[A-Za-z]+(?:,[A-Za-z]+)*$/
In case one need to allow whitespace between words:
/^[A-Za-z]+(?:\s*,\s*[A-Za-z]+)*$/
/^[A-Za-z]+(?:,\s*[A-Za-z]+)*$/
You can try this, it worked for me:
/.+?[\|$]/g
or
/[^\|?]+/g
but replace '|' for the one you need. Also, don't forget about shielding.
something like this should work: ((apple|pie|applepie),\s?)*