I'm going nuts trying to get a regex to detect spam of keywords in the user inputs. Usually there is some normal text at the start and the keyword spam at the end, separated by commas or other chars.
What I need is a regex to count the number of keywords to flag the text for a human to check it.
The text is usually like this:
[random text, with commas, dots and all]
keyword1, keyword2, keyword3, keyword4, keyword5,
Keyword6, keyword7, keyword8...
I've tried several regex to count the matches:
-This only gets one out of two keywords
[,-](\w|\s)+[,-]
-This also matches the random text
(?:([^,-]*)(?:[^,-]|$))
Can anyone tell me a regex to do this? Or should I take a different approach?
Thanks!
Pr your answer to my question, here is a regexp to match a string that occurs between two commas.
(?<=,)[^,]+(?=,)
This regexp does not match, and hence do not consume, the delimiting commas.
This regexp would match " and hence do not consume" in the previous sentence.
The fact that your regexp matched and consumed the commas was the reason why your attempted regexp only matched every other candidate.
Also if the whole input is a single string you will want to prevent linebreaks. In that case you will want to use;
(?<=,)[^,\n]+(?=,)
http://www.phpliveregex.com/p/1DJ
As others have said this is potentially a very tricky thing to do... It suffers from all of the same failures as general "word filtering" (e.g. people will "mask" the input). It is made even more difficult without plenty of example posts to test against...
Solution
Anyway, assuming that keywords will be on separate lines to the rest of the input and separated by commas you can match the lines with keywords in like:
Regex
#(?:^)((?:(?:[\w\.]+)(?:, ?|$))+)#m
Input
Taken from your question above:
[random text, with commas, dots and all]
keyword1, keyword2, keyword3, keyword4, keyword5,
Keyword6, keyword7, keyword8
Output
// preg_match_all('#(?:^)((?:(?:[\w]+)(?:, ?|$))+)#m', $string, $matches);
// var_dump($matches);
array(2) {
[0]=>
array(2) {
[0]=>
string(49) "keyword1, keyword2, keyword3, keyword4, keyword5,"
[1]=>
string(31) "Keyword6, keyword7, keyword8..."
}
[1]=>
array(2) {
[0]=>
string(49) "keyword1, keyword2, keyword3, keyword4, keyword5,"
[1]=>
string(31) "Keyword6, keyword7, keyword8"
}
}
Explanation
#(?:^)((?:(?:[\w]+)(?:, ?|$))+)#m
# => Starting delimiter
(?:^) => Matches start of line in a non-capturing group (you could just use ^ I was using |\n originally and didn't update)
( => Start a capturing group
(?: => Start a non-capturing group
(?:[\w]+) => A non-capturing group to match one or more word characters a-zA-Z0-9_ (Using a character class so that you can add to it if you need to....)
(?:, ?|$) => A non-capturing group to match either a comma (with an optional space) or the end of the string/line
)+ => End the non-capturing group (4) and repeat 5/6 to find multiple matches in the line
) => Close the capture group 3
# => Ending delimiter
m => Multi-line modifier
Follow up from number 2:
#^((?:(?:[\w]+)(?:, ?|$))+)#m
Counting keywords
Having now returned an array of lines only containing key words you can count the number of commas and thus get the number of keywords
$key_words = implode(', ', $matches[1]); // Join lines returned by preg_match_all
echo substr_count($key_words, ','); // 8
N.B. In most circumstances this will return NUMBER_OF_KEY_WORDS - 1 (i.e. in your case 7); it returns 8 because you have a comma at the end of your first line of key words.
Links
http://php.net/manual/en/reference.pcre.pattern.modifiers.php
http://www.regular-expressions.info/
http://php.net/substr_count
Why not just use explode and trim?
$keywords = array_map ('trim', explode (',', $keywordstring));
Then do a count() on $keywords.
If you think keywords with spaces in are spam, then you can iterate of the $keywords array and look for any that contain whitespace. There might be legitimate reasons for having spaces in a keyword though. If you're talking about superheroes on your system, for example, someone might enter The Tick or Iron Man as a keyword
I don't think counting keywords and looking for spaces in keywords are really very good strategies for detecting spam though. You might want to look into other bot protection strategies instead, or even use manual moderation.
How to match on the String of text between the commas?
This SO Post was marked as a duplicate to my posted question however since it is NOT a duplicate and there were no answers in THIS SO Post that answered my question on how to also match on the strings between the commas see below on how to take this a step further.
How to Match on single digit values in a CSV String
For example if the task is to search the string within the commas for a single 7, 8 or a single 9 but not match on combinations such as 17 or 77 or 78 but only the single 7s, 8s, or 9s see below...
The answer is to Use look arounds and place your search pattern within the look arounds:
(?<=^|,)[789](?=,|$)
See live demo.
The above Pattern is more concise however I've pasted below the Two Patterns provided as solutions to THIS this question of matching on Strings within the commas and they are:
(?<=^|,)[789](?=,|$) Provided by #Bohemian and chosen as the Correct Answer
(?:(?<=^)|(?<=,))[789](?:(?=,)|(?=$)) Provided in comments by #Ouroborus
Demo: https://regex101.com/r/fd5GnD/1
Your first regexp doesn't need a preceding comma
[\w\s]+[,-]
A regex that will match strings between two commas or start or end of string is
(?<=,|^)[^,]*(?=,|$)
Or, a bit more efficient:
(?<![^,])[^,]*(?![^,])
See the regex demo #1 and demo #2.
Details:
(?<=,|^) / (?<![^,]) - start of string or a position immediately preceded with a comma
[^,]* - zero or more chars other than a comma
(?=,|$) / (?![^,]) - end of string or a position immediately followed with a comma
If people still search for this in 2021
([^,\n])+
Match anything except new line and comma
regexr.com/60eme
I think the difficulty is that the random text can also contain commas.
If the keywords are all on one line and it is the last line of the text as a whole, trim the whole text removing new line characters from the end. Then take the text from the last new line character to the end. This should be your string containing the keywords. Once you have this part singled out, you can explode the string on comma and count the parts.
<?php
$string = " some gibberish, some more gibberish, and random text
keyword1, keyword2, keyword3
";
$lastEOL = strrpos(trim($string), PHP_EOL);
$keywordLine = substr($string, $lastEOL);
$keywords = explode(',', $keywordLine);
echo "Number of keywords: " . count($keywords);
I know it is not a regex, but I hope it helps nevertheless.
The only way to find a solution, is to find something that separates the random text and the keywords that is not present in the keywords. If a new line is present in the keywords, you can not use it. But are 2 consecutive new lines? Or any other characters.
$string = " some gibberish, some more gibberish, and random text
keyword1, keyword2, keyword3,
keyword4, keyword5, keyword6,
keyword7, keyword8, keyword9
";
$lastEOL = strrpos(trim($string), PHP_EOL . PHP_EOL); // 2 end of lines after random text
$keywordLine = substr($string, $lastEOL);
$keywords = explode(',', $keywordLine);
echo "Number of keywords: " . count($keywords);
(edit: added example for more new lines - long shot)
Supposing I have a group of words as a sentence like this :
Aujourd'hui séparer l'élément en deux
And want the result to be as an individual words (after the split) :
Aujourd'hui | séparer | l' | élément | en | deux
Note : as you can see, « aujourd'hui » is a single word.
What would be the best regex to use here ?
With my current knowledge, all i can achieve is this basic operation :
QString sentence("Aujourd'hui séparer l'élément en deux");
QStringList list = sentence.split(" ");
Output :
Aujourd'hui / Séparer / l'élément / en / deux
Here are the two questions closest to mine : this and this.
Since the contractions that you want to consider as separate words are usually a single letter + an apostrophe in French (like l'huile, n'en, d'accord) you can use a pattern that either matches 1+ whitespace chars, or a location that is immediately preceded with a start of a word, then 1 letter and then an apostrophe.
I also suggest taking into account curly apostrophes. So, use
\s+|(?<=\b\p{L}['’])\b
See the regex demo.
Details
\s+ - 1+ whitespaces
| - or
(?<=\b\p{L}['’])\b - a word boundary (\b) location that is preceded with a start of word (\b), a letter (\p{L}) and a ' or ’.
In Qt, you may use
QStringList result = text.split(
QRegularExpression(R"(\s+|(?<=\b\p{L}['’])\b)",
QRegularExpression::PatternOption::UseUnicodePropertiesOption)
);
The R"(...)" is a raw string literal notation, you may use "\\s+|(?<=\\b\\p{L}['’])\\b" if you are using a C++ environment that does not allow raw string literals.
Not sure if I understood what you are saying but this might help you
QString sentence("Aujourd'hui séparer l'élément en deux");
QStringList list = sentence.split(" '");
I don't know C++ but I guees it supports negative lookbehind.
Have a try with:
(?: |(?<!\w{2})')
This will split on space or apostroph if there are not 2 letters before.
Demo & explanation
Well, you're dealing with a natural language, here, and the first - and toughest - problem to answer is: Can you actually come up with a fixed rule, when splits should happen? In this particular case, there is really no logical reason, why French considers "aujourd'hui" as a single word (when logically, it could be parsed as "au jour de hui").
I'm not familiar with all the possible pitfalls in French, but if you really want to make sure to cover all obscure cases, you'll have to look for a natural language tokenizer.
Anyway, for the example you give, it may be good enough to use a QRegularExpression with negative lookbehind to omit splits when more than one letter precedes the apostrophe:
sentence.split(QRegularExpression("(?<![\\w][\\w])'"));
I'm using Notepad++ v6.9.2. I need to find ICD9 Codes which will take the following forms:
(X##.), (X##.#) or (X##.##) where X is a letter and always at the beginning and # is a number
(##.), (##.#), (##.##), (###.), (###.#), (###.##) or (###.###) where # is a number
and
replace the first ( with | and the ) and single space behind second with |.
EXAMPLE
(305.11) TOBACCO ABUSE-CONTINUOUS
Becomes:
|305.11|TOBACCO ABUSE-CONTINUOUS
OTHER CONSIDERATIONS:
There are other places with parentheses but will only contain letters. Those do not need to be changed. Some examples:
UE (Major) Amputation
(282.45) THALASSEMIA (ALPHA)
(284.87) RED CELL APLASIA (W/THYMOMA)
Pain (non-headache) (338.3) Neoplasm related pain (acute) (chronic)
Becomes
UE (Major) Amputation
|282.45|THALASSEMIA (ALPHA)
|284.87|RED CELL APLASIA (W/THYMOMA)
Pain (non-headache) |338.3|Neoplasm related pain (acute) (chronic)
You can use a regex like this to match ICD9 codes:
[EV]\d+\.?\d*
This covers both E and V codes and cases where the . is omitted (in my experience this is not uncommon). Use this regex to match the portions of text you need:
\(([EV]?\d+\.?\d*)\)\s?
The outer parentheses are escaped to match literal ( and ) characters, and the inner parentheses create a group for replacement (\1). The \s? at end will capture an optional space after the parentheses.
So your Notepad++ replace window should look like this:
I'm editing some text directly from OCR engine and in some paragraphs the OCR engine ignores the opening and closing quotes. I prefer editing in HTML mode and as a result end up with some text like:
<p>“Wait a moment,” Jacey said. The street light lit up his aged, rat face. Who’s on the move?”</p>
Notice the missing “.
Another sentence:
<p>“He said he’ coming afer you,” Harry said, and he’ bringing the boys too!”</p>
I use this regex : ([>\.\,])(.*?)” which seems to do the job for the second sentence but not for the first. This is because the regex is matching from left to right and so matched the extra sentence The street light lit up his aged, rat face. which should not be within the quotes.
I was thinking that the problem can be solved if the matching was done from right to left. I know this is an option available in C# but I'm using the regex engine of text-based editors to edit a simple text file. Is there a way to locate just the last sentence before the “, which is the sentence Who’s on the move?.
[EDIT]
I have been trying using the lookbehind regex: (?<=(?:\. |, |>)(\w)(.*?))(”) which seems to find all sentences with missing open quotes, “, but the problem is I cannot replace the contents inside the (?<=) construct with \3“\1\2\3 because lookbehind is 0 length. Instead the text is just duplicated. For example with the above regex the sentence Who’s on the move?” becomes Who’s on the move?”“Who’s on the move?”
Any ideas will be appreciated.
Thanks
Recursion and Defined Subroutines
The following regex checks that strings are balanced. The code below (see its output in the online demo) checks several strings. The explanations are in the comments.
$balanced_string_regex = "~(?sx) # Free-Spacing
(?(DEFINE) # Define a few subroutines
(?<double>“(?:(?!&[lr]squo;).)*”) # full set of doubles (no quotes inside)
(?<single>‘(?:(?!&[lr]dquo;).)*’) # full set of singles (no quotes inside)
(?<notquotes>(?:(?!&[lr][sd]quo;).)*) # chars that are not quotes
) # end DEFINE
^ # Start of string
(?: # Start non-capture group
(?¬quotes) # Any non-quote chars
&l(?<type>[sd])quo; # Opening quote, capture single or double type
# any full singles, doubles, not quotes or recursion
(?:(?&single)|(?&double)|(?¬quotes)|(?R))*
&r\k<type>quo; # Closing quote of the correct type
(?¬quotes) #
)++ # Repeat non-capture group
$ # End of string
~";
$string = "“He said ” ‘He said ’";
check_string($string);
$string = "<p>“Wait a moment,” Jacey said. The street light lit up his aged, rat face. Who’s on the move?”</p>";
check_string($string);
$string = "<p>“Wait a moment,” Jacey said. The street light lit up his aged, rat face. ‘Whos on the “move?” ’</p>";
check_string($string);
$string = "<p>“He said he’ coming afer you,” Harry said, and he’ bringing the boys too!”</p>";
check_string($string);
$string = "<p>“He ‘said he’ coming afer you,” Harry said, and he“ bringing the boys too!”</p>";
check_string($string);
function check_string($string) {
global $balanced_string_regex;
echo (preg_match($balanced_string_regex, $string)) ?
"Balanced!\n" :
" Nah... Not Balanced.\n" ;
}
Output
Balanced!
Nah... Not Balanced.
Balanced!
Nah... Not Balanced.
Balanced!
Replacing Missing Quotes
As I've indicated in the comments, IMO replacing missing quotes is hazardous: before or after what word should the missing quote fall? If there was any kind of nesting, can we be sure that we've correctly identified the missing quote? For that reason, if you're going to do anything, my inclination would be to match the balanced portion (hoping it is correct) and remove any extra quotes.
The pattern above lends itself to all kinds of tweaks. For instance, on this regex demo, we match and replace an unbalanced quote. Since this was requested, I'll offer a second potential tweak with some reluctance—this one inserts a missing left quote at the beginning of the phrase preceding the unmatched right quote.
I can't seem to make this regex work.
The input is as follows. Its really on one row but I have inserted line breaks after each \r\n so that it's easier to see, so no check for space characters are needed.
01-03\r\n
01-04\r\n
TEXTONE\r\n
STOCKHOLM\r\n
350,00\r\n ---- 350,00 should be the last value in the first match
12-29\r\n
01-03\r\n
TEXTTWO\r\n
COPENHAGEN\r\n
10,80\r\n
This could go on with another 01-31 and 02-01, marking another new match (these are dates).
I would like to have a total of 2 matches for this input.
My problem is that I cant figure out how to look ahead and match the starting of a new match (two following dates) but not to include those dates within the first match. They should belong to the second match.
It's hard to explain, but I hope someone will get me.
This is what I got so far but its not even close:
(.*?)((?<=\\d{2}-\\d{2}))
The matches I want are:
1: 01-03\r\n01-04\r\nTEXTONE\r\nSTOCKHOLM\r\n350,00\r\n
2: 12-29\r\n01-03\r\nTEXTTWO\r\nCOPENHAGEN\r\n10,80\r\n
After that I can easily separate the columns with \r\n.
Can this more explicit pattern work to you?
(\d{2}-\d{2})\r\n(\d{2}-\d{2})\r\n(.*)\r\n(.*)\r\n(\d+(?:,?\d+))
Here's another option for you to try:
(.+?)(?=\d{2}-\d{2}\\r\\n\d{2}-\d{2}|$)
Rubular
/
\G
(
(?:
[0-9]{2}-[0-9]{2}\r\n
){2}
(?:
(?! [0-9]{2}-[0-9]{2}\r\n ) [^\n]*\n
)*
)
/xg
Why do so much work?
$string = q(01-03\r\n01-04\r\nTEXTONE\r\nSTOCKHOLM\r\n350,00\r\n12-29\r\n01-03\r\nTEXTTWO\r\nCOPENHAGEN\r\n10,80\r\n);
for (split /(?=(?:\d{2}-\d{2}\\r\\n){2})/, $string) {
print join( "\t", split /\\r\\n/), "\n"
}
Output:
01-03 01-04 TEXTONE STOCKHOLM 350,00
12-29 01-03 TEXTTWO COPENHAGEN 10,80`