How to unpunctuate, lowercase, de-space and hyphenate a string with regex? - regex

If I have a string like this
Newsflash: The Big(!) Brown Dog's Brother (T.J.) Ate The Small Blue Egg
how would I convert that into the following using regex:
newsflash-the-big-brown-dogs-brother-tj-ate-the-small-blue-egg
In other words, punctuation is discarded and spaces are replaced with hyphens.

It sounds like you want to create a "URL plug" -- a URL-friendly version of an article's title, for example. That means you'll want to make sure you remove all possible non-URL-friendly characters, not just a few. You might do it this way (in order):
Remove all non-letter non-number non-space characters by:
Replacing regex [^A-Za-z0-9 ] with the empty string "".
Replace all spaces with a dash by:
Replacing regex \s+ with the string "-".
Lower-case the string by:
Java s = s.toLowerCase();
JavaScript s = s.toLowerCase();
C# s = s.ToLowerCase();
Perl $s = lc($s);
Python s = s.lower()
PHP $s = strtolower($s);
Ruby s = s.downcase

Replace the regex [\s-]+ with "-", then replace [^\w-] with "".
Then, call ToLowerCase or equivalent.
In Javascript:
var s = "Newsflash: The Big(!) Brown Dog's Brother (T.J.) Ate The Small Blue Egg";
alert(s.replace(/[\s+-]/g, '-').replace(/[^\w-]/g, '').toLowerCase());

Replace /\W+/ with '-', that will replace all non-word characters with a dash.
Then, collapse dashes by replacing /-+/ with '-'.
Then, lowercase the string - pure regex solutions cannot do that. You didn't say which language you are using, so I cannot give you an example, but your language might have String.toLowercase() or a tr/// call (tr/A-Z/a-z/, for example, in Perl).

Related

Regex for text (and numbers and special characters) between multiple commas [duplicate]

I'm going nuts trying to get a regex to detect spam of keywords in the user inputs. Usually there is some normal text at the start and the keyword spam at the end, separated by commas or other chars.
What I need is a regex to count the number of keywords to flag the text for a human to check it.
The text is usually like this:
[random text, with commas, dots and all]
keyword1, keyword2, keyword3, keyword4, keyword5,
Keyword6, keyword7, keyword8...
I've tried several regex to count the matches:
-This only gets one out of two keywords
[,-](\w|\s)+[,-]
-This also matches the random text
(?:([^,-]*)(?:[^,-]|$))
Can anyone tell me a regex to do this? Or should I take a different approach?
Thanks!
Pr your answer to my question, here is a regexp to match a string that occurs between two commas.
(?<=,)[^,]+(?=,)
This regexp does not match, and hence do not consume, the delimiting commas.
This regexp would match " and hence do not consume" in the previous sentence.
The fact that your regexp matched and consumed the commas was the reason why your attempted regexp only matched every other candidate.
Also if the whole input is a single string you will want to prevent linebreaks. In that case you will want to use;
(?<=,)[^,\n]+(?=,)
http://www.phpliveregex.com/p/1DJ
As others have said this is potentially a very tricky thing to do... It suffers from all of the same failures as general "word filtering" (e.g. people will "mask" the input). It is made even more difficult without plenty of example posts to test against...
Solution
Anyway, assuming that keywords will be on separate lines to the rest of the input and separated by commas you can match the lines with keywords in like:
Regex
#(?:^)((?:(?:[\w\.]+)(?:, ?|$))+)#m
Input
Taken from your question above:
[random text, with commas, dots and all]
keyword1, keyword2, keyword3, keyword4, keyword5,
Keyword6, keyword7, keyword8
Output
// preg_match_all('#(?:^)((?:(?:[\w]+)(?:, ?|$))+)#m', $string, $matches);
// var_dump($matches);
array(2) {
[0]=>
array(2) {
[0]=>
string(49) "keyword1, keyword2, keyword3, keyword4, keyword5,"
[1]=>
string(31) "Keyword6, keyword7, keyword8..."
}
[1]=>
array(2) {
[0]=>
string(49) "keyword1, keyword2, keyword3, keyword4, keyword5,"
[1]=>
string(31) "Keyword6, keyword7, keyword8"
}
}
Explanation
#(?:^)((?:(?:[\w]+)(?:, ?|$))+)#m
# => Starting delimiter
(?:^) => Matches start of line in a non-capturing group (you could just use ^ I was using |\n originally and didn't update)
( => Start a capturing group
(?: => Start a non-capturing group
(?:[\w]+) => A non-capturing group to match one or more word characters a-zA-Z0-9_ (Using a character class so that you can add to it if you need to....)
(?:, ?|$) => A non-capturing group to match either a comma (with an optional space) or the end of the string/line
)+ => End the non-capturing group (4) and repeat 5/6 to find multiple matches in the line
) => Close the capture group 3
# => Ending delimiter
m => Multi-line modifier
Follow up from number 2:
#^((?:(?:[\w]+)(?:, ?|$))+)#m
Counting keywords
Having now returned an array of lines only containing key words you can count the number of commas and thus get the number of keywords
$key_words = implode(', ', $matches[1]); // Join lines returned by preg_match_all
echo substr_count($key_words, ','); // 8
N.B. In most circumstances this will return NUMBER_OF_KEY_WORDS - 1 (i.e. in your case 7); it returns 8 because you have a comma at the end of your first line of key words.
Links
http://php.net/manual/en/reference.pcre.pattern.modifiers.php
http://www.regular-expressions.info/
http://php.net/substr_count
Why not just use explode and trim?
$keywords = array_map ('trim', explode (',', $keywordstring));
Then do a count() on $keywords.
If you think keywords with spaces in are spam, then you can iterate of the $keywords array and look for any that contain whitespace. There might be legitimate reasons for having spaces in a keyword though. If you're talking about superheroes on your system, for example, someone might enter The Tick or Iron Man as a keyword
I don't think counting keywords and looking for spaces in keywords are really very good strategies for detecting spam though. You might want to look into other bot protection strategies instead, or even use manual moderation.
How to match on the String of text between the commas?
This SO Post was marked as a duplicate to my posted question however since it is NOT a duplicate and there were no answers in THIS SO Post that answered my question on how to also match on the strings between the commas see below on how to take this a step further.
How to Match on single digit values in a CSV String
For example if the task is to search the string within the commas for a single 7, 8 or a single 9 but not match on combinations such as 17 or 77 or 78 but only the single 7s, 8s, or 9s see below...
The answer is to Use look arounds and place your search pattern within the look arounds:
(?<=^|,)[789](?=,|$)
See live demo.
The above Pattern is more concise however I've pasted below the Two Patterns provided as solutions to THIS this question of matching on Strings within the commas and they are:
(?<=^|,)[789](?=,|$) Provided by #Bohemian and chosen as the Correct Answer
(?:(?<=^)|(?<=,))[789](?:(?=,)|(?=$)) Provided in comments by #Ouroborus
Demo: https://regex101.com/r/fd5GnD/1
Your first regexp doesn't need a preceding comma
[\w\s]+[,-]
A regex that will match strings between two commas or start or end of string is
(?<=,|^)[^,]*(?=,|$)
Or, a bit more efficient:
(?<![^,])[^,]*(?![^,])
See the regex demo #1 and demo #2.
Details:
(?<=,|^) / (?<![^,]) - start of string or a position immediately preceded with a comma
[^,]* - zero or more chars other than a comma
(?=,|$) / (?![^,]) - end of string or a position immediately followed with a comma
If people still search for this in 2021
([^,\n])+
Match anything except new line and comma
regexr.com/60eme
I think the difficulty is that the random text can also contain commas.
If the keywords are all on one line and it is the last line of the text as a whole, trim the whole text removing new line characters from the end. Then take the text from the last new line character to the end. This should be your string containing the keywords. Once you have this part singled out, you can explode the string on comma and count the parts.
<?php
$string = " some gibberish, some more gibberish, and random text
keyword1, keyword2, keyword3
";
$lastEOL = strrpos(trim($string), PHP_EOL);
$keywordLine = substr($string, $lastEOL);
$keywords = explode(',', $keywordLine);
echo "Number of keywords: " . count($keywords);
I know it is not a regex, but I hope it helps nevertheless.
The only way to find a solution, is to find something that separates the random text and the keywords that is not present in the keywords. If a new line is present in the keywords, you can not use it. But are 2 consecutive new lines? Or any other characters.
$string = " some gibberish, some more gibberish, and random text
keyword1, keyword2, keyword3,
keyword4, keyword5, keyword6,
keyword7, keyword8, keyword9
";
$lastEOL = strrpos(trim($string), PHP_EOL . PHP_EOL); // 2 end of lines after random text
$keywordLine = substr($string, $lastEOL);
$keywords = explode(',', $keywordLine);
echo "Number of keywords: " . count($keywords);
(edit: added example for more new lines - long shot)

I want to remove symbols from a string in dart

I want to remove all symbols except for characters (Japanese hiragana, kanji, and Roman alphabet ) that unmatch this regex.
var reg = RegExp(
r'([\u3040-\u309F]|\u3000|[\u30A1-\u30FC]|[\u4E00-\u9FFF]|[a-zA-Z]|[々〇〻])');
I don't know what to put in this "?".
text=text.replaceAll(?,"");
a="「私は、アメリカに行きました。」、'I went to the United States.'"
b="私はアメリカに行きましたI went to the United States"
I want to make a into b.
You can use
String a = "「私は、アメリカに行きました。」、'I went to the United States.'";
a = a.replaceAll(RegExp(r'[^\p{L}\p{M}\p{N}\s]+', unicode: true), '') );
Also, if you just want to remove any punctuation or math symbols, you can use
.replaceAll(RegExp(r'[\p{P}\p{S}]+', unicode: true), '')
Output:
私はアメリカに行きましたI went to the United States
The [^\p{L}\p{M}\p{N}\s]+ regex matches one or more chars other than letters (\p{L}), diacritics (\p{M}), digits (\p{N}) and whitespace chars (\s).
The [\p{P}\p{S}]+ regex matches one or more punctuation proper (\p{P}) or match symbol (\p{S}) chars.
The unicode: true enables the Unicode property class support in the regex.
You can need to specify the Pattern (RegEx) you want to apply on your replaceAll method.
// Creating the regEx/Pattern
var reg = RegExp(r'([\u3040-\u309F]|\u3000|[\u30A1-\u30FC]|[\u4E00-\u9FFF]|[a-zA-Z]|[々〇〻])');
// Applying it to your text.
text=text.replaceAll(reg,"");
You can learn more about it here:
https://api.flutter.dev/flutter/dart-core/String/replaceAll.html

python 3 regex string matching ignore whitespace and string.punctuation

I am new to regex and would like to know how to pattern match two strings. The use case would be something like finding a certain phrase in some text. I'm using python 3.7 if that makes a difference.
phrase = "some phrase" #the phrase I'm searching for
Possible matches:
text = "some##$#phrase"
^^^^ #non-alphanumeric can be treated like a single space
text = "some phrase"
text = "!!!some!!! phrase!!!"
These are not matches:
text = "some phrases"
^ #the 's' on the end makes it false
text = "ssome phrase"
text = "some other phrase"
I have tried using something like:
re.search(r'\b'+phrase+'\b', text)
I would very much appreciate an explanation of why the regex works if you provide a valid solution.
You should use something like this:
re.search(r'\bsome\W+phrase\b', text)
'\W' means non-word character
'+' means one or more times
In case you have a given phrase in a variable, you could try this before:
some_phrase = some_phrase.replace(r' ', r'\W+')

Using Regex is there a way to match outside characters in a string and exclude the inside characters?

I know I can exclude outside characters in a string using look-ahead and look-behind, but I'm not sure about characters in the center.
What I want is to get a match of ABCDEF from the string ABC 123 DEF.
Is this possible with a Regex string? If not, can it be accomplished another way?
EDIT
For more clarification, in the example above I can use the regex string /ABC.*?DEF/ to sort of get what I want, but this includes everything matched by .*?. What I want is to match with something like ABC(match whatever, but then throw it out)DEF resulting in one single match of ABCDEF.
As another example, I can do the following (in sudo-code and regex):
string myStr = "ABC 123 DEF";
string tempMatch = RegexMatch(myStr, "(?<=ABC).*?(?=DEF)"); //Returns " 123 "
string FinalString = myStr.Replace(tempMatch, ""); //Returns "ABCDEF". This is what I want
Again, is there a way to do this with a single regex string?
Since the regex replace feature in most languages does not change the string it operates on (but produces a new one), you can do it as a one-liner in most languages. Firstly, you match everything, capturing the desired parts:
^.*(ABC).*(DEF).*$
(Make sure to use the single-line/"dotall" option if your input contains line breaks!)
And then you replace this with:
$1$2
That will give you ABCDEF in one assignment.
Still, as outlined in the comments and in Mark's answer, the engine does match the stuff in between ABC and DEF. It's only the replacement convenience function that throws it out. But that is supported in pretty much every language, I would say.
Important: this approach will of course only work if your input string contains the desired pattern only once (assuming ABC and DEF are actually variable).
Example implementation in PHP:
$output = preg_replace('/^.*(ABC).*(DEF).*$/s', '$1$2', $input);
Or JavaScript (which does not have single-line mode):
var output = input.replace(/^[\s\S]*(ABC)[\s\S]*(DEF)[\s\S]*$/, '$1$2');
Or C#:
string output = Regex.Replace(input, #"^.*(ABC).*(DEF).*$", "$1$2", RegexOptions.Singleline);
A regular expression can contain multiple capturing groups. Each group must consist of consecutive characters so it's not possible to have a single group that captures what you want, but the groups themselves do not have to be contiguous so you can combine multiple groups to get your desired result.
Regular expression
(ABC).*(DEF)
Captures
ABC
DEF
See it online: rubular
Example C# code
string myStr = "ABC 123 DEF";
Match m = Regex.Match(myStr, "(ABC).*(DEF)");
if (m.Success)
{
string result = m.Groups[1].Value + m.Groups[2].Value; // Gives "ABCDEF"
// ...
}

Regex: match whole line except first string and #comment lines

I tried (\s|\t).*[\b\w*\s\b], this one is almost ok but I want also except lines with #.
#Name Type Allowable values
#========================== ========= ========================================
_absolute-path-base-uri String -
add-xml-decl Boolean y/n, yes/no, t/f, true/false, 1/0
As #anubhava said in his answer, it looks you just need to check for # at the beginning of the line. The regex for that is simple, but the mechanics of applying the regex varies wildly, so it would help if we knew which regex flavor/tool you're using (e.g. PHP, .NET, Notepad++, EditPad Pro, etc.). Here's a JavaScript version:
/^[^#].*$/mg
Notice the modifiers: m ("multiline") allows ^ and $ to match at line boundaries, and g ("global") allows you to find all the matches, not just the first one.
Now let's look at your regex. [\b\w*\s\b] is a character class that matches a word character (\w), a whitespace character (\s), an asterisk (*), or a backspace (\b). In other words, both * and \b lose their special meanings when the appear in a character class.
\s matches any whitespace character including \t, so (\s|\t) is needlessly redundant, and may not be needed at all. What it's actually doing in your case is matching the newline before each matched line. There's no need for that when you can use ^ in multiline mode. If you want to allow for horizontal whitespace (i.e., spaces and tabs) before the #, you can do this:
/^(?![ \t]*#).*$/mg
(?![ \t]*#) is a negative lookahead; it means "from this position, it is impossible to match zero or more tabs or spaces followed by #". Coming right after the ^ line anchor as it does, "this position" means the beginning of a line.
Try this:
^[A-z0-9_-]+\s+(.+)$
Assuming your first string will consist of only letters, numbers, underscores or hyphens, the first part will match that. Then we match whitespace, and then capture the rest. However, this is all dependent on the regular expression engine being used. Is this using language support for regexes, a specific editor, or a certain library? Which one? There isn't a standard: each regex engine works slightly differently.
Try this:
^[^#].*?(\s|\t)(?<Group>.*)$
After a match is found, the Group group will contain your string.
I would use this regex. In English, this says "First character is not a pound sign (#), then non-white space to match the first 'word', then white space, then match the whole line.
^[^#]\S*\s+(.+)$
Can I suggest another approach though? It looks like there are tabs between each field in the text, so why not just read the text line-by-line and split by tab into an array?
Here is an example in C# (untested):
using(StreamReader sr = new StreamReader("C:\\Path\\to\\file.txt"))
{
string line = sr.ReadLine();
while(!sr.EndOfStream)
{
//skip the comment lines
if(line.StartsWith("#"))
continue;
string[] fields = line.Split(new string[] {"\t"}, StringSplitOptions.RemoveEmptyEntries);
//now fields[0] contains the Name field
//fields[1] contains the Type field
//fields[2] contains the Allowable Values field
line = sr.ReadLine();
}
}
Try this code in php:
<?php
$s="#Name Type Allowable values
#========================== ========= ========================================
_absolute-path-base-uri String -
add-xml-decl Boolean y/n, yes/no, t/f, true/false, 1/0 ";
$a = explode("\n", $s);
foreach($a as $str) {
preg_match('~^[^#].*$~', $str, $m);
var_dump($m);
}
?>
OUTPUT
array(0) {
}
array(0) {
}
array(1) {
[0]=>
string(79) "_absolute-path-base-uri String - "
}
array(1) {
[0]=>
string(77) "add-xml-decl Boolean y/n, yes/no, t/f, true/false, 1/0 "
}
Code is pretty simple, it just ignores matching # at the start of a line thus ingoring those lines completely.