I want to remove symbols from a string in dart - regex

I want to remove all symbols except for characters (Japanese hiragana, kanji, and Roman alphabet ) that unmatch this regex.
var reg = RegExp(
r'([\u3040-\u309F]|\u3000|[\u30A1-\u30FC]|[\u4E00-\u9FFF]|[a-zA-Z]|[々〇〻])');
I don't know what to put in this "?".
text=text.replaceAll(?,"");
a="「私は、アメリカに行きました。」、'I went to the United States.'"
b="私はアメリカに行きましたI went to the United States"
I want to make a into b.

You can use
String a = "「私は、アメリカに行きました。」、'I went to the United States.'";
a = a.replaceAll(RegExp(r'[^\p{L}\p{M}\p{N}\s]+', unicode: true), '') );
Also, if you just want to remove any punctuation or math symbols, you can use
.replaceAll(RegExp(r'[\p{P}\p{S}]+', unicode: true), '')
Output:
私はアメリカに行きましたI went to the United States
The [^\p{L}\p{M}\p{N}\s]+ regex matches one or more chars other than letters (\p{L}), diacritics (\p{M}), digits (\p{N}) and whitespace chars (\s).
The [\p{P}\p{S}]+ regex matches one or more punctuation proper (\p{P}) or match symbol (\p{S}) chars.
The unicode: true enables the Unicode property class support in the regex.

You can need to specify the Pattern (RegEx) you want to apply on your replaceAll method.
// Creating the regEx/Pattern
var reg = RegExp(r'([\u3040-\u309F]|\u3000|[\u30A1-\u30FC]|[\u4E00-\u9FFF]|[a-zA-Z]|[々〇〻])');
// Applying it to your text.
text=text.replaceAll(reg,"");
You can learn more about it here:
https://api.flutter.dev/flutter/dart-core/String/replaceAll.html

Related

Powershell find non printing characters [duplicate]

i would appreciate your help on this, since i do not know which range of characters to use, or if there is a character class like [[:cntrl:]] that i have found in ruby?
by means of non printable, i mean delete all characters that are not shown in ie output, when one prints the input string. Please note, i look for a c# regex, i do not have a problem with my code
You may remove all control and other non-printable characters with
s = Regex.Replace(s, #"\p{C}+", string.Empty);
The \p{C} Unicode category class matches all control characters, even those outside the ASCII table because in .NET, Unicode category classes are Unicode-aware by default.
Breaking it down into subcategories
To only match basic control characters you may use \p{Cc}+, see 65 chars in the Other, Control Unicode category. It is equal to a [\u0000-\u0008\u000E-\u001F\u007F-\u0084\u0086-\u009F \u0009-\u000D \u0085]+ regex.
To only match 161 other format chars including the well-known soft hyphen (\u00AD), zero-width space (\u200B), zero-width non-joiner (\u200C), zero-width joiner (\u200D), left-to-right mark (\u200E) and right-to-left mark (\u200F) use \p{Cf}+. The equivalent including astral place code points is a (?:[\xAD\u0600-\u0605\u061C\u06DD\u070F\u08E2\u180E\u200B-\u200F\u202A-\u202E\u2060-\u2064\u2066-\u206F\uFEFF\uFFF9-\uFFFB]|\uD804[\uDCBD\uDCCD]|\uD80D[\uDC30-\uDC38]|\uD82F[\uDCA0-\uDCA3]|\uD834[\uDD73-\uDD7A]|\uDB40[\uDC01\uDC20-\uDC7F])+ regex.
To match 137,468 Other, Private Use control code points you may use \p{Co}+, or its equivalent including astral place code points, (?:[\uE000-\uF8FF]|[\uDB80-\uDBBE\uDBC0-\uDBFE][\uDC00-\uDFFF]|[\uDBBF\uDBFF][\uDC00-\uDFFD])+.
To match 2,048 Other, Surrogate code points that include some emojis, you may use \p{Cs}+, or [\uD800-\uDFFF]+ regex.
You can try with :
string s = "Täkörgåsmrgås";
s = Regex.Replace(s, #"[^\u0000-\u007F]+", string.Empty);
Updated answer after comments:
Documentation about non-printable character:
https://en.wikipedia.org/wiki/Control_character
Char.IsControl Method:
https://msdn.microsoft.com/en-us/library/system.char.iscontrol.aspx
Maybe you can try:
string input; // this is your input string
string output = new string(input.Where(c => !char.IsControl(c)).ToArray());
To remove all control and other non-printable characters
Regex.Replace(s, #"\p{C}+", String.Empty);
To remove the control characters only (if you don't want to remove the emojis 😎)
Regex.Replace(s, #"\p{Cc}+", String.Empty);
you can try this:
public static string TrimNonAscii(this string value)
{
string pattern = "[^ -~]*";
Regex reg_exp = new Regex(pattern);
return reg_exp.Replace(value, "");
}

How can I use RegEx to remove certain words in from string

I need to clean some cells and only keep important words to generate a search index.
Eg. "How to make an account recovery request" would be trimmed to "Make Account Recovery Request" because "How, To, An" would be in a list of words to be filtered out.
The other complexity is that it will also be in French and Spanish, which means that I have to deal with part-word like d'.
So far I've been trying to use a function but it doesn't work with part-word (d') and if "de" and "des" are listed in the same cell, it will remove DE from DES and then only keep the lonely S because DES is not recognized anymore:
Function ClearWords(s As String, rWords As Range) As String
Static RX As Object
If RX Is Nothing Then
Set RX = CreateObject("VBScript.RegExp")
RX.Global = True
RX.IgnoreCase = True
End If
RX.Pattern = "\b" & Replace(Join(Application.Transpose(rWords), "|"), ".", "\.") & "\b"
ClearWords = Application.Trim(RX.Replace(s, ""))
End Function
If you plan to support English, French, and other European languages you may leverage the regex I posted at Regular expression not working for at least one European character
, (?![×÷])[A-Za-zÀ-ÿ]. This is a pattern that is supposed to match all the alphabetic chars you need to support. Since you are going to use it in VBA, it makes sense to replace literal extended letters with \uXXXX entities, and convert it to a single character class, [A-Za-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF] ([A-Za-zÀ-ÖØ-öø-ÿ] with literal chars).
Now, you need to build the custom boundaries. The initial boundary is either start of the string, ^, or any char other than the letters above (and possibly digits, and _, if you want to fully emulate \b). Since you want to replace, you need to put these two patterns into a (^|[^A-Za-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF]) capturing group and use $1 in the replacement pattern to restore the value in order not to lose it. The trailing boundary is any char other than the letters above (or digits / _) and end of the string. Since VBA regex supports lookaheads, we may just use a negative lookahead, (?![A-Za-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF]).
Putting it together:
RX.Pattern = "(^|[^A-Za-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF])(?:" & Replace(Join(Application.Transpose(rWords), "|"), ".", "\.") & ")(?![A-Za-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF])"
ClearWords = Application.Trim(RX.Replace(s, "$1"))
See this regex demo.
To also remove spaces before, replace "(^|[^A-Za-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF])(?:" with (?:\s+|(^|[^A-Za-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF]))(?:. See this regex demo.
Bonus: you seem to need to escape the words to use them in a regex:
Dim regExEscape As New RegExp
With regExEscape
.pattern = "[-/\\^$*+?.()|[\]{}]"
.Global = True
.MultiLine = False
End With
Just make sure you process all words you have instead of Replace(Join(Application.Transpose(rWords), "|"), ".", "\.").

Regex replacing special characters in a string

I have numerical values that contain special characters and I would like to replace those special characters with "x"
I already tried [^\w*], and it will only work when there is one special character
When there is more than 1234?12?, it won't capture the second special character, what am i doing wrong?
Here is something you could use. It will replace all none numeric characters. Good luck!
var str = "rt5121212?232?2*dse%e&323"
var pattern = /([^![0-9])/gi;
var sanitized = str.replace(pattern,'');
console.log(sanitized);

Regex: match whole line except first string and #comment lines

I tried (\s|\t).*[\b\w*\s\b], this one is almost ok but I want also except lines with #.
#Name Type Allowable values
#========================== ========= ========================================
_absolute-path-base-uri String -
add-xml-decl Boolean y/n, yes/no, t/f, true/false, 1/0
As #anubhava said in his answer, it looks you just need to check for # at the beginning of the line. The regex for that is simple, but the mechanics of applying the regex varies wildly, so it would help if we knew which regex flavor/tool you're using (e.g. PHP, .NET, Notepad++, EditPad Pro, etc.). Here's a JavaScript version:
/^[^#].*$/mg
Notice the modifiers: m ("multiline") allows ^ and $ to match at line boundaries, and g ("global") allows you to find all the matches, not just the first one.
Now let's look at your regex. [\b\w*\s\b] is a character class that matches a word character (\w), a whitespace character (\s), an asterisk (*), or a backspace (\b). In other words, both * and \b lose their special meanings when the appear in a character class.
\s matches any whitespace character including \t, so (\s|\t) is needlessly redundant, and may not be needed at all. What it's actually doing in your case is matching the newline before each matched line. There's no need for that when you can use ^ in multiline mode. If you want to allow for horizontal whitespace (i.e., spaces and tabs) before the #, you can do this:
/^(?![ \t]*#).*$/mg
(?![ \t]*#) is a negative lookahead; it means "from this position, it is impossible to match zero or more tabs or spaces followed by #". Coming right after the ^ line anchor as it does, "this position" means the beginning of a line.
Try this:
^[A-z0-9_-]+\s+(.+)$
Assuming your first string will consist of only letters, numbers, underscores or hyphens, the first part will match that. Then we match whitespace, and then capture the rest. However, this is all dependent on the regular expression engine being used. Is this using language support for regexes, a specific editor, or a certain library? Which one? There isn't a standard: each regex engine works slightly differently.
Try this:
^[^#].*?(\s|\t)(?<Group>.*)$
After a match is found, the Group group will contain your string.
I would use this regex. In English, this says "First character is not a pound sign (#), then non-white space to match the first 'word', then white space, then match the whole line.
^[^#]\S*\s+(.+)$
Can I suggest another approach though? It looks like there are tabs between each field in the text, so why not just read the text line-by-line and split by tab into an array?
Here is an example in C# (untested):
using(StreamReader sr = new StreamReader("C:\\Path\\to\\file.txt"))
{
string line = sr.ReadLine();
while(!sr.EndOfStream)
{
//skip the comment lines
if(line.StartsWith("#"))
continue;
string[] fields = line.Split(new string[] {"\t"}, StringSplitOptions.RemoveEmptyEntries);
//now fields[0] contains the Name field
//fields[1] contains the Type field
//fields[2] contains the Allowable Values field
line = sr.ReadLine();
}
}
Try this code in php:
<?php
$s="#Name Type Allowable values
#========================== ========= ========================================
_absolute-path-base-uri String -
add-xml-decl Boolean y/n, yes/no, t/f, true/false, 1/0 ";
$a = explode("\n", $s);
foreach($a as $str) {
preg_match('~^[^#].*$~', $str, $m);
var_dump($m);
}
?>
OUTPUT
array(0) {
}
array(0) {
}
array(1) {
[0]=>
string(79) "_absolute-path-base-uri String - "
}
array(1) {
[0]=>
string(77) "add-xml-decl Boolean y/n, yes/no, t/f, true/false, 1/0 "
}
Code is pretty simple, it just ignores matching # at the start of a line thus ingoring those lines completely.

How to unpunctuate, lowercase, de-space and hyphenate a string with regex?

If I have a string like this
Newsflash: The Big(!) Brown Dog's Brother (T.J.) Ate The Small Blue Egg
how would I convert that into the following using regex:
newsflash-the-big-brown-dogs-brother-tj-ate-the-small-blue-egg
In other words, punctuation is discarded and spaces are replaced with hyphens.
It sounds like you want to create a "URL plug" -- a URL-friendly version of an article's title, for example. That means you'll want to make sure you remove all possible non-URL-friendly characters, not just a few. You might do it this way (in order):
Remove all non-letter non-number non-space characters by:
Replacing regex [^A-Za-z0-9 ] with the empty string "".
Replace all spaces with a dash by:
Replacing regex \s+ with the string "-".
Lower-case the string by:
Java s = s.toLowerCase();
JavaScript s = s.toLowerCase();
C# s = s.ToLowerCase();
Perl $s = lc($s);
Python s = s.lower()
PHP $s = strtolower($s);
Ruby s = s.downcase
Replace the regex [\s-]+ with "-", then replace [^\w-] with "".
Then, call ToLowerCase or equivalent.
In Javascript:
var s = "Newsflash: The Big(!) Brown Dog's Brother (T.J.) Ate The Small Blue Egg";
alert(s.replace(/[\s+-]/g, '-').replace(/[^\w-]/g, '').toLowerCase());
Replace /\W+/ with '-', that will replace all non-word characters with a dash.
Then, collapse dashes by replacing /-+/ with '-'.
Then, lowercase the string - pure regex solutions cannot do that. You didn't say which language you are using, so I cannot give you an example, but your language might have String.toLowercase() or a tr/// call (tr/A-Z/a-z/, for example, in Perl).