Flutter Dart: How to extract a number from a string using RegEx - regex

I want to extract a number (integer, decimal or 12:30 formats) from a string. I have used the following RegEx but to no avail:
final RegExp numberExp = new RegExp(
"[a-zA-Z ]*\\d+.*",
caseSensitive: false,
multiLine: false
);
final RegExp numberExp = new RegExp(
"/[+-]?\d+(?:\.\d+)?/g",
caseSensitive: false,
multiLine: false
);
String result = value.trim();
result = numberExp.stringMatch (result);
result = result.replaceAll("[^0-9]", "");
result = result.replaceAll("[^a-zA-Z]", "");
So far, nothing works perfectly.
Any help appreciated.

const text = '''
Lorem Ipsum is simply dummy text of the 123.456 printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an 12:30 unknown printer took a galley of type and scrambled it to make a
23.4567
type specimen book. It has 445566 survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
''';
final intRegex = RegExp(r'\s+(\d+)\s+', multiLine: true);
final doubleRegex = RegExp(r'\s+(\d+\.\d+)\s+', multiLine: true);
final timeRegex = RegExp(r'\s+(\d{1,2}:\d{2})\s+', multiLine: true);
void main() {
print(intRegex.allMatches(text).map((m) => m.group(0)));
print(doubleRegex.allMatches(text).map((m) => m.group(0)));
print(timeRegex.allMatches(text).map((m) => m.group(0)));
}

For one-line strings you can simply use:
final intValue = int.parse(stringValue.replaceAll(RegExp('[^0-9]'), ''));

That's how I solved my problem:
bool isNumber(String item){
return '0123456789'.split('').contains(item);
}
List<String> numbers = ['1','a','2','b','3','c','4','d','5','e','6','f','7','g','8','h','9','i','0'];
print(numbers);
numbers.removeWhere((item) => !isNumber(item));
print(numbers);
And here's the output:
[1, a, 2, b, 3, c, 4, d, 5, e, 6, f, 7, g, 8, h, 9, i, 0]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 0]

Try this for phone numbers starting with +country code detection in a multiline string.
\b[+][(]{0,1}[6-9]{1,4}[)]{0,1}[-\s.0-9]\b

Related

Dart - Extract number from list

I have a List that contains lottery numbers
List image
How can I separate number and text, and after that list them in the below list: for example:
List<String> n = ["ABC23", "21A23", "12A312","32141A"];
Thank you!
You can do it by using forEach like below:
List<String> n = ["23", "2123", "12312","32141"];
n.forEach((element) =>
print(element)
);
And to separate number and text you can use the following code:
const text = '''
Lorem Ipsum is simply dummy text of the 123.456 printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an 12:30 unknown printer took a galley of type and scrambled it to make a
23.4567
type specimen book. It has 445566 survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
''';
final intRegex = RegExp(r'\s+(\d+)\s+', multiLine: true);
final doubleRegex = RegExp(r'\s+(\d+\.\d+)\s+', multiLine: true);
final timeRegex = RegExp(r'\s+(\d{1,2}:\d{2})\s+', multiLine: true);
void main() {
print(intRegex.allMatches(text).map((m) => m.group(0)));
print(doubleRegex.allMatches(text).map((m) => m.group(0)));
print(timeRegex.allMatches(text).map((m) => m.group(0)));
}
Please refer to this link for more information.
To do something with each element:
List<String> n = ["23", "2123", "12312","32141"];
n.forEach((element) {
int number = int.parse(element);
...
}
);
To do create list of ints:
List<String> n = ["23", "2123", "12312","32141"];
final intList = n.map((element) => int.parse(element)).toList();

Dynamic String Masking in scala

Is there any simple way to do data masking in scala, can anyone please explain. I want to dynamically change the matching patterns to X with same keyword lengths
Example:
patterns to mask:
Narendra\s*Modi
Trump
JUN-\d\d
Input:
Narendra Modi pm of india 2020-JUN-03
Donald Trump president of USA
Ouput:
XXXXXXXX XXXX pm of india 2020-XXX-XX
Donald XXXXX president of USA
Note:Only characters should be masked, i want to retain space or hyphen in output for matching patterns
So you have an input String:
val input =
"Narendra Modi of India, 2020-JUN-03, Donald Trump of USA."
Masking off a given target with a given length is trivial.
input.replaceAllLiterally("abc", "XXX")
If you have many such targets of different lengths then it becomes more interesting.
"India|USA".r.replaceAllIn(input, "X" * _.matched.length)
//res0: String = Narendra Modi of XXXXX, 2020-JUN-03, Donald Trump of XXX.
If you have a mix of masked characters and kept characters, multiple targets can still be grouped together, but they must have the same number of sub-groups and the same pattern of masked-group to kept-group.
In this case the pattern is (mask)(keep)(mask).
raw"(Narendra)(\s+)(Modi)|(Donald)(\s+)(Trump)|(JUN)([-/])(\d+)".r
.replaceAllIn(input,{m =>
val List(a,b,c) = m.subgroups.flatMap(Option(_))
"X"*a.length + b + "X"*c.length
})
//res1: String = XXXXXXXX XXXX of India, 2020-XXX-XX, XXXXXX XXXXX of USA.
Something like that?
val pattern = Seq("Modi", "Trump", "JUN")
val str = "Narendra Modi pm of india 2020-JUN-03 Donald Trump president of USA"
def mask(pattern: Seq[String], str: String): String = {
var s = str
for (elem <- pattern) {
s = s.replaceAll(elem,elem.toCharArray.map(s=>"X").mkString)
}
s
}
print(mask(pattern,str))
out:
Narendra XXXX pm of india 2020-XXX-03 Donald XXXXX president of USA
scala> val pattern = Seq("Narendra\\s*Modi", "Trump", "JUN-\\d\\d", "Trump", "JUN")
pattern: Seq[String] = List(Narendra\s*Modi, Trump, JUN-\d\d, Trump, JUN)
scala> print(mask(pattern,str))
XXXXXXXXXXXXXXX pm of india 2020-XXXXXXXX Donald XXXXX president of USA
Yeah, It should work, try like above.
Please find the regex and code explanation inline
import org.apache.spark.sql.functions._
object RegExMasking {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
//Regex to fetch the word
val regEx : String = """(\s+[A-Z|a-z]+\s)""".stripMargin
//load your Dataframe
val df = List("Narendra Modi pm of india 2020-JUN-03",
"Donald Trump president of USA ").toDF("sentence")
df.withColumn("valueToReplace",
//Fetch the 1st word from the regex parse expression
regexp_extract(col("sentence"),regEx,0)
)
.map(row => {
val sentence = row.getString(0)
//Trim for extra spaces
val valueToReplace : String = row.getString(1).trim
//Create masked string of equal length
val replaceWith = List.fill(valueToReplace.length)("X").mkString
// Return sentence , masked sentence
(sentence,sentence.replace(valueToReplace,replaceWith))
}).toDF("sentence","maskedSentence")
.show()
}
}

read csv setting fields with spaces to NA

I have a csv file that looks like this:
A, B, C,
1, 2 1, 3,
3, 1, 0,
4, 1, 0 5,
...
is it possible to set the na.string to assign all fields with space to NA (e.g. something like regex function(x){x[grep(patt="\\ ", x)]<-NA;x}), i.e.
A, B, C,
1, NA, 3,
3, 1, 0,
4, 1, NA,
We can loop over the columns and set it to NA by converting to numeric
df1[] <- lapply(df1, as.numeric)
NOTE: Here, I assumed that the columns are character class. If it is factor, do lapply(df1, function(x) as.numeric(as.character(x)))
Variation on #akrun's answer (which I like).
library(dplyr)
read.csv("test.csv", colClasses="character") %>% mutate_each(funs(as.numeric))
This reads the file assuming all columns are character, then converts all to numeric with mutate_each from dplyr.
Using colClasses="numeric" already in the read call didn't work (and I don't know why :( ), since
> as.numeric("2 1")
[1] NA
From How to read data when some numbers contain commas as thousand separator? we learn that we can make a new function to do the conversion.
setAs("character", "numwithspace", function(from) as.numeric(from) )
read.csv("test.csv", colClasses="numwithspace")
which gives
A B C
1 1 NA 3
2 3 1 0
3 4 1 NA
I don't know how this would translate in r, but I would use the following regex to match fields containing spaces :
[^, ]+ [^, ]+
Which is :
some characters other than a comma or a space ([^, ]+)
followed by a space ()
and some more characters other than a comma or a space ([^, ]+)
You can see it in action here.

Count strings in a file, some single words, some full sentences

I want to count the occurrence of certain words and names in a file. The code below incorrectly counts fish and chips as one case of fish and one case of chips, instead of one count of fish and chips.
ngh.txt = 'test file with words fish, steak fish chips fish and chips'
import re
from collections import Counter
wanted = '''
"fish and chips"
fish
chips
steak
'''
cnt = Counter()
words = re.findall('\w+', open('ngh.txt').read().lower())
for word in words:
if word in wanted:
cnt[word] += 1
print cnt
Output:
Counter({'fish': 3, 'chips': 2, 'and': 1, 'steak': 1})
What I want is:
Counter({'fish': 2, 'fish and chips': 1, 'chips': 1, 'steak': 1})
(And ideally, I can get the output like this:
fish: 2
fish and chips: 1
chips: 1
steak: 1
)
Definition:
Wanted item: A string that is being searched for within the text.
To count wanted items, without re-counting them within longer wanted items, first count the number of times each one occurs within the string. Next, go through the wanted items, from longest to shortest, and as you encounter smaller wanted items that occur in a longer wanted item, subtract the number of results for the longer item from the shorter item. For example, assume your wanted items are "a", "a b", and "a b c", and your text is "a/a/a b/a b c". Searching for each of those individually produces: { "a": 4, "a b": 2, "a b c": 1 }. The desired result is: { "a b c": 1, "a b": #("a b") - #("a b c") = 2 - 1 = 1, "a": #("a") - #("a b c") - #("a b") = 4 - 1 - 1 = 2 }.
def get_word_counts(text, wanted):
counts = {}; # The number of times a wanted item was read
# Dictionary mapping word lengths onto wanted items
# (in the form of a dictionary where keys are wanted items)
lengths = {};
# Find the number of times each wanted item occurs
for item in wanted:
matches = re.findall('\\b' + item + '\\b', text);
counts[item] = len(matches)
l = len(item) # Length of wanted item
# No wanted item of the same length has been encountered
if (l not in lengths):
# Create new dictionary of items of the given length
lengths[l] = {}
# Add wanted item to dictionary of items with the given length
lengths[l][item] = 1
# Get and sort lenths of wanted items from largest to smallest
keys = lengths.keys()
keys.sort(reverse=True)
# Remove overlapping wanted items from the counts working from
# largest strings to smallest strings
for i in range(1,len(keys)):
for j in range(0,i):
for i_item in lengths[keys[i]]:
for j_item in lengths[keys[j]]:
#print str(i)+','+str(j)+': '+i_item+' , '+j_item
matches = re.findall('\\b' + i_item + '\\b', j_item);
counts[i_item] -= len(matches) * counts[j_item]
return counts
The following code contains test cases:
tests = [
{
'text': 'test file with words fish, steak fish chips fish and '+
'chips and fries',
'wanted': ["fish and chips","fish","chips","steak"]
},
{
'text': 'fish, fish and chips, fish and chips and burgers',
'wanted': ["fish and chips","fish","fish and chips and burgers"]
},
{
'text': 'fish, fish and chips and burgers',
'wanted': ["fish and chips","fish","fish and chips and burgers"]
},
{
'text': 'My fish and chips and burgers. My fish and chips and '+
'burgers',
'wanted': ["fish and chips","fish","fish and chips and burgers"]
},
{
'text': 'fish fish fish',
'wanted': ["fish fish","fish"]
},
{
'text': 'fish fish fish',
'wanted': ["fish fish","fish","fish fish fish"]
}
]
for i in range(0,len(tests)):
test = tests[i]['text']
print test
print get_word_counts(test, tests[i]['wanted'])
print ''
The output is as follows:
test file with words fish, steak fish chips fish and chips and fries
{'fish and chips': 1, 'steak': 1, 'chips': 1, 'fish': 2}
fish, fish and chips, fish and chips and burgers
{'fish and chips': 1, 'fish and chips and burgers': 1, 'fish': 1}
fish, fish and chips and burgers
{'fish and chips': 0, 'fish and chips and burgers': 1, 'fish': 1}
My fish and chips and burgers. My fish and chips and burgers
{'fish and chips': 0, 'fish and chips and burgers': 2, 'fish': 0}
fish fish fish
{'fish fish': 1, 'fish': 1}
fish fish fish
{'fish fish fish': 1, 'fish fish': 0, 'fish': 0}
So this solution works with your test data (and with some added terms to the test data, just to be thorough), though it can probably be improved upon.
The crux of it is to find occurances of 'and' in the words list and then to replace 'and' and its neighbours with a compound word (concatenating the neighbours with 'and') and adding this back to the list, along with a copy of 'and'.
I also converted the 'wanted' string to a list to handle the 'fish and chips' string as a distinct item.
import re
from collections import Counter
# changed 'wanted' string to a list
wanted = ['fish and chips','fish','chips','steak', 'and']
cnt = Counter()
words = re.findall('\w+', open('ngh.txt').read().lower())
for word in words:
# look for 'and', replace it and neighbours with 'comp_word'
# slice, concatenate, and append to make new words list
if word == 'and':
and_pos = words.index('and')
comp_word = str(words[and_pos-1]) + ' and ' +str(words[and_pos+1])
words = words[:and_pos-1] + words[and_pos+2:]
words.append(comp_word)
words.append('and')
for word in words:
if word in wanted:
cnt[word] += 1
print cnt
The output from your text would be:
Counter({'fish':2, 'and':1, 'steak':1, 'chips':1, 'fish and chips':1})
As noted in the comment above, it's unclear why you want/expect output to be 2 for fish, 2 for chips, and 1 for fish-and-chips in your ideal output. I'm assuming it's a typo, since the output above it has 'chips':1
I am suggesting two algorithms that will work on any patterns and any file.
The first algorithm has run time proportional to (number of characters in the file)* number of patterns.
1> For every pattern search all the patterns and create a list of super-patterns. This can be done by matching one pattern such as 'cat' against all patterns to be searched.
patterns = ['cat', 'cat and dogs', 'cat and fish']
superpattern['cat'] = ['cat and dogs', 'cat and fish']
2> Search for 'cat' in the file, let's say result is cat_count
3> Now search for every supper pattern of 'cat' in file and get their counts
for (sp in superpattern['cat']) :
sp_count = match sp in file.
cat_count = cat_count - sp
This a general solution that is brute force. Should be able to come up with a linear time solution if we arrange the patterns in a Trie.
Root-->f-->i-->s-->h-->a and so on.
Now when you are at h of the fish, and you do not get an a, increment fish_count and go to root. If you get 'a' continue. Anytime you get something un-expected, increment count of most recently found pattern and go to root or go to some other node (the longest match prefix that is a suffix of that other node). This is the Aho-Corasick algorithm, you can look it up on wikipedia or at:
http://www.cs.uku.fi/~kilpelai/BSA05/lectures/slides04.pdf
This solution is linear to the number of characters in the file.

Find multiline blocks of text in R and R-Studio using Regular Expressions

I was quite good at using regular expressions in PHP, but now I've completely stuck with a new programming language R.
An example of the working code is here https://regex101.com/r/mO1yR3/2
What I want to do, is to find and replace block of texts containing (Mini) at the header. Just delete those blocks from the text and save it to a file.
I spent a day finding the solution and I'am on a brink to dump R. It's much faster to do it in PHP, Perl or Python.
The code I used for R:
library(readr)
library(stringr)
contentsTosCSV <- read_file("d:/ProVallue/Provalue Group/BackTesting/SPY/2015/test.txt")
contentsTosCSV <- str_replace_all(contentsTosCSV, '\\r', '')#deleting \r
matches <- grep('|(^.*\\(Mini\\)(?s).*?\\n\\n)|mg', contentsTosCSV, ignore.case = FALSE, perl = TRUE, value = TRUE, fixed = FALSE, useBytes = FALSE, invert = FALSE)
print(matches)
It matches the whole string in contentsTosCSV
Then I tried these:
matches <- grep('(?m)(^.*\\(Mini\\)(?s).*?\\n\\n)', contentsTosCSV, ignore.case = TRUE, perl = TRUE, value = TRUE, fixed = FALSE, useBytes = FALSE, invert = FALSE)
print(matches)
and substituting the m modifyer with [.\n] with (?m) and (?s) and without.
matches <- grep('(^.*\\(Mini\\)[.\\n]*?\\n\\n)', contentsTosCSV, ignore.case = TRUE, perl = TRUE, value = TRUE, fixed = FALSE, useBytes = FALSE, invert = FALSE)
print(matches)
example text:
Last,Net Chng,Volume,Open,High,Low
209.79,-.71,"113,965,728",210.46,210.53,208.65
JUL4 15 (-10) 100 (Weeklys)
,,Mark,Last,Delta,Impl Vol,Open.Int,Volume,Bid,Ask,Exp,Strike,Bid,Ask,Mark,Last,Delta,Impl Vol,Open.Int,Volume,,
,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,
SEP 15 (46) 100
,,Mark,Last,Delta,Impl Vol,Open.Int,Volume,Bid,Ask,Exp,Strike,Bid,Ask,Mark,Last,Delta,Impl Vol,Open.Int,Volume,,
,,129.790,127.28,1.00,0.00%,0,0,129.55,129.87,SEP 15,80,0,.01,.005,.01,.00,81.60%,"2,964",0,,
,,.005,.01,.00,14.71%,"4,563",0,0,.01,SEP 15,245,36.19,36.52,36.355,0,-.94,27.65%,0,0,,
,,.005,.02,.00,16.54%,"2,473",0,0,.01,SEP 15,250,41.19,41.49,41.340,38.87,-.94,30.20%,118,0,,
SEP 15 (46) 10 (Mini)
,start,Mark,Last,Delta,Impl Vol,Open.Int,Volume,Bid,Ask,Exp,Strike,Bid,Ask,Mark,Last,Delta,Impl Vol,Open.Int,Volume,,
,,20.165,15.70,.91,21.64%,1,0,17.75,22.58,SEP 15,190,.52,4.99,2.755,2.22,-.19,32.90%,26,0,,
,,19.165,0,.91,20.79%,0,0,16.80,21.53,SEP 15,191,0,4.99,2.495,0,-.19,30.53%,0,0,,
,,18.230,21.31,.90,20.46%,2,0,15.83,20.63,SEP 15,192,0,4.99,2.495,2.90,-.19,29.45%,6,0,,end
SEP5 15 (58) 100 (Quarterlys)
,,Mark,Last,Delta,Impl Vol,Open.Int,Volume,Bid,Ask,Exp,Strike,Bid,Ask,Mark,Last,Delta,Impl Vol,Open.Int,Volume,,
,,134.790,132.33,1.00,0.00%,0,0,134.54,134.88,SEP5 15,75,0,.02,.010,.01,.00,81.69%,"2,375",0,,
,,129.790,127.37,1.00,0.00%,0,0,129.54,129.88,SEP5 15,80,0,.02,.010,.01,.00,76.86%,620,0,,
OCT 15 (74) 100
,,Mark,Last,Delta,Impl Vol,Open.Int,Volume,Bid,Ask,Exp,Strike,Bid,Ask,Mark,Last,Delta,Impl Vol,Open.Int,Volume,,
,,73.790,0,1.00,0.00%,0,0,73.56,73.89,OCT 15,136,.01,.03,.020,0,.00,33.93%,0,0,,
,,72.790,0,1.00,0.00%,0,0,72.57,72.89,OCT 15,137,.01,.03,.020,0,.00,33.35%,0,0,,
,,71.790,0,1.00,0.00%,0,0,71.57,71.89,OCT 15,138,.01,.04,.025,.04,.00,33.54%,300,0,,
,,70.790,0,1.00,0.00%,0,0,70.57,70.90,OCT 15,139,.02,.04,.030,0,.00,33.62%,0,0,,
I've based on your regex101 sample and modified your regex by doing the [\s\S] trick.
You can use a regex like this:
(\(Mini\)[\s\S]*?end)
Here you can find the update:
Working demo