String to numeric conversion using regex in scala - regex

Hi have an array of numbers as string:
val original_array = Array("-0,1234567",......) which is a string and I want to convert to a numeric Array.
val new_array = Array("1234567", ........)
How can I aheive this in scala?
Using original_array.toDouble is giving error

The simple answer is ...
val arrNums = Array("123", "432", "99").map(_.toDouble)
... but this a little dangerous because it will throw if any of the strings are not proper numbers.
This is safer...
val arrNums = Array("123", "432", "99").collect{ case n
if n matches """\d+""" => n.toDouble
}
... but you'll want to use a regex pattern that covers all cases. This example won't recognize floating point numbers ("1.1") or negatives ("-4"). Something like """-?\d*\.?\d+""" might fit your requirements.

Related

How do I match Regex Pattern on List to filter out decimal elements in Scala

I am wondering without creating a function, how can I filter out numbers from a list with both numbers and strings:
val a = sc.parallelize(List(“cat”,“horse”,4.0,3.5,2,“dog”))
I believe my question indeed is looking for how to use regex in Scala to find out matched pattern
----Updated on 20180302 11pm EST:
Thanks to #Nyavro which is the closest answer, I slightly modified as below:
val doubles = a.collect {
case v: Double => v
case v: Int => v
}
Now I get:
res10: Array[Double] = Array(4.0, 3.5, 2.0)
Just be curious, can types be mixed in a collect result in Scala?
Thanks to all replies.
Use collect:
val doubles = a.collect {
case v: Double => v
}
To filter for elements of type Int and Double, and to retain their respective types, you might try this.
a.flatMap {
case v: Int => Some(v)
case v: Double => Some(v)
case _ => None
}
//res0: List[AnyVal] = List(4.0, 3.5, 2)
To help understand why this is a really bad idea, read this question, and its answers.
You can use isInstanceOf to check whether an element of your list is a string.
val l = List("cat","horse",4.0,3.5,2,"dog")
l.filter(_.isInstanceOf[String])
>> List[Any] = List(cat, horse, dog)
Regex is (largely) irrelevant here because you do not have strings, you have a List[Any] that you're turning into an RDD[Any]. (The RDD is largely irrelevant here, too, except RDDs have no filterNot and Lists do--I can't tell if you want to keep the strings or drop the strings.)
Note also that filter takes a function as an argument--having some function here is inescapable, even if it's anonymous, as it is in my example.
I have an inkling, though, that I've given an answer to the opposite of what you're asking, and you have an RDD[String] that you want to convert to RDD[Double], throwing away the strings that don't convert. In that case, I would try to convert the strings to doubles, wrapping that in a Try and check for success, using the result to filter:
def isDouble(s: String) = Try(s.toDouble).isSuccess

How can I extract a file name based on number string?

I have a list of filenames in a struct array, example:
4x1 struct array with fields:
name
date
bytes
isdir
datenum
where files.name
ans =
ts.01094000.crest.csv
ans =
ts.01100600.crest.csv
etc.
I have another list of numbers (say, 1094000). And I want to find the corresponding file name from the struct.
Please note, that 1094000 doesn't have preceding 0. Often there might be other numbers. So I want to search for '1094000' and find that name.
I know I can do it using Regex. But I have never used that before. And finding it difficult to write for numbers instead of text using strfind. Any suggestion or another method is welcome.
What I have tried:
regexp(files.name,'ts.(\d*)1094000.crest.csv','match');
I think the regular expression you'd want is more like
filenames = {'ts.01100600.crest.csv','ts.01094000.crest.csv'};
matches = regexp(filenames, ['ts\.0*' num2str(1094000) '\.crest\.csv']);
matches = ~cellfun('isempty', matches);
filenames(matches)
For a solution with strfind...
Pre-16b:
match = ~cellfun('isempty', strfind({files.name}, num2str(1094000)),'UniformOutput',true)
files(match)
16b+:
match = contains({files.name}, string(1094000))
files(match)
However, the strfind way might have issues if the number you are looking for exists in unexpected places such as looking for 10 in ["01000" "00101"].
If your filenames match the pattern ts.NUMBER.crest.csv, then in 16b+ you could do:
str = {files.name};
str = extractBetween(str,4,'.');
str = strip(str,'left','0');
matches = str == string(1094000);
files(matches)

Finding the shortest repetitive pattern in a string

I was wondering if there was a way to do pattern matching in Octave / matlab? I know Maple 10 has commands to do this but not sure what I need to do in Octave / Matlab. So if a number was 12341234123412341234 the pattern match would be 1234. I'm trying to find the shortest pattern that upon repetiton generates the whole string.
Please note: the numbers (only numbers will be used) won't be this simple. Also, I won't know the pattern ahead of time (that's what I'm trying to find). Please see the Maple 10 example below which shows that the pattern isn't known ahead of time but the command finds the pattern.
Example of Maple 10 pattern matching:
ns:=convert(12341234123412341234,string);
ns := "12341234123412341234"
StringTools:-PrimitiveRoot(ns);
"1234"
How can I do this in Octave / Matlab?
Ps: I'm using Octave 3.8.1
To find the shortest pattern that upon repetition generates the whole string, you can use regular expressions as follows:
result = regexp(str, '^(.+?)(?=\1*$)', 'match');
Some examples:
>> str = '12341234123412341234';
>> result = regexp(str, '^(.+?)(?=\1*$)', 'match')
result =
'1234'
>> str = '1234123412341234123';
>> result = regexp(str, '^(.+?)(?=\1*$)', 'match')
result =
'1234123412341234123'
>> str = 'lullabylullaby';
>> result = regexp(str, '^(.+?)(?=\1*$)', 'match')
result =
'lullaby'
>> str = 'lullaby1lullaby2lullaby1lullaby2';
>> result = regexp(str, '^(.+?)(?=\1*$)', 'match')
result =
'lullaby1lullaby2'
I'm not sure if this can be accomplished with regular expressions. Here is a script that will do what you need in the case of a repeated word called pattern.
It loops through the characters of a string called str, trying to match against another string called pattern. If matching fails, the pattern string is extended as needed.
EDIT: I made the code more compact.
str = 'lullabylullabylullaby';
pattern = str(1);
matchingState = false;
sPtr = 1;
pPtr = 1;
while sPtr <= length(str)
if str(sPtr) == pattern(pPtr) %// if match succeeds, keep looping through pattern string
matchingState = true;
pPtr = pPtr + 1;
pPtr = mod(pPtr-1,length(pattern)) + 1;
else %// if match fails, extend pattern string and start again
if matchingState
sPtr = sPtr - 1; %// don't change str index when transitioning out of matching state
end
matchingState = false;
pattern = str(1:sPtr);
pPtr = 1;
end
sPtr = sPtr + 1;
end
display(pattern);
The output is:
pattern =
lullaby
Note:
This doesn't allow arbitrary delimiters between occurrences of the pattern string. For example, if str = 'lullaby1lullaby2lullaby1lullaby2';, then
pattern =
lullaby1lullaby2
This also allows the pattern to end mid-way through a cycle without changing the result. For example, str = 'lullaby1lullaby2lullaby1'; would still result in
pattern =
lullaby1lullaby2
To fix this you could add the lines
if pPtr ~= length(pattern)
pattern = str;
end
Another approach is as follows:
determine length of string, and find all possible factors of the string length value
for each possible factor length, reshape the string and check
for a repeated substring
To find all possible factors, see this solution on SO. The next step can be performed in many ways, but I implement it in a simple loop, starting with the smallest factor length.
function repeat = repeats_in_string(str);
ns = numel(str);
nf = find(rem(ns, 1:ns) == 0);
for ii=1:numel(nf)
repeat = str(1:nf(ii));
if all(ismember(reshape(str,nf(ii),[])',repeat));
break;
end
end
This problem is a great Rorschach test for your approach to problem solving. I'll add a signal engineering solution, which should be simple since the signal is expected to be perfectly repetitive, assuming this holds: find the shortest pattern that upon repetition generates the whole string.
In the following str fed to the function is actually a column vector of floats, not a string, the original string having been converted with str2num(str2mat(str)'):
function res=findshortestrepel(str);
[~,ii] = max(fft(str-mean(str)));
res = str(1:round(numel(str)/(ii-1)));
I performed a small test, comparing this to the regexp solution and found it to be faster overall (blue squares), although somewhat inconsistently, and only if you don't consider the time required to convert the string into a vector of floats (green squares). However I did not pursue this further (not breaking records with this):
Times in sec.

Convert list of one number to int

I have a regular expression that parses a line# string from a log. That line# is then subjected to another regular expression to just extract the line#.
For example:
Part of this regex:
m = re.match(r"^(\d{4}-\d{2}-\d{2}\s*\d{2}:\d{2}:\d{2}),?(\d{3}),?(?:\s+\[(?:[^\]]+)\])+(?<=])(\s+?[A-Z]+\s+?)+(\s?[a-zA-Z0-9\.])+\s?(\((?:\s?\w)+\))\s?(\s?.)+", line)
Will match this:
(line 206)
Then this regex:
re.findall(r'\b\d+\b', linestr)
Gives me
['206']
In order to further process my information I need to have the line number as an integer and am lost for a solution as to how to do that.
You may try:
line_int = int(re.findall(r'\b\d+\b', linestr)[0])
or if you have more than one element in the list:
lines_int = [int(i) for i in re.findall(r'\b\d+\b', linestr)]
or even
lines_int = map(int, re.findall(r'(\b\d+\b)+', linestr))
I hope it helps -^.^-
Use int() to convert your list of one "string number" to an int:
myl = ['206']
int(myl[0])
206
if you have a list of these, you can conver them all to ints using list comprehension:
[int(i) for i in myl]
resulting in a list of ints.
You can hook this into your code as best fits, e.g.,
int(re.findall(r'\b\d+\b', linestr)[0])
int(re.findall(r'\b\d+\b', linestr)[0])
?

How to accept numbers and specific words?

i have validating a clothes size field, and want it to accept only numbers and specific "words" like S, M, XL, XXL etc. But i am unsure how to add the words to the pattern. For example, i want it to match something like "2, 5, 23, S, XXXL" which are valid sizes, but not random combinations of letters like "2X3, SLX"
Ok since people are not suggesting regexp solutions i guess i should say that this is part of a larger method of validation which uses regexp. For convenience and code consistency i want to do this with regexp.
Thanks
If they're a known set of values, I am not sure a regex is the best way to do it. But here is one regex that is basically a brute-force match of your values, each with a \b (word boundary) anchor
\b2\b|\b5\b|\b23\b|\bXXXL\b|\bXL\b|\bM\b|\bS\b
Sorry for not giving you a straight answer. regexp might be overkill in your case. A solution without it could, depending on your needs, be more maintainable.
I don't know which language you use so I will just pick one randomly. You could treat it as a piece of pseudo code.
In PHP:
function isValidSize($size) {
$validSizeTags = array("S", "M", "XL", "XXL");
$minimumSize = 2;
$maximumSize = 23;
if(ctype_digit(strval($size))) { // explanation for not using is_numeric/is_int below
if($size >= $minimumSize && $size <= $maxiumSize) {
return true;
}
} else if(in_array($size, $validSizeTags)) {
return true;
} else {
return false;
}
}
$size = "XS";
$isValid = isValidSize($size); // true
$size = 23;
$isValid = isValidSize($size); // true
$size = "23";
$isValid = isValidSize($size); // true, is_int would return false here
$size = 50;
$isValid = isValidSize($size); // false
$size = 15.5;
$isValid = isValidSize($size); // false, is_numeric would return true here
$size = "diadjsa";
$isValid = isValidSize($size); // false
(The reason for using ctype_digit(strval($size)) instead of is_int or is_numeric is that the first one will only return true for real integers, not strings like "15". And the second one will return true for all numeric values not just integers. ctype_digit will however return true for strings containing numeric characters, but return false for integers. So we convert the value to a string using strval before sending it to ctype_digits. Welcome to the world of PHP.)
With this logic in place you can easily inject validSizeTags, maximumSize and minimumSize from a configuration file or a database where you store all valid sizes for this specific product. That would get much messier using regular expressions.
Here is an example in JavaScript:
var patt = /^(?:\d{1,2}|X{0,3}[SML])$/i;
patt.test("2"); // true
patt.test("23"); // true
patt.test("XXXL"); // true
patt.test("S"); // true
patt.test("SLX"); // false
Use Array Membership Instead of Regular Expressions
Some problems are easier to deal with by using a different approach to representing your data. While regular expressions can be powerful, you might be better off with an array membership test if you are primarily interested in well-defined fixed values. For example, using Ruby:
sizes = %w[2 5 23 S XXXL].map(&:upcase)
size = 'XXXL'
sizes.include? size.to_s.upcase # => true
size = 'XL'
sizes.include? size.to_s.upcase # => false
seeing as it is being harder than i had thought, i am thinking to store the individual matched values in an array and match those individually against accepted values. i will use something like
[0-9]+|s|m|l|xl|xxl
and store the matches in the array
then i will check each array element against [0-9]+ and s|m|l|xl|xxl and if it matches any of these, it's valid. maybe there is a better way but i can't dwell on this for too long
thanks for your help
This will accept the alternatives one or more times, separated by whitespace or punctuation. It should be easy enough to expand the separator character class if you think you need to.
^([Xx]{0,3}[SsMmLl]|[0-9]+)([ ,:;-]+([Xx]{0,3}[SsMmLl]))*$
If you can interpolate the accepted pattern into a string before using it as a regex, you can reduce the code duplication.
This is a regular egrep pattern. Regex dialects differ between languages, so you might need to tweak something in order to adapt it to your language of choice (PHP? It's good form to include this information in the question).