matlab - extracting numbers from (odd) string - regex

I have a series of strings in a cvs file, they all look like the two bellow:
7336598,"[4125420656L, 2428145712L, 1820029797L, 1501679119L, 1980837904L, 380501274L]"
7514340,"[507707719L, 901144614L, 854823005L]"
....
how can I extract the numbers in it?
As in.. to retreive 7336598, 4125420656, etc....
Tried textscan, and regexp, but not much success...
Sorry for the beginners question...and thank you for having a look! :)
Edit: the size of each line is variable.

You can use textread and regexp to extract only the numbers from your CSV file:
C = textread('file.cvs', '%s', 'delimiter', '\n');
C = regexp(C, '\d+', 'match');
The regular expression is quite simple. In MATLAB's regexp pattern,\d denotes a digit, and the + indicates that this digit must occur at least once. The match mode tells regexp to return the matched strings.
The result is a cell array of strings. You can go further and convert the strings to numerical values:
C = cellfun(#(x)str2num(sprintf('%s ', x{:})), C, 'Uniform', false)
The result is still stored in a cell array. If you can guarantee that there's the same amount of numerical values in each row, you can convert the cell array to a matrix:
A = cell2mat(C);

I don't have matlab to test, but does a '[0-9]+' does the job ?
It works for me outside matlab :
echo '7336598,"[4125420656L, 2428145712L, 1820029797L, 1501679119L, 1980837904L, 380501274L]"' | grep -o '[0-9]\+'
7336598
4125420656
2428145712
1820029797
1501679119
1980837904
380501274

Related

Find group of strings starting and ending by a character using regular expression

I have a string, and I want to extract, using regular expressions, groups of characters that are between the character : and the other character /.
typically, here is a string example I'm getting:
'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh'
and so, I want to retrieved, 45.72643,4.91203 and also hereanotherdata
As they are both between characters : and /.
I tried with this syntax in a easier string where there is only 1 time the pattern,
[tt]=regexp(str,':(\w.*)/','match')
tt = ':45.72643,4.91203/'
but it works only if the pattern happens once. If I use it in string containing multiples times the pattern, I get all the string between the first : and the last /.
How can I mention that the pattern will occur multiple time, and how can I retrieve it?
Use lookaround and a lazy quantifier:
regexp(str, '(?<=:).+?(?=/)', 'match')
Example (Matlab R2016b):
>> str = 'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
>> result = regexp(str, '(?<=:).+?(?=/)', 'match')
result =
1×2 cell array
'45.72643,4.91203' 'hereanotherdata'
In most languages this is hard to do with a single regexp. Ultimately you'll only ever get back the one string, and you want to get back multiple strings.
I've never used Matlab, so it may be possible in that language, but based on other languages, this is how I'd approach it...
I can't give you the exact code, but a search indicates that in Matlab there is a function called strsplit, example...
C = strsplit(data,':')
That should will break your original string up into an array of strings, using the ":" as the break point. You can then ignore the first array index (as it contains text before a ":"), loop the rest of the array and regexp to extract everything that comes before a "/".
So for instance...
'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh'
Breaks down into an array with parts...
1 - 'abcd'
2 - '45.72643,4.91203/Rou'
3 - 'hereanotherdata/defgh'
Then Ignore 1, and extract everything before the "/" in 2 and 3.
As John Mawer and Adriaan mentioned, strsplit is a good place to start with. You can use it for both ':' and '/', but then you will not be able to determine where each of them started. If you do it with strsplit twice, you can know where the ':' starts :
A='abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
B=cellfun(#(x) strsplit(x,'/'),strsplit(A,':'),'uniformoutput',0);
Now B has cells that start with ':', and has two cells in each cell that contain '/' also. You can extract it with checking where B has more than one cell, and take the first of each of them:
C=cellfun(#(x) x{1},B(cellfun('length',B)>1),'uniformoutput',0)
C =
1×2 cell array
'45.72643,4.91203' 'hereanotherdata'
Starting in 16b you can use extractBetween:
>> str = 'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
>> result = extractBetween(str,':','/')
result =
2×1 cell array
{'45.72643,4.91203'}
{'hereanotherdata' }
If all your text elements have the same number of delimiters this can be vectorized too.

Reg exp in matlab

I'm analyzing a file in matlab and I want to find the number of occurrences of the letter I (capitalized). I'm confused on how to write the regular expression for this step. Would it be something like (lines,'.I.')? Any help would be greatly appreciated.
If you want to count the number of capital 'I's in a file, assuming you have read the file in as a string, you could just do this:
count = sum(file_string == 'I');
If, as in this case, the file is read into a cell-string, one possible way of doing this would be to use:
count = sum(strcat(file_cellstr{:}) == 'I');
strcat will concatenate all of the strings passed to it into a single string. Passing file_cellstr{:} to strcat is essentially concatenating each of the cells (i.e. each line in your case) into a single string, then searching through it for the letter 'I'. If you wanted to find a whole word, you could use
count = length(strfind(strcat(file_cellstr{:}),'word'));
If you wanted a regular expression match, you could do the following:
count = length(regexp(strcat(file_cellstr{:}),'[a-z]+'));

How to print an integer with a thousands separator in Matlab?

I would like to turn a number into a string using a comma as a thousands separator. Something like:
x = 120501231.21;
str = sprintf('%0.0f', x);
but with the effect
str = '120,501,231.21'
If the built-in fprintf/sprintf can't do it, I imagine cool solution could be made using regular expressions, perhaps by calling Java (which I assume has some locale-based formatter), or with a basic string-insertion operation. However, I'm not an expert in either Matlab regexp's or calling Java from Matlab.
Related question: How can I print a float with thousands separators in Python?
Is there any established way to do this in Matlab?
One way to format numbers with thousands separators is to call the Java locale-aware formatter. The "formatting numbers" article at the "Undocumented Matlab" blog explains how to do this:
>> nf = java.text.DecimalFormat;
>> str = char(nf.format(1234567.890123))
str =
1,234,567.89
where the char(…) converts the Java string to a Matlab string.
voilà!
Here's the solution using regular expressions:
%# 1. create your formated string
x = 12345678;
str = sprintf('%.4f',x)
str =
12345678.0000
%# 2. use regexprep to add commas
%# flip the string to start counting from the back
%# and make use of the fact that Matlab regexp don't overlap
%# The three parts of the regex are
%# (\d+\.)? - looks for any number of digits followed by a dot
%# before starting the match (or nothing at all)
%# (\d{3}) - a packet of three digits that we want to match
%# (?=\S+) - requires that theres at least one non-whitespace character
%# after the match to avoid results like ",123.00"
str = fliplr(regexprep(fliplr(str), '(\d+\.)?(\d{3})(?=\S+)', '$1$2,'))
str =
12,345,678.0000

Put thousand separator using REGEX replace method

I have to put thousand separator between numbers. I have done till now:
Input string
1852
2589653
586699
8542.28
The find pattern
(?<=\d)(?=(?:\d{3})+(?!\d))
replace-with
,
result
1,852
2,589,653
586,699
8,542.28
TODO
I want to eliminate all year ranges from 1700 to 2010, from match-collection.
Anyone have any idea. All suggestions are welcome. Thanks in advance.
This is not a good way to use regular expressions.
Instead, use the string formatting features of your language:
Use a simple regular expression to find the numbers, if you have to extract them from text.
Convert them to floating point or integer numbers (as appropriate)
Use a string format specifier to say you would like them output with a thousands separator.
For example, here's a shell transcript where I extract a number from a string and format it with a comma thousands separator: (Python 2.x)
In [12]: import re
In [13]: number_pattern = re.compile(r"\d+(.\d+)") #positive integer or floating point number
In [14]: mystring = "The size of the rocket is 3141592.6."
In [15]: number_string = number_pattern.search(mystring).group() #extract the number as a string
In [16]: number_string
Out[16]: '3141592.6'
In [18]: number = float(number_string) #convert to number
In [19]: '{:,}'.format(number) #format with thousands separator
Out[19]: '3,141,592.6'
Doing it this way also makes eliminating ranges of numbers trivial.
if (number > 1700) or (number < 2100):
pass #do something
Here is a RegEx Example in PowerShell:
[Regex]::Replace(12345, '[0-9](?=(?:[0-9]{3})+(?![0-9]))', '$0''')
12'345

c++ regular expression matching whole line

I am trying to parse a text file the contains numeric data. I have a lot of lines that look like
129.3 72.7 121.6 173.6 203.3 120.7 40.5 79.2 94.0 123.2 165.8 178.8 135.5 78.5 66.2
but the length of the lines vary. Each line is also preceded by a few spaces.
I would like to use regular expressions to parse the line and place each number into an array that I can then manipulate later.
Using
std::getline(is, line);
std::tr1::regex rx("[0-9-\.]+");
std::tr1::cmatch res;
std::tr1::regex_search(line.c_str(), res, rx);
only matches the first number. If instead I use line anchors such as
"^[0-9-\.]+$"
"^[0-9-\.]+"
I get no matches and
"[0-9-\.]+$"
just matches the last number. So I am probably doing something wrong. Thanks for any help.
Um, pseudocode
for str in strtok(input string)
vector[index] = convert str to float
Here's an example using lots of stream magic: Split a string in C++?
Here's an example using a vector:
Splitting a string by whitespace in c++
But plain old strtok is probably easiest:
http://www.cplusplus.com/reference/clibrary/cstring/strtok/
in which case you'll get something like
Vector flts = // create it
for(int ix=0, char * cp; cp = strtok(str," "); ix++){
flts[ix] = atof(cp);
}
Now, that's very C like because I'm out of practice for C++, but the key point here is that by trying to use regex, you make it overcomplicated.
You need to include the space between the numbers in your match to match the whole line.
BTW, take a look at C++ tokenize a string using a regular expression to see a rather closely related answer.
You really shouldn't be using arrays here, use the standard containers for safety, convenience and sanity of anyone who has to look at this code later.
I looks like the regex has a small issue:
"[0-9-\.]+"
should be more like:
"[0-9\.]"
your regex might be incorrect, you should try:
[0-9\.]+
also keep in mind that std::tr1::cmatch returns an arrays of matches, i.e. res[2] contains 72.7
Using egrep you can experiment a bit:
egrep "[0-9-\.]+" /tmp/x
egrep: Invalid range end
but
egrep "^[0-9\.]+" /tmp/x
matches only
129.3
and
egrep "[0-9\.]+" /tmp/x
matches all
129.3 72.7 121.6 173.6 203.3 120.7 40.5 79.2 94.0 123.2 165.8 178.8 135.5 78.5 66.2
you don't need ^ in front because it matches a null character at the start of the string, i.e. you gen only the first sequence of numbers.
you don't need $ because it matches only the null character at the end, thus you get only the last sequence of numbers
you need + since you want to get all the matching atoms of type [0-9\.].
Also you can get a short guide regex matching in any unix system by issueing
man -S 7 regex
p.s. /tmp/x is a file with the line that is provided in the question.