regexp with varying integer lengths - regex

I want split these strings from
CH1Avg
Ch2Avg
Ch3
Ch4Avg
Ch5
Ch6Avg
Chan7
Channel9
Ch010
Ch011Avg
Chann12Average
...up to...
Ch100AVG
I need to split them into their consituent parts
"Ch", ##, "Avg"
1st and 3rd components are of variable length and form. I want to split using the 2nd component which is an integer of vary length from 0 to 100. The integer may or may not be zero padded.
Any thoughts? I am trying to use () without much success.

To split the string into the constituent parts, I suggest using named tokens for convenience:
strCell = {'CH1Avg'
'Ch2Avg'
'Ch3'
'Ch4Avg'
'Ch5'
'Ch6Avg'
'Chan7'
'Channel9'
'Ch010'
'Ch011Avg'
'Chann12Average'}
out = regexp(strCell,'(?<channelName>\D+)(?<channelNum>\d+)(?<channelType>\w*)','names')
out = [out{:}];
out(end)
ans =
channelName: 'Chann'
channelNum: '12'
channelType: 'Average'

Split on (\d+). The parentheses ensure that the number you're splitting on will also become part of the array.

Related

How to parse a string with comma separated values in AWS IoT SQL?

I am trying to parse a long string with comma-separated values such as "lat,long,distance,,elevation". String is actually quite long and I need to fetch each value and save the fetched values in different columns in dynamodb. I am using dyamodbv2 rule. Functions I found that could be useful were substring(String, Int [, Int]), length(String), indexof(String, String) and get().
For example I get data like this:
{
LOCATION_DATA: "lat,long,distance,,elevation"
}
Here is what I have done so far,
//first value - 0 to next comma
substring(LOCATION_DATA, 0, indexof(LOCATION_DATA, ',')) as latitude,
//second value - substring starting from last substring to next comma
substring(substring(LOCATION_DATA, indexof(LOCATION_DATA, ',') +1 ) ,
0,
indexof(substring(LOCATION_DATA, indexof(LOCATION_DATA, ',') +1 ), ',')
) as longitude,
...
But this gets too verbose and moving to next comma-separated value increasingly difficult. Is there a way to convert comma-separated values to array and then fetch them with get(0), get(1).. ? I have to fetch around 20 fields this way!
Also, the values can be of varying length, and some fields can be empty, such as value between "distance,,elevation" in example strings. These empty values can be ignored.
As far as I now, there is no way I can store and create custom functions, or use any other function than provided in http://docs.aws.amazon.com/iot/latest/developerguide/iot-sql-functions.html.
In rails, you can convert a string to array based on a separator
Example
LOCATION_DATA = "lat,long,distance,,elevation"
myarray = LOCATION_DATA.split(',')
Then you can use
myarray[0]="lat"
myarray[1]="long"
myarray[2]="distance"
myarray[3]=""
myarray[4]="elevation"
You can also convert these strings to integer or float as:
myarray[0].to_i
myarray[2].to_f
Hope This Helps

Dynamic regexprep in MATLAB

I have the following strings in a long string:
a=b=c=d;
a=b;
a=b=c=d=e=f;
I want to first search for above mentioned pattern (X=Y=...=Z) and then output like the following for each of the above mentioned strings:
a=d;
b=d;
c=d;
a=b;
a=f;
b=f;
c=f;
d=f;
e=f;
In general, I want all the variables to have an equal sign with the last variable on the extreme right of the string. Is there a way I can do it using regexprep in MATLAB. I am able to do it for a fixed length string, but for variable length, I have no idea how to achieve this. Any help is appreciated.
My attempt for the case of two equal signs is as follows:
funstr = regexprep(funstr, '([^;])+\s*=\s*+(\w+)+\s*=\s*([^;])+;', '$1 = $3; \n $2 = $3;\n');
Not a regexp but if you stick to Matlab you can make use of the cellfun function to avoid loop:
str = 'a=b=c=d=e=f;' ; %// input string
list = strsplit(str,'=') ;
strout = cellfun( #(a) [a,'=',list{end}] , list(1:end-1), 'uni', 0).' %'// Horchler simplification of the previous solution below
%// this does the same than above but more convoluted
%// strout = cellfun( #(a,b) cat(2,a,'=',b) , list(1:end-1) , repmat(list(end),1,length(list)-1) , 'uni',0 ).'
Will give you:
strout =
'a=f;'
'b=f;'
'c=f;'
'd=f;'
'e=f;'
Note: As Horchler rightly pointed out in comment, although the cellfun instruction allows to compact your code, it is just a disguised loop. Moreover, since it runs on cell, it is notoriously slow. You won't see the difference on such simple inputs, but keep this use when super performances are not a major concern.
Now if you like regex you must like black magic code. If all your strings are in a cell array from the start, there is a way to (over)abuse of the cellfun capabilities to obscure your code do it all in one line.
Consider:
strlist = {
'a=b=c=d;'
'a=b;'
'a=b=c=d=e=f;'
};
Then you can have all your substring with:
strout = cellfun( #(s)cellfun(#(a,b)cat(2,a,'=',b),s(1:end-1),repmat(s(end),1,length(s)-1),'uni',0).' , cellfun(#(s) strsplit(s,'=') , strlist , 'uni',0 ) ,'uni',0)
>> strout{:}
ans =
'a=d;'
'b=d;'
'c=d;'
ans =
'a=b;'
ans =
'a=f;'
'b=f;'
'c=f;'
'd=f;'
'e=f;'
This gives you a 3x1 cell array. One cell for each group of substring. If you want to concatenate them all then simply: strall = cat(2,strout{:});
I haven't had much experience w/ Matlab; but your problem can be solved by a simple string split function.
[parts, m] = strsplit( funstr, {' ', '='}, 'CollapseDelimiters', true )
Now, store the last part of parts; and iterate over parts until that:
len = length( parts )
for i = 1:len-1
print( strcat(parts(i), ' = ', parts(len)) )
end
I do not know what exactly is the print function in matlab. You can update that accordingly.
There isn't a single Regex that you can write that will cover all the cases. As posted on this answer:
https://stackoverflow.com/a/5019658/3393095
However, you have a few alternatives to achieve your final result:
You can get all the values in the line with regexp, pick the last value, then use a for loop iterating throughout the other values to generate the output. The regex to get the values would be this:
matchStr = regexp(str,'([^=;\s]*)','match')
If you want to use regexprep at any means, you should write a pattern generator and a replace expression generator, based on number of '=' in the input string, and pass these as parameters of your regexprep func.
You can forget about Regex and Split the input to generate the output looping throughout the values (similarly to alternative #1) .

Map sequences of numbers to single characters in Scala

Given an input string map three types of possible sequences of numbers contained in the string to a single number and leave the other elements of the string unchanged:
Single number should be mapped to the char 1: "help3me" -> "help1me"
Two numbers in a row should be mapped to the char 2: "help18me" -> "help2me"
Three or more numbers in a row should be mapped to 3: "test3432help234312me" -> "test3help3me"
Our input strings can contain any number of 1,2,3+ length sequences of digits so that a valid input example is "help3490897test73me23435please5"
What is an effective solution for the above problem in Scala does it just involve enumerating through the three possible cases as a regex ?
Use regular expression and method replaceAllIn. The second argument is the function that takes Match object and transforms it to its length.
val str = "help3me34"
val expr = "(\\d+)".r
expr.replaceAllIn(str, x => (x.group(0).length min 3).toString)
res2: String = help1me2

How to separate a line of input into multiple variables?

I have a file that contains rows and columns of information like:
104857 Big Screen TV 567.95
573823 Blender 45.25
I need to parse this information into three separate items, a string containing the identification number on the left, a string containing the item name, and a double variable containing the price. The information is always found in the same columns, i.e. in the same order.
I am having trouble accomplishing this. Even when not reading from the file and just using a sample string, my attempt just outputs a jumbled mess:
string input = "104857 Big Screen TV 567.95";
string tempone = "";
string temptwo = input.substr(0,1);
tempone += temptwo;
for(int i=1 ; temptwo != " " && i < input.length() ; i++)
{
temptwo = input.substr(j,j);
tempone += temp2;
}
cout << tempone;
I've tried tweaking the above code for quite some time, but no luck, and I can't think of any other way to do it at the moment.
You can find the first space and the last space using std::find_first_of and std::find_last_of . You can use this to better split the string into 3 - first space comes after the first variable and the last space comes before the third variable, everything in between is the second variable.
How about following pseudocode:
string input = "104857 Big Screen TV 567.95";
string[] parsed_output = input.split(" "); // split input string with 'space' as delimiter
// parsed_output[0] = 104857
// parsed_output[1] = Big
// parsed_output[2] = Screen
// parsed_output[3] = TV
// parsed_output[4] = 567.95
int id = stringToInt(parsed_output[0]);
string product = concat(parsed_output[1], parsed_output[2], ... ,parsed_output[length-2]);
double price = stringToDouble(parsed_output[length-1]);
I hope, that's clear.
Well try breaking down the files components:
you know a number always comes first, and we also know a number has no white spaces.
The string following the number CAN have whitespaces, but won't contain any numbers(i would assume)
After this title, you're going to have more numbers(with no whitespaces)
from these components, you can deduce:
grabbing the first number is as simple as reading in using the filestream <<.
getting the string requires you to check until you reach a number, grabbing one character at a time and inserting that into a string. the last number is just like the first, using the filestream <<
This seems like homework so i'll let you put the rest together.
I would try a regular expression, something along these lines:
^([0-9]+)\s+(.+)\s+([0-9]+\.[0-9]+)$
I am not very good at regex syntax, but ([0-9]+) corresponds to a sequence of digits (this is the id), ([0-9]+\.[0-9]+) is the floating point number (price) and (.+) is the string that is separated from the two number by sequences of "space" characters: \s+.
The next step would be to check if you need this to work with prices like ".50" or "10".

awk split by sequence of whitespaces

Can split(string, array, separator) in awk use sequence of whitespaces as the separator (or more generally any regexp as the separator)?
Obviously, one could use the internal autosplit (that runs on each line of the input with value of FS variable as the separator) and with simple for and $0 magic do the trick. However, I was just wondering if there's a more straightforward way using the splititself.
The GNU Awk User's Guide states:
split(string, array, fieldsep)
This divides string into pieces separated by fieldsep, and stores the
pieces in array. The first piece is stored in array[1], the second
piece in array[2], and so forth. The string value of the third
argument, fieldsep, is a regexp describing where to split string (much
as FS can be a regexp describing where to split input records). If
the fieldsep is omitted, the value of FS is used. split returns the
number of elements created. The split function, then, splits strings
into pieces in a manner similar to the way input lines are split into
fields
Here is a short (somewhat silly) example that uses a simple regular expression ".s " that will match any single character followed by a lower-case s and a space. The result of the split is put into array a. Note that the parts that match are not placed into the array.
BEGIN {
s = "this isn't a string yes isodore?"
count = split(s, a, ".s ")
printf("number of splits: %d\n", count)
print "Contents of array:"
for (i = 1; i <= count; i++)
printf "a[%d]: %s\n", i, a[i]
}
The output:
$ awk -f so.awk
number of splits: 3
Contents of array:
a[1]: th
a[2]: isn't a string y
a[3]: isodore?
The article Advanced Awk for Sysadmins show an example of parsing a line using split(). This page contains an example of using a regular expression to split data into
an array.
From the GNU awk(1) manual page:
split(s, a [, r])
Splits the string s into the array a on the regular expression r, and returns the number of fields. If r is omitted, FS is used instead.
The point here is that you can use any regular expression to perform field splitting--at least you can with gawk. If you're using something else, you'll need to check your documentation.