Multilevel list natural sort with regexp

Multilevel list natural sort with regexp - regex

Here is an example of data:
'1.' 'Numeric types'
'1.1.' 'Integer'
'1.2.' 'Float'
...
'1.10' 'Double'
To naturally sort it we can use string_to_array with '.' as separator, then cast text[] to int[] and sort by integer array, but since the field itself is of type text and there might be cases where user decides to use non-numeric symbols, e.g. 1.1.3a, thus causing cast error.
To address that I decided to use regexp:
select regexp_matches('1.2.3.4.', E'(?:(\\d+)\.?)+')
Expected result is array: {'1', '2', '3', '4'} but instead i get only the last element of the said array, however, if I use following regexp:
select regexp_matches('1.2.3.4.', E'((?:\\d+)\.?)+')
The result is {'1.2.3.4.'}.
Using global-flag 'g' is not an option, because regexp_matches returns a column.
Is there any way to convert '1.2.3.4a.'::text to {1, 2, 3 ,4}::int[] using only one regexp_matches?
Fiddle.

You can use the global 'g' flag with regexp_matches, but needs to aggregate values to an array (most simple with the array() constructor):
select array(select m[1] from regexp_matches(dt_code, '(\d+)', 'g') m)::int[] nums, *
from data_types
order by 1;
Or, you can split your string to array with string_to_array(), but you still need to use regexp to remove any non-numeric characters:
select string_to_array(trim(regexp_replace(dt_code, '[^\d\.]+', ''), '.'), '.')::int[] nums, *
from data_types
order by 1;
For a more advanced natural-like sorting, you need to split your text to tokens yourself. See more info at the related SO question.
I could come up with a simplified, reusable function:
create or replace function natural_order_tokens(text)
returns table (
txt text,
num int,
num_rep text
)
language sql
strict
immutable
as $func$
select m[1], (case m[2] when '' then '0' else m[2] end)::int, m[2]
from regexp_matches($1, '(\D*)(\d*)', 'g') m
where m[1] != '' or m[2] != ''
$func$;
With this function, natural sorting will be this easy:
select *
from data_types
order by array(select t from natural_order_tokens(dt_code) t);
SQLFiddle

Related

Removing Measurement Units from Cell Array

I am trying to remove the units out of a column of cell array data i.e.:
cArray =
time temp
2022-05-10 20:19:43 '167 °F'
2022-05-10 20:19:53 '173 °F'
2022-05-10 20:20:03 '177 °F'
...
2022-06-09 20:18:10 '161 °F'
I have tried str2double but get all NaN.
I have found some info on regexp but don't follow exactly as the example is not the same.
Can anyone help me get the temp column to only read the value i.e.:
cArray =
time temp
2022-05-10 20:19:43 167
2022-05-10 20:19:53 173
2022-05-10 20:20:03 177
...
2022-06-09 20:18:10 161

For some cell array of data
cArray = { ...
1, '123 °F'
2, '234 °F'
3, '345 °F'
};
The easiest option is if we can safely assume the temperature data always starts with numeric values, and you want all of the numeric values. Then we can use regex to match only numbers
temps = regexp( cArray(:,2), '\d+', 'match', 'once' );
The match option causes regexp to return the matching string rather than the index of the match, and once means "stop at the first match" so that we ignore everything after the first non-numeric character.
The pattern '\d+' means "one or more numbers". You could expand it to match numbers with a decimal part using '\d+(\.\d+)?' instead if that's a requirement.
Then if you want to actually output numbers, you should use str2double. You could do this in a loop, or use cellfun which is a compact way of achieving the same thing.
temps = cellfun( #str2double, temps, 'uni', 0 ); % 'uni'=0 to retain cell array
Finally you can override the column in cArray
cArray(:,2) = temps;

T-SQL RegExp to find sequential repeated characters

I'm looking for a RegExp to find duplicated characters in a entire word in SQL Server and Regular Expressions (RegExp). For Example:
"AAUGUST" match (AA)
"ANDREA" don't match (are 2 vowels "A", buit are separated)
"ELEEPHANT" match (EE)
I was trying with:
SELECT field1
FROM exampleTable
WHERE field1 like '%([A-Z]){2}%'
But it doesn't work.
I apreciated for your help.
Thanks!

You can't do what you're asking with T-SQL's LIKE.
Your best bet is to look at using the Common Language Runtime (CLR), but it can also be achieved (albeit painfully slowly) using, for example, a scalar value function as follows:
create function dbo.ContainsRepeatingAlphaChars(#str nvarchar(max)) returns bit
as begin
declare #p int, -- the position we're looking at
#c char(1) -- the previous char
if #str is null or len(#str) < 2 return 0;
select #c = substring(#str, 1, 1), #p = 1;
while (1=1) begin
set #p = #p + 1; -- move position pointer ahead
if #p > len(#str) return 0; -- if we're at the end of the string and haven't already exited, we haven't found a match
if #c like '[A-Z]' and #c = substring(#str, #p, 1) return 1; -- if last char is A-Z and matches the current char then return "found!"
set #c = substring(#str, #p, 1); -- Get next char
end
return 0; -- this will never be hit but stops SQL Server complaining that not all paths return a value
end
GO
-- Example usage:
SELECT field1
FROM exampleTable
WHERE dbo.ContainsRepeatingAlphaChars(field1) = 1
Did I mention that it would be slow? Don't use this on a large table. Go CLR.

How to change multiple rows in a column from unicode to timestamp in python

I am learning python for beginners. I would like to convert column values from unicode time ('1383260400000') to timestamp (1970-01-01 00:00:01enter code here). I have read and tried the following but its giving me an error.
ti=datetime.datetime.utcfromtimestamp(int(arr[1]).strftime('%Y-%m-%d %H:%M:%S');
Its saying invalid syntax. I read and tried a few other stuffs but I can not come right.. Any suggestion?
And another one, in the same file I have some empty cells that I would like to replace with 0, I tried this too and its giving me invalid syntax:
smsin=arr[3];
if arr[3]='' :
smsin='0';
Please help. Thank you alot.

You seem to have forgotten a closing bracket after (arr[1]).
import datetime
arr = ['23423423', '1163838603', '1263838603', '1463838603']
ti = datetime.datetime.utcfromtimestamp(int(arr[1])).strftime('%Y-%m-%d %H:%M:%S')
print(ti)
# => 2006-11-18 08:30:03
To replace empty strings with '0's in your list you could do:
arr = ['123', '456', '', '789', '']
arr = [x if x else '0' for x in arr]
print(arr)
# => ['123', '456', '0', '789', '0']
Note that the latter only works correctly since the empty string '' is the only string with a truth value of False. If you had other data types within arr (e.g. 0, 0L, 0.0, (), [], ...) and only wanted to replace the empty strings you would have to do:
arr = [x if x != '' else '0' for x in arr]
More efficient yet would be to modify arr in place instead of recreating the whole list.
for index, item in enumerate(arr):
if item = '':
arr[index] = '0'
But if that is not an issue (e.g. your list is not too large) I would prefer the former (more readable) way.
Also you don't need to put ;s at the end of your code lines as Python does not require them to terminate statements. They can be used to delimit statements if you wish to put multiple statements on the same line but that is not the case in your code.

Dynamic regexprep in MATLAB

I have the following strings in a long string:
a=b=c=d;
a=b;
a=b=c=d=e=f;
I want to first search for above mentioned pattern (X=Y=...=Z) and then output like the following for each of the above mentioned strings:
a=d;
b=d;
c=d;
a=b;
a=f;
b=f;
c=f;
d=f;
e=f;
In general, I want all the variables to have an equal sign with the last variable on the extreme right of the string. Is there a way I can do it using regexprep in MATLAB. I am able to do it for a fixed length string, but for variable length, I have no idea how to achieve this. Any help is appreciated.
My attempt for the case of two equal signs is as follows:
funstr = regexprep(funstr, '([^;])+\s*=\s*+(\w+)+\s*=\s*([^;])+;', '$1 = $3; \n $2 = $3;\n');

Not a regexp but if you stick to Matlab you can make use of the cellfun function to avoid loop:
str = 'a=b=c=d=e=f;' ; %// input string
list = strsplit(str,'=') ;
strout = cellfun( #(a) [a,'=',list{end}] , list(1:end-1), 'uni', 0).' %'// Horchler simplification of the previous solution below
%// this does the same than above but more convoluted
%// strout = cellfun( #(a,b) cat(2,a,'=',b) , list(1:end-1) , repmat(list(end),1,length(list)-1) , 'uni',0 ).'
Will give you:
strout =
'a=f;'
'b=f;'
'c=f;'
'd=f;'
'e=f;'
Note: As Horchler rightly pointed out in comment, although the cellfun instruction allows to compact your code, it is just a disguised loop. Moreover, since it runs on cell, it is notoriously slow. You won't see the difference on such simple inputs, but keep this use when super performances are not a major concern.
Now if you like regex you must like black magic code. If all your strings are in a cell array from the start, there is a way to (over)abuse of the cellfun capabilities to obscure your code do it all in one line.
Consider:
strlist = {
'a=b=c=d;'
'a=b;'
'a=b=c=d=e=f;'
};
Then you can have all your substring with:
strout = cellfun( #(s)cellfun(#(a,b)cat(2,a,'=',b),s(1:end-1),repmat(s(end),1,length(s)-1),'uni',0).' , cellfun(#(s) strsplit(s,'=') , strlist , 'uni',0 ) ,'uni',0)
>> strout{:}
ans =
'a=d;'
'b=d;'
'c=d;'
ans =
'a=b;'
ans =
'a=f;'
'b=f;'
'c=f;'
'd=f;'
'e=f;'
This gives you a 3x1 cell array. One cell for each group of substring. If you want to concatenate them all then simply: strall = cat(2,strout{:});

I haven't had much experience w/ Matlab; but your problem can be solved by a simple string split function.
[parts, m] = strsplit( funstr, {' ', '='}, 'CollapseDelimiters', true )
Now, store the last part of parts; and iterate over parts until that:
len = length( parts )
for i = 1:len-1
print( strcat(parts(i), ' = ', parts(len)) )
end
I do not know what exactly is the print function in matlab. You can update that accordingly.

There isn't a single Regex that you can write that will cover all the cases. As posted on this answer:
https://stackoverflow.com/a/5019658/3393095
However, you have a few alternatives to achieve your final result:
You can get all the values in the line with regexp, pick the last value, then use a for loop iterating throughout the other values to generate the output. The regex to get the values would be this:
matchStr = regexp(str,'([^=;\s]*)','match')
If you want to use regexprep at any means, you should write a pattern generator and a replace expression generator, based on number of '=' in the input string, and pass these as parameters of your regexprep func.
You can forget about Regex and Split the input to generate the output looping throughout the values (similarly to alternative #1) .

Perl hash substitution with special characters in keys

My current script will take an expression, ex:
my $expression = '( a || b || c )';
and go through each boolean combination of inputs using sub/replace, like so:
my $keys = join '|', keys %stimhash;
$expression =~ s/($keys)\b/$stimhash{$1}/g;
So for example expression may hold,
( 0 || 1 || 0 )
This works great.
However, I would like to allow the variables (also in %stimhash) to contain a tag, *.
my $expression = '( a* || b* || c* )';
Also, printing the keys of the stimhash returns:
a*|b*|c*
It is not properly substituting/replacing with the extra special character, *.
It gives this warning:
Use of uninitialized value within %stimhash in substitution iterator
I tried using quotemeta() but did not have good results so far.
It will drop the values. An example after the substitution looks like:
( * || * || * )
Any suggestions are appreciated,
John

Problem 1
You use the pattern a* thinking it will match only a*, but a* means "0 or more a". You can use quotemeta to convert text into a regex pattern that matches that text.
Replace
my $keys = join '|', keys %stimhash;
with
my $keys = join '|', map quotemeta, keys %stimhash;
Problem 2
\b
is basically
(?<!\w)(?=\w)|(?<=\w)(?!\w)
But * (like the space) isn't a word character. The solution might be to replace
s/($keys)\b/$stimhash{$1}/g
with
s/($keys)(?![\w*])/$stimhash{$1}/g
though the following make more sense to me
s/(?<![\w*])($keys)(?![\w*])/$stimhash{$1}/g
Personally, I'd use
s{([\w*]+)}{ $stimhash{$1} // $1 }eg

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Multilevel list natural sort with regexp - regex

Related

Removing Measurement Units from Cell Array

T-SQL RegExp to find sequential repeated characters

How to change multiple rows in a column from unicode to timestamp in python

Dynamic regexprep in MATLAB

Perl hash substitution with special characters in keys

Categories

Resources