Vim - use regex to lexicographically compare strings (to find earlier/later dates) - regex

I want to write a simple regex, in vim, that will find all strings lexicographically smaller than another string.
Specifically, I want to use this to compare dates formatted as 2014-02-17. These dates are lexicographically sortable, which is why I use them.
My specific use case: I'm trying to run through a script and find all the dates that are earlier than today's today.
I'm also OK with comparing these as numbers, or any other solution.

I don't think there is anyway to do this easily in regex. For matching any date earlier than the current date you can use run the function below (Some of the stuff was stolen from benjifisher)
function! Convert_to_char_class(cur)
if a:cur =~ '[2-9]'
return '[0-' . (a:cur-1) . ']'
endif
return '0'
endfunction
function! Match_number_before(num)
let branches = []
let init = ''
for i in range(len(a:num))
if a:num[i] =~ '[1-9]'
call add(branches, init . Convert_to_char_class(a:num[i]) . repeat('\d', len(a:num) - i - 1))
endif
let init .= a:num[i]
endfor
return '\%(' . join(branches, '\|') .'\)'
endfunction
function! Match_date_before(date)
if a:date !~ '\v\d{4}-\d{2}-\d{2}'
echo "invalid date"
return
endif
let branches =[]
let parts = split(a:date, '-')
call add(branches, Match_number_before(parts[0]) . '-\d\{2}-\d\{2}')
call add(branches, parts[0] . '-' . Match_number_before(parts[1]) . '-\d\{2}')
call add(branches, parts[0] . '-' . parts[1] . '-' .Match_number_before(parts[2]))
return '\%(' . join(branches, '\|') .'\)'
endfunction
To use you the following to search for all matches before 2014-02-24.
/<C-r>=Match_date_before('2014-02-24')
You might be able to wrap it in a function to set the search register if you wanted to.
The generated regex for dates before 2014-02-24 is the following.
\%(\%([0-1]\d\d\d\|200\d\|201[0-3]\)-\d\{2}-\d\{2}\|2014-\%(0[0-1]\)-\d\{2}\|2014-02-\%([0-1]\d\|2[0-3]\)\)
It does not do any validation of dates. It assumes if you are in that format you are a date.
Equivalent set of functions for matching after the passed in date.
function! Convert_to_char_class_after(cur)
if a:cur =~ '[0-7]'
return '[' . (a:cur+1) . '-9]'
endif
return '9'
endfunction
function! Match_number_after(num)
let branches = []
let init = ''
for i in range(len(a:num))
if a:num[i] =~ '[0-8]'
call add(branches, init . Convert_to_char_class_after(a:num[i]) . repeat('\d', len(a:num) - i - 1))
endif
let init .= a:num[i]
endfor
return '\%(' . join(branches, '\|') .'\)'
endfunction
function! Match_date_after(date)
if a:date !~ '\v\d{4}-\d{2}-\d{2}'
echo "invalid date"
return
endif
let branches =[]
let parts = split(a:date, '-')
call add(branches, Match_number_after(parts[0]) . '-\d\{2}-\d\{2}')
call add(branches, parts[0] . '-' . Match_number_after(parts[1]) . '-\d\{2}')
call add(branches, parts[0] . '-' . parts[1] . '-' .Match_number_after(parts[2]))
return '\%(' . join(branches, '\|') .'\)'
endfunction
The regex generated was
\%(\%([3-9]\d\d\d\|2[1-9]\d\d\|20[2-9]\d\|201[5-9]\)-\d\{2}-\d\{2}\|2014-\%([1-9]\d\|0[3-9]\)-\d\{2}\|2014-02-\%([3-9]\d\|2[5-9]\)\)

You do not say how you want to use this; are you sure that you really want a regular expression? Perhaps you could get away with
if DateCmp(date, '2014-02-24') < 0
" ...
endif
In that case, try this function.
" Compare formatted date strings:
" #param String date1, date2
" dates in YYYY-MM-DD format, e.g. '2014-02-24'
" #return Integer
" negative, zero, or positive according to date1 < date2, date1 == date2, or
" date1 > date2
function! DateCmp(date1, date2)
let [year1, month1, day1] = split(a:date1, '-')
let [year2, month2, day2] = split(a:date2, '-')
if year1 != year2
return year1 - year2
elseif month1 != month2
return month1 - month2
else
return day1 - day2
endif
endfun
If you really want a regular expression, then try this:
" Construct a pattern that matches a formatted date string if and only if the
" date is less than the input date. Usage:
" :echo '2014-02-24' =~ DateLessRE('2014-03-12')
function! DateLessRE(date)
let init = ''
let branches = []
for c in split(a:date, '\zs')
if c =~ '[1-9]'
call add(branches, init . '[0-' . (c-1) . ']')
endif
let init .= c
endfor
return '\d\d\d\d-\d\d-\d\d\&\%(' . join(branches, '\|') . '\)'
endfun
Does that count as a "simple" regex? One way to use it would be to type :g/ and then CRTL-R and = and then DateLessRE('2014-02-24') and Enter, followed by the rest of your command. In other words,
:g/<C-R>=DateLessRE('2014-02-24')<CR>/s/foo/bar
EDIT: I added a concat (:help /\&) that matches a complete "formatted date string". Now, there is no need to anchor the pattern.

Use nested subpatterns. It starts simple, with the century:
[01]\d\d\d-\d\d-\d\d|20
As for each digit to follow, use one of the following patterns; you may want to replace .* by an appropriate sequence of \d and -.
for 0: (0
for 1: (0.*|1
for 2: ([01].*|2
for 3: ([0-2].*|3
for 4: ([0-3].*|4
for 5: ([0-4].*|5
for 6: ([0-5].*|6
for 7: ([0-6].*|7
for 8: ([0-7].*|8
for 9: ([0-8].*|9
For the last digit, you only need the digit range, e.g.:
[0-6]
Finally, all parentheses should be closed:
)))))
In the example of 2014-02-17, this becomes:
[01]\d\d\d-\d\d-\d\d|20
(0\d-\d\d-\d\d|1
([0-3]-\d\d-\d\d|4
-
(0
([01]-\d\d|2
-
(0\d|1
[0-6]
)))))
Now in one line:
[01]\d\d\d-\d\d-\d\d|20(0\d-\d\d-\d\d|1([0-3]-\d\d-\d\d|4-(0([01]-\d\d|2-(0\d|1[0-6])))))
For VIM, let's not forget to escape (, ) and |:
[01]\d\d\d-\d\d-\d\d\|20\(0\d-\d\d-\d\d\|1\([0-3]-\d\d-\d\d\|4-\(0\([01]-\d\d\|2-\(0\d\|1[0-6]\)\)\)\)\)
Would be best to try and generate this (much like in FDinoff's answer), rather than write it yourself...
Update:
Here is a sample AWK script to generate the correct regex for any date yyyy-mm-dd.
#!/usr/bin/awk -f
BEGIN { # possible overrides for non-VIM users
switch (digit) {
case "ascii" : digit = "[0-9]"; break;
case "posix" : digit = "[:digit:]"; break;
default : digit = "\\d";
}
switch (metachar) {
case "unescaped" : escape = ""; break;
default : escape = "\\";
}
}
/^[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]$/ {
print BuildRegex($0);
}
function BuildRegex(s) {
if (s ~ /^[1-9][^1-9]*$/) {
regex = LessThanOnFirstDigit(s);
}
else {
regex = substr(s, 1, 1) BuildRegex(substr(s, 2)); # recursive call
if (s ~ /^[1-9]/) {
regex = escape "(" LessThanOnFirstDigit(s) escape "|" regex escape ")";
}
}
return regex;
}
function LessThanOnFirstDigit(s) {
first = substr(s, 1, 1) - 1;
rest = substr(s, 2);
gsub(/[0-9]/, digit, rest);
return (first ? "[0-" first "]" : "0") rest;
}
Call it like this:
echo 2014-02-17 | awk -f genregex.awk
Of course, you can write such a simple generator in any language you like.
Would be nice to do it in Vimscript, but I have no experience with that, so I will leave that as a home assignment.

If you wanted to search for all dates that were less than 2014-11-23, inclusive, you would use the following regex.
2014-(?:[1-9]|1[0-1])-(?:[1-9]|1[0-9]|2[0-3])
for a better explanation of the regex visit regex101.com and paste the regex in. You can also test it by using that site.
The basics of the regex are to search all dates that:
start with 2014-
either contain a single character from 1 - 9
or a 1 and a single character from 0 - 1, i.e. numbers from 1 - 11
finished by - and numbers from 1 - 23 done in the same style as the second term

Related

Split string by . but ignore whats inside []

I have a string:
s='articles[zone.id=1].comments[user.status=active].user'
Looking to split (via split(some_regex_here)). The split needs to occur on every period other than those inside the bracketed substring.
Expected output:
["articles[zone.id=1]", "comments[user.status=active]", "user"]
How would I go about this? Or is there something else besides split(), I should be looking at?
Try this,
s.split(/\.(?![^\[]*\])/)
I got this result,
2.3.2 :061 > s.split(/\.(?![^\[]*\])/)
=> ["articles[zone.id=1]", "comments[user.status=active]", "user"]
You can also test it here:
https://rubular.com/r/LaxEFQZJ0ygA3j
I assume the problem is to split on periods that are not within matching brackets.
Here is a non-regex solution that works with any number of nested brackets. I've assumed the brackets are all matched, but it would not be difficult to check that.
def split_it(s)
left_brackets = 0
s.each_char.with_object(['']) do |c,a|
if c == '.' && left_brackets.zero?
a << '' unless a.last.empty?
else
case c
when ']' then left_brackets -= 1
when '[' then left_brackets += 1
end
a.last << c
end
end.tap { |a| a.pop if a.last.empty? }
end
split_it '.articles[zone.id=[user.loc=1]].comments[user.status=active].user'
#=> ["articles[zone.id=[user.loc=1]]", "comments[user.status=active]", "user"]

python replace line text with weired characters

How do I replace the following using python
GSA*HC*11177*NYSfH-EfC*23130303*0313*1*R*033330103298
STEM*333*3001*0030303238
BHAT*3319*33*33377*23330706*031829*RTRCP
NUM4*41*2*My Break Room Place*****6*1133337
I want to replace the all character after first occurence of '*' . All characters must be replace except '*'
Example input:
NUM4*41*2*My Break Room Place*****6*1133337
example output:
NUM4*11*1*11 11111 1111 11111*****1*1111111
Fairly simple, use a callback to return group 1 (if matched) unaltered, otherwise
return replacement 1
Note - this also would work in multi-line strings.
If you need that, just add (?m) to the beginning of the regex. (?m)(?:(^[^*]*\*)|[^*\s])
You'd probably want to test the string for the * character first.
( ^ [^*]* \* ) # (1), BOS/BOL up to first *
| # or,
[^*\s] # Not a * nor whitespace
Python
import re
def repl(m):
if ( m.group(1) ) : return m.group(1)
return "1"
str = 'NUM4*41*2*My Break Room Place*****6*1133337'
if ( str.find('*') ) :
newstr = re.sub(r'(^[^*]*\*)|[^*\s]', repl, str)
print newstr
else :
print '* not found in string'
Output
NUM4*11*1*11 11111 1111 11111*****1*1111111
If you want to use regex, you can use this one: (?<=\*)[^\*]+ with re.sub
inputs = ['GSA*HC*11177*NYSfH-EfC*23130303*0313*1*R*033330103298',
'STEM*333*3001*0030303238',
'BHAT*3319*33*33377*23330706*031829*RTRCP',
'NUM4*41*2*My Break Room Place*****6*1133337']
outputs = [re.sub(r'(?<=\*)[^\*]+', '1', inputline) for inputline in inputs]
Regex explication here

Need to know logic behind few regular expression

I am having a code as
$wrk = OC192-1-1-1;
#temp = split (/-/, $wrk);
if ($temp1[3] =~ /101 || 102 /)
{
print "yes";
} else {
print "no";
}
Output :
yes
Need to know why this is printing yes. I know for regular expression | is supported for OR operator. But need to know why || is giving as "yes" as output
It is because || will make regex match succeed by matching with nothing all the time.
So it is essentially matching $temp1[3] (which doesn't exist) with anyone of the following
"101 "
""
" 102 "
I added double quotes just for explanation.
/101 || 102 / regex tries to match '101 ', or '' (empty string), or ' 102 '.
Since empty string can always be matched, it always returns true in your condition.
In addition to the regex-relevant answer from #anubhava, note that: OC192-1-1-1 is same as 0-1-1-1, which is just "-3", therefore #temp evaluates to ( "", "3" )
And of course there's no such thing as $temp1

Vim regular expression for range of numbers - Not working

I have the following regular expression for range 288-303 but it is not working in GVim.
The regexp is :/28[89]|29[0-9]|30[0-3]/.
Could anyone please point out the reason. I referred Stack Overflow and got the regexp from http://utilitymill.com/utility/Regex_For_Range/42.
You have to escape the pipe in Vim:
:/28[89]\|29[0-9]\|30[0-3]/
Edit:
Per #Tim's comment, you can optionally prefix the pattern with \v instead of escaping the individual pipe characters:
:/\v28[89]|29[0-9]|30[0-3]/
Thanks #Tim.
Based on Jim answer I made a little script to search for integers in a given range. You use the command like :
:Range 341 752
This will match each sequence of digit between the two numbers 341 and 752.
Using a search like
/\%(3\%(\%(4\%([1-9]\)\)\|\%([5-9]\d\{1}\)\|\%(9\%([0-9]\)\)\)\)\|\%([4-7]\d\{2}\)\|\%(7\%(\%(0\%([0-9]\)\)\|\%([1-5]\d\{1}\)\|\%(5\%([0-2]\)\)\)\)
Just add that to your vimrc
function! RangeMatch(min,max)
let l:res = RangeSearchRec(a:min,a:max)
execute "/" . l:res
let #/=l:res
endfunction
"TODO if both number don't have same number of digit
function! RangeSearchRec(min,max) " suppose number with the same number of digit
if len(a:max) == 1
return '[' . a:min . '-' . a:max . ']'
endif
if a:min[0] < a:max[0]
" on cherche de a:min jusqu'à 99999 x times puis de (a:min[0]+1)*10^x à a:max[0]*10^x
let l:zeros=repeat('0',len(a:max)-1) " string (a:min[0]+1 +) 000000
let l:res = '\%(' . a:min[0] . '\%(' . RangeSearchRec( a:min[1:], repeat('9',len(a:max)-1) ) . '\)\)' " 657 à 699
if a:min[0] +1 < a:max[0]
let l:res.= '\|' . '\%('
let l:res.= '[' . (a:min[0]+1) . '-' . a:max[0] . ']'
let l:res.= '\d\{' . (len(a:max)-1) .'}' . '\)' "700 a 900
endif
let l:res.= '\|' . '\%(' . a:max[0] . '\%(' . RangeSearchRec( repeat('0',len(a:max)-1) , a:max[1:] ) . '\)\)' " 900 a 957
return l:res
else
return '\%(' . a:min[0] . RangeSearchRec(a:min[1:],a:max[1:]) . '\)'
endif
endfunction
command! -nargs=* Range call RangeMatch(<f-args>)
Note that the \%(\) matching parenthesis instead of \(\) avoid a ERROR E872: (NFA regexp) Too many '('
The script look between 341-399 or 400-699 or 700-752

Matching math expression with regular expression?

For example, these are valid math expressions:
a * b + c
-a * (b / 1.50)
(apple + (-0.5)) * (boy - 1)
And these are invalid math expressions:
--a *+ b # 1.5.0 // two consecutive signs, two consecutive operators, invalid operator, invalid number
-a * b + 1) // unmatched parentheses
a) * (b + c) / (d // unmatched parentheses
I have no problem with matching float numbers, but have difficulty with parentheses matching. Any idea? If there is better solution than regular expression, I'll accept as well. But regex is preferred.
========
Edit:
I want to make some comments on my choice of the “accepted answer”, hoping that people who have the same question and find this thread will not be misled.
There are several answers I consider “accepted”, but I have no idea which one is the best. So I chose the accepted answer (almost) randomly. I recommend reading Guillaume Malartre’s answer as well besides the accepted answer. All of them give practical solutions to my question. For a somewhat rigorous/theoretical answer, please read David Thornley’s comments under the accepted answer. As he mentioned, Perl’s extension to regular expression (originated from regular language) make it “irregular”. (I mentioned no language in my question, so most answerers assumed the Perl implementation of regular expression – probably the most popular implementation. So did I when I posted my question.)
Please correct me if I said something wrong above.
Use a pushdown automaton for matching paranthesis http://en.wikipedia.org/wiki/Pushdown_automaton (or just a stack ;-) )
Details for the stack solution:
while (chr available)
if chr == '(' then
push '('
else
if chr == ')' then
if stack.elements == 0 then
print('too many or misplaced )')
exit
else
pop //from stack
end while
if (stack.elements != 0)
print('too many or misplaced(')
Even simple: just keep a counter instead of stack.
Regular expressions can only be used to recognize regular languages. The language of mathematical expressions is not regular; you'll need to implement an actual parser (e.g. LR) in order to do this.
Matching parens with a regex is quite possible.
Here is a Perl script that will parse arbitrary deep matching parens. While it will throw out the non-matching parens outside, I did not design it specifically to validate parens. It will parse arbitrarily deep parens so long as they are balanced. This will get you started however.
The key is recursion both in the regex and the use of it. Play with it, and I am sure that you can get this to also flag non matching prens. I think if you capture what this regex throws away and count parens (ie test for odd parens in the non-match text), you have invalid, unbalanced parens.
#!/usr/bin/perl
$re = qr /
( # start capture buffer 1
\( # match an opening paren
( # capture buffer 2
(?: # match one of:
(?> # don't backtrack over the inside of this group
[^()]+ # one or more
) # end non backtracking group
| # ... or ...
(?1) # recurse to opening 1 and try it again
)* # 0 or more times.
) # end of buffer 2
\) # match a closing paren
) # end capture buffer one
/x;
sub strip {
my ($str) = #_;
while ($str=~/$re/g) {
$match=$1; $striped=$2;
print "$match\n";
strip($striped) if $striped=~/\(/;
return $striped;
}
}
while(<DATA>) {
print "start pattern: $_";
while (/$re/g) {
strip($1) ;
}
}
__DATA__
"(apple + (-0.5)) * (boy - 1)"
"((((one)two)three)four)x(one(two(three(four))))"
"a) * (b + c) / (d"
"-a * (b / 1.50)"
Output:
start pattern: "(apple + (-0.5)) * (boy - 1)"
(apple + (-0.5))
(-0.5)
(boy - 1)
start pattern: "((((one)two)three)four)x(one(two(three(four))))"
((((one)two)three)four)
(((one)two)three)
((one)two)
(one)
(one(two(three(four))))
(two(three(four)))
(three(four))
(four)
start pattern: "a) * (b + c) / (d"
(b + c)
start pattern: "-a * (b / 1.50)"
(b / 1.50)
I believe you will be better off implementing a real parser to accomplish what you're after.
A parser for simple mathematical expressions is "Parsing 101", and there are several examples to be found online.
Some examples include:
ANTLR: Expression Evaluator Sample (ANTLR grammars can target several languages)
pyparsing: http://pyparsing.wikispaces.com/file/view/fourFn.py (pyparsing is a Python library)
Lex & Yacc: http://epaperpress.com/lexandyacc/ (contains a PDF tutorial and sample code for a calculator)
Note that the grammar you will need for validating expressions is simpler than the examples above, since the examples also implement evaluation of the expression.
You can't use regex to do things like balance parenthesis.
This is tricky with one single regular expression, but quite easy using mixed regexp/procedural approach. The idea is to construct a regexp for the simple expression (without parenthesis) and then repeatedly replace ( simple-expression ) with some atomic string (e.g. identifier). If the final reduced expression matches the same `simple' pattern, the original expression is considered valid.
Illustration (in php).
function check_syntax($str) {
// define the grammar
$number = "\d+(\.\d+)?";
$ident = "[a-z]\w*";
$atom = "[+-]?($number|$ident)";
$op = "[+*/-]";
$sexpr = "$atom($op$atom)*"; // simple expression
// step1. remove whitespace
$str = preg_replace('~\s+~', '', $str);
// step2. repeatedly replace parenthetic expressions with 'x'
$par = "~\($sexpr\)~";
while(preg_match($par, $str))
$str = preg_replace($par, 'x', $str);
// step3. no more parens, the string must be simple expression
return preg_match("~^$sexpr$~", $str);
}
$tests = array(
"a * b + c",
"-a * (b / 1.50)",
"(apple + (-0.5)) * (boy - 1)",
"--a *+ b # 1.5.0",
"-a * b + 1)",
"a) * (b + c) / (d",
);
foreach($tests as $t)
echo $t, "=", check_syntax($t) ? "ok" : "nope", "\n";
The above only validates the syntax, but the same technique can be also used to construct a real parser.
For parenthesis matching, and implementing other expression validation rules, it is probably easiest to write your own little parser. Regular expressions are no good in this kind of situation.
Ok here's my version of parenthesis finding in ActionScript3, using this approach give a lot of traction to analyse the part before the parenthesis, inside the parenthesis and after the parenthis, if some parenthesis remains at the end you can raise a warning or refuse to send to a final eval function.
package {
import flash.display.Sprite;
import mx.utils.StringUtil;
public class Stackoverflow_As3RegexpExample extends Sprite
{
private var tokenChain:String = "2+(3-4*(4/6))-9(82+-21)"
//Constructor
public function Stackoverflow_As3RegexpExample() {
// remove the "\" that just escape the following "\" if you want to test outside of flash compiler.
var getGroup:RegExp = new RegExp("((?:[^\\(\\)]+)?) (?:\\() ( (?:[^\\(\\)]+)? ) (?:\\)) ((?:[^\\(\\)]+)?)", "ix") //removed g flag
while (true) {
tokenChain = replace(tokenChain,getGroup)
if (tokenChain.search(getGroup) == -1) break;
}
trace("cummulativeEvaluable="+cummulativeEvaluable)
}
private var cummulativeEvaluable:Array = new Array()
protected function analyseGrammar(matchedSubstring:String, capturedMatch1:String, capturedMatch2:String, capturedMatch3:String, index:int, str:String):String {
trace("\nanalyseGrammar str:\t\t\t\t'"+str+"'")
trace("analyseGrammar matchedSubstring:'"+matchedSubstring+"'")
trace("analyseGrammar capturedMatchs:\t'"+capturedMatch1+"' '("+capturedMatch2+")' '"+capturedMatch3+"'")
trace("analyseGrammar index:\t\t\t'"+index+"'")
var blank:String = buildBlank(matchedSubstring.length)
cummulativeEvaluable.push(StringUtil.trim(matchedSubstring))
// I could do soo much rigth here!
return str.substr(0,index)+blank+str.substr(index+matchedSubstring.length,str.length-1)
}
private function replace(str:String,regExp:RegExp):String {
var result:Object = regExp.exec(str)
if (result)
return analyseGrammar.apply(null,objectToArray(result))
return str
}
private function objectToArray(value:Object):Array {
var array:Array = new Array()
var i:int = 0
while (true) {
if (value.hasOwnProperty(i.toString())) {
array.push(value[i])
} else {
break;
}
i++
}
array.push(value.index)
array.push(value.input)
return array
}
protected function buildBlank(length:uint):String {
var blank:String = ""
while (blank.length != length)
blank = blank+" "
return blank
}
}
}
It should trace this:
analyseGrammar str: '2+(3-4*(4/6))-9(82+-21)'
analyseGrammar matchedSubstring:'3-4*(4/6)'
analyseGrammar capturedMatchs: '3-4*' '(4/6)' ''
analyseGrammar index: '3'
analyseGrammar str: '2+( )-9(82+-21)'
analyseGrammar matchedSubstring:'2+( )-9'
analyseGrammar capturedMatchs: '2+' '( )' '-9'
analyseGrammar index: '0'
analyseGrammar str: ' (82+-21)'
analyseGrammar matchedSubstring:' (82+-21)'
analyseGrammar capturedMatchs: ' ' '(82+-21)' ''
analyseGrammar index: '0'
cummulativeEvaluable=3-4*(4/6),2+( )-9,(82+-21)