I want to make Latex table code from Unix cal output, e.g. It should look like:
Mo & Tu & We & Th & Fr \\
& & 1 & 2 & 3 \\
6 & 7 & 8 & 9 & 10 \\
13 & 14 & 15 & 16 & 17 \\
20 & 21 & 22 & 23 & 24 \\
27 & 28 & & & \\
I've come up with the following solution:
cal | sed -e '1d; /^$/d; s/^\(...\)\?\(...\)\?\(...\)\?\(...\)\?\(...\)\?\(...\)\?.*/\2 \& \3 \& \4 \& \5 \& \6 \\\\/'
Works like a charm! But I'm not sure if the result is defined. Wouldn't it be correct behaviour, e.g. for the first group to match the empty string, and for the second group to match the first three chars of any line (instead of chars 4-6)? And if not, would there be some switch to make a variation of it a correct behaviour (so I can know how to avoid it / control the behaviour)?
Well if you can use awk:
cal | awk 'BEGIN { OFS = " & " }
NR == 1 || $0 ~ "^$" { next }
NR == 2 { for (i=1;i<NF;i++) { printf("%-2s%s",$i,OFS) }
printf("%s %s\n",$NF," \\\\")
next
}
{ for (i=1;i<NF;i++) { printf("% 2i%s",$i,OFS) }
printf("% 2i%s\n",$NF," \\\\")
}'
will do the something really similar without too much regex...
Anyway, from my point of view you don't need those \? as they (the captured groups) must be always present.
My regex is fulfilling the spec. That is because the expression tree is expanded greedily from the left, so if there is a possible match which includes the first subexpression, then it will take this one.
Related
I'm trying to write a Ruby method that will return true only if the input is a valid phone number, which means, among other rules, it can have spaces and/or dashes between the digits, but not before or after the digits.
In a sense, I need a method that does the opposite of String#strip! (remove all spaces except leading and trailing spaces), plus the same for dashes.
I've tried using String#gsub!, but when I try to match a space or a dash between digits, then it replaces the digits as well as the space/dash.
Here's an example of the code I'm using to remove spaces. I figure once I know how to do that, it will be the same story with the dashes.
def valid_phone_number?(number)
phone_number_pattern = /^0[^0]\d{8}$/
# remove spaces
number.gsub!(/\d\s+\d/, "")
return number.match?(phone_number_pattern)
end
What happens is if I call the method with the following input:
valid_phone_number?(" 09 777 55 888 ")
I get false because line 5 transforms the number into " 0788 ", i.e. it gets rid of the digits around the spaces as well as the spaces. What I want it to do is just to get rid of the inner spaces, so as to produce " 0977755888 ".
I've tried
number.gsub!(/\d(\s+)\d/, "") and number.gsub!(/\d(\s+)\d/) { |match| "" } to no avail.
Thank you!!
If you want to return a boolean, you might for example use a pattern that accepts leading and trailing spaces, and matches 10 digits (as in your example data) where there can be optional spaces or hyphens in between.
^ *\d(?:[ -]?\d){9} *$
For example
def valid_phone_number?(number)
phone_number_pattern = /^ *\d(?:[ -]*\d){9} *$/
return number.match?(phone_number_pattern)
end
See a Ruby demo and a regex demo.
To remove spaces & hyphen inbetween digits, try:
(?:\d+|\G(?!^)\d+)\K[- ]+(?=\d)
See an online regex demo
(?: - Open non-capture group;
d+ - Match 1+ digits;
| - Or;
\G(?!^)\d+ - Assert position at end of previous match but (negate start-line) with following 1+ digits;
)\K - Close non-capture group and reset matching point;
[- ]+ - Match 1+ space/hyphen;
(?=\d) - Assert position is followed by digits.
p " 09 777 55 888 ".gsub(/(?:\d+|\G(?!^)\d+)\K[- ]+(?=\d)/, '')
Prints: " 0977755888 "
Using a very simple regex (/\d/ tests for a digit):
str = " 09 777 55 888 "
r = str.index(/\d/)..str.rindex(/\d/)
str[r] = str[r].delete(" -")
p str # => " 0977755888 "
Passing a block to gsub is an option, capture groups available as globals:
>> str = " 09 777 55 888 "
# simple, easy to understand
>> str.gsub(/(^\s+)([\d\s-]+?)(\s+$)/){ "#$1#{$2.delete('- ')}#$3" }
=> " 0977755888 "
# a different take on #steenslag's answer, to avoid using range.
>> s = str.dup; s[/^\s+([\d\s-]+?)\s+$/, 1] = s.delete("- "); s
=> " 0977755888 "
Benchmark, not that it matters that much:
n = 1_000_000
puts(Benchmark.bmbm do |x|
# just a match
x.report("match") { n.times {str.match(/^ *\d(?:[ -]*\d){9} *$/) } }
# use regex in []=
x.report("[//]=") { n.times {s = str.dup; s[/^\s+([\d\s-]+?)\s+$/, 1] = s.delete("- "); s } }
# use range in []=
x.report("[..]=") { n.times {s = str.dup; r = s.index(/\d/)..s.rindex(/\d/); s[r] = s[r].delete(" -"); s } }
# block in gsub
x.report("block") { n.times {str.gsub(/(^\s+)([\d\s-]+?)(\s+$)/){ "#$1#{$2.delete('- ')}#$3" }} }
# long regex
x.report("regex") { n.times {str.gsub(/(?:\d+|\G(?!^)\d+)\K[- ]+(?=\d)/, "")} }
end)
Rehearsal -----------------------------------------
match 0.997458 0.000004 0.997462 ( 0.998003)
[//]= 1.822698 0.003983 1.826681 ( 1.827574)
[..]= 3.095630 0.007955 3.103585 ( 3.105489)
block 3.515401 0.003982 3.519383 ( 3.521392)
regex 4.761748 0.007967 4.769715 ( 4.772972)
------------------------------- total: 14.216826sec
user system total real
match 1.031670 0.000000 1.031670 ( 1.032347)
[//]= 1.859028 0.000000 1.859028 ( 1.860013)
[..]= 3.074159 0.003978 3.078137 ( 3.079825)
block 3.751532 0.011982 3.763514 ( 3.765673)
regex 4.634857 0.003972 4.638829 ( 4.641259)
I have a file with contents like this -
Random text
+-------------------+------+-------+-----------+-------+
| Data | A | B | C | D |
+-------------------+------+-------+-----------+-------+
| Data 1 | 1403 | 0 | 2520 | 55.67 |
| Data 2 | 1365 | 2 | 2520 | 54.17 |
| Data 3 | 1 | 3 | 1234 | 43.12 |
Some more random text
I want to extract the value of column D of row Data 1 i.e. I want to extract the value 55.67 from the example above. I am parsing this file line by line using getline -
while(getline(inputFile1,line)) {
if(line.find("| Data 1") != string::npos) {
subString = //extract the desired value
}
How can I extract the desired sub string from the line. Is there any way using boost::regex that I can extract this substring?
While regex may have its uses, it's probably overkill for this.
Bring in a trim function and:
char delim;
std::string line, data;
int a, b, c;
double d;
while(std::getline(inputFile1, line)) {
std::istringstream is(line);
if( std::getline(is >> delim, data, '|') >>
a >> delim >> b >> delim >> c >> delim >> d >> delim)
{
trim(data);
if(data == "Data 1") {
std::cout << a << ' ' << b << ' ' << c << ' ' << d << '\n';
}
}
}
Demo
Yes, it is easily possible to extract your substring with a regex. There is no need to use boost, you can also use the existing C++ regex library.
The resulting program is ultra simple.
We read all lines of the source file in a simple for loop. Then we use std::regex_match to match a just read line against our regex. If we have found a match, then the result will be in the std::smatch sm, group 1.
And because we will design the regex for finding double values, we will get exactly what we need, without any additional spaces.
This we can convert to a double and show the result on the screen. And because we defined the regex to find a double, we can be sure that std::stod will work.
The resulting program is rather straightforward and easy to understand:
#include <iostream>
#include <string>
#include <sstream>
#include <regex>
// Please note. For std::getline, it does not matter, if we read from a
// std::istringstream or a std::ifstream. Both are std::istream's. And because
// we do not have files here on SO, we will use an istringstream as data source.
// If you want to read from a file later, simply create an std::ifstream inputFile1
// Source File with all data
std::istringstream inputFile1{ R"(
Random text
+-------------------+------+-------+-----------+-------+
| Data | A | B | C | D |
+-------------------+------+-------+-----------+-------+
| Data 1 | 1403 | 0 | 2520 | 55.67 |
| Data 2 | 1365 | 2 | 2520 | 54.17 |
| Data 3 | 1 | 3 | 1234 | 43.12 |
Some more random text)"
};
// Regex for finding the desired data
const std::regex re(R"(\|\s+Data 1\s+\|.*?\|.*?\|.*?\|\s*([-+]?[0-9]*\.?[0-9]+)\s*\|)");
int main() {
// The result will be in here
std::smatch sm;
// Read all lines of the source file
for (std::string line{}; std::getline(inputFile1, line);) {
// If we found our matching string
if (std::regex_match(line, sm, re)) {
// Then extract the column D info
double data1D = std::stod(sm[1]);
// And show it to the user.
std::cout << data1D << "\n";
}
}
}
For most people the tricky part is how to define the regular expression. There are pages like Online regex tester and debugger. There is also a breakdown for the regex and a understandable explanation.
For our regex
\|\s+Data 1\s+\|.*?\|.*?\|.*?\|\s*([-+]?[0-9]*\.?[0-9]+)\s*\|
we get the following explanation:
\|
matches the character | literally (case sensitive)
\s+
matches any whitespace character (equal to [\r\n\t\f\v ])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
Data 1 matches the characters Data 1 literally (case sensitive)
\s+
matches any whitespace character (equal to [\r\n\t\f\v ])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
\|
matches the character | literally (case sensitive)
.*?
matches any character (except for line terminators)
*? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy)
\|
matches the character | literally (case sensitive)
.*?
matches any character (except for line terminators)
*? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy)
\|
matches the character | literally (case sensitive)
.*?
matches any character (except for line terminators)
\|
matches the character | literally (case sensitive)
\s*
matches any whitespace character (equal to [\r\n\t\f\v ])
1st Capturing Group ([-+]?[0-9]*\.?[0-9]+)
\s*
matches any whitespace character (equal to [\r\n\t\f\v ])
\|
matches the character | literally (case sensitive)
By the way, a more safe (more secure matching) regex would be:
\|\s+Data 1\s+\|\s*?\d+\s*?\|\s*?\d+\s*?\|\s*?\d+\s*?\|\s*([-+]?[0-9]*\.?[0-9]+)\s*\|
I need to make sure the users' time stamp input is
(0 to infinity) days
(0 to 23) hours
(h to 59) minutes
with spaces between, so good example is
22h 40m
2d 20m
1d
but not
0d
2d22h20m
what i have so far is:
^(0|[1-9][0-9]*d)?( [0|[1-5][0-9]*h)?(0| [1-5][0-9]*m)?$
which gets:
22d 3h 40m
2d 40m
but not
40m
40h
it seems trivial, and I've composed it from several SO questions, but nothing matches exactly this.
Edit
just note that my original attempt has a mistake that allows 1-59 hours.
It is quite verbose, but one option could be to use an alternation and list all the allowed possibilities using your patterns for d, h and m.
Matching (d h m) or (d h) or (d m) or (h m) or (d) or (h) or (m) you might use:
^(?:[1-9][0-9]*d (?:2[0-3]|1[0-9]|[1-9])h (?:[1-5][0-9]|[1-9])m|[1-9][0-9]*d (?:(?:2[0-3]|1[0-9]|[1-9])h|(?:[1-5][0-9]|[1-9])m)|(?:2[0-3]|1[0-9]|[1-9])h (?:[1-5][0-9]|[1-9])m|[1-9][0-9]*d|(?:2[0-3]|1[0-9]|[1-9])h|(?:[1-5][0-9]|[1-9])m)$
regex101 demo
That will match:
^ Start of string
(?: Non capturing group
[1-9][0-9]*d (?:2[0-3]|1[0-9]|[1-9])h (?:[1-5][0-9]|[1-9])m Match full part
| Or
[1-9][0-9]*d Match d
(?: Non capturing group
(?:2[0-3]|1[0-9]|[1-9])h Match h
| Or
(?:[1-5][0-9]|[1-9])m Match m
) close non capturing group
| Or
(?:2[0-3]|1[0-9]|[1-9])h (?:[1-5][0-9]|[1-9])m Match h and m
| Or
[1-9][0-9]*d Match d
| Or
(?:2[0-3]|1[0-9]|[1-9])h Match h
| Or
(?:[1-5][0-9]|[1-9])m Mach m
) Close non capturing group
$ End of string
Edit
The only suggestions (so far) are really to list all possibilities. So I'll be doing that here.. just note that my original attempt has a mistake that allows 1-59 hours.
regex101
^((([1-9][0-9])d (2[0-3]|1[0-9]|[1-9])h ([1-5][0-9]|[1-9])m)|(([1-9][0-9])d (2[0-3]|1[0-9]|[1-9])h)|(([1-9][0-9])d ([1-5][0-9]|[1-9])m)|((2[0-3]|1[0-9]|[1-9])h ([1-5][0-9]|[1-9])m)|([1-9][0-9])d|(2[0-3]|1[0-9]|[1-9])h|([1-5][0-9]|[1-9])m)$
meaning: (d h m) or (d h) or (d m) or (h m) or (d) or (h) or (m)
I am trying to create a regex for [lon,lat] coordinates.
The code first checks if the input starts with '['.
If it does we check the validity of the coordinates via a regex
/([\[][-+]?(180(\.0{1,15})?|((1[0-7]\d)|([1-9]?\d))(\.\d{1,15})?),[-+]?([1-8]?\d(\.\d{1,15})?|90(\.0{1,15})?)[\]][\;]?)+/gm
The regex tests for [lon,lat] with 15 decimals [+- 180degrees, +-90degrees]
it should match :
single coordinates :
[120,80];
[120,80]
multiple coordinates
[180,90];[180,67];
[180,90];[180,67]
with newlines
[123,34];[-32,21];
[12,-67]
it should not match:
semicolon separator missing - single
[25,67][76,23];
semicolon separator missing - multiple
[25,67]
[76,23][12,90];
I currently have problems with the ; between coordinates (see 4 & 5)
jsfiddle equivalent here : http://regex101.com/r/vQ4fE0/4
You can try with this (human readable) pattern:
$pattern = <<<'EOD'
~
(?(DEFINE)
(?<lon> [+-]?
(?:
180 (?:\.0{1,15})?
|
(?: 1(?:[0-7][0-9]?)? | [2-9][0-9]? | 0 )
(?:\.[0-9]{1,15})?
)
)
(?<lat> [+-]?
(?:
90 (?:\.0{1,15})?
|
(?: [1-8][0-9]? | 9)
(?:\.[0-9]{1,15})?
)
)
)
\A
\[ \g<lon> , \g<lat> ] (?: ; \n? \[ \g<lon> , \g<lat> ] )* ;?
\z
~x
EOD;
explanations:
When you have to deal with a long pattern inside which you have to repeat several time the same subpatterns, you can use several features to make it more readable.
The most well know is to use the free-spacing mode (the x modifier) that allows to indent has you want the pattern (all spaces are ignored) and eventually to add comments.
The second consists to define subpatterns in a definition section (?(DEFINE)...) in which you can define named subpatterns to be used later in the main pattern.
Since I don't want to repeat the large subpatterns that describes the longitude number and the latitude number, I have created in the definition section two named pattern "lon" and "lat". To use them in the main pattern, I only need to write \g<lon> and \g<lat>.
javascript version:
var lon_sp = '(?:[+-]?(?:180(?:\\.0{1,15})?|(?:1(?:[0-7][0-9]?)?|[2-9][0-9]?|0)(?:\\.[0-9]{1,15})?))';
var lat_sp = '(?:[+-]?(?:90(?:\\.0{1,15})?|(?:[1-8][0-9]?|9)(?:\\.[0-9]{1,15})?))';
var coo_sp = '\\[' + lon_sp + ',' + lat_sp + '\\]';
var regex = new RegExp('^' + coo_sp + '(?:;\\n?' + coo_sp + ')*;?$');
var coordinates = new Array('[120,80];',
'[120,80]',
'[180,90];[180,67];',
'[123,34];[-32,21];\n[12,-67]',
'[25,67][76,23];',
'[25,67]\n[76,23]');
for (var i = 0; i<coordinates.length; i++) {
console.log("\ntest "+(i+1)+": " + regex.test(coordinates[i]));
}
fiddle
Try this out:
^(\[([+-]?(?!(180\.|18[1-9]|19\d{1}))\d{1,3}(\.\d{1,15})?,[+-]?(?!(90\.|9[1-9]))\d{1,2}(\.\d{1,15})?(\];$|\]$|\];\[)){1,})
Demo: http://regex101.com/r/vQ4fE0/7
Explanation
^(\[
Must start with a bracket
[+-]?
May or may not contain +- in front of the number
(?!(180\.|18[1-9]|19\d{1}))
Should not contain 180., 181-189 nor 19x
\d{1,3}(\.\d{1,15})?
Otherwise, any number containing 1 or 3 digits, with or without decimals (up to 15) are allowed
(?!(90\.|9[1-9]))
The 90 check is similar put here we are not allowing 90. nor 91-99
\d{1,2}(\.\d{1,15})?
Otherwise, any number containing 1 or 2 digits, with or without decimals (up to 15) are allowed
(\];$|\]$|\];\[)
The ending of a bracket body must have a ; separating two bracket bodies, otherwise it must be the end of the line.
{1,}
The brackets can exist 1 or multiple times
Hope this was helpful.
This might work. Note that you have a lot of capture groups, none of which
will give you good information because of recursive quantifiers.
# /^(\[[-+]?(180(\.0{1,15})?|((1[0-7]\d)|([1-9]?\d))(\.\d{1,15})?),[-+]?([1-8]?\d(\.\d{1,15})?|90(\.0{1,15})?)\](?:;\n?|$))+$/
^
( # (1 start)
\[
[-+]?
( # (2 start)
180
( \. 0{1,15} )? # (3)
|
( # (4 start)
( 1 [0-7] \d ) # (5)
|
( [1-9]? \d ) # (6)
) # (4 end)
( \. \d{1,15} )? # (7)
) # (2 end)
,
[-+]?
( # (8 start)
[1-8]? \d
( \. \d{1,15} )? # (9)
|
90
( \. 0{1,15} )? # (10)
) # (8 end)
\]
(?: ; \n? | $ )
)+ # (1 end)
$
Try a function approach, where the function can do some of the splitting for you, as well as delegating the number comparisons away from the regex. I tested it here: http://repl.it/YyG/3
//represents regex necessary to capture one coordinate, which
// looks like 123 or 123.13532
// the decimal part is a non-capture group ?:
var oneCoord = '(-?\\d+(?:\\.\\d+)?)';
//console.log("oneCoord is: "+oneCoord+"\n");
//one coordinate pair is represented by [x,x]
// check start/end with ^, $
var coordPair = '^\\['+oneCoord+','+oneCoord+'\\]$';
//console.log("coordPair is: "+coordPair+"\n");
//the full regex string consists of one or more coordinate pairs,
// but we'll do the splitting in the function
var myRegex = new RegExp(coordPair);
//console.log("my regex is: "+myRegex+"\n");
function isPlusMinus180(x)
{
return -180.0<=x && x<=180.0;
}
function isPlusMinus90(y)
{
return -90.0<=y && y<=90.0;
}
function isValid(s)
{
//if there's a trailing semicolon, remove it
if(s.slice(-1)==';')
{
s = s.slice(0,-1);
}
//remove all newlines and split by semicolon
var all = s.replace(/\n/g,'').split(';');
//console.log(all);
for(var k=0; k<all.length; ++k)
{
var match = myRegex.exec(all[k]);
if(match===null)
return false;
console.log(" match[1]: "+match[1]);
console.log(" match[2]: "+match[2]);
//break out if one pair is bad
if(! (isPlusMinus180(match[1]) && isPlusMinus90(match[2])) )
{
console.log(" one of matches out of bounds");
return false;
}
}
return true;
}
var coords = new Array('[120,80];',
'[120.33,80]',
'[180,90];[180,67];',
'[123,34];[-32,21];\n[12,-67]',
'[25,67][76,23];',
'[25,67]\n[76,23]',
'[190,33.33]',
'[180.33,33]',
'[179.87,90]',
'[179.87,91]');
var s;
for (var i = 0; i<coords.length; i++) {
s = coords[i];
console.log((i+1)+". ==== testing "+s+" ====");
console.log(" isValid? => " + isValid(s));
}
Consider the string "AB 1 BA 2 AB 3 BA". How can I match the content between "AB" and "BA" in a non-greedy fashion (in awk)?
I have tried the following:
awk '
BEGIN {
str="AB 1 BA 2 AB 3 BA"
regex="AB([^B][^A]|B[^A]|[^B]A)*BA"
if (match(str,regex))
print substr(str,RSTART,RLENGTH)
}'
with no output. I believe the reason for no match is that there is an odd number of characters between "AB" and "BA". If I replace str with "AB 11 BA 22 AB 33 BA" the regex seems to work..
Merge your two negated character classes and remove the [^A] from the second alternation:
regex = "AB([^AB]|B|[^B]A)*BA"
This regex fails on the string ABABA, though - not sure if that is a problem.
Explanation:
AB # Match AB
( # Group 1 (could also be non-capturing)
[^AB] # Match any character except A or B
| # or
B # Match B
| # or
[^B]A # Match any character except B, then A
)* # Repeat as needed
BA # Match BA
Since the only way to match an A in the alternation is by matching a character except B before it, we can safely use the simple B as one of the alternatives.
The other answer didn't really answer: how to match non-greedily?
Looks like it can't be done in (G)AWK. The manual says this:
awk (and POSIX) regular expressions always match the leftmost, longest
sequence of input characters that can match.
https://www.gnu.org/software/gawk/manual/gawk.html#Leftmost-Longest
And the whole manual doesn't contain the words "greedy" nor "lazy". It mentions Extended Regular Expressions, but for greedy matching you'd need Perl-Compatible Regular Expressions. So… no, can't be done.
For general expressions, I'm using this as a non-greedy match:
function smatch(s, r) {
if (match(s, r)) {
m = RSTART
do {
n = RLENGTH
} while (match(substr(s, m, n - 1), r))
RSTART = m
RLENGTH = n
return RSTART
} else return 0
}
smatch behaves like match, returning:
the position in s where the regular expression r occurs, or 0 if it does not. The variables RSTART and RLENGTH are set to the position and length of the matched string.