Perl - Format Strings in Array using Regular Expressions

Perl - Format Strings in Array using Regular Expressions - regex

I have an array of strings that represent sizes.
A list of all format variations is below:
2x3
3.6x5.6
6'RD
Goal: Convert to these formats:
2' x 3'
3' 6'' x 5' 6''
6' ROUND
All values are currently being pushed into an array called #sizearray like so:
push(#sizearray, $data[$size_num]);
Then,
my #formattedsize = #sizearray;
foreach (#formattedsize) {
if ($formattedsize[$_] =~ /(\d+)x(\d+)/) {
#convert to
if (???) {
#???
}
if (???) {
#???
}
}
How would I go through each element in the array and save the values into a new array with the new format?

You are trying to solve 2 problems:
Parse the input to extract "meaningful" data, i.e. geometry (rectangular, round, etc) and parameters (aspect ratio, diameter, etc). Before you can do that you must establish the "universe" of possibilities. Are there more than just rectangular and round? This is the harder part.
take the extracted data and normalize/standardize the format. This is the easy part
Let's say you only have two options, rectangular and round. Rectangular seems to be defined by a pair of real numbers separated by an 'x', so a regex for that might be
(\d+(?:\.\d+)?)\s*x\s*(\d+(?:\.\d+)?)
What you have here is two expressions for real numbers:
1 or more digits followed by an optional group of a dot and one or more digits
optional whitespace, an x and more optional whitespace
1 or more digits followed by an optional group of a dot and one or more digits
The outer parentheses around the number expression is a capturing group that causes the regex engine to make whatever matched available in the results. The inner parentheses (?:\.\d+)? is a non-capturing group (the ?: part). It allows you to apply the trailing ? quantifier (0 or 1) to the decimal portion but not capture it separately.
If the input doesn't match this, you move on to the next pattern, looking for a round specification. Repeat as needed for all possibilities.
For the above expression
# assume string to be parsed is in $_
if (my ($h,$w) = /(\d+(?:\.\d+)?)\s*x\s*(\d+(?:\.\d+)?)/))
{
printf "%s x %s\n", $h, $w;
}
I haven't tested this so there may be a typo... but this is the general idea.

Related

Regex split string by two consecutive pipe ||

I want to split below string by two pipe(|| ) regex .
Input String
value1=data1||value2=da|ta2||value3=test&user01|
Expected Output
value1=data1
value2=da|ta2
value3=test&user01|
I tried ([^||]+) but its consider single pipe | also to split .
Try out my example - Regex
value2 has single pipe it should not be considered as matching.
I am using lua script like
for pair in string.gmatch(params, "([^||]+)") do
print(pair)
end

You can explicitly find each ||.
$ cat foo.lua
s = 'value1=data1||value2=da|ta2||value3=test&user01|'
offset = 1
for idx in string.gmatch(s, '()||') do
print(string.sub(s, offset, idx - 1) )
offset = idx + 2
end
-- Deal with the part after the right-most `||`.
-- Must +1 or it'll fail to handle s like "a=b||".
if offset <= #s + 1 then
print(string.sub(s, offset) )
end
$ lua foo.lua
value1=data1
value2=da|ta2
value3=test&user01|
Regarding ()|| see Lua's doc about Patterns (Lua does not have regex support) —
Captures:
A pattern can contain sub-patterns enclosed in parentheses; they describe captures. When a match succeeds, the substrings of the subject string that match captures are stored (captured) for future use. Captures are numbered according to their left parentheses. For instance, in the pattern "(a*(.)%w(%s*))", the part of the string matching "a*(.)%w(%s*)" is stored as the first capture, and therefore has number 1; the character matching "." is captured with number 2, and the part matching "%s*" has number 3.
As a special case, the capture () captures the current string position (a number). For instance, if we apply the pattern "()aa()" on the string "flaaap", there will be two captures: 3 and 5.

the easiest way is to replace the sequence of 2 characters || with any other character (e.g. ;) that will not be used in the data, and only then use it as a separator:
local params = "value1=data1||value2=da|ta2||value3=test&user01|"
for pair in string.gmatch(params:gsub('||',';'), "([^;]+)") do
print(pair)
end
if all characters are possible, then any non-printable characters can be used, according to their codes: string.char("10") == "\10" == "\n"
even with code 1: "\1"
string.gmatch( params:gsub('||','\1'), "([^\1]+)" )

Regex substitution does not replace match character for character

I am trying to use Regex to dynamically capture all numbers in a string such as 1234-12-1234 or 1234-123-1234 without knowing the number of characters that will occur in each string segment. I have been able to capture this using positive look ahead via the following expression: [0-9]*(?=-). However, when I try to replace the numbers to Xs such that each number that occurs before the last dash is replaced by an X, the Regex does not return X's for numbers 1:1. Instead, each section returns exactly two X's. How can I get the regex to return the following:
1234-123-1234 -> XXXX-XXX-1234
1234-12-1234 -> XXXX-XX-1234
instead of the current
1234-123-1234 -> XX-XX-1234
?
Link to demo

The problem is that by placing the * directly after the digit match, more than one digit would get replaced with a single X. And then zero digits would get replaced with a single X. Therefore any number of digits would be effectively replaced as two X's.
Use this instead:
[0-9](?=.*-)

prevent nested groups from spoiling regexp matches (TCL)

I have a file with a number of (multi)space separated floats. Number of floats could vary. For the sake of the argument let's say it's 5. I picked up a regexp from this tutorial page :
www.regular-expressions.info/floatingpoint.html
[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?
To catch multiple floats I stuck this into a group added some spaces and grouped it again with ? quantifier.
(([-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?)\s+)+
I understand that has created nested groups and that's where my knowledge ends. When I test the regexp I get undesired matches of 'sub' groups i.e. the exponents.
So my question is: how do I capture only the 'first level' groups that are my full floats?
A sample test data set (note varying number of spaces):
set x " 1.0034e-09 -0.34e+07 -3 0.46 3.445e+03 "
Thanks,
Gert

The fact that your expression contains nested capturing groups does not mean you will be able to access those repeated captures, the only one accessible will be the text captured during the last iteration.
Also, each capturing group is returned in Tcl, and if you do not need it, convert all capturing groups into non-capturing (([eE][-+]?[0-9]+)?) => (?:[eE][-+]?[0-9]+)?).
To match all the numbers in your testing set, you may use
set x { 1.0034e-09 -0.34e+07 -3 0.46 3.445e+03 }
set RE {[-+]?[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?}
set res [regexp -all -inline $RE $x]
puts $res
See the IDEONE demo
NOTE that the [-+]?[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)? regex matches integer OR float values. To only match floats, use [-+]?[0-9]*\.[0-9]+(?:[eE][-+]?[0-9]+)? (remove the optional - one or zero occurrences - ? quantifier after \.).

Split by regex with capturing groups in lookahead produces repeating fragments in results

I was hoping for a one-liner to insert thousands separators into string of digits with decimal separator (example: 78912345.12). My first attempt was to split the string in places where there is either 3 or 6 digits left until decimal separator:
console.log("5789123.45".split(/(?=([0-9]{3}\.|[0-9]{6}\.))/));
which gave me the following result (notice how fragments of original string are repeated):
[ '5', '789123.', '789', '123.', '123.45' ]
I found out that "problem" (please read problem here as my obvious misunderstanding) comes from using a group within lookahead expression. This simple expression works "correctly":
console.log("abcXdeYfgh".split(/(?=X|Y)/));
when executed prints:
[ 'abc', 'Xde', 'Yfgh' ]
But the moment I surround X|Y with parentheses:
console.log("abcXdeYfgh".split(/(?=(X|Y))/));
the resulting array looks like:
[ 'abc', 'X', 'Xde', 'Y', 'Yfgh' ]
Moreover, when I change the group to a non-capturing one, everything comes back to "normal":
console.log("abcXdeYfgh".split(/(?=(?:X|Y))/));
this yields again:
[ 'abc', 'Xde', 'Yfgh' ]
So, I could do the same trick (changing to non-capturing group) within original expression (and it indeed works), but I was hoping for an explanation of this behavior I cannot understand. I experience identical results when trying to do the same in .NET so it seems like a fundamental thing with how regular expression lookaheads work. This is my question: why lookahead with capturing groups produces those "strange" results?

Capturing groups inside a regex pattern inside a regex split method/function make the captured texts appear as separate elements in the resulting array (for most of the major languages).
Here is C#/.NET reference:
If capturing parentheses are used in a Regex.Split expression, any captured text is included in the resulting string array. For example, if you split the string "plum-pear" on a hyphen placed within capturing parentheses, the returned array includes a string element that contains the hyphen.
Here is JavaScript reference:
If separator is a regular expression that contains capturing parentheses, then each time separator is matched, the results (including any undefined results) of the capturing parentheses are spliced into the output array. However, not all browsers support this capability.
Just a note: the same behavior is observed with
PHP (with preg_split and PREG_SPLIT_DELIM_CAPTURE flag):
print_r(preg_split("/(?<=(X))/","XYZ",-1,PREG_SPLIT_DELIM_CAPTURE));
// --> [0] => X, [1] => X, [2] => YZ
Ruby (with string.split):
"XYZ".split(/(?<=(X))/) # => X, X, YZ
But it is the opposite in Java, the captured text is not part of the resulting array:
System.out.println(Arrays.toString("XYZ".split("(?<=(X))"))); // => [X, YZ]
And in Python, with re module, re.split cannot split on the zero-width assertion, so the string does not get split at all with
print(re.split(r"(?<=(X))","XXYZ")) # => ['XXYZ']

Here is a simple way to do it in Javascript
number.toString().replace(/\B(?=(\d{3})+(?!\d))/g, ",")

Normally, including capture buffers could sometimes produce extra elements
if mixing with lookaheads.
You are on the right track but didn't have a natural anchor.
If you use a string where all the characters are the same type
(in your case digits), and using lookaheads, its not good enough
to do the split incrementally based on a length of common characters.
The engine just bumps along one character at a time, splitting on that
character and including the captured ones as elements.
You could handle this by consuming the capture in the process,
like (?=(\d{3}))\1 but that not only splits at the wrong place but
injects an empty element in the array.
The solution is to use the Natural Anchor, the DOT, then split at
multiples of 3 up to the dot anchor.
This forces the engine to seek to the point at which there are multiples
away from the anchor.
Then your problem is solved, no need for captures and the split is perfect.
Regex: (?=(?:[0-9]{3})+\.)
Formatted:
(?=
(?: [0-9]{3} )+
\.
)
C#:
string[] ary = Regex.Split("51234555632454789123.45", #"(?=(?:[0-9]{3})+\.)");
int size = ary.Count();
for (int i = 0; i < size; i++)
Console.WriteLine(" {0} = '{1}' ", i, ary[i]);
Output:
0 = '51'
1 = '234'
2 = '555'
3 = '632'
4 = '454'
5 = '789'
6 = '123.45'

How to identify a given string is hex color format

I'm looking for a regular expression to validate hex colors in ASP.NET C# and
am also looking code for validation on server side.
For instance: #CCCCCC

Note: This is strictly for validation, i.e. accepting a valid hex color. For actual parsing you won't get the individual parts out of this.
^#(?:[0-9a-fA-F]{3}){1,2}$
For ARGB:
^#(?:[0-9a-fA-F]{3,4}){1,2}$
Dissection:
^ anchor for start of string
# the literal #
( start of group
?: indicate a non-capturing group that doesn't generate backreferences
[0-9a-fA-F] hexadecimal digit
{3} three times
) end of group
{1,2} repeat either once or twice
$ anchor for end of string
This will match an arbitrary hexadecimal color value that can be used in CSS, such as #91bf4a or #f13.

Minor disagreement with the other solution. I'd say
^#(([0-9a-fA-F]{2}){3}|([0-9a-fA-F]){3})$
The reason is that this (correctly) captures the individual RGB components. The other expression broke #112233 in three parts, '#' 112 233. The syntax is actually '#' (RR GG BB) | (R G B)
The slight disadvantage is more backtracking is required. When parsing #CCC you don't know that the second C is the green component until you hit the end of the string; when parsing #CCCCCC you don't know that the second C is still part of the red component until you see the 4th C.
It also works great for RGBA but the other solution doesn't
const thisRegex = /#(([0-9a-fA-F]{2}){3,4}|([0-9a-fA-F]){3,4})/g
document.write("#fff;ae#rvaerv c #fffaff---#afd #ffff".match(thisRegex))
// #fff,#fffaff,#afd,#ffff
the other solution doesn't recognize #fffaff well
const theOtherSolutionRegex = /#(?:[0-9a-fA-F]{3,4}){1,2}/g
document.write("#fff;ae#rvaerv c #fffaff---#afd #ffff".match(theOtherSolutionRegex))
// #fff,#fffa,#afd,#ffff

all answers mentioned RGB format,
here is regex for ARGB format:
^#[0-9a-fA-F]{8}$|#[0-9a-fA-F]{6}$|#[0-9a-fA-F]{4}$|#[0-9a-fA-F]{3}$

This should match any #rgb, #rgba, #rrggbb, and #rrggbbaa syntax:
/^#(?:(?:[\da-f]{3}){1,2}|(?:[\da-f]{4}){1,2})$/i
break down:
^ // start of line
# // literal pound sign, followed by
(?: // either:
(?: // a non-capturing group of:
[\da-f]{3} // exactly 3 of: a single digit or a letter 'a'–'f'
){1,2} // repeated exactly 1 or 2 times
| // or:
(?: // a non-capturing group of:
[\da-f]{4} // exactly 4 of: a single digit or a letter 'a'–'f'
){1,2} // repeated exactly 1 or 2 times
)
$ // end of line
i // ignore case (let 'A'–'F' match 'a'–'f')
Notice that the above is not equivalent to this syntax, which is incorrect:
/^#(?:[\da-f]{3,4}){1,2}$/i
This would allow a group of 3 followed by a group of 4, such as #1234567, which is not a valid hex color.

This if you want to accept named colors and rgb(a,b,c) too. The final "i" is for case insensitive.
HTML colors (#123, rgb not accepted)
/^(#[a-f0-9]{6}|black|green|silver|gray|olive|white|yellow|maroon|navy|red|blue|purple|teal|fuchsia|aqua)$/i
CSS colors (#123, rgb accepted)
/^(#[a-f0-9]{6}|#[a-f0-9]{3}|rgb *\( *[0-9]{1,3}%? *, *[0-9]{1,3}%? *, *[0-9]{1,3}%? *\)|rgba *\( *[0-9]{1,3}%? *, *[0-9]{1,3}%? *, *[0-9]{1,3}%? *, *[0-9]{1,3}%? *\)|black|green|silver|gray|olive|white|yellow|maroon|navy|red|blue|purple|teal|fuchsia|aqua)$/i

Based on MSalters' answer, but preventing an incorrect match, the following works
^#(([0-9a-fA-F]{2}){3}|([0-9a-fA-F]){3})$
Or for an optional hash # symbol:
^#?(([0-9a-fA-F]{2}){3}|([0-9a-fA-F]){3})$
And without back references being generated:
^#?(?:(?:[0-9a-fA-F]{2}){3}|(?:[0-9a-fA-F]){3})$

Ruby
In Ruby, you have access to the \h (hexadecimal) character class. You also have to take more care of line endings, hence the \A...\z instead of the more common ^...$
/\A#(\h{3}){1,2}\z/
This will match 3 or 6 hexadecimal characters following a #. So no RGBA. It's also case-insensitive, despite not having the i flag.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Perl - Format Strings in Array using Regular Expressions - regex

Related

Regex split string by two consecutive pipe ||

Regex substitution does not replace match character for character

prevent nested groups from spoiling regexp matches (TCL)

Split by regex with capturing groups in lookahead produces repeating fragments in results

How to identify a given string is hex color format

Categories

Resources