I want to trim every string between double quotes to 16 characters - regex

In Powershell, I have a string of this form:
"abcdefghijk","hijk","lmnopqrstuvwxyzabcdefghij"
Assuming that I know how to escape a character (thus, if I were actually to write it including the string markers):
"`"abcdefghijk`",`"hijk`",`"lmnopqrstuvwxyzabcdefghij`""
... how would I trim anything between double quotes to only 16 characters?
The expected output is therefore:
"abcdefghijk","hijk","lmnopqrstuvwxyza"
I thought this:
% {$_ -replace "([^\`"]{16})([^\`"]+)", "$1"}
would match any relevant blocks as two backreferences, one of length 16 and one of unlimited length, and return only the first. However, that just removes everything of length 16 or more, resulting in:
"abcdefghijk","hijk",""
This isn't what I expected at all! What am I doing wrong?

The answer is as simple as this: change the double quotes around $1 to single quotes:
-replace "([^`"]{16})([^`"]+)", '$1'
It is a bit counter-intuitive, but here is the reason behind it: when you use a pair of double quotes, $1 is interpreted as a variable name and interpolated into its content, which is empty in this case, before it even reaches the regex engine.
Or, you can escape the $ as well:
-replace "([^`"]{16})([^`"]+)", "`$1"

I came up with the following which should be pretty close to what you're trying to do:
(\w+)\W+(\w+)\W+(\w+){16}
https://regex101.com/r/kfEAFl/1
Here you have the three groups and you can join them in any way.

Here's an example to do it without regex:
$longArray = #("1221111111111111111111111111", "213")
$shortArray = $longArray.ForEach{$_.ToString().PadRight(16, " ").SubString(0,16)}
write-host $shortArray
# Writes 1221111111111111 213

Related

Replace a sequence of characters with a sequence of different characters of same length using regular expressions

I have a string which starts with spaces. I want to replace the leading spaces with equal number of dashes -. I don't want to replace any other spaces which may occur elsewhere in the string.
If I use /^\s*/-/, it only replaces with a single dash. If I use /^\s/-/, it only replaces the first space with a dash. If I remove the anchor /\s/-/, it replaces every occurences of space in the string which is not acceptable.
My string looks like this in general:
<n-leading-spaces><a-non-space-character><remaining-characters>
Example (pipes added to show the boundary):
| ajfn ssfdjn ng jnv sjfj%nv sjfj n s ;sn |
After substitution (pipes added to show the boundary):
|---ajfn ssfdjn ng jnv sjfj%nv sjfj n s ;sn |
NOTE: I cannot use any code snippet. I just want to know whether this can be done using just regex patterns. (Forgive my formatting as I'm new to markdown. I welcome formatting corrections)
You can use the following solution to replace a sequence of characters with a sequence of different characters of same length using regular expressions:
my $string = ' ajfn ssfdjn ng jnv sjfj%nv sjfj n s ;sn ';
$string =~ s/^(\s+)/"-" x length($1)/eg;
print $string;
Returns '----ajfn ssfdjn ng jnv sjfj%nv sjfj n s ;sn '

How to use sed/perl to find only 2d arrays and replace text?

Currently I have tons of code that looks like this:
static double testVar1 [2][8] = {0.0} ; /* This is for testing var 1 */
static double var_test2 [3][2] = {0.0} ; /* This is for testing var 2 */
static double var_test3 [4] = {0.0} ; /* This is for testing var 3 */
2d arrays in c++ initialize with double curly brackets, so I need to only find the 2d arrays and change it like this:
static double testVar1 [2][8] = {{0.0}} ; /* This is for testing var 1 */
static double var_test2 [3][2] = {{0.0}} ; /* This is for testing var 2 */
static double var_test3 [4] = {0.0} ; /* This is for testing var 3 */
I have been trying with sed to use groupings, but I can't figure out how to escape the brackets, some posts suggest not escaping at all. I have also tried without extended regular expressions.
Just now, I found out only 9 groupings in sed are possible, so now completely stuck. Any suggestions?
sed -i -r 's/(.*)(\[)([0-9]+)(\])(\[)([0-9]+)(\])(.*)(\{)(0.0)(\})(.*)/echo "\1\2\3"/ge'
Use a perl script with the following regex:
\w+\s*(?:\[\d+\]){2}\s*=\s*\K\{([\d.]*)\}
And replace this with \{\{\1\}\}, see a demo on regex101.com.
Broken down, this says:
\w+ # at least one word character
\s* # Zero or more spaces
(?:\[\d+\]){2} # [0][1] or any other two-dimensional array
\s*=\s* # spaces = and spaces
\K # "forget" everything
\{([\d.]*)\} # match and capture {number}
A Perl one-liner, cautious about literals such as 2u and 1e-06l (etc)
perl -pe's/(?:\[ [^]]+ \]){2} \s*=\s* \K (\{ [^}]+ \})/{$1}/x' in > out
The (?:) groups (without capture) and (?:\[[^]]+\]){2} is for [n][m]. The \K is the form of the positive lookbehind, which also drops previous matches so we don't have to put them back.
With an integer inside [] being just digits and with a float in {} being n.m this simplifies
perl -pe's/(?:\[\d+\]){2}\s*=\s*\K( \{[\d.]+\} )/{$1}/x' in > out
Note that [\d.] allows for all kinds of wrong things, like .2..3, but that is a different issue.
However, watch out for use of literals for numbers such as 2u (with the suffix) which are fine as indices as well, along with vec[1.2e+01] or even vec[1.2]. The varied notation for float/double literals is also more likely to show up in data. Altogether I'd go with a more rounded pattern like
perl -pe's/(?:\[ [\d\w+-.]+ \]){2}\s*=\s*\K(\{ [\d\w+-.]+ \})/{$1}/x' in > out
Keep in mind that this allows various wrong formats and so it doesn't check data well.
Here is a sed attempt with the regex wrinkles ironed out.
sed -i -r 's/(.*\[[0-9]+\]\[[0-9]+\].*)(\{0.0\})(.*)/\1{\2}\3/'
You had significant amounts of unmotivated additional grouping parentheses so \1\2\3 would only refer to the very beginning of the match. I simply took them out. Remember, the captures are ordered from left to right, so the first left parenthesis creates group \1, the second captures into \2, etc.
The GNU sed extension /e allows you to invoke the shell on the replacement string but in this case this added no value and introduced significant additional possible errors, so taking it out was a no-brainer. The /g option would make sense if you expected multiple matches per line, but your example shows no examples of input lines with multiple matches, and the entire script would need to be rather more complex in order to support that, so I took that out as well.
Depending on the language you are attempting to process and the regularity of the files, you might want to permit whitespace between the closing and opening square brackets, or not; and the "anything" wildcard between the closing square bracket and the opening curly bracket looks somewhat prone to false positives (matching where you don't want it to) -- maybe change it to only permit whitespace and and equals sign, like [ =]* instead of .*
Another approach with sed:
sed -i -r 's/((\[[0-9]\]){2} *= )(\{[^}]*\})/\1{\3}/' file
and same in BRE mode :
sed -i 's/\(\(\[[0-9]\]\)\{2\} *= \({[^}]*}\)\)/\1{\2}/' file
sed -i '/]\[/s/[{}]/&&/g' file

Perl regex wierd behavior : works with smaller strings fails with longer repetitions of the smaller ones

here is a REGEX in perl that I use to identify strings that match this pattern : include any number of occurrences of any character but single quote ' or backslash , allow only escaped occurrences of ' or , respectively : \' and \ and finally it has to end with a (non-escaped) single quote '
foo.pl
#!/usr/bin/perl
my $line;
my $matchString;
Main();
sub Main() {
foreach $line( <STDIN> ) {
$line =~ m/(^(([^\\\']*?(\\\')*?(\\\\)*?)*?\'))/g;
$matchString = $1;
print "matchString:$matchString\n"
}
}
It seems to work fine for strings like :
./foo.pl
asasas'
sdsdsdsdsdsd'
\\\'sdsdsdsdsd\\\'sdsdsdsd\\'
\'sddsd\\sdsdsds\\\\\\sdsdsdsd\\\\\\'
matchString:asasas'
matchString:sdsdsdsdsdsd'
matchString:\\\'sdsdsdsdsd\\\'sdsdsdsd\\'
matchString:\'sddsd\\sdsdsds\\\\\\sdsdsdsd\\\\\\'
Then I create a file with the following recurring pattern :
AAAAAAAAAAAAAAAAAAAAAAAAAAAAA\\BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB\'CCCCCCCCCCCCCCCCCCCCCC\\sdsdsd\\\\\' ZZZZ\'GGGGGG
By creating a string by repeating this pattern one or more times and adding a single quote ' at the end should match the reg exp. I created a file called zz3 with 16 repetitions of the above pattern. I created then a file called ZZ6 with 18 repetitions of zz3 and another one called ZZ7 with the contents of ZZ6 + one additional instance of zz3, hence 19 repetitions of zz3.
By adding a single quote at the end of zz3 it results in a match. By adding a single quote at the end of ZZ6 it also results in a match as expected.
Now here is the tough part, by adding a single quote at the end of ZZ7 does not result in a match!
here is a link to the 3 files :
https://drive.google.com/file/d/0BzIKyGguqkWvOWdKaElGRjhGdjg/view?usp=sharing
The perl version I am using is v5.16.3 on FreeBSD bit i tried with various versions on either FreeBSD or linux with identical results. It seems to me that either perl has a problem with the size from 34274 bytes (ZZ6) to 36178 bytes (ZZ7), or I am missing something badly.
Your regular expression leads to catastrophic backtracking because you have nested quantifiers.
If you change it to
(^(([^\\\']*+(\\')*+(\\\\)*+)*?'))
(using possessive quantifiers to avoid backtracking), it should work.
I just would like to note that the whole problem appeared in an effort to re-engineer an old in-house program to parse escaped PostgreSQL bytea values.
Following this discussion it is clear that perl cannot match any repetition of non dot (.) patterns for more than 32766(=32K-2) times.
The solution is to masquerade the \\ and \' sequences with some chars that are certain to not appear in the input, such as Device Ctrl1 (\x11) and Device Ctrl2 (\x12), (presented as ^Q, ^R in vi respectively) :
$dataField =~ s/\\\\/\x11/g;
$dataField =~ s/\\\'/\x12/g;
then try to match non greedily any input till the first single quote.
$dataField =~ m/(^.*?\')/s;
$matchString = $1;
and finally substitute the above Ctrl chars back to their initial values
$matchString =~ s/\x11/\\\\/g;
$matchString =~ s/\x12/\\\'/g;
This is very fast. Another solution would be to parse till the first single quote and count the number of \'s. If it is even then we have found our last non escaped single quote in the text so we have found our desired match, otherwise the single quote is an escape one and thus considered part of the text, so we keep this value and iterate to the next single quote and repeat the same logic, by concatenating the value to the previous value. This tends to be very slow for big files with many intermediate escaped single quotes.
Perl regex's seem to be much faster than Perl code.

powershell regex remove digit

Using Powershell, I'm trying to remove leading zeros of the numeric part of a string. Here are a couple of examples and expected results:
AB012 needs to become AB12
ABC010 needs to become ABC10
I've tried the following, but it gets rid of all zeros. How can I tell it to only remove the leading zeros? The format of the string will always be letters followed by numbers. However, the length of the letters may vary.
$x = ABC010
$y = $x -replace '[0]'
$y
This will display ABC1, and I wish to display ABC10.
Thanks.
This regex searches for a letter followed by any number of 0's and keeps the letter in the replacement pattern but strips the zeros:
$x = $x -replace '([a-z])0*', '$1'
try this regex with look-behind and look-ahead assertions
'(?<=.+)(0)(?=.+)'
the problem is if you have string like "0AB0101" that become"0AB11" in this case use:
'(?<=\D)(0)(?=.+)'
Here's another option that evaluates the digits as a number:
PS> [regex]::replace('ABC010','(\d+)',{iex $args[0]})
ABC10
Or using the TrimStart method:
PS> [regex]::replace('ABC010','(\d+)',{$args[0].ToString().TrimStart('0')})
ABC10

perl regex grouping overload

I am using the following perl regex lines
$myalbum =~ s/[-_'&’]/ /g;
$myalbum =~ s/[,’.]//g;
$myalbum =~ m/([A-Z0-9\$]+) +([A-Z0-9\$]+) +([A-Z0-9\$]+) +([A-Z0-9\$]+) +([A-Z0-9\$]+)/i;
to match the following strings
"30_Seconds_To_Mars_-_30_Seconds_To_Mars"
"30_Seconds_To_Mars_-_A_Beautiful_Lie"
"311_-_311"
"311_-_From_Chaos"
"311_-_Grassroots"
"311_-_Sound_System"
What I am experiencing is that for strings with less than 5 matching groups (ex. 311_-_311), attempting to print $1 $2 $3 prints nothing at all. Only strings with more than 5 matches will print.
How do I resolve this?
It looks like you just want the words in separate groups. To me, it seems like you're abusing regexes to do that when you could just run your substitutions and then split. Just do:
$myalbum =~ s/[-_'&’]/ /g;
$myalbum =~ s/[,’.]//g;
my #myalbum_list = split(/\s/, $myalbum);
#Print out whatever it is you want/ test length, etc...
print "$myalbum_list[0] $myalbum_list[1] $myalbum_list[2]";
the + character means at least one match. Which means your regex m/([A-Z0-9\$]+) +([A-Z0-9\$]+) + ... requires all those fields to be there for it to be considered a match. The reason you are not capturing anything is because it's not actually matching.
You are probably looking for the * character which means zero or more not one or more like +.
I suppose your capturing groups are empty for "311 - 311" because this string doesn't match your regex.
How to resolve? Use * instead of + to permit empty sequences.
Edit: From your post I guess you want to extract the album name, i.e. the part before the minus sign.
Why not match against '(.*) - (.*)', being the first group the album and the second the title. The problem is with strings like "Album with minus - sign - First track" or "My Album - Track is one - two - three". But also as a human you wouldn't know there where the album ends and the track starts.