What regular expression can remove duplicate items from a string?

What regular expression can remove duplicate items from a string? - regex

Given a string of identifiers separated by :, is it possible to construct a regular expression to extract the unique identifiers into another string, also separated by :?
How is it possible to achieve this using a regular expression? I have tried s/(:[^:])(.*)\1/$1$2/g with no luck, because the (.*) is greedy and skips to the last match of $1.
Example: a:b:c:d:c:c:x:c:c:e:e:f should give a:b:c:d:x:e:f
Note: I am coding in perl, but I would very much appreciate using a regex for this.

In .NET which supports infinite repetition inside lookbehind, you could search for
(?<=\b\1:.*)\b(\w+):?
and replace all matches with the empty string.
Perl (at least Perl 5) only supports fixed-length lookbehinds, so you can try the following (using lookahead, with a subtly different result):
\b(\w+):(?=.*\b\1:?)
If you replace that with the empty string, all previous repetitions of a duplicate entry will be removed; the last one will remain. So instead of
a:b:c:d:x:e:f
you would get
a:b:d:x:c:e:f
If that is OK, you can use
$subject =~ s/\b(\w+):(?=.*\b\1:?)//g;
Explanation:
First regex:
(?<=\b\1:.*): Check if you can match the contents of backreference no. 1, followed by a colon, somewhere before in the string.
\b(\w+):?: Match an identifier (from a word boundary to the next :), optionally followed by a colon.
Second regex:
\b(\w+):: Match an identifier and a colon.
(?=.*\b\1:?): Then check whether you can match the same identifier, optionally followed by a colon, somewhere ahead in the string.

Check out: http://www.regular-expressions.info/duplicatelines.html
Always a useful site when thinking about any regular expression.

$str = q!a:b:c:d:c:c:x:c:c:e:e:f!;
1 while($str =~ s/(:[^:]+)(.*?)\1/$1$2/g);
say $str
output :
a:b:c:d:x:e:f

here's an awk version, no need regex.
$ echo "a:b:c:d:c:c:x:c:c:e:e:f" | awk -F":" '{for(i=1;i<=NF;i++)if($i in a){continue}else{a[$i];printf $i}}'
abcdxef
split the fields on ":", go through the splitted fields, store the elements in an array. check for existence and if exists, skip. Else print them out. you can translate this easily into Perl code.

If the identifiers are sorted, you may be able to do it using lookahead/lookbehind. If they aren't, then this is beyond the computational power of a regex. Now, just because it's impossible with formal regex doesn't mean it's impossible if you use some perl specific regex feature, but if you want to keep your regexes portable you need to describe this string in a language that supports variables.

Related

Perl - Split a string containing tuples using regex

I have a string containing tuples like below:
"(-0.345205479452055,1.3543),(-0.26027397260274,1.218),(-0.183561643835616,1.3028)"
I am trying to split this string to an array containing just the tuples: [(-0.345205479452055,1.3543),(-0.26027397260274,1.218),(-0.183561643835616,1.3028)]
I cannot use Split function like below since the function also splits up the tuple as well. Is there a regex or some clever way to get the tuples as-is?
#Tuples = split /,/,$myString;

split can be used for this but requires a slightly more detailed expression.
my #str = "(-0.345205479452055,1.3543),(-0.26027397260274,1.218),(-0.183561643835616,1.3028)");
my #arr1 = split(/(?<=\)),(?=\()/, $str);
The key here is the use of a zero-width look-behind assertion for checking for a closing paren and the use of a zero-width look-ahead assertion (not exactly necessary here but useful to see) to check for an open paren. Check the perlre docs for more info on these.
Alternatively, if you want to avoid split altogether then you can use a global match as well.
my #arr2 = $str =~ /(\([^)]+\))/g;

If your data is structured consistently the way you showed, you can use a lookbehind to check and see if the comma comes after a parenthesis.
/(?<=\)),/
You could also use a negative lookbehind to see if a number is before the comma, and not split there, though that could be confusing to understand.
/(?<!\d),/

If there are no parenthesis inside the tuples, and no parenthesis outside the tuples, you could simply use the following regex:
my #array = $str =~ /(\(.*?\))/sg;
assuming that there is always a starting parenthesis and a matching end parenthesis for each tuple.
Here
*? is a nongreedy quantifier, see perlretut for more information,
the flag s is a regex modifier that lets the . also match newlines characters (in case your string contains newlines), see perlre for more information,
the flag g stands for global matching and allows the matching operator to match within a string as many times as possible.
In list context, g returns a list of matched groupings, or if there are no groupings, a list of matches to the whole regexp, see perlretut for more information.

How can I replace this data in between certain delimiter with Notepad ++?

I have a list of data in this format
0000000000000000|000|000|00000|000000|CITY|GA|123456|8001234567
I need to replace the last piece of data with the word N/A so there is no phone number in the list.
0000000000000000|000|000|00000|000000|CITY|GA|123456|N/A
Thank you for the assistance, much appreciated.

The simplest and fastest solution for that would be to search for
[^|\r\n]+$
and replacing all with N/A.
Explanation:
[^|\r\n]+ matches one or more characters except | or newlines, and $ makes sure that the match only occurs at the end of a line.

Do a find/replace, with the mode set to "Regular expression".
Find:
(.*)\|[0-9]*
Replace:
\1|N/A

If your phone numbers contain any non-numeric characters (such as periods, hyphens, spaces, etc.), then I would recommend the following adjustment to the regex given by #Bitwise:
(.*)\|(.*)$
Also, in Notepad++, the backreference syntax is not
\1
but rather
$1
which means your replace string will actually be
$1|N/A

You can use
(?!.*\|)(.+)
to mark the end of the line.
In Notepad++ you can use the search and replace (regex) function.

Seperate backreference followed by numeric literal in perl regex

I found this related question : In perl, backreference in replacement text followed by numerical literal
but it seems entirely different.
I have a regex like this one
s/([^0-9])([xy])/\1 1\2/g
^
whitespace here
But that whitespace comes up in the substitution.
How do I not get the whitespace in the substituted string without having perl confuse the backreference to \11?
For eg.
15+x+y changes to 15+ 1x+ 1y.
I want to get 15+1x+1y.

\1 is a regex atom that matches what the first capture captured. It makes no sense to use it in a replacement expression. You want $1.
$ perl -we'$_="abc"; s/(a)/\1/'
\1 better written as $1 at -e line 1.
In a string literal (including the replacement expression of a substitution), you can delimit $var using curlies: ${var}. That means you want the following:
s/([^0-9])([xy])/${1}1$2/g
The following is more efficient (although gives a different answer for xxx):
s/[^0-9]\K(?=[xy])/1/g

Just put braces around the number:
s/([^0-9])([xy])/${1}1${2}/g

Regex detect if a matched comma(,) does not lie in a regex

I am trying to figure out a way to determine if my matched comma(,) does not lie inside a regex. Basically, i do not want to match my character if it lies in a regex.
The regex i have come up with is ,(?<!.+\/)(?!.+\/) but its not quite working.
Any ideas?
I want to skip /some,regex/ but match any other commas.
Edit:
Live example: http://rubular.com/r/WjrwSnmzyP

Here is the regex that will work for you:
,(?!\s)(?=(?:(?:[^/]*\/){2})*[^/]*$)
Live Demo: http://rubular.com/r/37buDdg1tW
Explanation: It means match comma followed by EVEN number of forward slash /. Hence comma (,) between 2 slash (/) characters will NOT be matched and outside ones will be matched (since those are followed by even number of / characters).

A curious thing about regular expressions is that if you want to use them to ignore "something" that is within "something else", you need to match that "something else", prefer matches of it, and then either silently discard or reproduce those matches.
For example, in order to remove all commas from a string unless they are in a regular expression literal—
In Perl:
my $s = "/foo,bar/,baz";
$s =~ s{(/(?:[^/\\]|\\.)+/)|,}{\1}g;
In ECMAScript:
var s = "/foo,bar/,baz";
s = s.replace(/(\/([^\/\\]|\\.)+\/)|,/g, "$1");
or
s = s.replace(new RegExp("(/([^/\\\\]|\\\\.)+/)|,", "g"), "$1");
Note that I am capturing the match for the regular expression literal in the string value, and reproducing it (\1 or $1) if it matched. (If the other part of the alternation – the standalone comma – matched, the empty string is captured, so this simple approach suffices here.)
For further reading I recommend “Mastering Regular Expressions” by Jeffrey E. F. Friedl. Two rather enlightening example chapters, each from a different edition, are available for free online.

sed: Can my pattern contain an "is not" character? How do I say "is not X"?

How do I say "is not" a certain character in sed?

[^x]
This is a character class that accepts any character except x.

For those not satisfied with the selected answer as per johnny's comment.
'su[^x]' will match 'sum' and 'sun' but not 'su'.
You can tell sed to not match lines with x using the syntax below:
sed '/x/! s/su//' file
See kkeller's answer for another example.

There are two possible interpretations of your question. Like others have already pointed out, [^x] matches a single character which is not x. But an empty string also isn't x, so perhaps you are looking for [^x]\|^$.
Neither of these answers extend to multi-character sequences, which is usually what people are looking for. You could painstakingly build something like
[^s]\|s\($\|[^t]\|t\($\|[^r]\)\)\)
to compose a regular expression which doesn't match str, but a much more straightforward solution in sed is to delete any line which does match str, then keep the rest;
sed '/str/d' file
Perl 5 introduced a much richer regex engine, which is hence standard in Java, PHP, Python, etc. Because Perl helpfully supports a subset of sed syntax, you could probably convert a simple sed script to Perl to get to use a useful feature from this extended regex dialect, such as negative assertions:
perl -pe 's/(?:(?!str).)+/not/' file
will replace a string which is not str with not. The (?:...) is a non-capturing group (unlike in many sed dialects, an unescaped parenthesis is a metacharacter in Perl) and (?!str) is a negative assertion; the text immediately after this position in the string mustn't be str in order for the regex to match. The + repeats this pattern until it fails to match. Notice how the assertion needs to be true at every position in the match, so we match one character at a time with . (newbies often get this wrong, and erroneously only assert at e.g. the beginning of a longer pattern, which could however match str somewhere within, leading to a "leak").

From my own experience, and the below post supports this, sed doesn't support normal regex negation using "^". I don't think sed has a direct negation method...but if you check the below post, you'll see some workarounds.
Sed regex and substring negation

In addition to all the provided answers , you can negate a character class in sed , using the notation [^:[C_CLASS]:] , for example , [^[:blank:]] will match anything which is not considered a space character .

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

What regular expression can remove duplicate items from a string? - regex

Check out: http://www.regular-expressions.info/duplicatelines.html Always a useful site when thinking about any regular expression.

$str = q!a:b:c:d:c:c:x:c:c:e:e:f!; 1 while($str =~ s/(:[^:]+)(.*?)\1/$1$2/g); say $str output : a:b:c:d:x:e:f

Related

Perl - Split a string containing tuples using regex

How can I replace this data in between certain delimiter with Notepad ++?

Seperate backreference followed by numeric literal in perl regex

Regex detect if a matched comma(,) does not lie in a regex

sed: Can my pattern contain an "is not" character? How do I say "is not X"?

Categories

Resources