better way to replace text - regex

I am trying to separate a line of text with space in between.
here is my code
$Text= "hellohello"
if($Text -match "(\w+)(o)(\w+)") {$Text = ($Matches[1] + $Matches[2] -replace "o", "o ")+$Matches[3]}
What is a better way to do it? let's say if changed the text to "manymany", I want powershell to auto identifies the first word and add space in between.

The question you ask can be big and small. Identify a word programmatically is not an easy topic.
For your case, the easy one, I assume you just want to find the repeated strings and insert a space between.
The regular expression is easy enough to do it.
"yesyesnonolarryhellohello" -replace "(\w{2,}?)\1",'$1 $1'
Output
yes yesno nolarryhello hello
.{2,}? any string at least 2 characters, () to mark as a reference
\1 to refer the first matched reference
so, (.{2,}?)\1 can match yesyes, okok and hellohello, in such case, $1 is the value of yes, ok and hello

Related

Regex Remove From x To y

I'm new to regex, I know the basics but only the basics. I need to parse a string to remove all occurances of one string to another. For example,
Here is some random text
This wants to stay
foo
This wants to be removed
bar
And this wants to stay
So the desired output would be
Here is some random text
This wants to stay
And this wants to stay
And removed would be
foo
This wants to be removed
bar
It will always follow the pattern of match 'this string' to 'that string' and remove everything in between, including 'this string' and 'that string'.
The file is a text file, for the sake of this question, the pattern will always start with foo and end with bar, removing foo, bar and everything in between.
Foo and Bar ARE part of the file and need removing.
Regexes are probably the wrong tool here. I'd probably use string equality along with the flip-flop operator.
while (<$input_fh>) {
print $output_fh unless ($_ eq "foo\n" .. $_ eq "bar\n");
}
You could do it with a regex and a match operator.
while (<$input_fh>) {
print $output_fh unless /foo/ .. /bar/;
}
That looks neater, but the regexes will match if the strings appear anywhere on an input line.
Update: Inverted the logic on the tests - so it's now correct.
Are you looking for something like this?
#!/usr/bin/perl
$start = "foo";
$end = "bar";
while (<STDIN>) {
$str = $str . $_;
}
$str =~ s/(.*)$start\n.*$end\n(.*)/\1\2/s;
print $str;
The only part of real importance to you is the regex I suppose, but I declare the start and end, then read from standard input and tack each concurrent line onto $str. Then I take str and say "whatever is the first thing within perenthesis before foo put first, whatever is in the second after bar parenthesis put last" (with the backslash \1 and \2)
My output from a file containing your lines is:
marshall#marshall-desktop:~$ cat blah | ./haha
Here is some random text
This wants to stay
And this wants to stay
That's not what RegEx is there for. RegEx is there to detect pattern - if you want simple string slice, you should simply iterate over the big string with a simple comparison (or, with other languages which include string operations, indexOf("your string here"); etc. )
However, simple typing of the string would find you the matches:
This wants to be removed will return all occurances of that specific string, and thus it is fit for you.

Extracting first two words in perl using regex

I want to create extract the first two words from a sentence using a Perl function in PostgreSQL. In PostgreSQL, I can do this with:
text = "I am trying to make this work";
Select substring(text from '(^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)');
It would return "I Am"
I tried to build a Perl function in Postgresql that does the same thing.
CREATE OR REPLACE FUNCTION extract_first_two (text)
RETURNS text AS
$$
my $my_text = $_[0];
my $temp;
$pattern = '^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)';
my $regex = qr/$pattern/;
if ($my_text=~ $regex) {
$temp = $1;
}
return $temp;
$$ LANGUAGE plperl;
But I receive a syntax error near the regular expression. I am not sure what I am doing wrong.
Extracting words is none trivial even in English. Take the following contrived example using Locale::CLDR
use 'Locale::CLDR';
my $locale = Locale::CLDR->new('en');
my #words = $locale->split_words('adf543. 123.25');
#words now contains
adf543
.
123.25
Note that the full stop after adf543 is split into a separate word but the one between 123 and 25 is kept as part of the number 123.25 even though the '.' is the same character
If gets worse when you look at non English languages and much worse when you use non Latin scripts.
You need to precisely define what you think a word is otherwise the following French gets split incorrectly.
Je avais dit «Elle a dit «Il a dit «Ni» il ya trois secondes»»
The parentheses are mismatched in our regex pattern. It has three opening parentheses and four closing ones.
Also, you have two single quotes in the middle of a singly-quoted string, so
'^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)'
is parsed as two separate strings
'^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)'
and
'^\w+-\w+|^\w+(\s+)?(!|,|\&|'
')?(\s+)?\w+)'
But I can't suggest how to fix it as I don't understand your intention.
Did you mean a double quote perhaps? In which case (!|,|\&|")? can be written as [!,&"]?
Update
At a rough guess I think you want this
my $regex = qr{ ^ \w++ \s* [-!,&"]* \s* \w+ }x;
$temp = $1 if $my_text=~ /($regex)/;
but I can't be sure. If you describe what you're looking for in English then I can help you better. For instance, it's unclear why you don't have question marks, full stops, and semicolons in the list of intervening punctuation.

Regular expression help in Perl

I have following text pattern
(2222) First Last (ab-cd/ABC1), <first.last#site.domain.com> 1224: efadsfadsfdsf
(3333) First Last (abcd/ABC12), <first.last#site.domain.com> 1234, 4657: efadsfadsfdsf
I want the number 1224 or 1234, 4657 from the above text after the text >.
I have this
\((\d+)\)\s\w*\s\w*\s\(\w*\/\w+\d*\),\s<\w*\.\w*\#\w*\.domain.com>\s\d+:
which will take the text before : But i want the one after email till :
Is there any easy regular expression to do this? or should I use split and do this
Thanks
Edit: The whole text is returned by a command line tool.
(3333) First Last (abcd/ABC12), <first.last#site.domain.com> 1234, 4657: efadsfadsfdsf
(3333) - Unique ID
First Last - First and last names
<first.last#site.domain.com> - Email address in format FirstName.LastName#sub.domain.com
1234, 4567 - database primary Keys
: xxxx - Headline
What I have to do is process the above and get hte database ID (in ex: 1234, 4567 2 separate ID's) and query the tables
The above is the output (like this I will get many entries) from the tool which I am calling via my Perl script.
My idea was to use a regular expression to get the database id's. Guess I could use regular expression for this
you can fudge the stuff you don't care about to make the expression easier, say just 'glob' the parts between the parentheticals (and the email delimiters) using non-greedy quantifiers:
/(\d+)\).*?\(.*?\),\s*<.*?>\s*(\d+(?:,\s*\d+)*):/ (not tested!)
there's only two captured groups, the (1234), and the (1234, 4657), the second one which I can only assume from your pattern to mean: "a digit string, followed by zero or more comma separated digit strings".
Well, a simple fix is to just allow all the possible characters in a character class. Which is to say change \d to [\d, ] to allow digits, commas and space.
Your regex as it is, though, does not match the first sample line, because it has a dash - in it (ab-cd/ABC1 does not match \w*\/\w+\d*\). Also, it is not a good idea to rely too heavily on the * quantifier, because it does match the empty string (it matches zero or more times), and should only be used for things which are truly optional. Use + otherwise, which matches (1 or more times).
You have a rather strict regex, and with slight variations in your data like this, it will fail. Only you know what your data looks like, and if you actually do need a strict regex. However, if your data is somewhat consistent, you can use a loose regex simply based on the email part:
sub extract_nums {
my $string = shift;
if ($string =~ /<[^>]*> *([\d, ]+):/) {
return $1 =~ /\d+/g; # return the extracted digits in a list
# return $1; # just return the string as-is
} else { return undef }
}
This assumes, of course, that you cannot have <> tags in front of the email part of the line. It will capture any digits, commas and spaces found between a <> tag and a colon, and then return a list of any digits found in the match. You can also just return the string, as shown in the commented line.
There would appear to be something missing from your examples. Is this what they're supposed to look like, with email?
(1234) First Last (ab-cd/ABC1), <foo.bar#domain.com> 1224: efadsfadsfdsf
(1234) First Last (abcd/ABC12), <foo.bar#domain.com> 1234, 4657: efadsfadsfdsf
If so, this should work:
\((\d+)\)\s\w*\s\w*\s\(\w*\/\w+\d*\),\s<\w*\.\w*\#\w*\.domain\.com>\s\d+(?:,\s(\d+))?:
$string =~ /.*>\s*(.+):.+/;
$numbers = $1;
That's it.
Tested.
With number catching:
$string =~ /.*>\s*(?([0-9]|,)+):.+/;
$numbers = $1;
Not tested but you get the idea.

perl regex grouping overload

I am using the following perl regex lines
$myalbum =~ s/[-_'&’]/ /g;
$myalbum =~ s/[,’.]//g;
$myalbum =~ m/([A-Z0-9\$]+) +([A-Z0-9\$]+) +([A-Z0-9\$]+) +([A-Z0-9\$]+) +([A-Z0-9\$]+)/i;
to match the following strings
"30_Seconds_To_Mars_-_30_Seconds_To_Mars"
"30_Seconds_To_Mars_-_A_Beautiful_Lie"
"311_-_311"
"311_-_From_Chaos"
"311_-_Grassroots"
"311_-_Sound_System"
What I am experiencing is that for strings with less than 5 matching groups (ex. 311_-_311), attempting to print $1 $2 $3 prints nothing at all. Only strings with more than 5 matches will print.
How do I resolve this?
It looks like you just want the words in separate groups. To me, it seems like you're abusing regexes to do that when you could just run your substitutions and then split. Just do:
$myalbum =~ s/[-_'&’]/ /g;
$myalbum =~ s/[,’.]//g;
my #myalbum_list = split(/\s/, $myalbum);
#Print out whatever it is you want/ test length, etc...
print "$myalbum_list[0] $myalbum_list[1] $myalbum_list[2]";
the + character means at least one match. Which means your regex m/([A-Z0-9\$]+) +([A-Z0-9\$]+) + ... requires all those fields to be there for it to be considered a match. The reason you are not capturing anything is because it's not actually matching.
You are probably looking for the * character which means zero or more not one or more like +.
I suppose your capturing groups are empty for "311 - 311" because this string doesn't match your regex.
How to resolve? Use * instead of + to permit empty sequences.
Edit: From your post I guess you want to extract the album name, i.e. the part before the minus sign.
Why not match against '(.*) - (.*)', being the first group the album and the second the title. The problem is with strings like "Album with minus - sign - First track" or "My Album - Track is one - two - three". But also as a human you wouldn't know there where the album ends and the track starts.

How to match the pattern between a specified characters?

I have a fixed message delimited by "|"... tag=value is the pair between the delimiter;
(8=FIX.4.2|9=0360|35=8|49=BLPFT|56=ESP|34=8415|52=20110201-15:59:59|50=MBA|143=LN|115=MSET|57=2457172|30=CHIX|60=20110201-15:59:59.121|150=1|31=56.3100|151=71785|32=137|6=56.4058|37=9D9ZIhgu4BGU9sBtfHcYeQA|38=97370|39=1|40=1|11=20110201-05529|12=0.0012|13=2|14=25585|15=EUR|76=CHIXCCP|17=272674|47=A|167=CS|18=1|48=FR0000131104|20=0|21=1|22=4|113=N|54=1|55=BNP|207=FP|29=1|59=0|10=205|)
How to extract a data between "11=" and a first occurrence of "|" after a match?
For example i want a data
20110201-05529
which is between "|11=" and "|"
Can you please tell me the regular expression?
The best approach will depend on how much you know about the data you are trying to match. If you know it will be comprised of numbers and dashes only:
m/11=([0-9\-]+)/
Conversely, if the data could contain any kind of characters, use:
m/11=([^|]+)/
Which matches anything that isn't a pipe character. This is probably the most reliable expression.
In both cases, the data you want is captured into the $1 special variable.
If you don't always want to match the value for the key 11, you can use variables in the pattern, so:
my $key = 42; # or any number
if ($text =~ m/$key=([^|]+)/) {
print "I found $1"; # prints "I found 20110201-05529"
}
As always, there is more than 1 way to solve the problem. Therefore, there is no such thing as "the regular expression".
But you will definitely want to perldoc split.
Something like this will match everything else than = then everything else than |
[^=]+=([^|])+