Extracting substrings ending with "mp4" - regex

I have the following input:
string='GET........ref=mp4;GET........ref=flv;GET........ref=mp4;'
It has 3 segments. I need to extract the segments ending with mp4;.
ie.
GET........ref=mp4
GET........ref=mp4
The current result will match GET........ref=mp4 and GET........ref=flv;GET........ref=mp4;.
My regular express: GET(.*?)mp4
I don't need the long match containing flv inside, and this regex does not work: GET(.*?)(?!:flv)mp4
I don't know how to solve and any help is appreciated.

You can explode the semi-colon separated list and then use preg_grep to get only the elements that end with mp4:
$string='GET........ref=mp4;GET........ref=flv;GET........ref=mp4;';
$res = explode(";", $string);
$res = preg_grep('/mp4$/i', $res);
print_r($res);
See IDEONE demo
If there are no semi-colons, all is glued:
// NO SEMI_COLONS
$str='GET........ref=mp4GET........ref=flvGET........ref=mp4';
preg_match_all('/GET\b(?:(?!GET\b).)*mp4(?=$|GET\b)/', $str, $res);
print_r($res);
See another IDEONE demo

First things first, you need to split your string into tokens:
http://get........ref=mp4
http://get........ref=flv
http://get........ref=mp4
and then apply your regex. if you need it to start with the http and end with mp4 then use "^http.mp4$"
The ^ means beginning of the line, $ means the end of the line and the . means match any character 0 or more times. And example using sed to split the results for instance:
echo "http://get........ref=mp4;http://get........ref=flv;http://get........ref=mp4a;" | sed s/';'/\\n/g | grep "^http.*mp4$"
EDIT: if ';' is not your real separator, replace it with whatever is the real separator.

If you are looking for bit a cleaner approach that will work with or without ;
preg_match_all("/GET(?:(?!GET).)*=mp4/", $str, $res);
print_r($res);

Related

can sed replace words in pattern substring match in one line?

original line in file sed.txt:
outer_string_PATTERN_string(PATTERN_And_PATTERN_PATTERN_i)PATTERN_outer_string(i_PATTERN_inner)_outer_string
only need to replace PATTERN to pattern which in brackets, not lowercase, it could replace to other word.
expect result:
outer_string_PATTERN_string(pattern_And_pattern_pattern_i)PATTERN_outer_string(i_pattern_inner)_outer_string
I could use ([^)]*) pattern to find the substring which would be replace some worlds in. But I can't use this pattern to index the substring's position, and it will replace the whole line's PATTERN to pattern.
:/tmp$ sed 's/([^)]*)/---/g' sed.txt
outer_string_PATTERN_string---PATTERN_outer_string---_outer_string
:/tmp$ sed '/([^)]*)/s/PATTERN/pattern/g' sed.txt
outer_string_pattern_string(pattern_And_pattern_pattern_i)pattern_outer_string(i_pattern_inner)_outer_string
I also tried to use the regex group in sed to capture and replace the words, but I can't figure out the command.
Can sed implement that? And how to achieve that? THX.
Can sed implement that?
It can be done using GNU sed and basic regular expressions
(BRE):
sed '
s/)/)\n/g
:1
s/\(([^)]*\)PATTERN\([^)]*)\n\)/\1pattern\2/
t1
s/\n//g
' < file
where
1st s inserts a newline after each )
2nd s replaces the last (* is greedy) PATTERN inside ()s with pattern
t loops back if a substitution was made
3rd s strips all inserted newlines
EDIT
2nd substitute command edited according to OP's suggestion
since there is no need to match \n inside ().
Can sed implement that?
Yes. But you do not want to do it in sed. Use other programming language, like Python, Perl, or awk.
how to achieve that?
Implementing non-greedy regex is not simple in sed. Basically, generally, it consists of:
taking chunk of the input
process the chunk
put it in hold space
shuffle hold with pattern space - extract what been already processed, what's not
repeat
shuffle with hold space
output
Anyway, the following script:
#!/bin/bash
sed <<<'outer_string_PATTERN_string(PATTERN_i_PATTERN_PATTERN_i)PATTERN_outer_string(i_PATTERN_inner)_outer_string' '
:loop;
/\([^(]*\)\(([^)]*)\)\(.*\)/{
# Lowercase the second part.
s//\1\L\2\E\n\3/;
# Mix with hold space.
G;
s/\(.*\)\n\(.*\)\n\(.*\)/\3\1\n\2/;
# Put processed stuff into hold spcae
h; s/\n.*//; x;
# Process the other stuff again.
s/.*\n//;
bloop;
};
# Is hold space empty?
x; /^$/!{
# Pattern space has trailing stuff - add it.
G; s/\n//;
# We will print it.
h;
# Clear hold space
s/.*//
};x;
'
outputs:
PATTERN_outer_string(i_pattern_inner)outer_string_PATTERN_string(pattern_i_pattern_pattern_i)_outer_string
As an alternative, it is easier to do this in gnu awk with RS that matches (...) substring:
awk -v RS='\\([^)]+)' '{gsub(/PATTERN/, "pattern", RT); ORS=RT} 1' file
outer_string_PATTERN_string(pattern_i_pattern_pattern_i)PATTERN_outer_string(i_pattern_inner)_outer_string
Steps:
RS='\\([^)]+)' captures a (...) string as record separator
gsub function then replaces PATTERN with pattern in matched text i.e. RT
ORS=RT sets ORS as the new modified RT
1 prints each record to stdout
Another alternative solution using lookahead assertion in a perl regex:
perl -pe 's/PATTERN(?=[^()]*\))/pattern/g' file
Solved by this:
:/tmp$ sed 's/(/\n(/g' sed.txt | sed 's/)/)\n/g' | sed '/([^)]*)/s/PATTERN/pattern/g' | sed ':a;N;$!ba;s/\n//g'
outer_string_PATTERN_string(pattern_And_pattern_pattern_i)PATTERN_outer_string(i_pattern_inner)_outer_string
make pattern () in a new line
find the () lines and replace the PATTERN to pattern
merge multiple lines in one line
thanks for How can I replace a newline (\n) using sed?

Getting rid of all words that contain a special character in a textfile

I'm trying to filter out all the words that contain any character other than a letter from a text file. I've looked around stackoverflow, and other websites, but all the answers I found were very specific to a different scenario and I wasn't able to replicate them for my purposes; I've only recently started learning about Unix tools.
Here's an example of what I want to do:
Input:
#derik I was there and it was awesome! !! http://url.picture.whatever #hash_tag
Output:
I was there and it was awesome!
So words with punctuation can stay in the file (in fact I need them to stay) but any substring with special characters (including those of punctuation) needs to be trimmed away. This can probably be done with sed, but I just can't figure out the regex. Help.
Thanks!
Here is how it could be done using Perl:
perl -ane 'for $f (#F) {print "$f " if $f =~ /^([a-zA-z-\x27]+[?!;:,.]?|[\d.]+)$/} print "\n"' file
I am using this input text as my test case:
Hello,
How are you doing?
I'd like 2.5 cups of piping-hot coffee.
#derik I was there; it was awesome! !! http://url.picture.whatever #hash_tag
output:
Hello,
How are you doing?
I'd like 2.5 cups of piping-hot coffee.
I was there; it was awesome!
Command-line options:
-n loop around every line of the input file, do not automatically print it
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace
-e execute the perl code
The perl code splits each input line into the #F array, then loops over every field $f and decides whether or not to print it.
At the end of each line, print a newline character.
The regular expression ^([a-zA-z-\x27]+[?!;:,.]?|[\d.]+)$ is used on each whitespace-delimited word
^ starts with
[a-zA-Z-\x27]+ one or more lowercase or capital letters or a dash or a single quote (\x27)
[?!;:,.]? zero or one of the following punctuation: ?!;:,.
(|) alternately match
[\d.]+ one or more numbers or .
$ end
Your requirements aren't clear at all but this MAY be what you want:
$ awk '{rec=sep=""; for (i=1;i<=NF;i++) if ($i~/^[[:alpha:]]+[[:punct:]]?$/) { rec = rec sep $i; sep=" "} print rec}' file
I was there and it was awesome!
sed -E 's/[[:space:]][^a-zA-Z0-9[:space:]][^[:space:]]*//g' will get rid of any words starting with punctuation. Which will get you half way there.
[[:space:]] is any whitespace character
[^a-zA-Z0-9[:space:]] is any special character
[^[:space:]]* is any number of non whitespace characters
Do it again without a ^ instead of the first [[:space:]] to get remove those same words at the start of the line.

what regex to extract all data except within <> in perl?

I have string
Message <Network=Data Center> All Verified
I need to extract all string except one in angular brackets
I tried
m/(?![^<]*\\>)/s
Not giving desired result.
Removing <..> regions
It's easier to remove the <..> parts from the string and then deal with the remaining string.
Try this oneliner:
cat file | perl -pne 's/<[^>]*?>//g;'
For your sample input, this is the output:
Message All Verified
Notice the non-greedy quantifier ? is used in the regex. Also, because this is a oneliner, the s/// search-and-replace construct is applied to $_ implicit variable (which is a line from standard input). So after search & replace has run in this oneliner, the $_ will be altered(there will be no <..> regions in it). Also the -p was used in order to print the variable $_ after running the block of code. You can read more about Perl commandline switches in perlrun.
This is one solution. Below there is another one:
Capturing regions outside of <..>
On the other hand, you can(if you want) match the parts outside of the <..> regions.
In order to do that let's build a regex. First, we want a < or > free region. The following regex matches just that
$p = ([^<>]*).
Next, we want to match everything before <, and for that we can write (?:$p<) and everything after >, and that's (?:>$p).
Now if we assemble all those parts together we get (?:>$p)|(?:$p<).
Notice that (?:) is a non-capturing group.
So now there are two capturing groups (the two $p you see above) but only one will match at a time, so some of the captures will be undef. We'll have to filter those out.
Finally, we can assemble all the captures, and we're done.
cat file | perl -ne '$p="([^<>]*)";#x=grep{defined} m{(?:>$p)|(?:$p<)}g; print join(" ",#x)."\n";'
Parse::Yapp parser
You might think that using Parser::Yapp for this particular problem is a bit too much(usually, if you have something complicated to parse, you would use a grammar and a parser generator), but .. why not.. :)
Ok, so we need a grammar, here's one right here grammar_file.yp:
#header
%%
#rules
expression:
| exterior '<' interior '>' exterior
| exterior
;
exterior:
| TOK { $_[0]->YYData->{DATA} .= $_[1]; }
| expression
;
interior: TOK;
%%
#footer
sub Error { my ($parser)=shift; }
sub Lexer {
use Data::Dumper;
my($parser)=shift;
$parser->YYData->{INPUT} or return('',undef);
#$parser->YYData->{INPUT}=~s/^\s+//;
for ($parser->YYData->{INPUT}) {
return ('TOK',$1) if(s/^([^<>]+)//);
return ( $1,$1) if(s/^([<>])//);
};
}
You will notice in the grammar above that the interior is completely ignored, and only the terminals from exterior are collected.
Here's a small program that will use the parser(MyParser.pm generated from grammar_file.yp) parse.pl:
#!/usr/bin/env perl
use strict;
use warnings;
use MyParser;
my $parser=MyParser->new;
$parser->YYData->{INPUT} = "Message <Network=Data Center> All Verified";
my $value=$parser->YYParse(
yylex => \&MyParser::Lexer,
yyerror => \&MyParser::Error,
#yydebug => 0x1F,
);
my $nberr=$parser->YYNberr();
my $data=$parser->YYData->{DATA};
print "Result=$data"
And now a Makefile and we're done:
generate_parser_module:
yapp -m MyParser grammar_file.yp;
run:
perl parse.pl
all: generate_parser_module
Note
Some more Parser generators can be found here
Regexp::Grammars
Parse::RecDescent
Marpa::XS or Marpa::R2
You can do it other way: just remove the string in the angular brackets:
s#<.*>##
Or if > is not allowed:
s#<[^>]*>##
You can use sed for that:
cat yourfile |sed 's/<.*>//g' > newfile
If you need perl:
perl -i -pe "s/<.*?>//g" yourfile
Here is a compact approach. The following regex will capture your strings into Group 1:
<[^>]+>|([^<>]*)
What we are interested in here is not the overall match, but just the Group 1 matches.
So we need to iterate over Group 1 matches. I don't code in Perl, but following a recipe from the perlretut tutorial, this should do it:
while ($x =~ /<[^>]+>|([^<>]*)/g) {
print "$1","\n";
}
Please give it a try and let me know if it works for you.

Regex to extract content from each line of a log file output from '_m' to the end of the line

Format of log line:
Xxx x xx:xx:xx xmmxxx XXXXXX: XXXXXXX:XXX: xxx_Mxxx_Xxxxxx_mxxxxxmmxx [XXX xxxx.
I want to extract from '_m' to the end of the line, removing the '_' before the 'm'.
New to regex...
Thanks!
if your tool/language support look-behind, this works: match the first _m till EOL. also ignore the leading _
(?<=_)m.*
test with grep:
kent$ echo "Xxx x xx:xx:xx xmmxxx XXXXXX: XXXXXXX:XXX: xxx_Mxxx_Xxxxxx_mxxxxxmmxx [XXX xxxx."|grep -Po '(?<=_)m.*'
mxxxxxmmxx [XXX xxxx.
With sed:
sed -n 's/^.*_\(m.*$\)/\1/p' file
It is quite easy:
This example is written in C# however the regex is quite general and will probably work anywhere:
Regex regex = new Regex(#"_(m.*)"); // If you look for _M the regex should be #"_(M.*)"
Match match = regex.Match(logLine);
if (match.Success)
Console.WriteLine(match.Groups[1].Value);
Hope this will help you on your quest.

Regex: Line does NOT contain a number

I've been racking my brain for hours on this and I'm at my wit's end. I'm beginning to think that this isn't possible for a regular expression.
The closest thing I've seen is this post: Regular expression to match a line that doesn't contain a word?, but the solution doesn't work when I replace "hede" with the number.
I want to select EACH line that DOES NOT contain: 377681 so that I can delete it.
^((?!377681).)*$
...doesn't work, along with thousands of other examples/tweaks that I've found or done.
Is this possible?
Would grep -v 377681 input_file solve your problem?
Try this one
^(?!.*377681).+$
See it here on Regexr
Important here is to use the m (multiline) modifier, so that ^ match the start of the line and $ the end of the row, other wise it will not work.
(Note: I recognized that my regex has the same meaning than yours.)
There's probably a better way of doing this, like for example iterating each line and asking for a built String method, like indexOf or contains depending on the language you're using.
Could you give us the full example?
<?php
$lines = array(
'434343343776815456565464',
'434343343774815456565464',
'434343343776815456565464'
);
foreach($lines as $key => $value){
if(!preg_match('#(377681)#is', $value)){
unset($lines[$key]);
}
}
print_r($lines);
?>
You'll need to enable the m (multi-line) flag for the ^ and $ to match the start- and end-of-lines respectively. If you don't, ^ will match the start-of-input and $ will only match the end-of-input.
The following demo:
#!/usr/bin/env php
<?php
$text = 'foo 377681 bar
this can be 3768 removed
377681 more text
remove me';
echo preg_replace('/^((?!377681).)*$/m', '---------', $text);
?>
will print:
foo 377681 bar
---------
377681 more text
---------