How do i extract some particular words from each line? - regex

The text file has many lines of these sort , i want to extract the words after /videos till .mp4 and the very last number ( shown in bold ) and output each filtered line in a separate file
https://videos-a.jwpsrv.com/content/conversions/7kHOkkQa/videos/**S4KWZTyt-32313922.mp4**.m3u8?hdnts=exp=1592315851~acl=*/S4KWZTyt-32313922.mp4.m3u8~hmac=83f4674e6bf2576b070c716a3196cb6a30f35737827ee69c8cf7e0c57a196e51 **1**
Lets say for example the text file content is ..
https://videos-a.jwpsrv.com/content/conversions/7kHOkkQa/videos/JajSfbVN-32313922.mp4.m3u8?hdnts=exp=1592315891~acl=*/JajSfbVN-32313922.mp4.m3u8~hmac=d3ca7bd5b233a531cfe242d17d2ea0c0167b41b90fff6459e433700ffc969d69 19
https://videos-a.jwpsrv.com/content/conversions/7kHOkkQa/videos/Qs3xZqcv-32313922.mp4.m3u8?hdnts=exp=1592315940~acl=*/Qs3xZqcv-32313922.mp4.m3u8~hmac=c30e2082bf748a6b4d1621c1d33a95319baa61798775e9da8856041951cf5233 20
The output should be
JajSfbVN-32313922.mp4 19
Qs3xZqcv-32313922.mp4 20

You may try the below regex:
.*\/videos\/(.*?mp4).*?(?<= )(\d+)
Explanation of the above regex:
.* - Matching everything before \videos.
\/videos\/ - Matching videos literally.
(.*?mp4) - Represents a capturing group lazily matching everything before mp4.
.*? - Greedily matches everything before the occurrence of digits.
(\d+) - Represents second capturing group matching the numbers at the end as required by you.
You can find the demo of the above regex in here.
Command line implementation in linux:
cat regea.txt | perl -ne 'print "$1 $2\n" while /.*\/videos\/(.*?mp4).*?(?<= )(\d+)/g;'> out.txt
You can find the sample implementation of the above command in here.

The proposed regex is probably a better solution, but I'll leave a Python solution that writes each filtered line in a separate file. This script works if every line in the file is like that.
with open("my_file.txt","r") as FILE:
lines=FILE.readlines()
for line in lines:
num=line.split(" ")[1]
newline=line.split("videos")[2]
newline=newline[1:]
new=newline.split(".")[0:2]
with open(new[0],"w") as f:
f.write(new[0]+"."+new[1]+" "+num.strip())
f.close

Related

txt file delete url to last "/" to get files

I have txt file contaning one url per row each url as:
://url/files.php?file=parent/children/file.pdf
://url/files.php?file=parent/children2/childrenofchildren2/file2.txt
......etc
I need help to cut everythink before last / in a row. That is what I used in notepad++ regex mode (it doesnt work):
^.+[/](.*)$
To get:
file.pdf
file2.txt
But I am open to all waysof solving.
Replace your line from left including / by nothing:
sed 's/.*\///' file
or
sed 's|.*/||' file
Output:
file.pdf
file2.txt
This solution may be more complicated than it needs to be, but it works!
A purely regex-based approach could be as follows:
(([^\/])*)((\n)|($))/g
Basically, it matches any number of non-newline and non \ characters (([^\/])*) and then stops when it either encounters a new line \nor the end of the sequence $. The global /g is also set, to allow it to match more than one instance!
I hope this helped!

Regex Match Paragraph Pattern

I am trying to match a paragraph pattern and I am having trouble.
The pattern is:
[image.gif]
some words, usually a few lines
name
emailaddress<mailto:theemailaddress#mail.com>
I tried matching everything between the gif image and the <mailto: but this happens multiple times in the file meaning I get a bad result.
I tried it with this
(?<=\[image.gif\].*?(\[image.gif\])).*?(?=<mailto:)
Is there a way to use Regex to match the general layout of a paragraph?
"the general layout of a paragraph" needs a better definition. Given the lack of an input plus expected output, I'm having to guess what you want here. I'm also guessing that you will accept any language. Here's perl, almost certainly not a language you're familiar with.
Assumed input:
do not match this line
[image.gif]
some words, usually a few lines
Bobert McBobson
emailaddress<mailto:bobertmb#example.com>
don't match this line either
[image.gif]
another few words
on another few lines
Bobina Robertsdaughter
emailaddress<mailto:bobinard#example.info>
this line is also not for matching
Expected output:
[image.gif]
some words, usually a few lines
Bobert McBobson
emailaddress<mailto:bobertmb#example.com>
---
[image.gif]
another few words
on another few lines
Bobina Robertsdaughter
emailaddress<mailto:bobinard#example.info>
Solution using perl:
#!/usr/bin/perl -n007
my $sep = "";
while (/(\[image\.gif\].*?<mailto:[^>]*>(\r)?\n)/gms) {
print $sep . $1;
$sep = "---$2\n";
}
perl is the king of regex languages; many would say that's all it is good for. Here, we use the -n007 option to tell it to read the entire contents of each file and run the code on it as the default variable.
$sep starts blank because there's nothing to separate until the second match.
Then we loop over each block of text that matches the regex:
matches a literal [image.gif]
then matches as little content following that as possible
then matches a literal <mailto: and continues until the next >
then captures the line break (including optional support for DOS line endings)
(see full regex explanation and example at regex101)
We then print the match and finally set the separator to three dashes and a line break (DOS line endings added when needed).
Now you can run it:
$ perl answer.pl input.txt
[image.gif]
some words, usually a few lines
Bobert McBobson
emailaddress<mailto:bobertmb#example.com>
---
[image.gif]
another few words
on another few lines
Bobina Robertsdaughter
emailaddress<mailto:bobinard#example.info>

sed - match regex in specific position

I'm having some trouble creating a one liner or a simple script to edit some fixed length files using sed.
Supposing my file has lines in this format:
IPITTYTHEFOOBUTIDONOTPITTYTHEBAR
IPITTYTH BARBUTIDONOTPITTYTH3FOO
If the entire lines are considered as a string, I can say I would want to match the substring that starts in position 10 and has length 3 with a regex. If it matches the regex I want to had some other string in the end of that line.
Assuming the matching regex is B.R, and the string to append in the end of the line is NOT, I would want my file to turn into:
IPITTYTHEFOOBUTIDONOTPITTYTHEBAR
IPITTYTH BARBUTIDONOTPITTYTHEFOONOT
The lines in the files are bigger than the ones in this sample.
So far I have this:
sed -i '/B.R/ s/$/NOT/' file.name
The problem is that this ignores the position where the regex is matched, making the first line of the example a match as well:
IPITTYTHEFOOBUTIDONOTPITTYTHEBAR
IPITTYTH BARBUTIDONOTPITTYTH3FOO
I'm open to use awk as well.
Thanks in advance.
You are almost there. You just need to specify the characters which exists before B.R . If B is at 10th position then there must be 9 characters exists before B
sed -i '/^.\{9\}B.R/s/$/NOT/' file.name
Example:
$ sed '/^.\{9\}B.R/s/$/NOT/' file
IPITTYTHEFOOBUTIDONOTPITTYTHEBAR
IPITTYTH BARBUTIDONOTPITTYTHEFOONOT

Regex - match up to first literal

I have some lines of code I am trying to remove some leading text from which appears like so:
Line 1: myApp.name;
Line 2: myApp.version
Line 3: myApp.defaults, myApp.numbers;
I am trying and trying to find a regex that will remove anything up to (but excluding) myApp.
I have tried various regular expressions, but they all seem to fail when it comes to line 3 (because myApp appears twice).
The closest I have come so far is:
.*?myApp
Pretty simple - but that matches both instances of myApp occurrences in Line 3 - whereas I'd like it to match only the first.
There's a few hundred lines - otherwise I'd have deleted them all manually by now.
Can somebody help me? Thanks.
You need to add an anchor ^ which matches the starting point of a line ,
^.*?(myApp)
DEMO
Use the above regex and replace the matched characters with $1 or \1. So that you could get the string myApp in the final result after replacement.
Pattern explanation:
^ Start of a line.
.*?(myApp) Shortest possible match upto the first myApp. The string myApp was captured and stored into a group.(group 1)
All matched characters are replaced with the chars present inside the group 1.
Your regular expression works in Perl if you add the ^ to ensure that you only match the beginnings of lines:
cat /tmp/test.txt | perl -pe 's/^.*?myApp/myApp/g'
myApp.name;
myApp.version
myApp.defaults, myApp.numbers;
If you wanted to get fancy, you could put the "myApp" into a group that doesn't get captured as part of the expression using (?=) syntax. That way it doesn't have to be replaced back in.
cat /tmp/test.txt | perl -pe 's/^.*?(?=myApp)//g'
myApp.name;
myApp.version
myApp.defaults, myApp.numbers;

sed only replacing last occurrence of match - need to match all

I would like to replace all { } on a certain line with [ ], but unfortunately I am only able to match the last occurrence of the regexp.
I have a config file which has structure as follows:
entry {
id 123456789
desc This is a description of {foo} and was added by {bar}
trigger 987654321
}
I have the following sed, of which is able to replace the last match 'bar' but not 'foo':
sed s'/\(desc.*\){\(.*\)}/\1\[\2\]/g' < filename
I anchor this search to the line containing 'desc' as I would hate for it to replace the delimiting braces of each 'entry' block.
For the life of me I am unable to figure out how to replace all of the occurrences.
Any help is appreciated - have been learning all day and unable to read any more tutorials for fear that my corneas might crack.
Thanks!
Try the following:
sed '/desc/ s/{\([^}]*\)}/[\1]/g' filename
The search and replace in the above command will only be done for lines that match the regex /desc/, however I don't think this is actually necessary because sed processes text a line at a time, so even without this you wouldn't be replacing braces on the 'entry' block. This means that you could probably simplify this to the following:
sed 's/{\([^}]*\)}/[\1]/g' filename
Instead of .* inside of the capturing group [^}]* is used which will match everything except closing braces, that way you won't match from the first opening to the last closing.
Also, you can just provide the file name as the final argument to sed instead of using input redirection.