Match URL pattern within file using SED, AWK or GREP

Match URL pattern within file using SED, AWK or GREP - regex

I am trying to use grep to extract a list of urls beginning with http and ending with jpg.
grep -o 'picturesite.com/wp-content/uploads/.......' filename
The code above is how far I've gotten. I then need to pass these file names to curl
title : "Family Vacation", jpg:"http://picturesite.com/wp-content/uploads/2014/01/mypicture.jpg", owner : "PhotoTaker"

sed -nr 's/http\S*(jpg\|gif\|other\|ext)/\
curl $CURLOPTS & >$OUT/p' <$infile | sh -n
The above command will search $infile for any string beginning with "http" followed by any length of non-whitespace characters and ending with any of the "\|" separated file extensions contained in the parentheses.
Once it's found such a string sed will substitute it into the curl commandline on the second line to replace "&." It will then pipe the command string to sh for execution.
Remember, sed is the stream editor, not just the stream searcher, so it can very capably pre-process input for other commands to make them do what you want.
Note: sh is currently passed the 'noexecute' argument which basically works more like echo than anything else. When you've run it a few times and are satisfied you're doing the right thing you'll need to remove it for any effect.
Note 2: If there's a chance you'll want to match more than one url per line you'll need the 'g' sed option.

You can capture url patterns by doing:
grep -o 'http.*.jpg' file
$ grep -o 'http.*.jpg' <<EOF
> title : "Family Vacation", jpg:"http://picturesite.com/wp-content/uploads/2014/01/mypicture.jpg", owner : "PhotoTaker
> EOF
http://picturesite.com/wp-content/uploads/2014/01/mypicture.jpg
curl does not take url from standard input so your best bet would be to store the extracted url to a file and then reading the file one line at a time and passing the variable that holds the line to curl command.

Related

Extract Source IP from log files

i want to extract "srcip=x.x.x.x" from log file in bash. my log file is like this:
2019:06:23-17:50:03 myhost ulogd[5692]: id="2021" severity="info" sys="SecureNet" sub="packetfilter" name="Packet dropped (GEOIP)" action="drop" fwrule="60019" initf="eth0" srcmac="3c:1e:04:92:6f:fb" dstmac="00:50:56:97:7c:af" srcip="185.53.91.50" dstip="192.168.50.10" proto="6" length="44" tos="0x00" prec="0x00" ttl="235" srcport="54522" dstport="5038" tcpflags="SYN"
I've wrote awk '{print $15}' to extract srcip but the problem is srcip position not same in each line. how can i extract srcip=x.x.x.x without position of that?

With any sed in any shell on every UNIX box:
$ sed -n 's/.*\(srcip="[^"]*"\).*/\1/p' file
srcip="185.53.91.50"

The following command provides the result you expect
grep -o -P 'srcip="(\d{1,3}[.]){3}\d{1,3}"' log
The option o is to print only the matched parts. The option P is to use perl-compatible regular expressions. The regex is matching srcip=<ipv4> and log is the name of the file you want to extract content from.
Here is a link to regex101 for an explanation for the regex: https://regex101.com/r/hjuZlM/2

An awk version
awk -F"srcip=" '{split($2,a," ");print FS a[1]}' file
srcip="185.53.91.50"
Split the line using the key word, then get the next field after split.

sed / awk - remove space in file name

I'm trying to remove whitespace in file names and replace them.
Input:
echo "File Name1.xml File Name3 report.xml" | sed 's/[[:space:]]/__/g'
However the output
File__Name1.xml__File__Name3__report.xml
Desired output
File__Name1.xml File__Name3__report.xml

You named awk in the title of the question, didn't you?
$ echo "File Name1.xml File Name3 report.xml" | \
> awk -F'.xml *' '{for(i=1;i<=NF;i++){gsub(" ","_",$i); printf i<NF?$i ".xml ":"\n" }}'
File_Name1.xml File_Name3_report.xml
$
-F'.xml *' instructs awk to split on a regex, the requested extension plus 0 or more spaces
the loop {for(i=1;i<=NF;i++) is executed for all the fields in which the input line(s) is(are) splitted — note that the last field is void (it is what follows the last extension), but we are going to take that into account...
the body of the loop
gsub(" ","_", $i) substitutes all the occurrences of space to underscores in the current field, as indexed by the loop variable i
printf i<NF?$i ".xml ":"\n" output different things, if i<NF it's a regular field, so we append the extension and a space, otherwise i equals NF, we just want to terminate the output line with a newline.
It's not perfect, it appends a space after the last filename. I hope that's good enough...
▶    A D D E N D U M    ◀
I'd like to address:
the little buglet of the last space...
some of the issues reported by Ed Morton
generalize the extension provided to awk
To reach these goals, I've decided to wrap the scriptlet in a shell function, that changing spaces into underscores is named s2u
$ s2u () { awk -F'\.'$1' *' -v ext=".$1" '{
> NF--;for(i=1;i<=NF;i++){gsub(" ","_",$i);printf "%s",$i ext (i<NF?" ":"\n")}}'
> }
$ echo "File Name1.xml File Name3 report.xml" | s2u xml
File_Name1.xml File_Name3_report.xml
$
It's a bit different (better?) 'cs it does not special print the last field but instead special-cases the delimiter appended to each field, but the idea of splitting on the extension remains.

This seems a good start if the filenames aren't delineated:
((?:\S.*?)?\.\w{1,})\b
( // start of captured group
(?: // non-captured group
\S.*? // a non-white-space character, then 0 or more any character
)? // 0 or 1 times
\. // a dot
\w{1,} // 1 or more word characters
) // end of captured group
\b // a word boundary
You'll have to look-up how a PCRE pattern converts to a shell pattern. Alternatively it can be run from a Python/Perl/PHP script.
Demo

Assuming you are asking how to rename file names, and not remove spaces in a list of file names that are being used for some other reason, this is the long and short way. The long way uses sed. The short way uses rename. If you are not trying to rename files, your question is quite unclear and should be revised.
If the goal is to simply get a list of xml file names and change them with sed, the bottom example is how to do that.
directory contents:
ls -w 2
bob is over there.xml
fred is here.xml
greg is there.xml
cd [directory with files]
shopt -s nullglob
a_glob=(*.xml);
for ((i=0;i< ${#a_glob[#]}; i++));do
echo "${a_glob[i]}";
done
shopt -u nullglob
# output
bob is over there.xml
fred is here.xml
greg is there.xml
# then rename them
cd [directory with files]
shopt -s nullglob
a_glob=(*.xml);
for ((i=0;i< ${#a_glob[#]}; i++));do
# I prefer 'rename' for such things
# rename 's/[[:space:]]/_/g' "${a_glob[i]}";
# but sed works, can't see any reason to use it for this purpose though
mv "${a_glob[i]}" $(sed 's/[[:space:]]/_/g' <<< "${a_glob[i]}");
done
shopt -u nullglob
result:
ls -w 2
bob_is_over_there.xml
fred_is_here.xml
greg_is_there.xml
globbing is what you want here because of the spaces in the names.
However, this is really a complicated solution, when actually all you need to do is:
cd [your space containing directory]
rename 's/[[:space:]]/_/g' *.xml
and that's it, you're done.
If on the other hand you are trying to create a list of file names, you'd certainly want the globbing method, which if you just modify the statement, will do what you want there too, that is, just use sed to change the output file name.
If your goal is to change the filenames for output purposes, and not rename the actual files:
cd [directory with files]
shopt -s nullglob
a_glob=(*.xml);
for ((i=0;i< ${#a_glob[#]}; i++));do
echo "${a_glob[i]}" | sed 's/[[:space:]]/_/g';
done
shopt -u nullglob
# output:
bob_is_over_there.xml
fred_is_here.xml
greg_is_there.xml

You could use rename:
rename --nows *.xml
This will replace all the spaces of the xml files in the current folder with _.
Sometimes it comes without the --nows option, so you can then use a search and replace:
rename 's/[[:space:]]/__/g' *.xml
Eventually you can use --dry-run if you want to just print filenames without editing the names.

Remove the data before the second repeated specified character in linux

I have a text file which has some below data:
AB-NJCFNJNVNE-802ac94f09314ee
AB-KJNCFVCNNJNWEJJ-e89ae688336716bb
AB-POJKKVCMMMMMJHHGG-9ae6b707a18eb1d03b83c3
AB-QWERTU-55c3375fb1ee8bcd8c491e24b2
I need to remove the data before the second hyphen (-) and produce another text file with the below output:
802ac94f09314ee
e89ae688336716bb
9ae6b707a18eb1d03b83c3
55c3375fb1ee8bcd8c491e24b2
I am pretty new to linux and trying sed command with unsuccessful attempts for the last couple of hours. How can I get the desired output with sed or any other useful command like awk?

You can use a simple cut call:
$ cat myfile.txt | cut -d"-" -f3- > myoutput.txt
Edit:
Some explanation, as requested in the comments:
cut breaks up a string of text to fields according to a given delimiter.
-d defines the delimiter, - in this case.
-f defines which fields to output. In this case, we want to eliminate everything before the second hyphen, or, in other words, return the third field and onwards (3-).
The rest of the command is just piping the output. cating the file into cut, and then saving the result to an output file.

Or, using sed:
cat myfile.txt | sed -e 's/^.\+-//'

Grep match a certain key/value set json

IF THIS IS THE FIRST TIME YOU"RE READING THIS QUESTION, SKIP RIGHT TO THE EDIT
So what I'm trying to do is match everything until a certain word
What I'm working with is similar to this:
{"selling":"0"morestuffhere"notes":"otherthingshere"}unwantedthingshere
The regex I got so far is:
grep -o "\{\"selling\":\"0\""
which will match up to {"selling":"0".
I want it to match {"selling":"0"morestuffhere"notes":"otherthingshere"} but NOT unwantedstuffhere.
I don't know beforehand what "morestuffhere", "otherthingshere" and "unwantedstuffhere" are gonna be. So what I want to do is match everything from what I already have until "notes":"otherthingshere"}.
How do I do this?
EDIT: forgot to mention some key points. Sorry, had to hurry because dinner was ready.
My input consists of a series of key:value sets, as such:
{"key":"value", "otherkey":"othervalue","morekeys":"morevalues"},{"othersetkey":"othersetvalue","otherothersetkey":"otherothersetvalue","othersetmorekeys":"othersetmorevalues"}
and so on.
The first key/value set is different from the rest of them, and I don't want to match that set.
The first key of all sets other than the first is "selling", and I want to match all sets that have a "selling" value of 1. The last key of the set is "notes".
The input is JSON, so I added that to the tags.

Through sed,
sed -r 's/^[^{]*([^}]*).*$/\1}/g' file
Example:
$ echo 'dSDGAadb{"selling":"0"morestuffhere"notes":"otherthingshere"}unwantedthingshere' | sed -r 's/^[^{]*([^}]*).*$/\1}/g'
{"selling":"0"morestuffhere"notes":"otherthingshere"}
I think you want something like this,
$ cat aa
dSDGAadb{"selling":"0"morestuffhere"notes":"otherthingshere"}{"selling":"1"morestuffhere"notes":"otherthingshere"}bgj
$ sed -r 's/.*(\{"selling":"1"[^}]*)}.*/\1}/g' aa
{"selling":"1"morestuffhere"notes":"otherthingshere"}
OR
something like this,
$ cat aa
dSDGAadb{"selling":"0"morestuffhere"notes":"otherthingshere"}{"selling":"1"morestuffhere"notes":"otherthingshere"}bgj{"selling":"1"morestuffhere"notes":"otherthingshere"}
$ grep -oP '{\"selling\":\"1\"[^}]*}' aa
{"selling":"1"morestuffhere"notes":"otherthingshere"}
{"selling":"1"morestuffhere"notes":"otherthingshere"}

You could do this with grep:
grep -o '{[^}]*}' file
This matches an opening curly brace, followed by anything that isn't a closing curly brace, followed by a closing curly brace.
Testing it out on your input:
$ grep -o '{[^}]*}' <<<'{"selling":"0"morestuffhere"notes":"otherthingshere"}unwantedthingshere'
{"selling":"0"morestuffhere"notes":"otherthingshere"}

What's wrong with
>> grep -o ".*}" file.txt
{"selling":"0"morestuffhere"notes":"otherthingshere"}
where file.txt contains your example string?

I've never found a good way to do this kind of thing with json in the shell with basic unix tools like grep, sed, etc. A quick and dirty ruby or python script is your friend,
#!/usr/bin/env ruby
# h.rb
require 'json'
key=ARGV.shift
json=ARGF.read
h=JSON.parse(json)
puts h.key?(key) ? h[key] : "not found"
And then pipe your json into the script specifying the key as a parameter,
$ echo '{"key":"value", "otherkey":"othervalue","morekeys":"morevalues"}' | /tmp/h.rb otherkey
othervalue
or from a file,
$ cat /tmp/h.json | /tmp/h.rb otherkey
othervalue

Understanding a sed example

I found a solution for extracting the password from a Mac OS X Keychain item. It uses sed to get the password from the security command:
security 2>&1 >/dev/null find-generic-password -ga $USER | \
sed -En '/^password: / s,^password: "(.*)"$,\1,p'
The code is here in a comment by 'sr105'. The part before the | evaluates to password: "secret". I'm trying to figure out exactly how the sed command works. Here are some thoughts:
I understand the flags -En, but what are the commas doing in this example? In the sed docs it says a comma separates an address range, but there's 3 commas.
The first 'address' /^password: / has a trailing s; in the docs s is only mentioned as the replace command like s/pattern/replacement/. Not the case here.
The ^password: "(.*)"$ part looks like the Regex for isolating secret, but it's not delimited.
I can understand the end part where the back-reference \1 is printed out, but again, what are the commas doing there??
Note that I'm not interested in an easier alternative to this sed example. This will only be part of a larger bash script which will include some more sed parsing in an .htaccess file, so I'd really like to learn the syntax even if it is obscure.
Thanks for your help!

Here is sed command:
sed -En '/^password: / s,^password: "(.*)"$,\1,p'
Commas are used as regex delimiter it can very well be another delimiter like #:
sed -En '/^password: / s#^password: "(.*)"$#\1#p'`
/^password: / finds an input line that starts with password:
s#^password: "(.*)"$#\1#p finds and captures double-quoted string after password: and replaces the entire line with the captured string \1 ( so all that remains is the password )

First, the command extracts passwords from a file (or stream) and prints them to stdout.
While you "normally" might execute a sed command on all lines of a file, sed offers to specify a regex pattern which describes which lines the following command should get applied to.
In your case
/^password: /
is a regex, saying that the command:
s,^password: "(.*)"$,\1,p
should get executed for all lines looking like password: "secret". The command substitutes those lines with the password itself while suppressing the outer lines.
The substitute command might look uncommon but you can choose the delimiter in an sed command, it is not limited to /. In this case , was chosen.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Match URL pattern within file using SED, AWK or GREP - regex

Related

Extract Source IP from log files

sed / awk - remove space in file name

Remove the data before the second repeated specified character in linux

Grep match a certain key/value set json

Understanding a sed example

Categories

Resources