How to output multiple regex matches through comma on the same line - regex

I want to use grep/awk/sed to extract matched strings for each line of a log file. Then place it into csv file.
Highlighted strings (1432,53,http://www.espn.com/)
If the input is:
2018-10-31
18:48:01.717,INFO,15592.15627,PfbProxy::handlePfbFetchDone(0x1d69850,
pfbId=561, pid=15912, state=4, fd=78, timer=61), FETCH DONE: len=45,
PFBId=561, pid=0, loadTime=1434 ms, objects=53, fetchReqEpoch=0.0,
fetchDoneEpoch:0.0, fetchId=26, URL=http://www.espn.com/
2018-10-31
18:48:01.806,DEBUG,15592.15621,FETCH DONE: len=45, PFBId=82, pid=0,
loadTime=1301 ms, objects=54, fetchReqEpoch=0.0, fetchDoneEpoch:0.0,
fetchId=28, URL=http://www.diply.com/
Expected output for the above log lines:
URL,LoadTime,Objects
http://www.espn.com/,1434,53
http://www.diply.com/,1301,54
This is an example, and the actual Log File will have much more data.
--My-Solution-So-far-
For now I used grep to get all lines containing keyword 'FETCH DONE' (these lines contain strings I am looking for).
I did come up with regular expression that matches the data I need, but when I grep it and put it in the file it prints each string on the new line which is not quite what I am looking for.
The grep and regular expression I use (online regex tool: https://regexr.com/42cah):
echo -en 'url,loadtime,object\n'>test1.csv #add header
grep -Po '(?<=loadTime=).{1,5}(?= )|((?<=URL=).*|\/(?=.))|((?<=objects=).{1,5}(?=\,))'>>test1.csv #get matching strings
Actual output:
URL,LoadTime,Objects
http://www.espn.com
1434
53
http://www.diply.com
1301
54
Expected output:
URL,LoadTime,Objects
http://www.espn.com/,1434,53
http://www.diply.com/,1301,54
I was trying using awk to match multiple regex and print comma in between. I couldn't get it to work at all for some reason, even though my regex matches correct strings.
Another idea I have is to use sed to replace some '\n' for ',':
for(i=1;i<=n;i++)
if(i % 3 != 0){
sed REPLACE "\n" with "," on i-th line
}
Im pretty sure there is a more efficient way of doing it

Using sed:
sed -n 's/.*loadTime=\([0-9]*\)[^,]*, objects=\([0-9]*\).* URL=\(.*\)/\3,\1,\2/p' input | \
sed 1i'URL,LoadTime,Objects'

Related

Regex command line change format of each line

I have a file that contains lines in a format similar to this...
/data/file.geojson?10,20,30,40
/data/file.geojson?bbox=-5.20751953125,49.05227025601607,3.0322265625,56.46249048388979
/data/file.geojson?bbox=-21.46728515625,45.99696161820381,19.2919921875,58.88194208135912
/data/file.geojson?bbox=-2.8482055664062496,54.38935426009769,-0.300750732421875,55.158473983815306
/data/file.geojson?bbox=-21.46728515625,45.99696161820381,19.2919921875,58.88194208135912
/data/file.geojson?bbox=-21.46728515625,45.99696161820381,19.2919921875,58.88194208135912
I've tried a combination of grep, sed, gawk, and |(pipes) to try and pattern match and then change the format to be more like this...
[10,40],[30,40],[30,20][10,20],
[-5.20751953125,56.46249048388979],[3.0322265625,56.46249048388979].....
Hopefully you get the idea from the first line so I don't have to type out all the examples manually!
I've got the hang of regex to match the co-ordinates. In fact the input file is the result of extracting from apache access logs. It might be easier to read/understand answers if they just match positive integer numbers, I will then be able to slot in a more complicated pattern to match the right range.
To be able to arrange the results like you which it is important to be able to access the last for values per line.
No pattern matching is required if you use awk. You can split the input strings by a set of delimiters and reassemble the resulting fields. 40 can be accessed as $(NF), 30 as $(NF-1) and so on.
awk -F'[?,=]' '
{printf "[%s,%s],[%s,%s],[%s,%s],[%s,%s]\n",
$(NF-3),$(NF),$(NF-1),$(NF),
$(NF-1),$(NF-2),$(NF-3),$(NF-2)
}' file
I'm using ?, , or = as the field delimiters. This makes it simple to access the columns of interest.
Output:
[10,40],[30,40],[30,20],[10,20]
[-5.20751953125,56.46249048388979],[3.0322265625,56.46249048388979],[3.0322265625,49.05227025601607],[-5.20751953125,49.05227025601607]
[-21.46728515625,58.88194208135912],[19.2919921875,58.88194208135912],[19.2919921875,45.99696161820381],[-21.46728515625,45.99696161820381]
[-2.8482055664062496,55.158473983815306],[-0.300750732421875,55.158473983815306],[-0.300750732421875,54.38935426009769],[-2.8482055664062496,54.38935426009769]
[-21.46728515625,58.88194208135912],[19.2919921875,58.88194208135912],[19.2919921875,45.99696161820381],[-21.46728515625,45.99696161820381]
[-21.46728515625,58.88194208135912],[19.2919921875,58.88194208135912],[19.2919921875,45.99696161820381],[-21.46728515625,45.99696161820381]
Btw, also sed can be used here:
sed -r 's/.*[?=]([^,]+),([^,]+),([^,]+),(.*)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
The command is capturing the numbers at the end each in a separate capturing group and re-assembles them in the replacement part.
Not all versions of sed support the + quantifier. The most compatible version would look like this :)
sed 's/.*[?=]\([^,]\{1,\}\),\([^,]\{1,\}+\),\([^,]\{1,\}\),\(.*\)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
sed strips off items prior to numbers, then awk splits on comma and outputs in different order. Assuming data is in a file called "td.txt"
sed 's/^[^0-9-]*//' td.txt|awk -F, '{print "["$1","$4"],["$3","$4"],["$3","$2"],["$1","$2"],"}'
This might work for you (GNU sed):
sed -r 's/^.*\?[^-0-9]*([^,]*),([^,]*),([^,]*),([^,]*)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
Or with more toothpicks:
sed 's/^.*\?[^-0-9]*\([^,]*\),\([^,]*\),\([^,]*\),\([^,]*\)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
You can use the following to match:
(\/data\/file\.geojson\?(?:bbox=)?)([0-9.-]+),([0-9.-]+),([0-9.-]+),([0-9.-]+)
And replace with the following:
$1[$2,$3],[$4,$5]
See DEMO

Getting rid of all words that contain a special character in a textfile

I'm trying to filter out all the words that contain any character other than a letter from a text file. I've looked around stackoverflow, and other websites, but all the answers I found were very specific to a different scenario and I wasn't able to replicate them for my purposes; I've only recently started learning about Unix tools.
Here's an example of what I want to do:
Input:
#derik I was there and it was awesome! !! http://url.picture.whatever #hash_tag
Output:
I was there and it was awesome!
So words with punctuation can stay in the file (in fact I need them to stay) but any substring with special characters (including those of punctuation) needs to be trimmed away. This can probably be done with sed, but I just can't figure out the regex. Help.
Thanks!
Here is how it could be done using Perl:
perl -ane 'for $f (#F) {print "$f " if $f =~ /^([a-zA-z-\x27]+[?!;:,.]?|[\d.]+)$/} print "\n"' file
I am using this input text as my test case:
Hello,
How are you doing?
I'd like 2.5 cups of piping-hot coffee.
#derik I was there; it was awesome! !! http://url.picture.whatever #hash_tag
output:
Hello,
How are you doing?
I'd like 2.5 cups of piping-hot coffee.
I was there; it was awesome!
Command-line options:
-n loop around every line of the input file, do not automatically print it
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace
-e execute the perl code
The perl code splits each input line into the #F array, then loops over every field $f and decides whether or not to print it.
At the end of each line, print a newline character.
The regular expression ^([a-zA-z-\x27]+[?!;:,.]?|[\d.]+)$ is used on each whitespace-delimited word
^ starts with
[a-zA-Z-\x27]+ one or more lowercase or capital letters or a dash or a single quote (\x27)
[?!;:,.]? zero or one of the following punctuation: ?!;:,.
(|) alternately match
[\d.]+ one or more numbers or .
$ end
Your requirements aren't clear at all but this MAY be what you want:
$ awk '{rec=sep=""; for (i=1;i<=NF;i++) if ($i~/^[[:alpha:]]+[[:punct:]]?$/) { rec = rec sep $i; sep=" "} print rec}' file
I was there and it was awesome!
sed -E 's/[[:space:]][^a-zA-Z0-9[:space:]][^[:space:]]*//g' will get rid of any words starting with punctuation. Which will get you half way there.
[[:space:]] is any whitespace character
[^a-zA-Z0-9[:space:]] is any special character
[^[:space:]]* is any number of non whitespace characters
Do it again without a ^ instead of the first [[:space:]] to get remove those same words at the start of the line.

Print commands in history consisting in just one word

I want to print lines that contains single word only.
For example:
this is a line
another line
one
more
line
last one
I want to get the ones with single word only
one
more
line
EDIT: Guys, thank you for answers. Almost all of the answers work for my test file. However I wanted to list single lines in bash history. When I try your answers like
history | your posted commands
all of them below fails. Some only prints some numbers (might line numbers?)
You want to get all those commands in history that contain just one word. Considering that history prints the number of the command as a first column, you need to match those lines consisting in two words.
For this, you can say:
history | awk 'NF==2'
If you just want to print the command itself, say:
history | awk 'NF==2 {print $2}'
To rehash your problem, any line containing a space or nothing should be removed.
grep -Ev '^$| ' file
Your problem statement is unspecific on whether lines containing only punctuation might also occur. Maybe try
grep -Ex '[A-Za-z]+' file
to only match lines containing only one or more alphabetics. (The -x option implicitly anchors the pattern -- it requires the entire line to match.)
In Bash, the output from history is decorated with line numbers; maybe try
history | grep -E '^ *[0-9]+ [A-Za-z]+$'
to match lines where the line number is followed by a single alphanumeric token. Notice that there will be two spaces between the line number and the command.
In all cases above, the -E selects extended regular expression matching, aka egrep (basic RE aka traditional grep does not support e.g. the + operator, though it's available as \+).
Try this:
grep -E '^\s*\S+\s*$' file
With the above input, it will output:
one
more
line
If your test strings are in a file called in.txt, you can try the following:
grep -E "^\w+$" in.txt
What it means is:
^ starting the line with
\w any word character [a-zA-Z0-9]
+ there should be at least 1 of those characters or more
$ line end
And output would be
one
more
line
Assuming your file as texts.txt and if grep is not the only criteria; then
awk '{ if ( NF == 1 ) print }' texts.txt
If your single worded lines don't have a space at the end you can also search for lines without an empty space :
grep -v " "
I think that what you're looking for could be best described as a newline followed by a word with a negative lookahead for a space,
/\n\w+\b(?! )/g
example

How to find/extract a pattern from a file?

Here are the contents of my text file named 'temp.txt'
---start of file ---
HEROKU_POSTGRESQL_AQUA_URL (DATABASE_URL) ----backup---> b687
Capturing... done
Storing... done
---end of file ----
I want to write a bash script in which I need to capture the string 'b687' in a variable. this is really a pattern (which is the letter 'b' followed by 'n' number of digits). I can do it the hard way by looping through the file and extracting the desired string (b687 in example above). Is there an easy way to do so? Perhaps by using awk or sed?
Try using grep
v=$(grep -oE '\bb[0-9]{3}\b' file)
This will seach for a word starting with b followed by '3' digits.
regex101 demo
Using sed
v=$(sed -nr 's/.*\b(b[0-9]{3})\b.*/\1/p' file)
varname=$(awk '/HEROKU_POSTGRESQL_AQUA_URL/{print $4}' filename)
what this does is reads the file when it matches the pattern HEROKU_POSTGRESQL_AQUA_URL print the 4th token in this case b687
your other option is to use sed
varname=$(sed -n 's/.* \(b[0-9][0-9]*\)/\1/p' filename)
In this case we are looking for the pattern you mentioned b####... and only print that pattern the -n tells sed not to print line that do not have that pattern. the rest of the sed command is a substitution .* is any string at the beginning. followed by a (...) which forms a group in which we put the regex that will match your b##### the second part says out of all that match only print the group 1 and the p at the end tells sed to print the result (since by default we told sed not to print with the -n)

Grep Regex: List all lines except

I'm trying to automagically remove all lines from a text file that contains a letter "T" that is not immediately followed by a "H". I've been using grep and sending the output to another file, but I can't come up with the magic regex that will help me do this.
I don't mind using awk, sed, or some other linux tool if grep isn't the right tool to be using.
That should do it:
grep -v 'T[^H]'
-v : print lines not matching
[^H]: matches any character but H
You can do:
grep -v 'T[^H]' input
-v is the inverse match option of grep it does not list the lines that match the pattern.
The regex used is T[^H] which matches any lines that as a T followed by any character other than a H.
Read lines from file exclude EMPTY Lines and Lines starting with #
grep -v '^$\|^#' folderlist.txt
folderlist.txt
# This is list of folders
folder1/test
folder2
# This is comment
folder3
folder4/backup
folder5/backup
Results will be:
folder1/test
folder2
folder3
folder4/backup
folder5/backup
Adding 2 awk solutions to the mix here.
1st solution(simpler solution): With simple awk and any version of awk.
awk '!/T/ || /TH/' Input_file
Checking 2 conditions:
If a line doesn't contain T OR
If a line contains TH then:
If any of above condition is TRUE then print that line simply.
2nd solution(GNU awk specific): Using GNU awk using match function where mentioning regex (T)(.|$) and using match function's array creation capability.
awk '
!/T/{
print
next
}
match($0,/(T)(.|$)/,arr) && arr[1]=="T" && arr[2]=="H"
' Input_file
Explanation: firstly checking if a line doesn't have T then print that simply. Then using match function of awk to match T followed by any character OR end of the line. Since these are getting stored into 2 capturing groups so checking if array arr's 1st element is T and 2nd element is H then print that line.