bash: Batch reformatting using sed + date? - regex

I have a bunch of data that looks like this:
"2004-03-23 20:11:55" 3 3 1
"2004-03-23 20:12:20" 1 1 1
"2004-03-31 02:20:04" 15 15 1
"2004-04-07 14:33:48" 141 141 1
"2004-04-15 02:08:31" 2 2 1
"2004-04-15 07:56:01" 1 2 1
"2004-04-16 12:41:22" 4 4 1
and I need to feed this data to a program which only accepts time in UNIX (Epoch) format. Is there a way I can change all the dates in bash? My first instinct tells me to do something like this:
sed 's/"(.*)"/`date -jf "%Y-%m-%d %T" "\1" "+%s"`'
But I am not entirely sure that the \1 inside the date call will properly backreference the regex matched by sed. In fact, when I run this, I get the following response:
sed: 1: "s/(".*")/`date -jf "% ...": unterminated substitute in regular expression
Can anyone guide me in the right direction on this? Thank you.

Nothing is going to be expanded between single quotes. Also, no, the shell expansions are going to happen before the sed \1 expansion, so your code isn't going to work. How about something like this (untested):
while IFS= read -r date time a b c
do
date --date "${date:1} ${time::-1}" # Cut the variables to remove the literal quotes
printf " %s %s %s\n" "$a" "$b" "$c"
done < file

Related

get number value between two strings using regex

I have a string with multiple value outputs that looks like this:
SD performance read=1450kB/s write=872kB/s no error (0 0), ManufactorerID 27 Date 2014/2 CardType 2 Blocksize 512 Erase 0 MaxtransferRate 25000000 RWfactor 2 ReadSpeed 22222222Hz WriteSpeed 22222222Hz MaxReadCurrentVDDmin 3 MaxReadCurrentVDDmax 5 MaxWriteCurrentVDDmin 3 MaxWriteCurrentVDDmax 1
I would like to output only the read value (1450kB/s) using bash and sed.
I tried
sed 's/read=\(.*\)kB/\1/'
but that outputs read=1450kB but I only want the number.
Thanks for any help.
Sample input shortened for demo:
$ echo 'SD performance read=1450kB/s write=872kB/s no error' | sed 's/read=\(.*\)kB/\1/'
SD performance 1450kB/s write=872/s no error
$ echo 'SD performance read=1450kB/s write=872kB/s no error' | sed 's/.*read=\(.*\)kB.*/\1/'
1450kB/s write=872
$ echo 'SD performance read=1450kB/s write=872kB/s no error' | sed 's/.*read=\([0-9]*\)kB.*/\1/'
1450
Since entire line has to be replaced, add .* before and after search pattern
* is greedy, will try to match as much as possible, so in 2nd example it can be seen that it matched even the values of write
Since only numbers after read= is needed, use [0-9] instead of .
Running
sed 's/read=\(.*\)kB/\1/'
will replace read=[digits]kB with [digit]. If you want to replace the whole string, use
sed 's/.*read=\([0-9]*\)kB.*/\1/'
instead.
As Sundeep noticed, sed doesn't support non-greedy pattern, updated for [0-9]* instead

Bash - count a pattern and print the line containing the pattern

everyone! While I was reading this discussion, "Count number of occurrences of a pattern in a file (even on same line)", I wondered if I could add the line containing the pattern next to the count values.
Somehow I wasn't able to add any comment on the discussion, so I'm posting a new question. Can somebody en-light me?
There must be some misunderstanding here, so I put an example.
Let's say, I have a DNA sequence like below and want to find out how many 'CG' are present in each line.
ACAAAGAACTCAAGAAGTTGGACCCCAGAGAACCAAATAACCCTATTAAA
AATTCGGAACAGAGATAAACAAAGAATTCTCAACTGAGGAAACTTGAATG
GGATTTTTTTTTAAGATTCACTTATTTTTATTTTCTGCATGAGTGTTTGC
CTCGATGTATGTACATATACGACATGTGTACGTGGTGCGCAAGTAAGCAG
Additionally, I want to print each line (not the pattern) along with the pattern counts.
0 ACAAAGAACTCAAGAAGTTGGACCCCAGAGAACCAAATAACCCTATTAAA
1 AATTCGGAACAGAGATAAACAAAGAATTCTCAACTGAGGAAACTTGAATG
0 GGATTTTTTTTTAAGATTCACTTATTTTTATTTTCTGCATGAGTGTTTGC
4 CTCGATGTATGTACATATACGACATGTGTACGTGGTGCGCAAGTAAGCAG
I wish the example above will help to understand the question better.
Thank you!
You can do:
printf 'pattern' | tee >(sed 's/$/ : /') | grep -cf - input.txt
Taking help of tee and process substitution.
Example:
% cat file.txt
foobar
spamegg
foo
% printf 'foo' | tee >(sed 's/$/ : /') | grep -cf - file.txt
foo : 2
cat fileName | grep pattern | uniq -c
I just found a really simple and elegant solution using EXCEL.
The formula goes like below...
=(LEN(B2)-LEN(SUBSTITUTE(B2,"CG","")))/2
What this formula basically does is it counts total length of strings in a cell and length after removal of the pattern ("CG" in this case), then subtract them. Since each "CG" is replaced by blanks, 2 strings are missing after substitution, and you can get the number of the pattern by dividing it with length of your pattern which is 2 in this case.
For example, following sequence contains 50 strings and 13 CG's.
CAGTGCACACAACACATGTACGCGCGCGCGCGCGCGCGCGCGCGCGTGTG 50
After substituting "CG" to blanks, you get 24 strings.
CAGTGCACACAACACATGTATGTG 24
To count the "CG" occurances,
(50-24)/2 = 13
If you are looking for "CAG", enter "CAG" instead of "CG" and divide by 3.
How simple is that!
You can see the original post in the following link.
http://fiveminutelessons.com/learn-microsoft-excel/count-occurrences-single-character-cell-excel#sthash.H4VfOkGB.dpbs
English is not my primary language, so please understand errors in my writing.
People are geniuses!

Can adding a particular number to a bunch of "time" strings, be done in Regex

I have a "srt" file(like standard movie-subtitle format) like shown in below link:http://pastebin.com/3k8a53SC
Excerpt:
1
00:00:53,000 --> 00:00:57,000
<any text that may span multiple lines>
2
00:01:28,000 --> 00:01:35,000
<any text that may span multiple lines>
But right now the subtitles timing is all wrong, as it lags behind by 9 seconds.
Is it possible to add 9 seconds(+9) to every time entry with regex ?
Even if the milliseconds is set to 000 then it's fine, but the addition of 9 seconds should adhere to "60 seconds = 1 minute & 60 minutes = 1 hour" rules.
Also the subtitle text after timing entry must not get altered by regex.
By the way the time format for each time string is "Hours:Minutes:Seconds.Milliseconds".
Quick answer is "no", that's not an application for regex. A regular expression lets you MATCH text, but not change it. Changing things is outside the scope of the regex itself, and falls to the language you're using -- perl, awk, bash, etc.
For the task of adjusting the time within an SRT file, you could do this easily enough in bash, using the date command to adjust times.
#!/usr/bin/env bash
offset="${1:-0}"
datematch="^(([0-9]{2}:){2}[0-9]{2}),[0-9]{3} --> (([0-9]{2}:){2}[0-9]{2}),[0-9]{3}"
os=$(uname -s)
while read line; do
if [[ "$line" =~ $datematch ]]; then
# Gather the start and end times from the regex
start=${BASH_REMATCH[1]}
end=${BASH_REMATCH[3]}
# Replace the time in this line with a printf pattern
linefmt="${line//[0-2][0-9]:[0-5][0-9]:[0-5][0-9]/%s}\n"
# Calculate new times
case "$os" in
Darwin|*BSD)
newstart=$(date -v${offset}S -j -f "%H:%M:%S" "$start" '+%H:%M:%S')
newend=$(date -v${offset}S -j -f "%H:%M:%S" "$end" '+%H:%M:%S')
;;
Linux)
newstart=$(date -d "$start today ${offset} seconds" '+%H:%M:%S')
newend=$(date -d "$end today ${offset} seconds" '+%H:%M:%S')
;;
esac
# And print the result
printf "$linefmt" "$newstart" "$newend"
else
# No adjustments required, print the line verbatim.
echo "$line"
fi
done
Note the case statement. This script should auto-adjust for Linux, OSX, FreeBSD, etc.
You'd use this script like this:
$ ./srtadj -9 < input.srt > output.srt
Assuming you named it that, of course. Or more likely, you'd adapt its logic for use in your own script.
No, sorry, you can’t. Regex are a context free language (see Chomsky e.g. https://en.wikipedia.org/wiki/Chomsky_hierarchy) and you cannot calculate.
But with a context sensitive language like perl it will work.
It could be a one liner like this ;-)))
perl -n -e 'if(/^(\d\d:\d\d:\d\d)([-,\d\s\>]*)(\d\d:\d\d:\d\d)(.*)/) {print plus9($1).$2.plus9($3).$4."\n";}else{print $_} sub plus9{ ($h,$m,$s)=split(/:/,shift); $t=(($h*60+$m)*60+$s+9); $h=int($t/3600);$r=$t-($h*3600);$m=int($r/60);$s=$r-($m*60);return sprintf "%02d:%02d:%02d", $h, $m, $s;}‘ movie.srt
with move.srt like
1
00:00:53,000 --> 00:00:57,000
hello
2
00:01:28,000 --> 00:01:35,000
I like perl
3
00:02:09,000 --> 00:02:14,000
and regex
you will get
1
00:01:02,000 --> 00:01:06,000
hello
2
00:01:37,000 --> 00:01:44,000
I like perl
3
00:02:18,000 --> 00:02:23,000
and regex
You can change the +9 in the "sub plus9{...}", if you want another delta.
How does it work?
We are looking for lines that matches
dd:dd:dd something dd:dd:dd something
and then we call a sub, which add 9 seconds to the matched group one ($1) and group three ($3). All other lines are printed unchanged.
added
If you want to put the perl oneliner in a file, say plus9.pl, you can add newlines ;-)
if(/^(\d\d:\d\d:\d\d)([-,\d\s\>]*)(\d\d:\d\d:\d\d)(.*)/) {
print plus9($1).$2.plus9($3).$4."\n";
} else {
print $_
}
sub plus9{
($h,$m,$s)=split(/:/,shift);
$t=(($h*60+$m)*60+$s+9);
$h=int($t/3600);
$r=$t-($h*3600);
$m=int($r/60);
$s=$r-($m*60);
return sprintf "%02d:%02d:%02d", $h, $m, $s;
}
Regular expressions strictly do matching and cannot add/substract. You can match each datetime string using python, for example, add 9 seconds to that, and then rewrite the string in the appropriate spot. The regular expression I would use to match it would be the following:
(?<hour>\d+):(?<minute>\d+):(?<second>\d+),(?<msecond>\d+)
It has labeled capture groups so it's really easy to get each section (you won't need msecond but it's there for visualization, I guess)
Regex101

Awk 3 Spaces + 1 space or hyphen

I have a rather large chart to parse. Each column is separated by either 4 spaces or by 3 spaces and a hyphen (since the numbers in the chart can be negative).
cat DATA.txt | awk "{ print match($0,/\s\s/) }"
does nothing but print a slew of 0's. I'm trying to understand AWK and when to escape, etc, but I'm not getting the hang of it. Help is appreciated.
One line:
1979 1 -0.176 -0.185 -0.412 0.069 -0.129 0.297 -2.132 -0.334 -0.019
1979 1 -0.176 0.185 -0.412 0.069 -0.129 0.297 -2.132 -0.334 -0.019
I would like to get just, say, the second column. I copied the line, but I'd like to see -0.185 and 0.185.
You need to start by thinking about bash quoting, since it is bash which interprets the argument to awk which will be the awk program. Inside double-quoted strings, bash expands $0 to the name of the bash executable (or current script); that's almost certainly not what you want, since it will not be a quoted string. In fact, you almost never want to use double quotes around the awk program argument, so you should get into the habit of writing awk '...'.
Also, awk regular expressions don't understand \s (although Gnu awk will handle that as an extension). And match returns the position of the match, which I don't think you care about either.
Since by default, awk considers any sequence of whitespace a field separator, you don't really need to play any games to get the fourth column. Just use awk '{print $4}'
Why not just use this simple awk
awk '$0=$4' Data.txt
-0.185
0.185
It sets $0 to value in $4 and does the default action, print.
PS do not use cat with program that can read data itself, like awk
In case of filed 4 containing 0, you can make it more robust like:
awk '{$0=$4}1' Data.txt
If you're trying to split the input according to 3 or 4 spaces then you will get the expected output only from column 3.
$ awk -v FS=" {3,4}" '{print $3}' file
-0.185
0.185
FS=" {3,4}" here we pass a regex as FS value. This regex get parsed and set the Field Separator value to three or four spaces. In regex {min,max} called range quantifier which repeats the previous token from min to max times.

unix regex for adding contents in a file

i have contents in a file
like
asdfb ... 1
adfsdf ... 2
sdfdf .. 3
I want to write a unix command that should be able to add 1 + 2 + 3 and give the result as 6
From what I am aware grep and awk would be handy, any pointers would help.
I believe the following is what you're looking for. It will sum up the last field in each record for the data that is read from stdin.
awk '{ sum += $NF } END { print sum }' < file.txt
Some things to note:
With awk you don't need to declare variables, they are willed into existence by assigning values to them.
The variable NF is the number of fields in the current record. By prepending it with a $ we are treating its value as a variable. At least this is how it appears to work anyway :)
The END { } block is only once all records have been processed by the other blocks.
An awk script is all you need for that, since it has grep facilities built in as part of the language.
Let's say your actual file consists of:
asdfb zz 1
adfsdf yyy 2
sdfdf xx 3
and you want to sum the third column. You can use:
echo 'asdfb zz 1
adfsdf yyy 2
sdfdf xx 3' | awk '
BEGIN {s=0;}
{s = s + $3;}
END {print s;}'
The BEGIN clause is run before processing any lines, the END clause after processing all lines.
The other clause happens for every line but you can add more clauses to change the behavior based on all sorts of things (grep-py things).
This might not exactly be what you're looking for, but I wrote a quick Ruby script to accomplish your goal:
#!/usr/bin/env ruby
total = 0
while gets
total += $1.to_i if $_ =~ /([0-9]+)$/
end
puts total
Here's one in Perl.
$ cat foo.txt
asdfb ... 1
adfsdf ... 2
sdfdf .. 3
$ perl -a -n -E '$total += $F[2]; END { say $total }' foo
6
Golfed version:
perl -anE'END{say$n}$n+=$F[2]' foo
6