How to use sed to match Month day year pattern? - regex

I am working with large documents and I want to make two tasks, the first is to substitute all the dates that come in this way: "August 12 2014" or "January 31 1999" so I am using the following line in sed:
s/\(jan.\|feb.\|mar.\|apr.\|may\|jun.\|jul.\|aug.\|sep.\|oct.\|nov.\|dec.\) \([0-9]\|[0-9][0-9]\) [1-9][0-9][0-9][0-9]/tokendate/g
However It does not take for example august, I know I could change aug. for august, but I'd like that sed match any string that begins with aug___ or sep____.
Thanks in advance for any help

sed -r 's/(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)[a-z]* [0-9]{1,2} [1-9][0-9]{3}/tokendate/gi' file

Related

RegEx for a multiple line search and replace using sed

I need to have a RegEx that finds a \n in the middle of a line as a start point, anything before is random, and replace after 15 digits and 49 alpha on the second line. I need to replace all that by blanks, but the second line needs to join with the first one.
Attempt
sed -r -e '{N;s/\n[[:digit:]]{15}[[:space:]]{49}//}'
Input
QC HOH 0H0 CA
:70:NOFX TRADE TR
100000100200621 ADE RELATED WOOD PURCHASE
What needs to be removed is the linefeed after TRADE TR and bring the ADE RELATED to the TR so it spells TRADE.
Desired Output
QC H0H 0H0 CA
:70:NOFX TRADE TRADE RELATED WOOD PURCHASE
This might work for you (GNU sed):
sed -E 'N;s/\n[[:digit:]]{15}[[:space:]]{49}//;P;D' file
This opens up a two line window and amends the second of them if the substitute command matches. It always prints the first of the two lines and then removes it.
With GNU sed:
$ sed -Ez 's/\n[[:digit:]]{15}[[:space:]]{49}//' file
QC J0B 2Y0 CA
:70:NOFX TRADE TRADE RELATED WOOD PURCHASE

Using regular expression to search for a specific pattern in UNIX

I have file names like
ABCD20140207090842 ABCD20140207090847 ABCD20140207090849 ABCD20140207090850 ABCD2014556644219268 ABCD20140508525691 tf in my directory.
I want to search for files with specific pattern. i.e FileNameYearMonthDayHourMinSec.txt
Note: files tf and ABCD2014556644219268 should not get matched.
Answer with exact pattern would be appreciated.
Based on NAME, followed by DATE.txt I get this:
find . -regex ".*[19|20][0-9][0-9][0-1][0-2][0-3][0-9][0-2][0-9][0-6][0-9][0-6][0-9]\.txt$"
This doesn't account for leap-years though, and could match dates such as 31 Feb.
Change the idea. Here is the script no matter leap years or not.
Using GNU date to identify the date
for file in *.txt
do
time=${file%.*} # remove suffix and get ABCD20140207090842
time=${time:(-14)} # get the date/time 20140207090842
time="${time:0:4}/${time:4:2}/${time:6:2} ${time:8:2}:${time:10:2}:${time:12}" # convert time to: 2014/02/07 09:08:42
date -d "$time" >/dev/null 2>&1 && echo $file
done
ABCD20140207090842.txt
ABCD20140207090847.txt
ABCD20140207090849.txt
ABCD20140207090850.txt

Regular expression to match second or last decimal number in a string

String:
<LF><CR>A214 pH/ISE,X00066,2.59,ABCDE,10/16/13 22:06:59,ABC1,CH-1,pH,7.00,pH,0.0, mV,25.0,C,100.0,%,M100,#35<LF><CR>
I need to match only the 7.00 - This number could be anywhere from 0.00 - 14.00 (its a pH reading).
Right now I can only come up with [0-9]{1,2}\.[0-9]{2} which also matches the software revision number which appears earlier in the string (2.59)
Any help is greatly appreciated.
EDIT: Thanks everyone. I figured it out by using [0-9]{1,2}\.[0-9]{2}(?=,p)
Simply find all entries and get the last:
>>> s = "A214 pH/ISE,X00066,2.59,ABCDE,10/16/13 22:06:59,ABC1,CH-1,pH,7.00,pH,0.0, mV,25.0,C,100.0,%,M100,#35"
>>> re.findall("[0-9]{1,2}.[0-9]{2}", s)[-1]
'7.00'
You can improve that regex by using the information that PH is between 0-14(first digit can only by one etc). Or better, just split by commas or use csv module.
maybe you can use that:
pH,(([0-9]|1[0-4])\.\d{2}),pH
group 1 match number that you need. And that control data
If the format of the string is fixed, i.e. the data is in the 9th position if you split on , Use e.g. awk:
$ awk -F, '{print $8, $9}' input
pH 7.00
or using perl in awk-mode:
$ perl -F, -lane 'print $F[8]' input
7.00
Or this regexp
pH,(\d+\.\d{2})
See it line on http://www.rubular.com/r/3kkWNVBAi8

Regex code for address separated by commas

How can I extract the state text which is before third comma only using the regex code?
54 West 21st Street Suite 603, New York,New York,United States, 10010
I've managed to extract the rest how I wanted but this one is a problem.
Also, how can I extract the "United States" please?
It looks like you want to use capturing groups:
.*,.*,(.*),(.*),.*
The first capturing group will be "New York" and the second will be "United States" (try it on Rubular).
Or you can split by commas (which will probably be even simpler) as #Jerry points out, assuming the language/tool you're using supports that.
You can use this regex:
(?:[^,]*,){2}([^,]*)
And use captured group # 1 for your desired String.
TL;DR
A lot depends on your regular expression engine, and whether you really need a regular expression or field-splitting. You can do field-splitting in Ruby and Awk (among others), but sed and grep only do regular expressions. See some examples below to get you started.
Ruby
str = '54 West 21st Street Suite 603, New York,New York,United States, 10010'
str.match /(?:.*?,){2}([^,]+)/
$1
#=> "New York"
GNU sed
$ echo '54 West 21st Street Suite 603, New York,New York,United States, 10010' |
sed -rn 's/([^,]+,){2}([^,]+).*/\2/p'
GNU awk
$ echo '54 West 21st Street Suite 603, New York,New York,United States, 10010' |
awk -F, '{print $3}'

RegExp to match everything up to first blank line

I'm writing a bash script that will show me what TV programs to watch today, it will get this information from a text file.
The text is in the following format:
Monday:
Family Guy (2nd May)
Tuesday:
House
The Big Bang Theory (3rd May)
Wednesday:
The Bill
NCIS
NCIS LA (27th April)
Thursday:
South Park
Friday:
FlashForward
Saturday:
Sunday:
HIGNFY
Underbelly
I'm planning to use 'date +%A' to work out the day of the week and use the output in a grep regex to return the appropriate lines from my text file.
If someone can help me with the regex I should be using I would be eternally great full.
Incidentally, this bash script will be used in a Conky dock so if anyone knows of a better way to achieve this then I'd like to hear about it,
Perl solution:
#!/usr/bin/perl
my $today=`date +%A`;
$today=~s/^\s*(\w*)\s*(?:$|\Z)/$1/gsm;
my $tv=join('',(<DATA>));
for my $t (qw(Monday Tuesday Wednesday Thursday Friday Saturday Sunday)) {
print "$1\n" if $tv=~/($t:.*?)(?:^$|\Z)/sm;
}
print "Today, $1\n" if $tv=~/($today:.*?)(?:^$|\Z)/sm;
__DATA__
Monday:
Family Guy (2nd May)
Tuesday:
House
The Big Bang Theory (3rd May)
Wednesday:
The Bill
NCIS
NCIS LA (27th April)
Thursday:
South Park
Friday:
FlashForward
Saturday:
Sunday:
HIGNFY
Underbelly
sed -n '/^Tuesday:/,/^$/p' list.txt
grep -B10000 -m1 ^$ list.txt
-B10000: print 10000 lines before the match
-m1: match at most once
^$: match an empty line
Alternatively, you can use this:
awk '/^'`date +%A`':$/,/^$/ {if ($0~/[^:]$/) print $0}' guide.txt
This awk script matches a consecutive group of lines which starts with /^Day:$/ and ends with a blank line. It only prints a line if the line ends with a character that is not a colon. So it won't print "Sunday:" or the blank line.