How do i fix regex matching few unexpected characters? - regex

i am using a regex where as a first preference i am intending to match the character ( number or alphanumeric ) immediately succeeding the string "Lecture" else match the last character of line in absence of string "Lecture".
Curent regex
cat 1.txt | perl -ne 'print "$& \n" while /Lecture\h*\K\w+|^(?!.*Lecture).*\h\K[^.\s]+/g;/^.*?-(.*)/g' | perl -ne 'print "$& \n" while /(\d+\w*)/g'
The data to read is not very consistent. There could be spaces or hyphen around the string "Lecture" or end character and line may not end as .mp4
My current regex is working almost well , it just having the issues for the bottom 3 lines . I could have only included those lines here but i don't want the solution regex to break for the other cases. So including all possibilities below
cat 1.txt
54282068 Lecture74- AS 29 Question.mp4
174424104Lecture 74B - AS 29 Theory.mp4
Branch Accounts Lecture 105
Lecture05 - Practicals AS 28
Submissions 20.mp4
HW Section 77N
Residential status HWS Q.1 to 6 -60A
Residential status HWS Q.7 to 20 -60B
House property all HWS-60C
Salary HWS Q.11 to 13 - 60F
Salary HWS Q.1 to 5-60D
Salary HWS Q.6 to 10-60E
Salary HWS Q.14 to 20-60G
Operating Costing 351
Expected Output
74
74B
105
05
20
77N
60A
60B
60C
60F
60D
60E
60G
351
Exact Issue - For the bottom 3 lines above the last one it is printing 5,10 and 20 additionally along with the end character 60D, 60E and 60G
I believe there's a issue in the last part of my regex somewhere, needs a very small edit to fix . Hopefully someone can help me.

Please inspect following piece of code for compliance with your requirements
use strict;
use warnings;
use feature 'say';
while( <DATA> ) {
chomp;
s/\.mp4//;
say $1 if /Lecture\s*(\w+)/ or /(\d{2}[A-Z]?)\Z/;
}
__DATA__
54282068 Lecture74- AS 29 Question.mp4
174424104Lecture 74B - AS 29 Theory.mp4
Branch Accounts Lecture 105
Lecture05 - Practicals AS 28
Submissions 20.mp4
HW Section 77N
Residential status HWS Q.1 to 6 -60A
Residential status HWS Q.7 to 20 -60B
House property all HWS-60C
Salary HWS Q.11 to 13 - 60F
Salary HWS Q.1 to 5-60D
Salary HWS Q.6 to 10-60E
Salary HWS Q.14 to 20-60G
Output
74
74B
105
05
20
77N
60A
60B
60C
60F
60D
60E
60G

Related

bash generate a regexp to match any formated date between two dates

Using bash I would generate a regexp to match any formatted date between two dates. (I will later use it in restricted prod so I would use only bash as far as possible)
Random dates of course even crossing years for example the regexp might match any date between
2022-12-27 and
2023-02-05
so all of the dates
2022-12-27
2022-12-28
2022-12-29
2022-12-30
2022-12-31
2023-01-01
(…)
2023-02-05
Time range comes from two parameters given as input to the future bash script.
Finaly will use that regexp to both find & manage filenames and to grep some datas.
Filename form-pattern are random but always contain a YYY-MM-DD time format, whatever the filename is from dd_YYYY-MM-DD.xx to aaaa__bbbb_dd_YYYY-MM-DD.xxx_ccc_zzz.log or anything else .
I tried to manage that by separating each year/month/day like
fromDate="$1"
toDate="$2"
# break the dates into array so that we can easily access the day, month and year
#
fdy=$( sed 's/.*\([0-9]\{4\}\).*/\1/' <<< $fromDate )
fdm=$( sed 's/.*-\([0-9]\{2\}\)-.*/\1/' <<< $fromDate )
fdd=$( sed 's/.*-.*-\([0-9]\{2\}\).*/\1/' <<< $fromDate )
#
edy=$( sed 's/.*\([0-9]\{4\}\).*/\1/' <<< $toDate )
edm=$( sed 's/.*-\([0-9]\{2\}\)-.*/\1/' <<< $toDate )
edd=$( sed 's/.*-.*-\([0-9]\{2\}\).*/\1/' <<< $toDate )
then to loop over that with some sort of
#[...]
printf -v date "%d-%02d-%02d" "${from[2]}" "${from[1]}" "${from[0]}"
pattern="$pattern|$date"
((from[0]++))
# reset days and increment month if days exceed 31
if [[ "${from[0]}" -gt 31 ]]
then
from[0]=1
((from[1]++))
#[...]
but didn't find a way to work around and output a correct regexp matching any date inside the date range.
This is not a regex solution but may be helpful.
When the date format is yyyy-mm-dd you can compare dates lexicographically.
The string "2023-01-05" is greater than "2022-08-05" because the first character they don't have in common is greater in the first string.
In bash you can therefore just do the comparison [[ "$date" > "$from" ]] && [[ "$to" > "$date" ]] to see if a date in yyyy-mm-dd string format is within the range $to-$from
There is a nifty feature in bash / readline, that you can use called brace patterns and expansion of these. It's not exactly regex, but converting from those to regex is easy.
First, since it's parctical in this example to generate the files for those dates let's start with:
$ k=0 b=2022-12-27 a=2022-12-27;
until [[ $b > 2023-01-05 ]]; do
b=$(date -d "$a + $k days" +%F); echo $b; k=$((k+1)); done \
| xargs -t touch
touch 2022-12-27 2022-12-28 2022-12-29 2022-12-30 (...)
$ ls
2022-12-27 2022-12-29 2022-12-31 2023-01-02 2023-01-04 2023-01-06
2022-12-28 2022-12-30 2023-01-01 2023-01-03 2023-01-05
Then check if you have this brace thing enabled.
$ bind -p | grep brace
"\e{": complete-into-braces
And if not, add it to the inputrc file:
$ grep braces ~/.inputrc
"\e{": complete-into-braces
Now, in the empty folder where those files were created, pressing escape+{ generates this string:
$ 202{2-12-{2{7,8,9},3{0,1}},3-01-0{1,2,3,4,5,6}}
converting that to a regex should be a matter of replacing the commans and braces:
$ echo '202{2-12-{2{7,8,9},3{0,1}},3-01-0{1,2,3,4,5,6}}' | tr '{},' '()|'
202(2-12-(2(7|8|9)|3(0|1))|3-01-0(1|2|3|4|5|6))
Let's test it:
$ (days 2022; days;) | grep -Ee '202(2-12-(2(7|8|9)|3(0|1))|3-01-0(1|2|3|4|5|6))'
361 2022-12-27 December 52 Tuesday
362 2022-12-28 December 52 Wednesday
363 2022-12-29 December 52 Thursday
364 2022-12-30 December 52 Friday
365 2022-12-31 December 52 Saturday
001 2023-01-01 January 52 Sunday
002 2023-01-02 January 01 Monday
003 2023-01-03 January 01 Tuesday
004 2023-01-04 January 01 Wednesday
005 2023-01-05 January 01 Thursday
006 2023-01-06 January 01 Friday
The days script is from my local-bin repo on github.
So it works well enough, but the initial while loop was wrong. We just need to remove the last 6 from the regex.
$ (days 2022; days;) | grep -Ee '202(2-12-(2(7|8|9)|3(0|1))|3-01-0(1|2|3|4|5))' | sed -n '1p; $p'
361 2022-12-27 December 52 Tuesday
005 2023-01-05 January 01 Thursday
This takes care of the YYYY-MM-DD portion.
I think this answers the question "how to generate a regex for the dates". If you want to expand the question, and need more hints, leave a comment.

Looking for regexp to keep some strings

This is the string I'm trying to regexp :
15C (59F) ambient, 22C (71F), 20C (68F), 26C (78F), 21C (69F), 27C (80F), 30C (86F), 33C (91F)
Actually I would like to keep only temperatures values in degrees without the C letter and to delete the other strings.
Can someone help me to do so ?
Thanks in advance !
With GNU grep:
grep -oP '[0-9]+(?=C)' file
Output:
15
22
20
26
21
27
30
33

How can I extract Twitter #handles from a text with RegEx?

I'm looking for an easy way to create lists of Twitter #handles based on SocialBakers data (copy/paste into TextMate).
I've tried using the following RegEx, which I found here on StackOverflow, but unfortunately it doesn't work the way I want it to:
^(?!.*#([\w+])).*$
While the expression above deletes all lines without #handles, I'd like the RegEx to delete everything before and after the #handle as well as lines without #handles.
Example:
1
katyperry KATY PERRY (#katyperry)
Followings 158
Followers 82 085 596
Rating
5
Worst012345678910Best
2
justinbieber Justin Bieber (#justinbieber)
254 399
74 748 878
2
Worst012345678910Best
3
taylorswift13 Taylor Swift (#taylorswift13)
245
70 529 992
Desired result:
#katyperry
#justinbieber
#taylorswift13
Thanks in advance for any help!
Something like this:
cat file | perl -ne 'while(s/(#[a-z0-9_]+)//gi) { print $1,"\n"}'
This will also work if you have lines with multiple #handles in.
A Twitter handle regex is #\w+. So, to remove everything else, you need to match and capture the pattern and use a backreference to this capture group, and then just match any character:
(#\w+)|.
Use DOTALL mode to also match newline symbols. Replace with $1 (or \1, depending on the tool you are using).
See demo
Strait REGEX Tested in Caret:
#.*[^)]
The above will search for and any given and exclude close parenthesis.
#.*\b
The above here does the same thing in Caret text editor.
How to awk and sed this:
Get usernames as well:
$ awk '/#.*/ {print}' test
katyperry KATY PERRY (#katyperry)
justinbieber Justin Bieber (#justinbieber)
taylorswift13 Taylor Swift (#taylorswift13)
Just the Handle:
$ awk -F "(" '/#.*/ {print$2}' test | sed 's/)//g'
#katyperry
#justinbieber
#taylorswift13
A look at the test file:
$ cat test
1
katyperry KATY PERRY (#katyperry)
Followings 158
Followers 82 085 596
Rating
5
Worst012345678910Best
2
justinbieber Justin Bieber (#justinbieber)
254 399
74 748 878
2
Worst012345678910Best
3
taylorswift13 Taylor Swift (#taylorswift13)
245
70 529 992
Bash Version:
$ bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin14)
Copyright (C) 2007 Free Software Foundation, Inc.

Sed regex string substitution from terminal

I have a log file with a standard format, e.g.:
31 Mar - Lorem Ipsom1
31 Mar - Lorem Ipsom2
31 Mar - Lorem Ipsom3
The replacement I want to implement is 31*31 to 31 so I'll end up with a log that has only its last line, in this example it will look like:
31 Mar - Lorem Ipsom3
I wish to perform it on a customized linux machine that has no perl.
I tried to use sed like this:
sed -i -- 's/31*31/31/g' /var/log/prog/logFile
But it did nothing..
Any alternatives involving ninja bash commands are also welcomed.
A way to keep only the last of consecutive lines that match a pattern is
sed -n '/^31/ { :a $!{ h; n; //ba; x; G } }; p' filename
This works as follows:
/^31/ { # if a line begins with 31
:a # jump label for looping
$!{ # if the end of input has not been reached (otherwise the current
# line is the last line of the block by virtue of being the last
# line)
h # hold the current line
n # fetch the next line. (note that this doesn't print the line
# because of -n)
//ba # if that line also begins with 31, go to :a. // attempts the
# most recently attempted regex again, which was ^31
x # swap hold buffer, pattern space
G # append hold buffer to pattern space. The PS now contains
# the last line of the block followed by the first line that
# comes after it
}
}
p # in the end, print the result
This avoids some problems of mult-line regular expressions such as matches that begin or end in the middle of a line. It will also not discard lines between two blocks of matching lines and keep the last line of each block.
* is not a wildcard as it is in the shell, it is a quantifier. You need to quantify over . (any character). The regex is thus:
sed ':a;N;$!ba;s/31.*31/31/g'
(I removed the -i flag so you can first test your file safely).
The :a;N;$!ba; part makes it possible to process over new lines.
Note however:
The regex will match any 31 so:
31 Mar - Lorem Ipsom1
31 Mar - Lorem 31 Ipsom2
Will result in
31 Ipsom2
It will match greedy, if the log reads:
31 Mar - Lorem Ipsom1
30 Mar - Lorem Ipsom2
31 Mar - Lorem Ipsom3
It remove the second line.
You can solve the first problem by writing:
sed ':a;N;$!ba;s/(^|\n)31.*\n31/31/g'
Which forces the regex that second 31 is located at the beginning of the line.
I think you might be looking for "tail" to get the last line of the file
e.g.
tail -1 /path/file
or if you want the last entry from each day then "sort" might be your solution
sort -ur -k 1,2 /path/file | sort
the -u flag specifies only a single match for the keyfields will be returned
the -k 1,2 specifies that the keyfields are the first two fields - in this case they are the month and the date - fields by default are separated by white space.
the -r flag reverses the lines such that the last match for each date will be returned. Sort a second time to restore the original order.
If your log file has more than a single month of data, and you wish to preserve order (e.g. if you have Mar 31 and Apr 1 in the same file) you can try:
cat -n tmp2 | sort -nr | sort -u -k 2,3 | sort -n | cut -f 2-
cat -n adds the line number to the log file before sorting.
sort as before but use fields 2 and 3, because field 1 is now the original line number
sort by the original line number to restore the original order.
use cut to remove the line numbers and restore the original line content.
e.g.
$ cat tmp2
30 Mar - Lorem Ipsom2
30 Mar - Lorem Ipsom1
31 Mar - Lorem Ipsom1
31 Mar - Lorem Ipsom2
31 Mar - Lorem Ipsom3
1 Apr - Lorem Ipsom1
1 Apr - Lorem Ipsom2
$ cat -n tmp2 | sort -r | sort -u -k 2,3 | sort | cut -f 2-
30 Mar - Lorem Ipsom1
31 Mar - Lorem Ipsom3
1 Apr - Lorem Ipsom2

Replace first two whitespace occurrences with a comma using sed

I have a whitespace delimited file with a variable number of entries on each line. I want to replace the first two whitespaces with commas to create a comma delimited file with three columns.
Here's my input:
a b 1 2 3 3 2 1
c d 44 55 66 2355
line http://google.com 100 200 300
ef jh 77 88 99
z y 2 3 33
And here's my desired output:
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line,http://google.com,100 200 300
ef,jh,77 88 99
z,y,2 3 33
I'm trying to use perl regular expressions in a sed command but I can't quite get it to work. First I try capturing a word, followed by a space, then another word, but that only works for lines 1, 2, and 5:
$ cat test | sed -r 's/(\w)\s+(\w)\s+/\1,\2,/'
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line http://google.com 100 200 300
ef jh 77 88 99
z,y,2 3 33
I also try capturing whitespace, a word, and then more whitespace, but that gives me the same result:
$ cat test | sed -r 's/\s+(\w)\s+/,\1,/'
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line http://google.com 100 200 300
ef jh 77 88 99
z,y,2 3 33
I also try doing this with the .? wildcard, but that does something funny to line 4.
$ cat test | sed -r 's/\s+(.?)\s+/,\1,/'
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line http://google.com 100 200 300
ef jh,,77 88 99
z,y,2 3 33
Any help is much appreciated!
How about this:
sed -e 's/\s\+/,/' | sed -e 's/\s\+/,/'
It's probably possible with a single sed command, but this is sure an easy way :)
My output:
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line,http://google.com,100 200 300
ef,jh,77 88 99
z,y,2 3 33
Try this:
sed -r 's/\s+(\S+)\s+/,\1,/'
Just replaced \w (one "word" char) with \S+ (one or more non-space chars) in one of your attempts.
You can provide multiple commands to a single instance of sed by just providing multiple -e arguments.
To do the first two, just use:
sed -e 's/\s\+/,/' -e 's/\s\+/,/'
This basically runs both commands on the line in sequence, the first doing the first block of whitespace, the second doing the next.
The following transcript shows this in action:
pax$ echo 'a b 1 2 3 3 2 1
c d 44 55 66 2355
line http://google.com 100 200 300
ef jh 77 88 99
z y 2 3 33
' | sed -e 's/\s\+/,/' -e 's/\s\+/,/'
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line,http://google.com,100 200 300
ef,jh,77 88 99
z,y,2 3 33
Sed s/// supports a way to say which occurrence of a pattern to replace: just add the n to the end of the command to replace only the nth occurrence. So, to replace the first and second occurrences of whitespace, just use it this way:
$ sed 's/ */,/1;s/ */,/2' input
a,b ,1 2 3 3 2 1
c,d ,44 55 66 2355
line,http://google.com 100,200 300
ef,jh ,77 88 99
z,y 2,3 33
EDIT: reading another proposed solutions, I noted that the 1 and 2 after s/ */,/ is not only unnecessary but plainly wrong. By default, s/// just replaces the first occurrence of the pattern. So, if we have two identical s/// in sequence, they will replace the first and the second occurrence. What you need is just
$ sed 's/ */,/;s/ */,/' input
(Note that you can put two sed commands in one expression if you separate them by a semicolon. Some sed implementations do not accept the semicolon after the s/// command; use a newline to separate the commands, in this case.)
A Perl solution is:
perl -pe '$_=join ",", split /\s+/, $_, 3' some.file
Not sure about sed/perl, but here's an (ugly) awk solution. It just prints fields 1-2, separated by commas, then the remaining fields separated by space:
awk '{
printf("%s,", $1)
printf("%s,", $2)
for (i=3; i<=NF; i++)
printf("%s ", $i)
printf("\n")
}' myfile.txt