Looking for regexp to keep some strings - regex

This is the string I'm trying to regexp :
15C (59F) ambient, 22C (71F), 20C (68F), 26C (78F), 21C (69F), 27C (80F), 30C (86F), 33C (91F)
Actually I would like to keep only temperatures values in degrees without the C letter and to delete the other strings.
Can someone help me to do so ?
Thanks in advance !

With GNU grep:
grep -oP '[0-9]+(?=C)' file
Output:
15
22
20
26
21
27
30
33

Related

How do i fix regex matching few unexpected characters?

i am using a regex where as a first preference i am intending to match the character ( number or alphanumeric ) immediately succeeding the string "Lecture" else match the last character of line in absence of string "Lecture".
Curent regex
cat 1.txt | perl -ne 'print "$& \n" while /Lecture\h*\K\w+|^(?!.*Lecture).*\h\K[^.\s]+/g;/^.*?-(.*)/g' | perl -ne 'print "$& \n" while /(\d+\w*)/g'
The data to read is not very consistent. There could be spaces or hyphen around the string "Lecture" or end character and line may not end as .mp4
My current regex is working almost well , it just having the issues for the bottom 3 lines . I could have only included those lines here but i don't want the solution regex to break for the other cases. So including all possibilities below
cat 1.txt
54282068 Lecture74- AS 29 Question.mp4
174424104Lecture 74B - AS 29 Theory.mp4
Branch Accounts Lecture 105
Lecture05 - Practicals AS 28
Submissions 20.mp4
HW Section 77N
Residential status HWS Q.1 to 6 -60A
Residential status HWS Q.7 to 20 -60B
House property all HWS-60C
Salary HWS Q.11 to 13 - 60F
Salary HWS Q.1 to 5-60D
Salary HWS Q.6 to 10-60E
Salary HWS Q.14 to 20-60G
Operating Costing 351
Expected Output
74
74B
105
05
20
77N
60A
60B
60C
60F
60D
60E
60G
351
Exact Issue - For the bottom 3 lines above the last one it is printing 5,10 and 20 additionally along with the end character 60D, 60E and 60G
I believe there's a issue in the last part of my regex somewhere, needs a very small edit to fix . Hopefully someone can help me.
Please inspect following piece of code for compliance with your requirements
use strict;
use warnings;
use feature 'say';
while( <DATA> ) {
chomp;
s/\.mp4//;
say $1 if /Lecture\s*(\w+)/ or /(\d{2}[A-Z]?)\Z/;
}
__DATA__
54282068 Lecture74- AS 29 Question.mp4
174424104Lecture 74B - AS 29 Theory.mp4
Branch Accounts Lecture 105
Lecture05 - Practicals AS 28
Submissions 20.mp4
HW Section 77N
Residential status HWS Q.1 to 6 -60A
Residential status HWS Q.7 to 20 -60B
House property all HWS-60C
Salary HWS Q.11 to 13 - 60F
Salary HWS Q.1 to 5-60D
Salary HWS Q.6 to 10-60E
Salary HWS Q.14 to 20-60G
Output
74
74B
105
05
20
77N
60A
60B
60C
60F
60D
60E
60G

Removing Leading 0 and applying Regex to Sed

I have several file names, for ease I've put them in a file as follows:
01.action1.txt
04action2.txt
12.action6.txt
2.action3.txt
020.action9.txt
10action4.txt
15action7.txt
021action10.txt
11.action5.txt
18.action8.txt
As you can see the formats aren't consistent what I'm trying to do is extract the first numbers from these file names 1,4,12,2,20 etc
I have the following regex
(\.)?action\d{1,}.txt
Which is successfully matching .action[number].txt but I need to also match the leading 0 and apply it to my substitute with blank in sed so i'm only left with the leading numbers. I'm having trouble matching the leading 0 and applying the whole thing to sed.
Thanks
With GNU sed:
sed -r 's/0*([0-9]*).*/\1/' file
Output:
1
4
12
2
20
10
15
21
11
18
See: The Stack Overflow Regular Expressions FAQ
I don't know if the below awk is helpful but it works as well:
awk '{print $1 + 0}' file
1
4
12
2
20
10
15
21
11
18

Bash script to split a file by grep everything till the second time match in a column into one file and the rest into another

I am trying to split a file with data like
2 0.2345
58 0.3608
59 0.3504
60 0.4175
65 0.3995
66 0.3972
67 0.4411
411 0.3455
2 1.3867
3 1.4532
4 1.2925
5 1.2473
6 1.2605
7 1.2463
8 1.1667
9 1.1312
10 1.1502
11 1.1190
12 1.0346
13 1.0291
409 0.8025
410 0.8695
411 0.9154
For this kind of data, I am trying to split this into two files:
File 1 : 2 -411 (first Column match)
File 2 : 2-411 (second occurrence in the first column)
For this, I wrote these two one liners:
awk '1;/411/{exit}' $1 > File1_$1 ;
awk '/411/,0' $1 | awk '{if (NR!=1) {print}}' > File2_$1
The problem is that if there is a match of "411" (as in "67 0.4411") on the second column, my script prematurely cuts from that line.
I am unable to make the match on the first column only as occurrence of 411 on the second column can be number of times and not of interest.
Any help would be greatly appreciated.
an idea could be to use this command combination
awk '{ if ($1 >= 2 && $1 <= 411) print $0 }{if ($1=="411") exit}' input > f1
then
grep -v -f f1 input > f2
if your input file is more bigger you should repeat step2.
I don't know nothing about Bash, but for regex i think you should indicate that the line begins with 411 like that \b411.

How can I extract Twitter #handles from a text with RegEx?

I'm looking for an easy way to create lists of Twitter #handles based on SocialBakers data (copy/paste into TextMate).
I've tried using the following RegEx, which I found here on StackOverflow, but unfortunately it doesn't work the way I want it to:
^(?!.*#([\w+])).*$
While the expression above deletes all lines without #handles, I'd like the RegEx to delete everything before and after the #handle as well as lines without #handles.
Example:
1
katyperry KATY PERRY (#katyperry)
Followings 158
Followers 82 085 596
Rating
5
Worst012345678910Best
2
justinbieber Justin Bieber (#justinbieber)
254 399
74 748 878
2
Worst012345678910Best
3
taylorswift13 Taylor Swift (#taylorswift13)
245
70 529 992
Desired result:
#katyperry
#justinbieber
#taylorswift13
Thanks in advance for any help!
Something like this:
cat file | perl -ne 'while(s/(#[a-z0-9_]+)//gi) { print $1,"\n"}'
This will also work if you have lines with multiple #handles in.
A Twitter handle regex is #\w+. So, to remove everything else, you need to match and capture the pattern and use a backreference to this capture group, and then just match any character:
(#\w+)|.
Use DOTALL mode to also match newline symbols. Replace with $1 (or \1, depending on the tool you are using).
See demo
Strait REGEX Tested in Caret:
#.*[^)]
The above will search for and any given and exclude close parenthesis.
#.*\b
The above here does the same thing in Caret text editor.
How to awk and sed this:
Get usernames as well:
$ awk '/#.*/ {print}' test
katyperry KATY PERRY (#katyperry)
justinbieber Justin Bieber (#justinbieber)
taylorswift13 Taylor Swift (#taylorswift13)
Just the Handle:
$ awk -F "(" '/#.*/ {print$2}' test | sed 's/)//g'
#katyperry
#justinbieber
#taylorswift13
A look at the test file:
$ cat test
1
katyperry KATY PERRY (#katyperry)
Followings 158
Followers 82 085 596
Rating
5
Worst012345678910Best
2
justinbieber Justin Bieber (#justinbieber)
254 399
74 748 878
2
Worst012345678910Best
3
taylorswift13 Taylor Swift (#taylorswift13)
245
70 529 992
Bash Version:
$ bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin14)
Copyright (C) 2007 Free Software Foundation, Inc.

grep -c value NH:i:1 only for every line in file, not also NH:i:12

cat samtry.txt | grep -c NH:i:1
See an example of three lines below. the bold information is whats important
HWI-ST697:178:D1U9CACXX:1:2111:12787:5687 153 scaffold_1 33005 50 101M * 0 0 GACTAAGGAAGTCATCTGCAGTGCCCCTTGCACTTCCTAATGGGACTTTCCCTGGTTGACTATTCTTACTATGAGAACAATGAGCACCAGCTTCATTCACA DCDDDDDDDDDDDEEEEEEEEFGHGJIHGHFHJIJIJJIJJJJIHJJIJIIIFJJIGGGIJJJIIJJHIGJIJJJGHJJIJIJIGFJJGHHHHFFFFFCCC AS:i:-11 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:18T26G55YT:Z:UU **NH:i:1**
HWI-ST697:178:D1U9CACXX:3:1310:18383:72540 89 scaffold_1 33005 50 101M * 0 0 GACTAAGGAAGTCATCTGCAGTGCCCCTTGCACTTCCTAATGGGACTTTCCCTGGTTGACTATTCTTACTATGAGAACAATGAGCACCAGCTTCATTCACA DDDDDDDDDDDDDEEEEEEFFFHHHIIJJIIIJIJJJJJJJJJJHJJJJJJJJJJJJJIJJJJJJJJIJJJIJJIJJJJJJJJIHFJJHHHHHFFFFFCCC AS:i:-11 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:18T26G55YT:Z:UU **NH:i:11**
HWI-ST697:178:D1U9CACXX:7:1212:17559:76798 89 scaffold_1 33007 50 101M * 0 0 CTAAGGAAGTCATCTGCAGTGCCCCTTGCACTTCCTAATGGGACTTTCCCTGGTTGACTATTCTTACTATGAGAACAATGAGCACCAGCTTCATTCACAAG DDDDDDDDDDDDDEEEECDFFHGHIGJIIHJJJIIJJJJJJHHJJJJJJJJJJJIIIJJJJGIIGBJJIJJJJIJJJJJIHHHFJJIJHHHHGFFFFFCCC AS:i:-11 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:16T26G57YT:Z:UU **NH:i:1**
I am trying to use a shell script to count all the lines in a tab-delimited-file (testfile: samtry.txt, contains 10 lines to test on) that contains the following Regular expression NH:i:1
The problem is of course that I get the information I wanted; but it also counts the lines with the following outcome: NH:i:1x (where x is any possible digit: 0-9)
The position of the NH:i:x (x = any digit until around 50) is in every line of the file on 20, its not the last position of the line. Every line has 23 'positions'.
Does anyone know how to do this with grep or another tool?
I've got around 100 files which each have a size of around 3GB each, and I don't know how to solve this problem
I hope I give enough information, I am happy for every answer
Try grep with word boundaries:
grep -c '\<NH:i:1\>' samtry.txt
OR grep -w:
grep -wc 'NH:i:1' samtry.txt