I have a string consist of words, special characters (*, |, ( etc.) and numbers(floating). I want to remove white spaces between only words and special characters. Spaces between numbers should not be removed. How I can do it in Perl?
E.g.:
Rama 1 * 2.34 * ( L - 0.45 ) XYZ 10 20.05 30.06 40 P > 25.
It should be after conversion:
Rama1*2.34*(L-0.45)XYZ 10 20.05 30.06 40 P>25.
(?<!\d)\h+|\h+(?!\d)
You can use lookarounds here.See demo.
https://regex101.com/r/uF4oY4/62
You may use the below lookaround based regex.
perl -pe 's/\s+(?=\D)|(?<=\D)\s+//g' file
Example:
$ echo 'Rama 1 * 2.34 * ( L - 0.45 ) XYZ 10 20.05 30.06 40 P > 25.' | perl -pe 's/\s+(?=\D)|(?<=\D)\s+//g'
Rama1*2.34*(L-0.45)XYZ10 20.05 30.06 40P>25.
or
$ echo 'Rama 1 * 2.34 * ( L - 0.45 ) XYZ 10 20.05 30.06 40 P > 25.' | perl -pe 's/(?<=[^\s\w])\s+|\s+(?=[^\w\s])//g'
Rama 1*2.34*(L-0.45)XYZ 10 20.05 30.06 40 P>25.
Related
I am trying a regex that satisfy the following for a total 10 digit number.
Tried this so far :
^(\d){0,8}(\.){0,1}(\d){0,2}$
It works fine but fails if I give the following :
123456789.0
Valid example:
1234567890 (total 10 digits)
1234567.1 (total 8 digits)
12345678.10 (total 10 digits)
123456789.1 (total 10 digits)
Invalid example :
12345678901 (11 characters)
Here is a way to go:
^(?:\d{1,10}|(?=\d+\.\d\d?$)[\d.]{3,11})$
Explanation:
^ : begining of string
(?: : start non capture group
\d{1,10} : 1 upto 10 digits
| : OR
(?= : start look ahead
\d+\.\d\d?$ : 1 or more digits then a dot then 1 or 2 digits
) : end lookahead
[\d.]{3,11} : only digit or dot are allowed, with a length from 3 upto 11
) : end group
$ : end of string
In action:
#!/usr/bin/perl
use Modern::Perl;
my $re = qr~^(?:\d{1,10}|(?=\d+\.\d\d?$)[\d.]{3,11})$~;
while(<DATA>) {
chomp;
say (/$re/ ? "OK: $_" : "KO: $_");
}
__DATA__
1
123
1.2
1234567890
1234567.1
12345678.10
123456789.1
12345678901
1.2.3
Output:
OK: 1
OK: 123
OK: 1.2
OK: 1234567890
OK: 1234567.1
OK: 12345678.10
OK: 123456789.1
KO: 12345678901
KO: 1.2.3
The solution using String.prototype.match() and RegExp.prototype.text() functions:
var isValid = function (num) {
return /^\d+(\.\d+)?$/.test(num) && String(num).match(/\d/g).length <= 10;
};
console.log(isValid(1234567890));
console.log(isValid(12345678.10));
console.log(isValid(12345678901));
console.log(isValid('123d3457'));
you can break your pattern in 3 step:
First step
You need at least 8 digit + 1 or 2 precision that both are optional
\d{8}\.?\d?\d? Here . and both digit are optional
Second step
You need at least 9 digit + 1 precision and that's it
\d{9}\.?\d? Here . and digit are optional
Then you can mix these three rule together with or | keyword
^(\d{8}\.?\d?\d?|\d{9}\.?\d?)$
Okay now this regex only matches 7 to 10 digit with 1 or 2 precision
It never matches less than 8 digit and a tricky part is here that you can change second step \d{8} with \d{1,8} and then It match from 1 to 9999999999 and plus 1 or 2 precision.
what you want:
^(\d{1,8}\.?\d?\d?|\d{9}\.?\d?)$
echo 1 | perl -lne '/^(\d{1,8}\.?\d?\d?|\d{9}\.?\d?)$/ && print $&'
1
echo 9999999999 | perl -lne '/^(\d{1,8}\.?\d?\d?|\d{9}\.?\d?)$/ && print $&'
9999999999
echo 1.1 | perl -lne '/^(\d{1,8}\.?\d?\d?|\d{9}\.?\d?)$/ && print $&'
1.1
echo 1.12 | perl -lne '/^(\d{1,8}\.?\d?\d?|\d{9}\.?\d?)$/ && print $&'
1.12
echo 1234567.1 | perl -lne '/^(\d{1,8}\.?\d?\d?|\d{9}\.?\d?)$/ && print $&'
1234567.1
echo 1234567.12 | perl -lne '/^(\d{1,8}\.?\d?\d?|\d{9}\.?\d?)$/ && print $&'
1234567.12
echo 99999999.9 | perl -lne '/^(\d{1,8}\.?\d?\d?|\d{9}\.?\d?)$/ && print $&'
99999999.9
echo 99999999.99 | perl -lne '/^(\d{1,8}\.?\d?\d?|\d{9}\.?\d?)$/ && print $&'
99999999.99
not match
echo 1.111 | perl -lne '/^(\d{1,8}\.?\d?\d?|\d{9}\.?\d?)$/ && print $&'
echo 1234567.111 | perl -lne '/^(\d{1,8}\.?\d?\d?|\d{9}\.?\d?)$/ && print $&'
echo 123456781.11 | perl -lne '/^(\d{1,8}\.?\d?\d?|\d{9}\.?\d?)$/ && print $&'
echo 1234567891.1 | perl -lne '/^(\d{1,8}\.?\d?\d?|\d{9}\.?\d?)$/ && print $&'
echo 123456789101 | perl -lne '/^(\d{1,8}\.?\d?\d?|\d{9}\.?\d?)$/ && print $&'
I have this expression:
XX h, YY min, ZZ s,
XX, YY or ZZ can be 1 or 2 digits. Also "XX h," or "XX h, YY min," maybe not present. Can anyone recommend any perl or sed expression to extract XX YY and ZZ??
I've tried some matching group regexp with no luck.
thanks
EDIT:
example1: 12 h, 23 min, 2 s,
output1: 12 23 2
example2: 3 min, 59 s,
output2: 3 59
echo "12 h, 3 min, 56 s," | tr -cd "0-9 "
Output:
12 3 56
echo "12 h, 3 min, 56 s," | tr "," "\n" | awk '/h/ {print $1}'
12
echo "12 h, 3 min, 56 s," | tr "," "\n" | awk '/min/ {print $1}'
3
echo "12 h, 3 min, 56 s," | tr "," "\n" | awk '/s/ {print $1}'
56
Let's talk about Perl regex. Let's assume you need to be able to extract the following substrings:
12 h, 54 min, 11 s, # you have a trailing comma in your example
1 h, 54 min, 11 s,
54 min, 11 s,
4 min, 11 s,
55 s,
and so on. We will need some building blocks:
\d: any digit
?: when appended to something (a character, a meta-character like \d or a group in brackets), make it optional
( ): brackets for grouping and extracting values into $1, $2, etc.
(?: ): brackets for grouping without extracting
The seconds part will be \d\d? s,.
After adding minutes that can be optional, we'll get (?:\d\d? min, )?\d\d? s,.
After adding hours (also optional), we'll get (?:(?:\d\d? h,)? \d\d? min, )?\d\d? s,.
Now we'll use brackets around all this staff for capturing the match into $1 and we'll finally get a regex:
/((?:(?:\d\d? h,)? \d\d? min, )?\d\d? s,)/
Or, and is the trailing comma also optional? Just add ? after it.
If you need the values for h, min, and s, put each \d\d? into a pair of brackets and check $2, $3 and $4:
/((?:(?:(\d\d?) h,)? (\d\d?) min, )?(\d\d?) s,)/
This is not the easiest possible regex for this task but I just wanted to show how you can build them starting from something very simple and then adding more complex things to it.
Try this (Perl):
my #matches = "1 h, 30 min, 15 s" =~ /(\d{1,2}) [hms]/g;
Or a bit stricter
my #matches = "1 h, 30 min, 15 s" =~ /(\d{1,2}) (?:h|min|s)/g;
if(scalar #matches == 3) {
my ($h, $mi, $s) = #matches;
print "$h : $mi : $s\n";
}
I'm trying to parse a DNA protein file. I want to extract just certain amount of information. I want to parse only if the line starts with "ATOM" and has either G,A,T,C at then end of the fourth column. For example in the snippet below DG would be parsed because it has a G at the end. Then save the line in file. I am using bash. What would you use to do this? grep, find, sed, awk or some kind of regular expression?
Thanks for any help!
HETATM 103 HG22 MVA A 8 4.999 -1.260 2.090 1.00 0.00 H
HETATM 104 HG23 MVA A 8 5.639 -2.810 2.604 1.00 0.00 H
TER 105 MVA A 8
ATOM 106 O5' DG C 11 -12.710 1.571 -11.945 1.00 0.00 O
ATOM 107 C5' DG C 11 -13.491 2.438 -11.111 1.00 0.00 C
Additional to the original problem:
Count the lines total and individual G,A,T,C? Output the counted total into a file as Total Lines, TOTAL G, TOTAL T, TOTAL A, TOTAL C.
awk '/^ATOM/&&$4~/[GATC]$/' input > output
Here is an old fashion bash way:
while read -ra fld; do
[[ ${fld[0]} == "ATOM" ]] && [[ ${fld[3]} =~ [GATC]$ ]] && echo "${fld[#]}"
done < dnafile.old > dnafile.new
Hope I get the chance to answer it, because OP questioned on Kent's answer. Here is question:
If you notice Line 3 of the example the 3rd column is blank will this matter, it shouldn't in this case because its not an ATOM but if it was?
So the fix is here, (base on the format and location is not changed.
awk '/^ATOM/&&substr($0,20,1)~/[GATC]/' file
Test result:
$ cat file
HETATM 103 HG22 MVA A 8 4.999 -1.260 2.090 1.00 0.00 H
HETATM 104 HG23 MVA A 8 5.639 -2.810 2.604 1.00 0.00 H
ATOM 105 MVA X 8
ATOM 106 O5' DG C 11 -12.710 1.571 -11.945 1.00 0.00 O
ATOM 107 C5' DG C 11 -13.491 2.438 -11.111 1.00 0.00 C
$ awk '/^ATOM/&&substr($0,20,1)~/[GATC]/' file
ATOM 105 MVA X 8
ATOM 106 O5' DG C 11 -12.710 1.571 -11.945 1.00 0.00 O
ATOM 107 C5' DG C 11 -13.491 2.438 -11.111 1.00 0.00 C
Edit for new request.
awk '/^ATOM/&&substr($0,20,1)~/[GATC]/{print;l++;a[substr($0,20,1)]++}END{printf "total line : %s\n",l;for (i in a) printf "%s : %s \n",i,a[i]}' file
ATOM 105 MVA A 8
ATOM 106 O5' DG C 11 -12.710 1.571 -11.945 1.00 0.00 O
ATOM 107 C5' DG C 11 -13.491 2.438 -11.111 1.00 0.00 C
total line : 3
A : 1
G : 2
Huh... after the excellent Kent's awk solution am hesitating writing a long regexp :) :)
grep -P 'ATOM\s+\S+\s+\S+\s*\S*[GATC]\s+' dnafile
this need a grep with -P - perl regexes.
Without perl regexes, the stndard-regex is much longer,
grep 'ATOM *[^ ][^ ]* *[^ ][^ ]* *[^ ][^ ]* *[^ ]*[GATC] *' dnafile
This might work for you (GNU sed):
sed -nr '/^ATOM.{15}[GATC]/w newfile' oldfile
Since columns may be empty, the match must be made on position in the line.
I'm trying to extract the specific lines from a trace file like below:
- 0.118224 0 7 ack 40 ------- 1 2.0 7.0 0 2
r 0.118436 1 2 tcp 40 ------- 2 7.1 2.1 0 1
+ 0.118436 1 2 ack 40 ------- 2 3.1 2.1 0 3
- 0.118436 1 2 ack 40 ------- 2 4.1 2.1 0 3
r 0.120256 0 7 ack 40 ------- 1 2.0 7.0 0 2
I want to extract any line that have the following:
r x.xxxxx 1 2 xxx xx ------- x numbers.x 2.x x x.
Note: x means any value and numbers could be between 3-to-7.
here is my try-its not working !!:
if {[regexp \r+ ([0-9.]+) 1 2.*- ([3-7.]+) 2.*- ([0-9.]+) $line -> time]}
Any suggestion??
Here's another approach: extract the fields you want to use for comparison
while {[gets $f line] != -1} {
lassign [split $line] a - b c - - - - d e - -
if {
$a eq "r" &&
$b == 1 &&
$c == 2 &&
3 <= floor($d) && floor($d) <= 7 &&
floor($e) == 2
} {
puts $line
}
}
You have to escape the . with a \. It means "any character" in regexp.
So your regexp could look like:
if {[regexp {r \d\.\d{5} 1 2 \d{3} \d{2} ------- \d [3-7]\.\d 2\.\d \d \d} $line -> time ]} {
# ...
}
Now you have to place () around the part you want.
Btw: I used the following transformation on your description of what you want to match:
set input {r x.xxxxx 1 2 xxx xx ------- x numbers.x 2.x x x}
set re [subst [regsub -all {x{2,}} $data {\\\\d{[string length \0]}}]]
set re [string map {. {\.} x {\d} numbers {[3-7]}} $re]
I have a line such as this:
andy_1972 * andy#ip.address 0 0 0 0 0 0 119075 224 1342751704 1348550270
I want the end result to be the bolded characters, like this:
andy_1972 119075
I am trying to just trim the line down to the word and the 4th number from the end of the line.
How can I do this using regex? I'm using Notepad++
This will match the first word and the fourth-from-last number:
^(\w+).* (\d+) \d+ \d+ \d+$
In perl-compatible (perl or PCRE) that would be
$string = "andy_1972 * andy#ip.address 0 0 0 0 0 0 119075 224 1342751704 1348550270";
$string =~ /^(\w+).* (\d+) \d+ \d+ \d+$/;
print $1 $2;
Using cut:
echo andy_1972 \* andy#ip.address 0 0 0 0 0 0 119075 224 1342751704 1348550270 |
cut -d' ' -f1,10