Improve regex to match javascript comments

Improve regex to match javascript comments - regex

I used the regex given in perlfaq6 to match and remove javascript comments, but it results in segmentation fault when the string is too long. The regex is -
s#/\*[^*]*\*+([^/*][^*]*\*+)*/|//([^\\]|[^\n][\n]?)*?\n|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#defined $3 ? $3 : ""#gse;
Can it be improved to avoid segmentation fault ?
[EDIT]
Long input:
<ent r=\"6\" t=\"259\" w=\"252\" /><ent r=\"6\" t=\"257\" w=\"219\" />
repeated about a 1000 times.

I suspect the trouble is partly that your 'C code' isn't very much like C code. In C, you can't have the sequence \" outside a pair of quotes, single or double, for example.
I adapted the regex to make it readable and wrapped into a trivial script that slurps its input and applies the regex to it:
#!/usr/bin/env perl
### Original regex from PerlFAQ6.
### s#/\*[^*]*\*+([^/*][^*]*\*+)*/|//([^\\]|[^\n][\n]?)*?\n|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#defined $3 ? $3 : ""#gse;
undef $/; # Slurp input
while (<>)
{
print "raw: $_";
s%
/\*[^*]*\*+([^/*][^*]*\*+)*/ # Simple C comments
| //([^\\]|[^\n][\n]?)*?\n # C++ comments, allowing for backslash-newline continuation
| (
"(\\.|[^"\\])*" # Double-quoted strings
| '(\\.|[^'\\])*' # Single-quoted characters
| .[^/"'\\]* # Anything else
)
% defined $3 ? $3 : ""
%egsx;
print "out: $_";
}
I took your line of non-C code, and created files data.1, data.2, data.4, data.8, ..., data.1024 with the appropriate number of lines in each. I then ran a timing loop.
$ for x in 1 2 4 8 16 32 64 128 256 512 1024
> do
> echo
> echo $x
> time perl xx.pl data.$x > /dev/null
> done
$
I've munged the output to give just the real time for the different file sizes:
1 0m0.022s
2 0m0.005s
4 0m0.007s
8 0m0.013s
16 0m0.035s
32 0m0.130s
64 0m0.523s
128 0m2.035s
256 0m6.756s
512 0m28.062s
1024 1m36.134s
I did not get a core dump (Perl 5.16.0 on Mac OS X 10.7.4; 8 GiB main memory). It does begin to take a significant amount of time. While it was running, it was not growing; during the 1024-line run, it was using about 13 MiB of 'real' memory and 23 MiB of 'virtual' memory.
I tried Perl 5.10.0 (the oldest version I have compiled on my machine), and it used slightly less 'real' memory, essentially the same 'virtual' memory, and was noticeably slower (33.3s for 512
lines; 1m 53.9s for 1024 lines).
Just for comparison purposes, I collected some C code that I had lying around in the test directory to create a file of about 88 KiB, with 3100 lines of which about 200 were comment lines. This compares with the size of the data.1024 file which was about 77 KiB. Processing that took between 10 and 20 milliseconds.
Summary
The non-C source you have makes a very nasty test case. Perl shouldn't crash on it.
Which version of Perl are you using, and on which platform? How much memory does your machine have. However, total quantity of memory is unlikely to be the issue (24 MiB is not an issue on most machines that run Perl). If you have a very old version of Perl, the results might be different.
I also note that the regex does not handle some pathological C comments that a C compiler must handle, such as:
/\
\
* Yes, this is a comment *\
\
/
/\
\
/ And so is this
Yes, you'd be right to reject any code submitted for review that contained such comments.

Related

AWK: Pattern match multiline data with variable line number

I am trying to write a script which will analyze data from a pipe. The problem is, a single element is described in a variable number of lines. Look at the example data set:
3 14 -30.48 17.23
4 1 -18.01 12.69
4 3 -11.01 2.69
8 12 -21.14 -8.76
8 14 -18.01 -5.69
8 12 -35.14 -1.76
9 2 -1.01 22.69
10 1 -88.88 17.28
10 1 -.88 14.28
10 1 5.88 1.28
10 1 -8.88 -7.28
In this case, the first entry is what defines the event to which the following data belongs. In the case of event number 8, we have data in 3 lines. To simplify the rather complex problem that I am trying to solve, let us imagine, that I want to calculate the following expression:
sum_i($2 * ($3 + $4))
Where i is taken over all lines belonging to a given element. The output I want to produce would then look like:
3=-185.5 [14(-30.48+17.23) ]
4=-30.28 [1(-18.01+12.69) + 3(-11.01+2.69)]
8=-1106.4 [...]
I thus need a script which reads all the lines that have the same index entry.
I am an AWK newbie and I've started learning the language a couple of days ago. I am now uncertain whether I will be able to achieve what I want. Therefore:
Is this doable with AWK?
If not, whith what? SED?
If yes, how? I would be grateful if one provided a link describing how this can be implemented.
Finally, I know that there is a similar question: Can awk patterns match multiple lines?, however, I do not have a constant pattern which separates my data.
Thanks!

You could try this:
awk '{ar[$1]+=$2*($3+$4)}
END{for (key in ar)
{print key"="ar[key]}}' inputFile
For each line input we do the desired calculation and sum the result in an array. $1 serves as the key of the array.
When the entire file is read, we print the results in the END{...}-block.
The output for the given sample input is:
4=-30.28
8=-1133.4
9=43.36
10=-67.2
3=-185.5
If sorting of the output is required, you might want to have a look at gawk's asorti function or Linux' sort-command (e.g. awk '{...} inputFile' | sort -n).
This solution does not require that the input is sorted.

awk 'id!=$1{if(id){print id"="sum;sum=0};id=$1}{sum+=$2*($3+$4)} END{print id"="sum}' file
3=-185.5
4=-30.28
8=-1133.4
9=43.36
10=-67.2

yet another similar awk
$ awk -v OFS="=" 'NR==1{p=$1}
p!=$1{print p,s; s=0; p=$1}
{s+=$2*($3+$4)}
END{print p,s}' file
3=-185.5
4=-30.28
8=-1133.4
9=43.36
10=-67.2
ps. Your calculation for "8" seems off.

Find the line and there the level, that contains a bigger number than the input

I fail to find a regex that returns for a given input, e.g., 1000000000 the result 214.
Here is the text I need to regex:
lvl=100 (2626 KB for nbparts) 9522 possible passwords
lvl=101 (2652 KB for nbparts) 10 K possible passwords (10604)
lvl=102 (2678 KB for nbparts) 11 K possible passwords (11805)
...
lvl=213 (5564 KB for nbparts) 956 M possible passwords (956026029)
lvl=214 (5590 KB for nbparts) 1 G possible passwords (1058500959)
lvl=215 (5616 KB for nbparts) 1 G possible passwords (1171975083)
...
lvl=400 (10426 KB for nbparts) 29926014 G possible passwords (29926014173292546)
I need to find the level (lvl=) that corresponds to the number in the bracket () that is bigger than the input, e.g., for the input 1000000000 it would be 214, because 1058500959 is bigger than 1000000000. For this Job I'm limited to Bash scripting and I would love to use grep -E (“basic,” (BRE) “extended” (ERE) and “perl”) or similar standard GNU Linux tools, which are pre-installed on Ubuntu.
The input can be in the range of
1 => lvl=101
10000000000000000 => lvl=400
Thank you very much.

Perl to the rescue!
perl -ne 'BEGIN { $trsh = shift }
($lvl, $pswd) = /lvl=([0-9]+).*\(([0-9]+)\)$/;
print "$lvl $pswd\n" and exit if $pswd > $trsh;
' 1000000000 input.txt
In the regex, there are two capture groups, the matching strings are assigned to variables $lvl and $pswd. If $pswd is greater than $trsh (assigned from the first parameter at the beginning), the details are printed and the script ends.

Can adding a particular number to a bunch of "time" strings, be done in Regex

I have a "srt" file(like standard movie-subtitle format) like shown in below link:http://pastebin.com/3k8a53SC
Excerpt:
1
00:00:53,000 --> 00:00:57,000
<any text that may span multiple lines>
2
00:01:28,000 --> 00:01:35,000
<any text that may span multiple lines>
But right now the subtitles timing is all wrong, as it lags behind by 9 seconds.
Is it possible to add 9 seconds(+9) to every time entry with regex ?
Even if the milliseconds is set to 000 then it's fine, but the addition of 9 seconds should adhere to "60 seconds = 1 minute & 60 minutes = 1 hour" rules.
Also the subtitle text after timing entry must not get altered by regex.
By the way the time format for each time string is "Hours:Minutes:Seconds.Milliseconds".

Quick answer is "no", that's not an application for regex. A regular expression lets you MATCH text, but not change it. Changing things is outside the scope of the regex itself, and falls to the language you're using -- perl, awk, bash, etc.
For the task of adjusting the time within an SRT file, you could do this easily enough in bash, using the date command to adjust times.
#!/usr/bin/env bash
offset="${1:-0}"
datematch="^(([0-9]{2}:){2}[0-9]{2}),[0-9]{3} --> (([0-9]{2}:){2}[0-9]{2}),[0-9]{3}"
os=$(uname -s)
while read line; do
if [[ "$line" =~ $datematch ]]; then
# Gather the start and end times from the regex
start=${BASH_REMATCH[1]}
end=${BASH_REMATCH[3]}
# Replace the time in this line with a printf pattern
linefmt="${line//[0-2][0-9]:[0-5][0-9]:[0-5][0-9]/%s}\n"
# Calculate new times
case "$os" in
Darwin|*BSD)
newstart=$(date -v${offset}S -j -f "%H:%M:%S" "$start" '+%H:%M:%S')
newend=$(date -v${offset}S -j -f "%H:%M:%S" "$end" '+%H:%M:%S')
;;
Linux)
newstart=$(date -d "$start today ${offset} seconds" '+%H:%M:%S')
newend=$(date -d "$end today ${offset} seconds" '+%H:%M:%S')
;;
esac
# And print the result
printf "$linefmt" "$newstart" "$newend"
else
# No adjustments required, print the line verbatim.
echo "$line"
fi
done
Note the case statement. This script should auto-adjust for Linux, OSX, FreeBSD, etc.
You'd use this script like this:
$ ./srtadj -9 < input.srt > output.srt
Assuming you named it that, of course. Or more likely, you'd adapt its logic for use in your own script.

No, sorry, you can’t. Regex are a context free language (see Chomsky e.g. https://en.wikipedia.org/wiki/Chomsky_hierarchy) and you cannot calculate.
But with a context sensitive language like perl it will work.
It could be a one liner like this ;-)))
perl -n -e 'if(/^(\d\d:\d\d:\d\d)([-,\d\s\>]*)(\d\d:\d\d:\d\d)(.*)/) {print plus9($1).$2.plus9($3).$4."\n";}else{print $_} sub plus9{ ($h,$m,$s)=split(/:/,shift); $t=(($h*60+$m)*60+$s+9); $h=int($t/3600);$r=$t-($h*3600);$m=int($r/60);$s=$r-($m*60);return sprintf "%02d:%02d:%02d", $h, $m, $s;}‘ movie.srt
with move.srt like
1
00:00:53,000 --> 00:00:57,000
hello
2
00:01:28,000 --> 00:01:35,000
I like perl
3
00:02:09,000 --> 00:02:14,000
and regex
you will get
1
00:01:02,000 --> 00:01:06,000
hello
2
00:01:37,000 --> 00:01:44,000
I like perl
3
00:02:18,000 --> 00:02:23,000
and regex
You can change the +9 in the "sub plus9{...}", if you want another delta.
How does it work?
We are looking for lines that matches
dd:dd:dd something dd:dd:dd something
and then we call a sub, which add 9 seconds to the matched group one ($1) and group three ($3). All other lines are printed unchanged.
added
If you want to put the perl oneliner in a file, say plus9.pl, you can add newlines ;-)
if(/^(\d\d:\d\d:\d\d)([-,\d\s\>]*)(\d\d:\d\d:\d\d)(.*)/) {
print plus9($1).$2.plus9($3).$4."\n";
} else {
print $_
}
sub plus9{
($h,$m,$s)=split(/:/,shift);
$t=(($h*60+$m)*60+$s+9);
$h=int($t/3600);
$r=$t-($h*3600);
$m=int($r/60);
$s=$r-($m*60);
return sprintf "%02d:%02d:%02d", $h, $m, $s;
}

Regular expressions strictly do matching and cannot add/substract. You can match each datetime string using python, for example, add 9 seconds to that, and then rewrite the string in the appropriate spot. The regular expression I would use to match it would be the following:
(?<hour>\d+):(?<minute>\d+):(?<second>\d+),(?<msecond>\d+)
It has labeled capture groups so it's really easy to get each section (you won't need msecond but it's there for visualization, I guess)
Regex101

Processing Ping Data (Regular Expressions)

I'm trying to create a script to process data from ping. So it will come from a file in the standard format with timestamps:
PING google.com (4.34.16.45) 56(84) bytes of data.
[1393790120.617504] 64 bytes from 4.34.16.45: icmp_req=1 ttl=63 time=25.7 ms
[1393790135.669873] 64 bytes from 4.34.16.45: icmp_req=2 ttl=63 time=30.2 ms
[1393790150.707266] 64 bytes from 4.34.16.45: icmp_req=3 ttl=63 time=20.6 ms
[1393790161.195257] 64 bytes from 4.34.16.45: icmp_req=4 ttl=63 time=35.2 ms
--- google.com ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 45145ms
rtt min/avg/max/mdev = 20.665/27.970/35.246/5.390 ms
I want to cut it to just the timestamp, time and request number like so (note this is from a different data set, given as an example):
0.026202538597014928 26.2 1
0.53210253859701473 24.5 2
1.0482067203067074 32.0 3
1.6627447926949444 139.6 4
2.2686229201578056 237.1 5
I realize I need to use sed to accomplish this. But I'm still really confused as to what the expressions would be to cut to the data properly. I imagine I would have something along these lines:
cat $inFile | grep -o "$begin$regex$end" | sed "s/$end//g" | sed "s/$begin//g" > $outFile
I'm just not sure what $begin and $end would be.
TL;DR Help me understand regular expressions?

You can try following sed command:
sed -ne '
2,/^$/ {
/^$/! {
s/^\[\([^]]*\).*icmp_req=\([0-9]*\).*time=\([0-9.]*\).*$/\1 \3 \2/
p
}
}
' infile
It uses -n switch to avoid automatic print of input lines. It select a range of lines between the second one and the first one that is blank, and for each one I do grouping of the text I want to extract.
Assuming infile with the content of the question, it yields:
1393790120.617504 25.7 1
1393790135.669873 30.2 2
1393790150.707266 20.6 3
1393790161.195257 35.2 4
UPDATE with simpler Scrutinizer's solution (see comments):
sed -n 's/^\[\([^]]*\).*icmp_req=\([0-9]*\).*time=\([0-9.]*\).*$/\1 \3 \2/p' infile

For good measure, here's an awk solution:
awk -F "[][ =]" '/^\[/ { print $2, $13, $9 }' file
Takes advantage of awk's ability to parse lines into fields based on a regex as the separator - here, any of the following chars: [, ],  , or =.
Simply prints out the fields of interest by index, for lines that start with [.

For a pure regex solution, see this expression:
\[([\d\.]*)].*?=(\d+).*?=([\d\.]*) ms
You can view an online demo here:
Regex101.com

Units of measure regex manipulation

Objective
On Linux, I am trying to get an end-user friendly string representing available system memory.
Example:
Your computer has 4 GB of memory.
Success criteria
I consider these aspects end-user friendly (you may disagree):
1G is more readable than 1.0G (1 Vs 1.0)
1GB is more readable than 1G (GB Vs G)
1 GB is more readable than 1GB (space-separated unit of measure)
memory is more readable than RAM, DDR or DDR3 (no jargon)
Starting point
The free utility from procps-ng has an option intended for humans:
-h, --human
Show all output fields automatically scaled to shortest three digit unit
and display the units of print out. Following units are used.
B = bytes
K = kilos
M = megas
G = gigas
T = teras
If unit is missing, and you have petabyte of RAM or swap, the number is
in terabytes and columns might not be aligned with header.
so I decided to start there:
> free -h
total used free shared buffers cached
Mem: 3.8G 1.4G 2.4G 0B 159M 841M
-/+ buffers/cache: 472M 3.4G
Swap: 4.9G 0B 3.9G
3.8G sounds promising so all I have to do now is...
Required steps
Filter the output for the line containing the human-readable string (i.e. Mem:)
Pick out the memory total from the middle of the line (i.e. 3.8G)
Parse out the number and unit of measure (i.e. 3.8 and G)
Format and display a string more to my liking (e.g. G↝ GB, ...)
My attempt
free -h | \
awk '/^Mem:/{print $2}' | \
perl -ne '/(\d+(?:\.\d+)?)(B|K|M|G|T)/ && printf "%g %sB\n", $1, $2'
outputs:
3.8 GB
Desired solution
I'd prefer to just use gawk, but I don't know how
Use a better, even canonical if there is one, way to parse a "float" out of a string
I don't mind the fastidious matching of "just the recognised magnitude letters" (B|K|M|G|T), even if this would unnecessarily break the match with the introduction of new sizes
I use %g to output 4.0 as 4, which is something you may disagree with, depending on how you feel about these comments: https://unix.stackexchange.com/a/70553/10283.
My question, in summary
Could you do the above in awk only?
Could my perl be written more elegantly than that, keeping the strictness of it?
Remember:
I am a beginner robot. Here to learn. :]
What I learned from Andy Lester
Summarised here for my own benefit: to cement learning, if I can.
Use regex character classes, not regex alternation, to pick out one character from a set
perl has a -a option, which splits $_ from -e or -n into #F:
for example, this gawk:
echo foo bar baz | awk '{print $2}'
can be written like this in perl:
echo foo bar baz | perl -ane 'print "$F[1]\n";'
Unless there is something equivalent to gawk 's --field-separator, I think I still like gawk better, although of course to do everything in perl is both cleaner and more efficient. (is there an equivalent?)
EDIT: actually, this proves there is, and it's -F just like in gawk:
echo ooxoooxoooo | perl -Fx -ane 'print join "\n", #F'
outputs:
oo
ooo
oooo
perl has a -l option, which is just awesome: think of it as Python's str.rstrip (see the link if you are not a Python head) for the validity of $_ but it re-appends the \n to the output automatically for you
Thanks, Andy!

Yes, I'm sure you could do this awk-only, but I'm a Perl guy so here's how you'd do it Perl-only.
Instead of (B|K|M|G|T) use [BKMGT].
Use Perl's -l to automatically strip newlines from input and add them on output.
I don't see any reason to have Awk do some of the stripping and Perl doing the rest. You can do autosplitting of fields with Perl's -a.
I don't know what the output from free -h is exactly (My free doesn't have an -h option) so I'm guessing at this
free -h | \
perl -alne'/^Mem:/ && ($F[1]=~/(\d+(?:\.\d+)?)[BKMGT]/) && printf( "%g %sB", $1, $2)'

An awk (actually gawk) solution
free -h | awk 'FNR == 2 {if (match($2,"[BKMGT]$",a)) r=sprintf("%.0f %sB",substr($2,0,RSTART-1), a[0]); else r=$2 " B";print "Your computer has " r " of memory."}'
or broken down for readability
free -h | awk 'FNR == 2 {if (match($2,"[BKMGT]$",a)) r=sprintf("%.0f %sB",
substr($2,0,RSTART-1), a[0]); else r=$2 " B";
print "Your computer has " r " of memory."}'
Where
FNR is the nth line (if 2 does the {} commands)
$2 is the 2nd field
if (condition) command; else command;
match(string, regex, matches array). Regex says "must end with one of BKMGT"
r=sprintf set variable r to sprintf with %.0f for no decimals float
RSTART tells where the match occured, a[0] is the first match
Outputs with the exemple above
Your computer has 4 GB of memory.

Another lengthy Perl answer:
free -b |
perl -lane 'if(/Mem/){ #u=("B","KB","MB","GB"); $F[2]/=1024, shift #u while ($F[2]>1024); printf("%.2f %s", $F[2],$u[0])}'

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Improve regex to match javascript comments - regex

Related

AWK: Pattern match multiline data with variable line number

Find the line and there the level, that contains a bigger number than the input

Can adding a particular number to a bunch of "time" strings, be done in Regex

Processing Ping Data (Regular Expressions)

Units of measure regex manipulation

Categories

Resources