performance grep vs perl -ne - regex

I made a performance test with really surprising result: perl is more than 20 times faster!
Is this normal?
Does it result from my regular expression?
is egrep far slower than grep?
... i tested on a current cygwin and a current OpenSuSE 13.1 in virtualbox.
Fastest Test with perl:
time zcat log.gz \
| perl -ne 'print if ($_ =~ /^\S+\s+\S+\s+(ERROR|WARNING|SEVERE)\s/ )'
| tail
2014-06-24 14:51:43,929 SEVERE ajp-0.0.0.0-8009-13 SessionDataUpdateManager cannot register active data when window has no name
2014-06-24 14:52:01,031 ERROR HFN SI ThreadPool(4)-442 CepEventUnmarshaler Unmarshaled Events Duration: 111
2014-06-24 14:52:03,556 ERROR HFN SI ThreadPool(4)-444 CepEventUnmarshaler Unmarshaled Events Duration: 52
2014-06-24 14:52:06,789 SEVERE ajp-0.0.0.0-8009-1 SessionDataUpdateManager cannot register active data when window has no name
2014-06-24 14:52:06,792 SEVERE ajp-0.0.0.0-8009-1 SessionDataUpdateManager cannot register active data when window has no name
2014-06-24 14:52:07,371 SEVERE ajp-0.0.0.0-8009-9 SessionDataUpdateManager cannot register active data when window has no name
2014-06-24 14:52:07,373 SEVERE ajp-0.0.0.0-8009-9 SessionDataUpdateManager cannot register active data when window has no name
2014-06-24 14:52:07,780 SEVERE ajp-0.0.0.0-8009-11 SessionDataUpdateManager cannot register active data when window has no name
2014-06-24 14:52:07,782 SEVERE ajp-0.0.0.0-8009-11 SessionDataUpdateManager cannot register active data when window has no name
2014-06-24 15:06:24,119 ERROR HFN SI ThreadPool(4)-443 CepEventUnmarshaler Unmarshaled Events Duration: 117
real 0m0.151s
user 0m0.062s
sys 0m0.139s
fine!
far slower test with egrep:
time zcat log.gz \
| egrep '^\S+\s+\S+\s+(ERROR|WARNING|SEVERE)\s'
| tail
...
real 0m2.454s
user 0m2.448s
sys 0m0.092s
(Output was same as above...)
finally even slower grep with different notation (my first try)
time zcat log.gz \
| egrep '^[^\s]+\s+[^\s]+\s+(ERROR|WARNING|SEVERE)\s'
| tail
...
real 0m4.295s
user 0m4.272s
sys 0m0.138s
(Output was same as above...)
The ungzipped file size is about 2.000.000 lines an un-gzip-ped 500MBytes - matching line count is very small.
my tested versions:
OpenSuSE with grep (GNU grep) 2.14
cygwin with grep (GNU grep) 2.16
perhaps some Bug with newer grep versions?

You should be able to make the Perl a little bit faster by making the parentheses non-capturing:
(?:ERROR|WARNING|SEVERE)
Also, it's unnecessary to match against $_. $_ is assumed if there is nothing specified. That's why it exists.
perl -ne 'print if /^\S+\s+\S+\s+(?:ERROR|WARNING|SEVERE)\s/'

You get tricked by your operating system's cache. When reading and grepping files some layers to the filesystem get walked through:
Harddrive own cache
OS read cache
To really know what's going on it's a good idea to warm up these caches by running some tests which do their work but do not count. After these test stop your runnning time.

As Chris Hamel commented.... not using the "|" atom grep becomes faster about 10+ times - still slower than perl.
time zcat log.gz \
| egrep '^[^\s]+\s+[^\s]+\s+(ERROR)\s'
| tail
...
real 0m0.216s
user 0m0.062s
sys 0m0.123s
So with 2 "|" atoms the grep run gets more than 3 times slower than running three greps after each other - sounds like a Bug for me... any earlier grep version to test around? i have a reHat5 also... grep seems similar slow there...

Related

Find how big a work directory can get during execution- linux

I have a cron job and this cron job is doing something with lots of data and then delete all the temp files it creates. during the execution, I get 'ERROR: Insufficient space in file WORK.AIB_CUSTOMER_DATA.DATA.' the current work directory has 50G free, when I run the code in another directory with 170G free space, I don't get the error, I want to track the size of working directory during the execution.
I'm afraid I might not fully understand your problem.
In order to get an understanding on how fast is it growing in terms of size you could run a simple script like:
#!/bin/bash
while true
do
#uncomment this to check all partitions of the system.
#df -h >> log.log
#uncomment this to check the files in the current folder.
#du -sh * >> log.log
sleep 1
done
Then analyze the logs and see the increase in size.
I wrote this script and let it run during the job execution to monitor the directory size and get the maximum amount of size for this work directory.
#!/bin/bash
Max=0
while true
do
SIZE=`du -sh -B1 /data/work/EXAMPLE_work* | awk '{print $1}' `
echo size: $SIZE
echo max: $Max
if [ "$SIZE" -ge $Max ]
then
echo "big size: $SIZE" > /home/mmm/Biggestsize.txt
Max=$SIZE
else
echo "small size: $SIZE" > /home/mmm/sizeSmall.txt
fi
done

C++ program significantly slower when run in bash

I have a query regarding bash. I have been running some of my own C++ programs in conjunction with commercial programs and controlling their interaction (via input and output files) through Bash scripting. I am finding that if I run my c++ program alone in terminal it completes in around 10–15 seconds, but when I run the same through the bash script it can take up to 5 minutes to complete in each case.
I find using System Monitor that consistently 100% of one CPU is used when I run the program directly in terminal whereas when I run it in bash (in a loop) a maximum of 60% of CPU usage is recorded and seems to be linked to the longer completion time (although the average CPU usage is higher over the 4 processors).
This is quite frustrating as until recently this was not a problem.
An example of the code:
#!/usr/bin/bash
DIR="$1"
TRCKDIR=$DIR/TRCKRSLTS
STRUCTDIR=$DIR
SHRTTRCKDIR=$TRCKDIR/SHRT_TCK_FILES
VTAL=VTAL.png
VTAR=VTAR.png
NAL=$(find $STRUCTDIR | grep NAL)
NAR=$(find $STRUCTDIR | grep NAR)
AMYL=$(find $STRUCTDIR | grep AMYL)
AMYR=$(find $STRUCTDIR | grep AMYR)
TCKFLS=($(find $TRCKDIR -maxdepth 1 | grep .fls))
numTCKFLS=${#TCKFLS[#]}
for i in $(seq 0 $[numTCKFLS-1]); do
filenme=${TCKFLS[i]}
filenme=${filenme%.t*}
filenme=${filenme##*/}
if [[ "$filenme" == *VTAL* || "$filenme" == *VTA_L* ]]; then
STREAMLINE_CUTTER -MRT ${TCKFLS[i]} -ROI1 $VTAL -ROI2 $NAL -op "$SHRTTRCKDIR"/"$filenme"_VTAL_NAL.fls
STREAMLINE_CUTTER -MRT ${TCKFLS[i]} -ROI1 $VTAL -ROI2 $AMYL -op "$SHRTTRCKDIR"/"$filenme"_VTAL_AMYL.fls
fi
if [[ "$filenme" == *VTAR* || "$filenme" == *VTA_R* ]];then
STREAMLINE_CUTTER -MRT ${TCKFLS[i]} -ROI1 $VTAR -ROI2 $NAR -op "$SHRTTRCKDIR"/"$filenme"_VTAR_NAR.fls
STREAMLINE_CUTTER -MRT ${TCKFLS[i]} -ROI1 $VTAR -ROI2 $AMYR -op "$SHRTTRCKDIR"/"$filenme"_VTAR_AMYR.fls
fi
done

List all devices, partitions and volumes in Powershell

I have multiple volumes (as nearly everybody nowadays): on Windows they end up specified as C:, D: and so on. How do I list these all like on a Unix machine with "ls /mnt/" with Powershell?
To get all of the file system drives, you can use the following command:
gdr -PSProvider 'FileSystem'
gdr is an alias for Get-PSDrive, which includes all of the "virtual drives" for the registry, etc.
Get-Volume
You will get:
DriveLetter, FileSystemLabel, FileSystem, DriveType, HealthStatus, SizeRemaining and Size.
On Windows Powershell:
Get-PSDrive
[System.IO.DriveInfo]::getdrives()
wmic diskdrive
wmic volume
Also the utility dskwipe: http://smithii.com/dskwipe
dskwipe.exe -l
Firstly, on Unix you use mount, not ls /mnt: many things are not mounted in /mnt.
Anyhow, there's the mountvol DOS command, which continues to work in Powershell, and there's the Powershell-specific Get-PSDrive.
Though this isn't 'powershell' specific... you can easily list the drives and partitions using diskpart, list volume
PS C:\Dev> diskpart
Microsoft DiskPart version 6.1.7601
Copyright (C) 1999-2008 Microsoft Corporation.
On computer: Box
DISKPART> list volume
Volume ### Ltr Label Fs Type Size Status Info
---------- --- ----------- ----- ---------- ------- --------- --------
Volume 0 D DVD-ROM 0 B No Media
Volume 1 C = System NTFS Partition 100 MB Healthy System
Volume 2 G C = Box NTFS Partition 244 GB Healthy Boot
Volume 3 H D = Data NTFS Partition 687 GB Healthy
Volume 4 E System Rese NTFS Partition 100 MB Healthy
Run command:
Get-PsDrive -PsProvider FileSystem
For more info see:
PsDrive Documentation
PSProvider Documentation
This is pretty old, but I found following worth noting:
PS N:\> (measure-command {Get-WmiObject -Class Win32_LogicalDisk|select -property deviceid|%{$_.deviceid}|out-host}).totalmilliseconds
...
928.7403
PS N:\> (measure-command {gdr -psprovider 'filesystem'|%{$_.name}|out-host}).totalmilliseconds
...
169.474
Without filtering properties, on my test system, 4319.4196ms to 1777.7237ms. Unless I need a PS-Drive object returned, I'll stick with WMI.
EDIT:
I think we have a winner:
PS N:> (measure-command {[System.IO.DriveInfo]::getdrives()|%{$_.name}|out-host}).to‌​talmilliseconds
110.9819
We have multiple volumes per drive (some are mounted on subdirectories on the drive). This code shows a list of the mount points and volume labels. Obviously you can also extract free space and so on:
gwmi win32_volume|where-object {$_.filesystem -match "ntfs"}|sort {$_.name} |foreach-object {
echo "$(echo $_.name) [$(echo $_.label)]"
}
You can also do it on the CLI with
net use
You can use the following to find the "total" disk size on a drive as well.
Get-CimInstance -ComputerName yourhostname win32_logicaldisk | foreach-object {write " $($.caption) $('{0:N2}' -f ($.Size/1gb)) GB total, $('{0:N2}' -f ($_.FreeSpace/1gb)) GB free "}
Microsoft have a way of doing this as part of their az vm repair scripts (see: Repair a Windows VM by using the Azure Virtual Machine repair commands).
It is available under MIT license at: https://github.com/Azure/repair-script-library/blob/51e60cf70bba38316394089cee8e24a9b1f22e5f/src/windows/common/helpers/Get-Disk-Partitions.ps1
If the device is present, but not (yet) mounted, this helps:
Get-PnpDevice -PresentOnly -InstanceId SCSI*

calculate mp3 average volume

I need to know the average volume of an mp3 file so that when I convert it to mp3 (at a different bitrate) I can scale the volume too, to normalize it...
Therefore I need a command line tool / ruby library that gives me the average volume in dB.
You can use sox (an open source command line audio tool http://sox.sourceforge.net/sox.html) to normalize and transcode your files at the same time.
EDIT
Looks like it doesn't have options for bit-rate. Anyway, sox is probably overkill if LAME does normalization.
You can use LAME to encode to mp3. It has options for normalization, scaling, and bitrate. LAME also compiles to virtually any platform.
I wrote a little wrapper script, based on the above input:
#!/bin/sh
# Get the current volume (will reset to this later).
current=`amixer -c 0 get Master 2>&1 |\
awk '/%/ {
p=substr($4,2,length($4)-2);
if( substr(p,length(p)) == "%" )
{
p = substr(p,1,length(p)-1)
}
print p
}'`
# Figure out how loud the track is. The normal amplitude for a track is 0.1.
# Ludicrously low values are 0.05, high is 0.37 (!!?)
rm -f /tmp/$$.out
/usr/bin/mplayer -vo null -ao pcm:file=/tmp/$$.out $1 >/dev/null 2>&1
if [ $? = 0 ] ; then
amplitude=`/usr/bin/sox /tmp/$$.out -n stat 2>&1 | awk '/RMS.+amplitude/ {print $NF}'`
fi
rm -f /tmp/$$.out
# Set an appropriate volume for the track.
to=`echo $current $amplitude | awk '{printf( "%.0f%%", $1 * 0.1/$2 );}'`
echo $current $amplitude | awk '{print "Amplitude:", $2, " Setting volume to:", 10/$2 "%, mixer volume:", $1 * 0.1/$2}'
amixer -c 0 set Master $to >/dev/null 2>&1
mplayer -quiet -cache 2500 $1
# Reset the volume for next time.
amixer -c 0 set Master "$current%" >/dev/null 2>&1
It takes an extra second to start up playing the file, and relies on alsamixer to adjust the volume, but it does a really nice job of keeping you from having to constantly tweak the master volume. And it doesn't really care what the input format is, since if mplayer can play it at all, it can extract the audio, so it should work fine with MP3, Ogg, AVI, whatever.
http://mp3gain.sourceforge.net/ is a well thought out solution for this.

Capture CPU and Memory usage dynamically

I am running a shell script to execute a c++ application, which measures the performance of an api. i can capture the latency (time taken to return a value for a given set of parameters) of the api, but i also wish to capture the cpu and memory usage alongside at intervals of say 5-10 seconds.
is there a way to do this without effecting the performance of the system too much and that too within the same script? i have found many examples where one can do outside (independently) of the script we are running; but not one where we can do within the same script.
If you are looking for capturing CPU and Mem utilization dynamically for entire linux box, then following command can help you too:
CPU
vmstat -n 15 10| awk '{now=strftime("%Y-%m-%d %T "); print now $0}'> CPUDataDump.csv &
vmstat is used for collection of CPU counters
-n for delay value, in this case it's 15, that means after every 15 sec, stats will be collected.
then 10 is the number of intervals, there would be 10 iterations in this example
awk '{now=strftime("%Y-%m-%d %T "); print now $0}' this will dump the timestamp of each iteration
in the end, the dump file with & for continuation
Memory
free -m -s 10 10 | awk '{now=strftime("%Y-%m-%d %T "); print now $0}'> DataDumpMemoryfile.csv &
free is for mem stats collection
-m this is for units of mem (you can use -b for bytes, -k for kilobytes, -g for gigabytes)
then 10 is the number of intervals (there would be 10 iterations in this example)
awk'{now=strftime("%Y-%m-%d %T "); print now $0}' this will dump the timestamp of each iteration
in the end, the dump & for continuation
I'd suggest to use 'time' command and also 'vmstat' command. The first will give CPU usage of executable execution and second - periodic (i.e. once per second) dump of CPU/memory/IO of the system.
Example:
time dd if=/dev/zero bs=1K of=/dev/null count=1024000
1024000+0 records in
1024000+0 records out
1048576000 bytes (1.0 GB) copied, 0.738194 seconds, 1.4 GB/s
0.218u 0.519s 0:00.73 98.6% 0+0k 0+0io 0pf+0w <== that's time result