Extract a numeric substring and another line as variables from multi-line text file - regex

I'm writing a bash shell script that I hope to ultimately use to automate the naming and 'attachment' of scanned documents to our db. The script OCR's a section of the first page of the pdf and outputs a text file containing three lines; a name, unique id, and a datetime string:
Smith, John
Case #: 234567 ( )
09/04/2013 11:34 AM
What I'd like to do is end up with two seperate strings as variables, "Smith, John" and "234567". I'm looking for help using regex with sed/awk/etc to extract this number. One issue is that the OCR will rarely output strings like:
"Case #2 234567 ( )"
or
"Ca$e # 2234567 ( 7"
So I'm thinking to take the only last 6-digits in the string, since only maybe 1 in 10,000+ of these ever get the last 6-digits read incorrectly. This unique ID is only 6 digits, and is always between 200000-999999. I'm learning regex, but it's slow going. Any help is greatly appreciated.
Edit:
For now I am using:
casename="$(cat test.txt | sed '1!d')"
casenum="$(cat test.txt | sed -n -r 's/.*([0-9]{6}).*/\1/p')"
echo ${casenum} ${casename}
234567 Smith, John
Any input for why this might not be a good way to do it, or what could be improved is (very) welcomed.

you might try this Regex (BRE):
[2-9][0-9]\{5\}\>

You could use the following regex for the second line:
^.*(\d{6})[^\d].*$
Here, the first named sub-group would denote the digits of interest.
For example, using Notepad++,
Orginal text:
Replace options:
Resulting text:
The regex should remain more or less the same across environments. You might need to simply change the way the named subexpression ($1 here) is referenced.

You could probably use something like this untested but somewhat-syntactically-valid snippet:
shopt -s extglob
declare -a cases
for casefile in casefiles/*
do
name=""
while read l
do
if [[ -z "$name" ]]
then
[[ "$l" == #(*, *) ]] && name=$l
elif [[ "$l" == +([0-9]) ]]
then
after=${l#*[2-9][0-9][0-9][0-9][0-9][0-9]}
l=${l%$after}
l=${l#${l%[2-9][0-9][0-9][0-9][0-9][0-9]}}
if [[ "$l" == #([2-9][0-9][0-9][0-9][0-9][0-9]) ]]
then
cases[$l]=$name
fi
name=""
fi
done < $casefile
done
The "hard part" prunes the first 6-digit number in your range and everything after it, then removes what's left (the stuff before the number) from the line. Then it removes the number from the beginning of the string, and removes what's left (the part after the number) from the end. If what's left is a 6-digit number in your range, it uses that as the index and the case name as the value in an array which you can later iterate over.
The rest should be pretty straightforward. :) If this doesn't quite work as expected, I blame the fact that I mostly use ksh, not bash. ;)

Related

Regex "start of string" anchor not working

The regex I'm working with at the moment is as follows:
^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$
I'm trying to match all floating point numbers, but ONLY the number. For example, the following should match:
6.0
1.22E3
-2
2.99999e-12
However, the following should NOT match:
somestring///////6.0
I've tested the above regex on this validation site and it works as expected. When running it in my bash script, however, nothing at all matches.
This is my bash code:
if [[ "$VAL" =~ ^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$ ]]
then
echo $VAL, "is a number"
else
echo $VAL, "is not a number"
fi
I've tried removing the anchors, and it matches any strings that contain floating points. However, strings like "//////6.00007" match. The $ anchor works as expected; however, the ^ does not.
Please let me know if you have any suggestions for troubleshooting this issue.
Thanks!
Edit 1
Removed bad examples
Edit 2
I ran the regex in its own foo.sh as suggested by #lurker and the code ran as expected with my test cases. So I looked at what was being compared with the regex. When I echoed what was being compared, everything looked fine, so it made zero sense as to why the regex wasn't matching.
Then I began to suspect that echo wasn't displaying what was actually in $VAL for some reason.
So I ran this: NEWVAL=(echo $VAL) as a temporary workaround until I can figure out what's going on.
Your regex does not allow decimals in the exponent. Exponents may have decimals, so your either need to change your definition of what a 'number' is or you need to change your regex.
Assuming the later, here is a correction (Bash 4.4):
echo "6.0
1.22E3.7
-2
2.99999e-0.0001
somestring///////6.0" >/tmp/f1.txt
while IFS= read -r line || [[ -n $line ]]; do
if [[ "$line" =~ ^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[-+]?[0-9]*\.?[0-9]+)?$ ]]
then
echo $line, "is a number"
else
echo $line, "is not a number"
fi
done < /tmp/f1.txt
Prints:
6.0, is a number
1.22E3.7, is a number
-2, is a number
2.99999e-0.0001, is a number
somestring///////6.0, is not a number
BUT you should know, that most consider the only two legit numbers in your list to be 6.0 and -2. Easy way to test is with awk:
$ awk '$1+0==$1{print $0 " is a number"; next} {print $0 " not a number"}' /tmp/f1.txt
6.0 is a number
1.22E3.7 not a number
-2 is a number
2.99999e-0.0001 not a number
somestring///////6.0 not a number
The same C language function that awk uses to convert a string to a float is used by many other languages (Ruby, Perl, Python, C, C++, Swift, etc etc) so if you consider your format valid, you are probably going to be writing your own conversion routine as well.
For example, in most languages your can enter 10**1.5 as legit float literal. No language I know of accepts decimal numbers after the e in a string of the form 'xx.zzEyy.y'
As it turns out, the variables that I was comparing with my regex had leading newlines on them (e.g. "\n2.3333") that were being stripped away with echo. So, when I displayed the values to the screen with echo I would see the stripped version of my variable, which was not what was being compared with the regex.
Lesson learned: echo isn't always trustworthy. Per #CharlesDuffy's comment, use one of the following to see what's actually in your variables: declare -p varname or printf '%q\n' "$varname" but do not use echo $varname.

Should I use AWK or SED to remove commas between quotation marks from a CSV file? (BASH)

I have a bunch of daily printer logs in CSV format and I'm writing a script to keep track of how much paper is being used and save the info to a database, but I've come across a small problem
Essentially, some of the document names in the logs include commas in them (which are all enclosed within double quotes), and since it's in comma separated format, my code is messing up and pushing everything one column to the right for certain records.
From what I've been reading, it seems like the best way to go about fixing this would be using awk or sed, but I'm unsure which is the best option for my situation, and how exactly I'm supposed to implement it.
Here's a sample of my input data:
2015-03-23 08:50:22,Jogn.Doe,1,1,Ineo 4000p,"MicrosoftWordDocument1",COMSYRWS14,A4,PCL6,,,NOT DUPLEX,GRAYSCALE,35kb,
And here's what I have so far:
#!/bin/bash
#Get today's file name
yearprefix="20"
currentdate=$(date +"%m-%d-%y");
year=${currentdate:6};
year="$yearprefix$year"
month=${currentdate:0:2};
day=${currentdate:3:2};
filename="papercut-print-log-$year-$month-$day.csv"
echo "The filename is: $filename"
# Remove commas in between quotes.
#Loop through CSV file
OLDIFS=$IFS
IFS=,
[ ! -f $filename ] && { echo "$Input file not found"; exit 99; }
while read time user pages copies printer document client size pcl blank1 blank2 duplex greyscale filesize blank3
do
#Remove headers
if [ "$user" != "" ] && [ "$user" != "User" ]
then
#Remove any file name with an apostrophe
if [[ "$document" =~ "'" ]];
then
document="REDACTED"; # Lazy. Need to figure out a proper solution later.
fi
echo "$time"
#Save results to database
mysql -u username -p -h localhost -e "USE printerusage; INSERT INTO printerlogs (time, username, pages, copies, printer, document, client, size, pcl, duplex, greyscale, filesize) VALUES ('$time', '$user', '$pages', '$copies', '$printer', '$document', '$client', '$size', '$pcl', '$duplex', '$greyscale', '$filesize');"
fi
done < $filename
IFS=$OLDIFS
Which option is more suitable for this task? Will I have to create a second temporary file to get this done?
Thanks in advance!
As I wrote in another answer:
Rather than interfere with what is evidently source data, i.e. the stuff inside the quotes, you might consider replacing the field-separator commas (with say |) instead:
s/,([^,"]*|"[^"]*")(?=(,|$))/|$1/g
And then splitting on | (assuming none of your data has | in it).
Is it possible to write a regular expression that matches a particular pattern and then does a replace with a part of the pattern
There is probably an easier way using sed alone, but this should work. Loop on the file, for each line match the parentheses with grep -o then replace the commas in the line with spaces (or whatever it is you would like to use to get rid of the commas - if you want to preserve the data you can use a non printable and explode it back to commas afterward).
i=1 && IFS=$(echo -en "\n\b") && for a in $(< test.txt); do
var="${a}"
for b in $(sed -n ${i}p test.txt | grep -o '"[^"]*"'); do
repl="$(sed "s/,/ /g" <<< "${b}")"
var="$(sed "s#${b}#${repl}#" <<< "${var}")"
done
let i+=1
echo "${var}"
done

Can adding a particular number to a bunch of "time" strings, be done in Regex

I have a "srt" file(like standard movie-subtitle format) like shown in below link:http://pastebin.com/3k8a53SC
Excerpt:
1
00:00:53,000 --> 00:00:57,000
<any text that may span multiple lines>
2
00:01:28,000 --> 00:01:35,000
<any text that may span multiple lines>
But right now the subtitles timing is all wrong, as it lags behind by 9 seconds.
Is it possible to add 9 seconds(+9) to every time entry with regex ?
Even if the milliseconds is set to 000 then it's fine, but the addition of 9 seconds should adhere to "60 seconds = 1 minute & 60 minutes = 1 hour" rules.
Also the subtitle text after timing entry must not get altered by regex.
By the way the time format for each time string is "Hours:Minutes:Seconds.Milliseconds".
Quick answer is "no", that's not an application for regex. A regular expression lets you MATCH text, but not change it. Changing things is outside the scope of the regex itself, and falls to the language you're using -- perl, awk, bash, etc.
For the task of adjusting the time within an SRT file, you could do this easily enough in bash, using the date command to adjust times.
#!/usr/bin/env bash
offset="${1:-0}"
datematch="^(([0-9]{2}:){2}[0-9]{2}),[0-9]{3} --> (([0-9]{2}:){2}[0-9]{2}),[0-9]{3}"
os=$(uname -s)
while read line; do
if [[ "$line" =~ $datematch ]]; then
# Gather the start and end times from the regex
start=${BASH_REMATCH[1]}
end=${BASH_REMATCH[3]}
# Replace the time in this line with a printf pattern
linefmt="${line//[0-2][0-9]:[0-5][0-9]:[0-5][0-9]/%s}\n"
# Calculate new times
case "$os" in
Darwin|*BSD)
newstart=$(date -v${offset}S -j -f "%H:%M:%S" "$start" '+%H:%M:%S')
newend=$(date -v${offset}S -j -f "%H:%M:%S" "$end" '+%H:%M:%S')
;;
Linux)
newstart=$(date -d "$start today ${offset} seconds" '+%H:%M:%S')
newend=$(date -d "$end today ${offset} seconds" '+%H:%M:%S')
;;
esac
# And print the result
printf "$linefmt" "$newstart" "$newend"
else
# No adjustments required, print the line verbatim.
echo "$line"
fi
done
Note the case statement. This script should auto-adjust for Linux, OSX, FreeBSD, etc.
You'd use this script like this:
$ ./srtadj -9 < input.srt > output.srt
Assuming you named it that, of course. Or more likely, you'd adapt its logic for use in your own script.
No, sorry, you can’t. Regex are a context free language (see Chomsky e.g. https://en.wikipedia.org/wiki/Chomsky_hierarchy) and you cannot calculate.
But with a context sensitive language like perl it will work.
It could be a one liner like this ;-)))
perl -n -e 'if(/^(\d\d:\d\d:\d\d)([-,\d\s\>]*)(\d\d:\d\d:\d\d)(.*)/) {print plus9($1).$2.plus9($3).$4."\n";}else{print $_} sub plus9{ ($h,$m,$s)=split(/:/,shift); $t=(($h*60+$m)*60+$s+9); $h=int($t/3600);$r=$t-($h*3600);$m=int($r/60);$s=$r-($m*60);return sprintf "%02d:%02d:%02d", $h, $m, $s;}‘ movie.srt
with move.srt like
1
00:00:53,000 --> 00:00:57,000
hello
2
00:01:28,000 --> 00:01:35,000
I like perl
3
00:02:09,000 --> 00:02:14,000
and regex
you will get
1
00:01:02,000 --> 00:01:06,000
hello
2
00:01:37,000 --> 00:01:44,000
I like perl
3
00:02:18,000 --> 00:02:23,000
and regex
You can change the +9 in the "sub plus9{...}", if you want another delta.
How does it work?
We are looking for lines that matches
dd:dd:dd something dd:dd:dd something
and then we call a sub, which add 9 seconds to the matched group one ($1) and group three ($3). All other lines are printed unchanged.
added
If you want to put the perl oneliner in a file, say plus9.pl, you can add newlines ;-)
if(/^(\d\d:\d\d:\d\d)([-,\d\s\>]*)(\d\d:\d\d:\d\d)(.*)/) {
print plus9($1).$2.plus9($3).$4."\n";
} else {
print $_
}
sub plus9{
($h,$m,$s)=split(/:/,shift);
$t=(($h*60+$m)*60+$s+9);
$h=int($t/3600);
$r=$t-($h*3600);
$m=int($r/60);
$s=$r-($m*60);
return sprintf "%02d:%02d:%02d", $h, $m, $s;
}
Regular expressions strictly do matching and cannot add/substract. You can match each datetime string using python, for example, add 9 seconds to that, and then rewrite the string in the appropriate spot. The regular expression I would use to match it would be the following:
(?<hour>\d+):(?<minute>\d+):(?<second>\d+),(?<msecond>\d+)
It has labeled capture groups so it's really easy to get each section (you won't need msecond but it's there for visualization, I guess)
Regex101

Get digit from filename immediately preceeding file extension, with other digits in filename

I'm trying to extract the last number before a file extension in a bash script. So the format varies but it'll be some combination of numbers and letters, and the last character will always be a digit. I need to pull those digits and store them in a variable.
The format is generally:
sdflkej10_sdlkei450_sdlekr_1.txt
I want to store just the final digit 1 into a variable.
I'll be using this to loop through a large number of files, and the last number will get into double and triple digits.
So for this file:
kej10_sdlkei450_sdlekr_310.txt
I'd need to return 310.
The number of alphanumeric characters and underscores varies with each file, but the number I want always is immediately before the .txt extension and immediately after an underscore.
I tried:
bname=${f%%.*}
number=$(echo $bname | tr -cd '[[:digit:]]')
but this returns all digits.
If I try
number = $(echo $(bname -2) it changes the number it returns.
The problem i'm having is mostly related to the variability, and the fact that I've been asked to do it in bash. Any help would really be appreciated.
regex='([0-9]+)\.[^.]*$'
[[ $file =~ $regex ]] && number=${BASH_REMATCH[1]}
This uses bash's underappreciated =~ regex operator which stores matches in an array named BASH_REMATCH.
You could do this using parameter substitution
var=kej10_sdlkei450_sdlekr_310.txt
var=${var%.*}
var=${var##*_}
echo $var
310
Use a Series of Bash Shell Expansions
While not the most elegant solution, this one uses a sequence of shell parameter expansions to achieve the desired result without having to define a specific extension. For example, this function uses the length and offset expansions to find the digit after removing filename extensions:
extract_digit() {
local basename=${1%%.*}
echo "${basename:$(( ${#basename} - 1 ))}"
}
Capturing Function Output
You can capture the output in a variable with something like:
$ foo=$(extract_digit sdflkej10_sdlkei450_sdlekr_1.txt)
$ echo $foo
1
Sample Output from Function
$ extract_digit sdflkej10_sdlkei450_sdlekr_1.txt
1
$ extract_digit sdflkej10_sdlkei450_sdlekr_9.txt
9
$ extract_digit sdflkej10_sdlkei450_sdlekr_10.txt
0
This should take care of your situation:
INPUT="some6random7numbers_12345_moreletters_789.txt"
SUBSTRING=`expr match "$INPUT" '.*_\([[:digit:]]*\)'`
echo $SUBSTRING
This will output 789
No need of regex here, you can utilize IFS
var="kej10_sdlkei450_sdlekr_310.txt"
v=$(IFS=[_.] read -ra arr <<< "$var" && echo "${arr[#]:(-2):1}")
echo "$v"
310

In GNU Grep or another standard bash command, is it possible to get a resultset from regex?

Consider the following:
var="text more text and yet more text"
echo $var | egrep "yet more (text)"
It should be possible to get the result of the regex as the string: text
However, I don't see any way to do this in bash with grep or its siblings at the moment.
In perl, php or similar regex engines:
$output = preg_match('/yet more (text)/', 'text more text yet more text');
$output[1] == "text";
Edit: To elaborate why I can't just multiple-regex, in the end I will have a regex with multiple of these (Pictured below) so I need to be able to get all of them. This also eliminates the option of using lookahead/lookbehind (As they are all variable length)
egrep -i "([0-9]+) +$USER +([0-9]+).+?(/tmp/Flash[0-9a-z]+) "
Example input as requested, straight from lsof (Replace $USER with "j" for this input data):
npviewer. 17875 j 11u REG 8,8 59737848 524264 /tmp/FlashXXu8pvMg (deleted)
npviewer. 17875 j 17u REG 8,8 16037387 524273 /tmp/FlashXXIBH29F (deleted)
The end goal is to cp /proc/$var1/fd/$var2 ~/$var3 for every line, which ends up "Downloading" flash files (Flash used to store in /tmp but they drm'd it up)
So far I've got:
#!/bin/bash
regex="([0-9]+) +j +([0-9]+).+?/tmp/(Flash[0-9a-zA-Z]+)"
echo "npviewer. 17875 j 11u REG 8,8 59737848 524264 /tmp/FlashXXYOvS8S (deleted)" |
sed -r -n -e " s%^.*?$regex.*?\$%\1 \2 \3%p " |
while read -a array
do
echo /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done
It cuts off the first digits of the first value to return, and I'm not familiar enough with sed to see what's wrong.
End result for downloading flash 10.2+ videos (Including, perhaps, encrypted ones):
#!/bin/bash
lsof | grep "/tmp/Flash" | sed -r -n -e " s%^.+? ([0-9]+) +$USER +([0-9]+).+?/tmp/(Flash[0-9a-zA-Z]+).*?\$%\1 \2 \3%p " |
while read -a array
do
cp /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done
Edit: look at my other answer for a simpler bash-only solution.
So, here the solution using sed to fetch the right groups and split them up. You later still have to use bash to read them. (And in this way it only works if the groups themselves do not contain any spaces - otherwise we had to use another divider character and patch read by setting $IFS to this value.)
#!/bin/bash
USER=j
regex=" ([0-9]+) +$USER +([0-9]+).+(/tmp/Flash[0-9a-zA-Z]+) "
sed -r -n -e " s%^.*$regex.*\$%\1 \2 \3%p " |
while read -a array
do
cp /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done
Note that I had to adapt your last regex group to allow uppercase letters, and added a space at the beginning to be sure to capture the whole block of numbers. Alternatively here a \b (word limit) would have worked, too.
Ah, I forget mentioning that you should pipe the text to this script, like this:
./grep-result.sh < grep-result-test.txt
(provided your files are named like this). Instead you can add a < grep-result-test after the sed call (before the |), or prepend the line with cat grep-result-test.txt |.
How does it work?
sed -r -n calls sed in extended-regexp-mode, and without printing anything automatically.
-e " s%^.*$regex.*\$%\1 \2 \3%p " gives the sed program, which consists of a single s command.
I'm using % instead of the normal / as parameter separator, since / appears inside the regex and I don't want to escape it.
The regex to search is prefixed by ^.* and suffixed by .*$ to grab the whole line (and avoid printing parts of the rest of the line).
Note that this .* grabs greedy, so we have to insert a space into our regexp to avoid it grabbing the start of the first digit group too.
The replacement text contains of the three parenthesed groups, separated by spaces.
the p flag at the end of the command says to print out the pattern space after replacement. Since we grabbed the whole line, the pattern space consists of only the replacement text.
So, the output of sed for your example input is this:
5 11 /tmp/FlashXXu8pvMg
5 17 /tmp/FlashXXIBH29F
This is much more friendly for reuse, obviously.
Now we pipe this output as input to the while loop.
read -a array reads a line from standard input (which is the output from sed, due to our pipe), splits it into words (at spaces, tabs and newlines), and puts the words into an array variable.
We could also have written read var1 var2 var3 instead (preferably using better variable names), then the first two words would be put to $var1 and $var2, with $var3 getting the rest.
If read succeeded reading a line (i.e. not end-of-file), the body of the loop is executed:
${array[0]} is expanded to the first element of the array and similarly.
When the input ends, the loop ends, too.
This isn't possible using grep or another tool called from a shell prompt/script because a child process can't modify the environment of its parent process. If you're using bash 3.0 or better, then you can use in-process regular expressions. The syntax is perl-ish (=~) and the match groups are available via $BASH_REMATCH[x], where x is the match group.
After creating my sed-solution, I also wanted to try the pure-bash approach suggested by Mark. It works quite fine, for me.
#!/bin/bash
USER=j
regex=" ([0-9]+) +$USER +([0-9]+).+(/tmp/Flash[0-9a-zA-Z]+) "
while read
do
if [[ $REPLY =~ $regex ]]
then
echo cp /proc/${BASH_REMATCH[1]}/fd/${BASH_REMATCH[2]} ~/${BASH_REMATCH[3]}
fi
done
(If you upvote this, you should think about also upvoting Marks answer, since it is essentially his idea.)
The same as before: pipe the text to be filtered to this script.
How does it work?
As said by Mark, the [[ ... ]] special conditional construct supports the binary operator =~, which interprets his right operand (after parameter expansion) as a extended regular expression (just as we want), and matches the left operand against this. (We have again added a space at front to avoid matching only the last digit.)
When the regex matches, the [[ ... ]] returns 0 (= true), and also puts the parts matched by the individual groups (and the whole expression) into the array variable BASH_REMATCH.
Thus, when the regex matches, we enter the then block, and execute the commands there.
Here again ${BASH_REMATCH[1]} is an array-access to an element of the array, which corresponds to the first matched group. ([0] would be the whole string.)
Another note: Both my scripts accept multi-line input and work on every line which matches. Non-matching lines are simply ignored. If you are inputting only one line, you don't need the loop, a simple if read ; then ... or even read && [[ $REPLY =~ $regex ]] && ... would be enough.
echo "$var" | pcregrep -o "(?<=yet more )text"
Well, for your simple example, you can do this:
var="text more text and yet more text"
echo $var | grep -e "yet more text" | grep -o "text"