How does this sed command parse numbers with commas? - regex

I'm having difficulty understanding a number-parsing sed command I saw in this article:
sed -i ':a;s/\B[0-9]\{3\}\>/,&/;ta' numbers.txt
I'm a sed newbie, so this is what I've been able to figure out:
& adds to what's already there rather than substitutes
the :a; ... ;ta calls the substitution recursively on the line until the search finds no more returns
Here's what I am hoping folks can explain
What does -i do? I can't seem to find it on the man pages though I'm sure it's there.
I'm a little fuzzy on what the \B is accomplishing here? Perhaps it helps with the left-right parsing priority, but I don't see how. So lastly...
Most importantly, why does this execute right to left instead of left to right? For example, which part of the command keeps this from doing something like: 1234566778,9 ---> 1234,566,778,9

Bisecting this command:
sed -i ':a;s/\B[0-9]\{3\}\>/,&/;ta' numbers.txt
-i # inline editing to save changes in input file
\B # opposite of \b (word boundary) - to match between words
[0-9] # match any digit
\{3,\} # match exact 3 digits
\> # word boundary
& # use matched pattern in replacement
:a # start label a
ta # go back to label a until \B[0-9]\{3\}\> is matches
Yes indeed this sed command starts match/replacement from right most 3 digits and keeps going left till it finds 3 digits.
Update: However looking at this inefficient sed command in a loop I recommend this much simpler and faster awk instead:
awk '/^[0-9]+$/{printf "%\047.f\n", $1}' file
20,130,607,215,015
607,220,701
992,171
Where input file is:
cat file
20130607215015
607220701
992171

The matching is greedy, i.e. it matches the leftmost three digits NOT preceded by a word boundary and followed by the word boundary, i.e. the rightmost three digits. After inserting the comma, the "goto" makes it match again, but the comma introduced a new word boundary, so the match happens earlier.

Related

Find and replace regular expression with alternate format

I have a file that has lines that contain text like this
something,12:3456789,somethingelse
foobar,12:345678,somethingdifferent
For lines where the second item in the line has 6 digits after the : I would like to alternate the format of it by adding a 0 in the front and shifting the :. For example the above would change to:
something,12:3456789,somethingelse
foobar,01:2345678,somethingdifferent
I can't figure out how to do this using sed or any unix command line tool
You just need to match the middle section where you have 2 digits followed by : followed by exactly 6 digits. If you capture the text in individual groups appropriately you can move them around in your result. Note the \b word boundary at the end of the pattern is to ensure that we match on exactly 6 digits and don't match on lines which have the full 7 digits:
/\b(\d)(\d):(\d{6})\b/0\1:\2\3/
|__________________| |______|
pattern replacement
This gives the expected output. You can experiment with it online here
sed doesn't have Perl style specifiers such as \d. Instead, you will need to use [[:digit:]]. Here is the updated regex that works with sed
sed -E 's/\b([[:digit:]])([[:digit:]]):([[:digit:]]{6})\b/0\1:\2\3/g' myfile.txt
As #Jonathan Leffler pointed out, \b doesn't work on Mac's sed so you will instead need to add commas in your regex pattern at the front and back and then replace them back in the replacement pattern

Extract words containing question marks

I have tens of long text files (10k - 100k record each) where some characters were lost by careless handling and got replaced with question marks. I need to build a list of corrupted words.
I'm sure the most effective approach would be to regex the file with sed or awk or some other bash tools, but I'm unable to compose regex that would do the trick.
Here are couple of sample records for processing:
?ilkin, Aleksandr, Zahhar, isa
?igadlo-?van, Maria, Karl, abikaasa, 27.10.45, Veli?anõ raj.
Desired output would be:
?ilkin
?igadlo-?van
Veli?anõ
My best result so far seems to retrieve only words from the beginning of records:
awk '$1 ~/\?/ {print $1}' test.txt
->
?ilkin,
?igadlo-?van,
I need to build a list of corrupted words
If the aim is to only search for matches grep would be the most fast and powerful tool:
grep -Po '(^|)([^?\s]*?\?[^\s,]*?)(?=\s|,|$)' test.txt
The output:
?ilkin
?igadlo-?van
Veli?anõ
Explanation:
-P option, allows perl regular expresssions
-o option, tells to print only matched substrings
(^|) - matches the start of the string or an empty value(we can't use word boundary anchor \b in this case cause question mark ? is considered as a word boundary)
[^?\s]*? - matches any character except ? and whitespace \s if occurs
\?[^\s,]*? - matches a question mark ? followed by any character except whitespace \s and ,(which can be at right word boundary)
(?=\s|,|$) - lookahead positive assertion, ensures that a needed substring is followed by either whitespace \s, comma , or placed at the end of the string

grep for words ending in 'ing' immediately after a comma

I am trying to grep files for lines with a word ending in 'ing' immediately after a comma, of the form:
... we gave the dog a bone, showing great generosity ...
... this man, having no home ...
but not:
... this is a great place, we are having a good time ...
I would like to find instances where the 'ing' word is the first word after a comma. It seems like this should be very doable in grep, but I haven't figured out how, or found a similar example.
I have tried
grep -e ", .*ing"
which matches multiple words after the comma. Commands like
grep -i -e ", [a-z]{1,}ing"
grep -i -e ", [a-z][a-z]+ing"
don't do what I expect--they don't match phrases like my first two examples. Any help with this (or pointers to a better tool) would be much appreciated.
Try ,\s*\S+ing
Matches your first two phrases, doesn't match in your third phrase.
\s means 'any whitespace', * means 0 or more of that, \S means 'any non-whitespace' (capitalizing the letter is conventional for inverting the character set in regexes - works for \b \s \w \d), + means 'one or more' and then we match ing.
You can use the \b token to match on word boundaries (see this page).
Something like the following should work:
grep -e ".*, \b\w*ing\b"
EDIT: Except now I realised that the \b is unnecessary, and .*,\s*\w*ing would work, as Patashu pointed out. My regex-fu is rusty.

Linux Rename command uppercase first letter

I'm writing a Bash script for cleaning up in my music.
I wanted it to format all the file names and making them and so with a little internet search I wrote this line:
sed -i -e 's/[-_]/ /g' -e 's/ \+/ /g' -e **'s/\<[a-z]/\U&/g'** -e "s/$artist //g" -e "s/$album //g"
Which I used to add the file names to a text file and then sed it, but then I didn't know how to apply the new names to the files.
So then I started experimenting with rename and managed to get the exact same result
except for the bolded parts, which is supposed to make every first letter in a word uppercase.
rename 's/[-_]/ /g' * && rename 's/\s+/ /g' * && **rename 's/\s\w{1}/*A-Z*/g' *** && rename 's/^\d+[[:punct:]]\s//g' * && rename "s/$artist\s//g" * && rename "s/$album\s//g" * && rename "s/($ext)//g" *
Now, the code in rename is working (satisfactorily at least), finding only one letter after a SPACE character, but it's the replacement that is problematic. I've tried numerous different approaches, all leaving me with the result that the first letter in focus get exchanged to exactly A-Z in this case.
In the rename manual page it says to make lower case uppercase you do 's/a-z/A-Z/g' but it's easy to figure that it only applies when it finds a-z A-Z.
So this is what I need help with.
A bonus would be if someone knows how to do it like in the sed example, where the \< matches the beginning of each word, because at the moment, my rename command won't apply to the very first word and neither will it apply if there are multiple discs looking like "Disc name [Disc 1]" for obvious reasons.
This is sort-of a Perl question, since rename is written in Perl, and instructions for how to perform the renaming are a Perl command.
In a s/// in order for the substitution to know which letter to insert the upper-case version of, it has to ‘capture’ the letter from the input. Parentheses in the pattern do this, storing the captured letter in the variable $1. And \u in a substitution makes the next character upper-case.
So you can do:
$ rename 's/\s(\w)/ \u$1/g' *
Note that the replacement part has to insert a space before the upper-case letter, because the pattern includes a space and so both the space and the original letter are being replaced. You can avoid this by using \b, a zero-width assertion which only matches at a word boundary:
$ rename 's/\b(\w)/\u$1/g' *
Also you don't need the {1} in there, because \w (like other symbols in regexs) matches a single character by default.
Finally, the example in rename(1) is actually y/A-Z/a-z/, using the y/// operator, not s///. y/// is a completely different operator, which replaces all occurrences of one set of letters with another; that isn't of use to you here, where it's only some characters you want making upper-case.
rename -nv 's{ (\A|\s) (\w+) }{$1\u$2}xmsg'
This looks for the beginning of the string \A or for whitespace \s followed by at least one or more word characters (a-z, 0-9, underscore) \w+. It will uppercase the first character of all word sequences.

Substitution till the end of the line in bash

I have a huge text file with lots of lines like:
asdasdasdaasdasd_DATA_3424223423423423
gsgsdgsgs_DATA_6846343636
.....
I would like to do, for each line, to substitute from DATA_ .. to the end, with just empty space so I would get:
asdasdasdaasdasd_DATA_
gsgsdgsgs_DATA_
.....
I know that you can do something similar with:
sed -e "s/^DATA_*$/DATA_/g" filename.txt
but it does not work.
Do you know how?
Thanks
You have two problems: you're unnecessarily matching beginning and end of line with ^ and $, and you're looking for _* (zero or more underscores) instead of .* (zero or more of any character. Here's what you want:
sed -e 's/_DATA_.*/_DATA_/'
The g on the end (global) won't do anything, because you're already going to remove everything from the first instance of "DATA" onward - there can't be another match.
P.S. The -e isn't strictly necessary if you only have one expression, but if you think you might tack more on, it's a convenient habit.
With regular expressions, * means the previous character, any number of times. To match any character, use .
So what you really want is .* which means any character, any number of times, like this:
sed 's/DATA_.*/DATA_/' filename.txt
Also, I removed the ^ which means start of line, since you want to match "DATA_" even if it's not in the beginning of a line.
using awk. Set field delimiter as "DATA", then get field 1 ($1). No need regular expression
$ awk -F"_DATA_" '{print $1"_DATA_"}' file
asdasdasdaasdasd_DATA_
gsgsdgsgs_DATA_