Delete similar lines csh - regex

I've seen several articles on deleting duplicate lines, but I need something a little more specific. Here is an example of some raw data:
11111 AA 1 date1
11111 BB 64 date1
11111 BB 64 date2
...
11111 BB 64 date64
11111 BB 64 date1
11111 BB 64 date2
...
11111 BB 64 date64
11111 BB ## date1
11111 BB ## date2
...
11111 BB ## date##
22222 AA 1 date1
22222 BB 64 date1
22222 BB 64 date2
...
22222 BB 64 date64
22222 BB 64 date1
22222 BB 64 date2
...
22222 BB 64 date64
22222 BB ## date1
22222 BB ## date2
...
22222 BB ## date##
Note: Where ## is some number < 64.
I need to edit that file so it looks something like this:
11111 AA 1 date1
11111 BB 64 date1
11111 BB 64 date1
11111 BB ## date1
22222 AA 1 date1
22222 BB 64 date1
22222 BB 64 date1
22222 BB ## date1
I've seen several examples of using awk, sed, or ed along with regex to match the first part of a line. My confusion is with the occurance of the "BB 64" and "BB ##" and not just deleting all BB lines but the first.
Vital Info: Running this csh script on a Solaris v5.8
The AA lines are not important in this question except to know they are there (we are not doing anything with them).
Here's essentially what I've got so far (still having syntax issues from looking at examples using other languages, so if you can correct please do):
sed 'N;(\d{1,8}\sBB\s\d{1,2}.+\n);P;D' filename
If I were not getting errors due to syntax, I am sure this would delete all BB lines but the first "BB 64 date1." I think my sed regex above is based on uniq but only matches the frist part of the line instead of the entire line because I will need the first date of each BB (if there are more than 1 series of BB 64 for each 11111, 22222, etc the output should contain an identical BB 64 line for each series [just date1]). Any ideas?

Seems like sort -k4,4 | uniq would do the trick? (or sort +3 if the Solaris version is sufficiently old.)

Related

How to use two regex capture groups to make two pandas columns

I have a dataframe column of strings and I want to extract numbers to another column:
column
1 abc123
2 def456
3 ghi789jkl012
I've used:
dataframe["newColumn"] = dataframe["column"].str.extract("(\d*\.?\d+)", expand=True)
It works, but only captures the first block of numbers to one column. My desired output is
column newColumn newColumn2
1 abc123 123 NaN
2 def456 456 NaN
3 ghi789jkl012 789 012
but can't figure out how to do it
Use Series.str.extractall with Series.unstack and DataFrame.add_prefix, last add to original DataFrame by DataFrame.join:
df = dataframe.join(dataframe["column"].str.extractall("(\d*\.?\d+)")[0]
.unstack()
.add_prefix('newColumn'))
print (df)
column newColumn0 newColumn1
1 abc123 123 NaN
2 def456 456 NaN
3 ghi789jkl012 789 012
Or you can use (\d+), thank you #Manakin:
df = (dataframe.join(dataframe["column"].str.extractall("(\d+)")[0]
.unstack()
.add_prefix('newColumn'))
print (df)
Can also use split, expand=True and join back to df.
df.join(df.column.str.split('\D+', expand=True).replace({None: np.NaN}).rename({2:'newColumn2',1:'newColumn'},axis=1).iloc[:,-2::])
column newColumn newColumn2
1 abc123 123 NaN
2 def456 456 NaN
3 ghi789jkl012 789 012

Right Align Columns in Text File with Sed

I have a file containing a lot of information that I want to get in a specific format, i.e. add a specific number of spaces between the different columns. I can add the same amount of spaces to every line, but some of the columns need to be right aligned, meaning that I might need to add more spaces in some lines. I have no idea how to do this, and awk doesn't seem to work since I have more than two lines modify.
Here's an example:
I have managed to get a file looking something like this
apple 1 33.413 C cat 10
banana 2 21.564 B horse 356
cherry 3 43.223 D cow 32
pear 4 26.432 A goat 22
raspberry 5 72.639 C eagle 4
watermelon 6 54.436 A fox 976
pumpkin 7 42.654 B mouse 1
peanut 8 36.451 B dog 56
orange 9 57.333 C elephant 32
coconut 10 10.445 A frog 3
blueberry 11 46.435 B camel 446
But I want to get the file on this format
apple 1 33.413 C cat 10
banana 2 21.564 B horse 356
cherry 3 43.223 D cow 32
pear 4 26.432 A goat 22
raspberry 5 72.639 C eagle 4
watermelon 6 54.436 A fox 976
pumpkin 7 42.654 B mouse 1
peanut 8 36.451 B dog 56
orange 9 57.333 C elephant 32
coconut 10 10.445 A frog 3
blueberry 11 46.435 B camel 446
What bash command can I use to right align the second and fifth columns?
You can use printf with width as you want like this:
awk '{printf "%-15s%3d%10s%2s%15s %-5d\n", $1, $2, $3, $4, $5, $6}' file
apple 1 33.413 C cat 10
banana 2 21.564 B horse 356
cherry 3 43.223 D cow 32
pear 4 26.432 A goat 22
raspberry 5 72.639 C eagle 4
watermelon 6 54.436 A fox 976
pumpkin 7 42.654 B mouse 1
peanut 8 36.451 B dog 56
orange 9 57.333 C elephant 32
coconut 10 10.445 A frog 3
blueberry 11 46.435 B camel 446
Feel free to adjust widths to tweak the output.

Why is grep showing lines that don't match?

I am trying to print out all lines with at least one character that is NOT numeric.
My grep code looks like this: grep '[^[:digit:]]' GTEST
Where GTEST is this:
TEST
55 55 Pink
123
sss
aaa
ss aaa ss
a 1 b 2 a b a
Doop Dap
12 13
77a
59360
And the output is exactly what is in GTEST, except with the matching parts of lines (AKA all of the alpha characters) in red. Instead of displaying the matching characters in red, I /only/ want to print out the lines that contain matching characters.
I've been looking around the grep tags (-o, -w, etc), but none of them seem to do it for me.
Am I missing something?
EDITED:
Expected output would be:
TEST
55 55 Pink
sss
aaa
ss aaa ss
a 1 b 2 a b a
Doop Dap
77a
From your data, I get this output:
grep '[^[:digit:]]' file
TEST
55 55 Pink
sss
aaa
ss aaa ss
a 1 b 2 a b a
Doop Dap
12 13
77a
You get the 12 13, since space between 12 and 13 is a non digit character.
This will also give output if you have space before or after digits, like: 123<space>
To overcome this, you can do like this:
grep '[^[:digit:] ]' file
TEST
55 55 Pink
sss
aaa
ss aaa ss
a 1 b 2 a b a
Doop Dap
77a
Or even better:
grep '[^[:digit:][:blank:]]' file

Comparing datasets

I have 2 datasets. 1 containing the columns origin_zip(number) and destination_zip(char) and tracking_number(char) and the other containing zip.
I would like to compare these 2 datasets so I can see all the tracking numbers and destination_zips that are not in the zip column of the second dataset.
Additionally I would like to see all of the tracking_numbers and origin_zips where the origin_zips = the destination_zips.
How would I accomplish this?
origin_zip destination_zip tracking_number
12345 23456 11111
34567 45678 22222
12345 12345 33333
zip
12345
34567
23456
results_tracking_number
22222
33333
Let's start with this...I don't think this completely answers your question, but follow up with comments and I will help if I can...
data zips;
input origin_zip $ destination_zip $ tracking_number $;
datalines;
12345 23456 11111
34567 45678 22222
56789 12345 33333
;
data zip;
input zip $;
datalines;
12345
54321
34567
76543
56789
;
Proc sort data=zips;
by origin_zip;
run;
Proc sort data=zip;
by zip;
run;
Data contained not_contained;
merge zip(in=a) zips(in=b rename=(origin_zip=zip));
by zip;
if a and b then output contained;
if a and not b then output not_contained;
run;

Replace first two whitespace occurrences with a comma using sed

I have a whitespace delimited file with a variable number of entries on each line. I want to replace the first two whitespaces with commas to create a comma delimited file with three columns.
Here's my input:
a b 1 2 3 3 2 1
c d 44 55 66 2355
line http://google.com 100 200 300
ef jh 77 88 99
z y 2 3 33
And here's my desired output:
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line,http://google.com,100 200 300
ef,jh,77 88 99
z,y,2 3 33
I'm trying to use perl regular expressions in a sed command but I can't quite get it to work. First I try capturing a word, followed by a space, then another word, but that only works for lines 1, 2, and 5:
$ cat test | sed -r 's/(\w)\s+(\w)\s+/\1,\2,/'
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line http://google.com 100 200 300
ef jh 77 88 99
z,y,2 3 33
I also try capturing whitespace, a word, and then more whitespace, but that gives me the same result:
$ cat test | sed -r 's/\s+(\w)\s+/,\1,/'
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line http://google.com 100 200 300
ef jh 77 88 99
z,y,2 3 33
I also try doing this with the .? wildcard, but that does something funny to line 4.
$ cat test | sed -r 's/\s+(.?)\s+/,\1,/'
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line http://google.com 100 200 300
ef jh,,77 88 99
z,y,2 3 33
Any help is much appreciated!
How about this:
sed -e 's/\s\+/,/' | sed -e 's/\s\+/,/'
It's probably possible with a single sed command, but this is sure an easy way :)
My output:
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line,http://google.com,100 200 300
ef,jh,77 88 99
z,y,2 3 33
Try this:
sed -r 's/\s+(\S+)\s+/,\1,/'
Just replaced \w (one "word" char) with \S+ (one or more non-space chars) in one of your attempts.
You can provide multiple commands to a single instance of sed by just providing multiple -e arguments.
To do the first two, just use:
sed -e 's/\s\+/,/' -e 's/\s\+/,/'
This basically runs both commands on the line in sequence, the first doing the first block of whitespace, the second doing the next.
The following transcript shows this in action:
pax$ echo 'a b 1 2 3 3 2 1
c d 44 55 66 2355
line http://google.com 100 200 300
ef jh 77 88 99
z y 2 3 33
' | sed -e 's/\s\+/,/' -e 's/\s\+/,/'
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line,http://google.com,100 200 300
ef,jh,77 88 99
z,y,2 3 33
Sed s/// supports a way to say which occurrence of a pattern to replace: just add the n to the end of the command to replace only the nth occurrence. So, to replace the first and second occurrences of whitespace, just use it this way:
$ sed 's/ */,/1;s/ */,/2' input
a,b ,1 2 3 3 2 1
c,d ,44 55 66 2355
line,http://google.com 100,200 300
ef,jh ,77 88 99
z,y 2,3 33
EDIT: reading another proposed solutions, I noted that the 1 and 2 after s/ */,/ is not only unnecessary but plainly wrong. By default, s/// just replaces the first occurrence of the pattern. So, if we have two identical s/// in sequence, they will replace the first and the second occurrence. What you need is just
$ sed 's/ */,/;s/ */,/' input
(Note that you can put two sed commands in one expression if you separate them by a semicolon. Some sed implementations do not accept the semicolon after the s/// command; use a newline to separate the commands, in this case.)
A Perl solution is:
perl -pe '$_=join ",", split /\s+/, $_, 3' some.file
Not sure about sed/perl, but here's an (ugly) awk solution. It just prints fields 1-2, separated by commas, then the remaining fields separated by space:
awk '{
printf("%s,", $1)
printf("%s,", $2)
for (i=3; i<=NF; i++)
printf("%s ", $i)
printf("\n")
}' myfile.txt