I have a dataframe column of strings and I want to extract numbers to another column:
column
1 abc123
2 def456
3 ghi789jkl012
I've used:
dataframe["newColumn"] = dataframe["column"].str.extract("(\d*\.?\d+)", expand=True)
It works, but only captures the first block of numbers to one column. My desired output is
column newColumn newColumn2
1 abc123 123 NaN
2 def456 456 NaN
3 ghi789jkl012 789 012
but can't figure out how to do it
Use Series.str.extractall with Series.unstack and DataFrame.add_prefix, last add to original DataFrame by DataFrame.join:
df = dataframe.join(dataframe["column"].str.extractall("(\d*\.?\d+)")[0]
.unstack()
.add_prefix('newColumn'))
print (df)
column newColumn0 newColumn1
1 abc123 123 NaN
2 def456 456 NaN
3 ghi789jkl012 789 012
Or you can use (\d+), thank you #Manakin:
df = (dataframe.join(dataframe["column"].str.extractall("(\d+)")[0]
.unstack()
.add_prefix('newColumn'))
print (df)
Can also use split, expand=True and join back to df.
df.join(df.column.str.split('\D+', expand=True).replace({None: np.NaN}).rename({2:'newColumn2',1:'newColumn'},axis=1).iloc[:,-2::])
column newColumn newColumn2
1 abc123 123 NaN
2 def456 456 NaN
3 ghi789jkl012 789 012
I have a file containing a lot of information that I want to get in a specific format, i.e. add a specific number of spaces between the different columns. I can add the same amount of spaces to every line, but some of the columns need to be right aligned, meaning that I might need to add more spaces in some lines. I have no idea how to do this, and awk doesn't seem to work since I have more than two lines modify.
Here's an example:
I have managed to get a file looking something like this
apple 1 33.413 C cat 10
banana 2 21.564 B horse 356
cherry 3 43.223 D cow 32
pear 4 26.432 A goat 22
raspberry 5 72.639 C eagle 4
watermelon 6 54.436 A fox 976
pumpkin 7 42.654 B mouse 1
peanut 8 36.451 B dog 56
orange 9 57.333 C elephant 32
coconut 10 10.445 A frog 3
blueberry 11 46.435 B camel 446
But I want to get the file on this format
apple 1 33.413 C cat 10
banana 2 21.564 B horse 356
cherry 3 43.223 D cow 32
pear 4 26.432 A goat 22
raspberry 5 72.639 C eagle 4
watermelon 6 54.436 A fox 976
pumpkin 7 42.654 B mouse 1
peanut 8 36.451 B dog 56
orange 9 57.333 C elephant 32
coconut 10 10.445 A frog 3
blueberry 11 46.435 B camel 446
What bash command can I use to right align the second and fifth columns?
You can use printf with width as you want like this:
awk '{printf "%-15s%3d%10s%2s%15s %-5d\n", $1, $2, $3, $4, $5, $6}' file
apple 1 33.413 C cat 10
banana 2 21.564 B horse 356
cherry 3 43.223 D cow 32
pear 4 26.432 A goat 22
raspberry 5 72.639 C eagle 4
watermelon 6 54.436 A fox 976
pumpkin 7 42.654 B mouse 1
peanut 8 36.451 B dog 56
orange 9 57.333 C elephant 32
coconut 10 10.445 A frog 3
blueberry 11 46.435 B camel 446
Feel free to adjust widths to tweak the output.
I am trying to print out all lines with at least one character that is NOT numeric.
My grep code looks like this: grep '[^[:digit:]]' GTEST
Where GTEST is this:
TEST
55 55 Pink
123
sss
aaa
ss aaa ss
a 1 b 2 a b a
Doop Dap
12 13
77a
59360
And the output is exactly what is in GTEST, except with the matching parts of lines (AKA all of the alpha characters) in red. Instead of displaying the matching characters in red, I /only/ want to print out the lines that contain matching characters.
I've been looking around the grep tags (-o, -w, etc), but none of them seem to do it for me.
Am I missing something?
EDITED:
Expected output would be:
TEST
55 55 Pink
sss
aaa
ss aaa ss
a 1 b 2 a b a
Doop Dap
77a
From your data, I get this output:
grep '[^[:digit:]]' file
TEST
55 55 Pink
sss
aaa
ss aaa ss
a 1 b 2 a b a
Doop Dap
12 13
77a
You get the 12 13, since space between 12 and 13 is a non digit character.
This will also give output if you have space before or after digits, like: 123<space>
To overcome this, you can do like this:
grep '[^[:digit:] ]' file
TEST
55 55 Pink
sss
aaa
ss aaa ss
a 1 b 2 a b a
Doop Dap
77a
Or even better:
grep '[^[:digit:][:blank:]]' file
I have 2 datasets. 1 containing the columns origin_zip(number) and destination_zip(char) and tracking_number(char) and the other containing zip.
I would like to compare these 2 datasets so I can see all the tracking numbers and destination_zips that are not in the zip column of the second dataset.
Additionally I would like to see all of the tracking_numbers and origin_zips where the origin_zips = the destination_zips.
How would I accomplish this?
origin_zip destination_zip tracking_number
12345 23456 11111
34567 45678 22222
12345 12345 33333
zip
12345
34567
23456
results_tracking_number
22222
33333
Let's start with this...I don't think this completely answers your question, but follow up with comments and I will help if I can...
data zips;
input origin_zip $ destination_zip $ tracking_number $;
datalines;
12345 23456 11111
34567 45678 22222
56789 12345 33333
;
data zip;
input zip $;
datalines;
12345
54321
34567
76543
56789
;
Proc sort data=zips;
by origin_zip;
run;
Proc sort data=zip;
by zip;
run;
Data contained not_contained;
merge zip(in=a) zips(in=b rename=(origin_zip=zip));
by zip;
if a and b then output contained;
if a and not b then output not_contained;
run;
I have a whitespace delimited file with a variable number of entries on each line. I want to replace the first two whitespaces with commas to create a comma delimited file with three columns.
Here's my input:
a b 1 2 3 3 2 1
c d 44 55 66 2355
line http://google.com 100 200 300
ef jh 77 88 99
z y 2 3 33
And here's my desired output:
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line,http://google.com,100 200 300
ef,jh,77 88 99
z,y,2 3 33
I'm trying to use perl regular expressions in a sed command but I can't quite get it to work. First I try capturing a word, followed by a space, then another word, but that only works for lines 1, 2, and 5:
$ cat test | sed -r 's/(\w)\s+(\w)\s+/\1,\2,/'
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line http://google.com 100 200 300
ef jh 77 88 99
z,y,2 3 33
I also try capturing whitespace, a word, and then more whitespace, but that gives me the same result:
$ cat test | sed -r 's/\s+(\w)\s+/,\1,/'
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line http://google.com 100 200 300
ef jh 77 88 99
z,y,2 3 33
I also try doing this with the .? wildcard, but that does something funny to line 4.
$ cat test | sed -r 's/\s+(.?)\s+/,\1,/'
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line http://google.com 100 200 300
ef jh,,77 88 99
z,y,2 3 33
Any help is much appreciated!
How about this:
sed -e 's/\s\+/,/' | sed -e 's/\s\+/,/'
It's probably possible with a single sed command, but this is sure an easy way :)
My output:
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line,http://google.com,100 200 300
ef,jh,77 88 99
z,y,2 3 33
Try this:
sed -r 's/\s+(\S+)\s+/,\1,/'
Just replaced \w (one "word" char) with \S+ (one or more non-space chars) in one of your attempts.
You can provide multiple commands to a single instance of sed by just providing multiple -e arguments.
To do the first two, just use:
sed -e 's/\s\+/,/' -e 's/\s\+/,/'
This basically runs both commands on the line in sequence, the first doing the first block of whitespace, the second doing the next.
The following transcript shows this in action:
pax$ echo 'a b 1 2 3 3 2 1
c d 44 55 66 2355
line http://google.com 100 200 300
ef jh 77 88 99
z y 2 3 33
' | sed -e 's/\s\+/,/' -e 's/\s\+/,/'
a,b,1 2 3 3 2 1
c,d,44 55 66 2355
line,http://google.com,100 200 300
ef,jh,77 88 99
z,y,2 3 33
Sed s/// supports a way to say which occurrence of a pattern to replace: just add the n to the end of the command to replace only the nth occurrence. So, to replace the first and second occurrences of whitespace, just use it this way:
$ sed 's/ */,/1;s/ */,/2' input
a,b ,1 2 3 3 2 1
c,d ,44 55 66 2355
line,http://google.com 100,200 300
ef,jh ,77 88 99
z,y 2,3 33
EDIT: reading another proposed solutions, I noted that the 1 and 2 after s/ */,/ is not only unnecessary but plainly wrong. By default, s/// just replaces the first occurrence of the pattern. So, if we have two identical s/// in sequence, they will replace the first and the second occurrence. What you need is just
$ sed 's/ */,/;s/ */,/' input
(Note that you can put two sed commands in one expression if you separate them by a semicolon. Some sed implementations do not accept the semicolon after the s/// command; use a newline to separate the commands, in this case.)
A Perl solution is:
perl -pe '$_=join ",", split /\s+/, $_, 3' some.file
Not sure about sed/perl, but here's an (ugly) awk solution. It just prints fields 1-2, separated by commas, then the remaining fields separated by space:
awk '{
printf("%s,", $1)
printf("%s,", $2)
for (i=3; i<=NF; i++)
printf("%s ", $i)
printf("\n")
}' myfile.txt