How to use two regex capture groups to make two pandas columns - regex

I have a dataframe column of strings and I want to extract numbers to another column:
column
1 abc123
2 def456
3 ghi789jkl012
I've used:
dataframe["newColumn"] = dataframe["column"].str.extract("(\d*\.?\d+)", expand=True)
It works, but only captures the first block of numbers to one column. My desired output is
column newColumn newColumn2
1 abc123 123 NaN
2 def456 456 NaN
3 ghi789jkl012 789 012
but can't figure out how to do it

Use Series.str.extractall with Series.unstack and DataFrame.add_prefix, last add to original DataFrame by DataFrame.join:
df = dataframe.join(dataframe["column"].str.extractall("(\d*\.?\d+)")[0]
.unstack()
.add_prefix('newColumn'))
print (df)
column newColumn0 newColumn1
1 abc123 123 NaN
2 def456 456 NaN
3 ghi789jkl012 789 012
Or you can use (\d+), thank you #Manakin:
df = (dataframe.join(dataframe["column"].str.extractall("(\d+)")[0]
.unstack()
.add_prefix('newColumn'))
print (df)

Can also use split, expand=True and join back to df.
df.join(df.column.str.split('\D+', expand=True).replace({None: np.NaN}).rename({2:'newColumn2',1:'newColumn'},axis=1).iloc[:,-2::])
column newColumn newColumn2
1 abc123 123 NaN
2 def456 456 NaN
3 ghi789jkl012 789 012

Related

Remove "." from digits

I have a string in the following way =
"lmn abc 4.0mg 3.50 mg over 12 days. Standing nebs."
I want to convert it into :
"lmn abc 40mg 350 mg over 12 days. Standing nebs."
that is I only convert a.b -> ab where a and b are integer
waiting for help
Assuming you are using Python. You can use captured groups in regex. Either numbered captured group or named captured group. Then use the groups in the replacement while leaving out the ..
import re
text = "lmn abc 4.0mg 3.50 mg over 12 days. Standing nebs."
Numbered: You reference the pattern group (content in brackets) by their index.
text = re.sub("(\d+)\.(\d+)", "\\1\\2", text)
Named: You reference the pattern group by a name you specified.
text = re.sub("(?P<before>\d+)\.(?P<after>\d+)", "\g<before>\g<after>", text)
Which each returns:
print(text)
> lmn abc 40mg 350 mg over 12 days. Standing nebs.
However you should be aware that leaving out the . in decimal numbers will change their value. So you should be careful with whatever you are doing with these numbers afterwards.
Using any sed in any shell on every Unix box:
$ sed 's/\([0-9]\)\.\([0-9]\)/\1\2/g' file
"lmn abc 40mg 350 mg over 12 days. Standing nebs."
Using sed
$ cat input_file
"lmn abc 4.0mg 3.50 mg over 12 days. Standing nebs. a.b.c."
$ sed 's/\([a-z0-9]*\)\.\([a-z0-9]\)/\1\2/g' input_file
"lmn abc 40mg 350 mg over 12 days. Standing nebs. abc."
echo '1.2 1.23 12.34 1. .2' |
ruby -p -e '$_.gsub!(/\d+\K\.(?=\d+)/, "")'
Output
12 123 1234 1. .2
If performance matters:
echo '1.2 1.23 12.34 1. .2' |
ruby -p -e 'BEGIN{$regex = /\d+\K\.(?=\d+)/; $empty_string = ""}; $_.gsub!($regex, $empty_string)'

Regular expression in Oracle to Filter particular charecters only

I have a scenario
Case 1: "NO 41 ABC STREET"
Case 2: "42 XYZ STREET"
For almost 100 000 data in my table.
I want a regexp that
omits 'NO 41' and leaves back ABC STREET as output in case 1, whereas
in case 2 I want '42 XYZ STREET' as output.
regexp_replace('NO 41 ABC STREET', 'NO [0-9]+ |([0-9]+)', '\1') outputs ABC STREET.
regexp_replace('42 XYZ STREET', 'NO [0-9]+ |([0-9]+)', '\1') outputs 42 XYZ STREET.
You have provided only 2 scenarios of your data in the table. Assuming that you only want to replace the characters in a column which starts with a "NO" followed by digit and then space before some other characters, you could use this.
SQL Fiddle
Query:
select s,REGEXP_REPLACE(s,'^NO +\d+ +') as r FROM data
Results:
| S | R |
|------------------|---------------|
| NO 41 ABC STREET | ABC STREET |
| 42 XYZ STREET | 42 XYZ STREET |
If you have more complex data to be filtered, please edit your question and describe it clearly.

counting a unique string in line

I try to use "uniq -c" to count 2nd string in line
My file A has around 500,000 lines, and looks like this
File_A
30-Nov 20714 GHI 235
30-Nov 10005 ABC 101
30-Nov 10355 DEF 111
30-Nov 10005 ABC 101
30-Nov 10005 ABC 101
30-Nov 10355 DEF 111
30-Nov 10005 ABC 101
30-Nov 20714 GHI 235
...
The command I used
sort -k 2 File_A | uniq -c
I find out the result i get doesn't match the lines.
How can i fix this problem? or Does there has other way to count unique string in line?
The result i get will similar like this (i just random made out the number)
70 30-Nov 10005 ABC 101
5 30-Nov 10355 DEF 111
55 30-Nov 20714 GHI 235
You need to also tell uniq to consider only that field, the same way you did with sort. Perhaps you can use -f of --skip-fields for that. The problem you then have is that uniq doesn't take a "number of fields to check".
Otherwise, if you don't need to keep the original string you can just:
cut -d' ' -f2 | sort ...
Here are a couple, or three, other ways to do it. These solutions have the benefit that the file is not sorted - rather they rely on hashes (associative arrays) to keep track of unique occurrences.
Method 1:
perl -ane 'END{print scalar keys %h,"\n"}$h{$F[1]}++' File_A
The "-ane" makes Perl loop through the lines in File_A, and sets elements of the array F[] equal to the fields of each line as it goes. So your unique numbers end up in F[1]. %h is a hash. The hash element indexed by $F[1] is incremented as each line is processed. At the end, the END{} block is run, and it simply prints the number of elements in the hash.
Method 2:
perl -ane 'END{print "$u\n"}$u++ if $h{$F[1]}++==1' File_A
Similar to the method above, but this time a variable $u is incremented each time incrementing the hash results in it becoming 1 - i.e. the first time we see that number.
I am sure #mpapec or #fedorqui could do it in half the code, but you get the idea!
Method 3:
awk 'FNR==NR{a[$2]++;next}{print a[$2],$0}END{for(i in a)u++;print u}' File_A File_A
Result:
2 30-Nov 20714 GHI 235
4 30-Nov 10005 ABC 101
2 30-Nov 10355 DEF 111
4 30-Nov 10005 ABC 101
4 30-Nov 10005 ABC 101
2 30-Nov 10355 DEF 111
4 30-Nov 10005 ABC 101
2 30-Nov 20714 GHI 235
3
This uses awk and runs through your file twice - that is why it appears twice at the end of the command. On the first pass, the code in curly braces after "FNR==NR" is run and it increments the element of associative array a[] as indexed by field 2 ($2) so it is essentially counting the number of times each id in field 2 is seen. Then, on the second pass, the part in the second set of curly braces is run and it prints the total number of times the id was seen on the first pass, plus the current line. At the end, the END{} block is run and it counts the elements in associative array a[] and prints that out.
If your intention is to count the unique values in the second column, the one that has 20714, 10005, ... in it, then you need to extract it first using cut.
cut -d' ' -f 2 File_A | sort | uniq -c

Print specific number of lines furthest from the current pattern match and just before matching another pattern

I have a tab delimited file such as the one below. I want to find the specific number of minimum values in a group. The group starts after finding E in the last column. For example, I want to print two lines (records) that are furthest from, first occurrence of E, the items are sorted in column with E. Here Jack's case and also after second occurrence of E in Gareth's case.
Jack 2 98 E
Jones 6 25 8.11
Mike 8 11 5.22
Jasmine 5 7 4
Simran 5 7 3
Gareth 1 85 E
Jones 4 76 178.32
Mark 11 12 157.3
Steve 17 8 88.5
Clarke 3 7 12.3
Vid 3 7 2.3
I want my result to be
Jasmine 5 7 4
Simaran 5 7 3
Clarke 3 7 12.3
Vid 3 7 2.3
There can be different number of records in a group. I tried with grep
grep -B 2 F$ inputfile.txt
But it repeats the results with E and also does not work with the last record.
quick & dirty:
kent$ awk '/E$/&&a&&b{print b RS a;a=b="";next}{b=a;a=$0}END{print b RS a}' file
Jasmine 5 7 4
Simran 5 7 3
Clarke 3 7 12.3
Vid 3 7 2.3
Using arrays of arrays in Gnu Awk version 4, you can try
gawk -vnum=2 -f e.awk input.txt
where e.awk is:
$4=="E" {
N[j++]=i
i=0
}
{
l[j][++i]=$0
}
END {
N[j]=i; ngr=j
for (i=1; i<=ngr; i++) {
m=N[i]
for (j=m-num+1; j<=m; j++)
print l[i][j]
}
}
I don't see an F in you last column. But assuming you want to get every 2 lines above a line ending in E:
grep -B2 'E$' <(cat inputfile.txt;echo "E")|sed "/E$\|^--/d"
Should do the trick
'E$' look for an "E" at the end of a line
the -B2 gets the 2 lines before as well
<(cat inputfile.txt;echo "E") add an "E" as last line to match the last ones as well (this does not chage the actual file)
sed "/E$\|^--/d" delete all lines ending in "E" or beginning with "--" (separator of grep)
awk '$2 ~/5|3/ && $3 ~/7/' file
Jasmine 5 7 4
Simran 5 7 3
Clarke 3 7 12.3
Vid 3 7 2.3

Comparing datasets

I have 2 datasets. 1 containing the columns origin_zip(number) and destination_zip(char) and tracking_number(char) and the other containing zip.
I would like to compare these 2 datasets so I can see all the tracking numbers and destination_zips that are not in the zip column of the second dataset.
Additionally I would like to see all of the tracking_numbers and origin_zips where the origin_zips = the destination_zips.
How would I accomplish this?
origin_zip destination_zip tracking_number
12345 23456 11111
34567 45678 22222
12345 12345 33333
zip
12345
34567
23456
results_tracking_number
22222
33333
Let's start with this...I don't think this completely answers your question, but follow up with comments and I will help if I can...
data zips;
input origin_zip $ destination_zip $ tracking_number $;
datalines;
12345 23456 11111
34567 45678 22222
56789 12345 33333
;
data zip;
input zip $;
datalines;
12345
54321
34567
76543
56789
;
Proc sort data=zips;
by origin_zip;
run;
Proc sort data=zip;
by zip;
run;
Data contained not_contained;
merge zip(in=a) zips(in=b rename=(origin_zip=zip));
by zip;
if a and b then output contained;
if a and not b then output not_contained;
run;