Problem with string to float conversion of values in pandas - regex

My pandas dataframe column which have prices are mostly in the format r'\d+\.\d+' , which is what you expect. But when I try to convert it astype float, it says that I have got few numbers in the format \d+\.\d+\.\d+ like this one '6041.60.1'.
How do I go about converting all of them in the format \d+\.\d+ with series.str.replace()? The expected result is '6041.60'.

I'd recommand using .apply
df1["column"] = df1["column"].apply(lambda x: "".join(x.rsplit(".",1)), axis = 1 )#remove the last "."
df1["column"] = df1["column"].astype("float")

Related

Is there a way to assign a default value as int8 to a pandas dataframe column in a single line?

When assigning a default value for a new pandas dataframe columns, I noticed that the type is int64. To occupy less memory I converted to int8 with a second line. However I was wondering if there was a way to do it in as single line instead of two.
# Create new column with default value 1
df['reordered'] = 1
# Convert it from int64 to int8
df['reordered'] = df['reordered'].astype(int8)
Thank you for helping a neophyte
df = pd.DataFrame([1], columns=['A'])
df['B'] = np.int8(1)
df
A B
0 1 1
df.dtypes
A int64
B int8
dtype: object

Convert date format to character string

I have a column of format DATETIME23. like this:
14.02.2017 13:00:25
I want to conver it to a string, so later, i would be able to modern it, so, for example, the final version would look like:
2017-02-14 13:00:25.000
Problem occures, when i try to convert date to char format: in result i have a string of smth like 1802700293 - which is the number of seconds.
I tried:
format date $23.0
or
date = put(date, $23.0)
P.S This is nother try:
data a;
format d date9.;
d = '12jan2016'd;
dtms = cat(day(d),'-',month(d),'-',year(d),' 00:00:00.000');
/* если нужно обязательно двухзначные день и месяц, то такой колхоз: */
if day(d) < 10 then dd=cat('0',put(day(d),$1.));
else ddday=put(day(d),$2.);
if month(d) < 10 then mm=cat('0',put(month(d),$1.));
else mm=put(month(d),$2.);
yyyy=put(year(d),$4.);
/*dtms2 = cat(dd,'-',mm,'-',yyyy,' 00:00:00.000');*/
dtms2 = cat(dd,'-',mm,'-',yyyy,' 00:00:00.000');
dtms = cat(day(d),'-',month(d),'-',year(d),' 00:00:00.000');
run;
BUT, abnormally, the dtms2 concat destroys the zero in the month element
If your datetime is stored as a SAS datetime, just use the appropriate format :
data test ;
dt = '09feb2017:13:53:26'dt ; /* specify a datetime constant */
new_dt = put(dt,E8601DT23.3) ; /* ISO datetime format */
run ;
Output
dt new_dt
1802267606 2017-02-09T13:53:26.000
If you need to replace the 'T' with a space, simply add a translate function around the put().
For your dtms solution you can use put and the Z2. format to keep the leading zero when you concatenate:
dtms = cat(day(d),'-', put(month(d),z2.),'-',year(d),' 00:00:00.000');
You should be able to just use put(date, datetime23.) for your problem though instead of $23, which is converting the number of seconds to a string with length 23. However, as a comment has mentioned datetime23. is not the format from your example.

Adding comma separators to a string in a Dataframe Column with pandas

I am trying to add comma separators to indicate thousands to my string to a column in a dataframe. Can someone help with format? I do not understand how to do this to an entire column in a dataframe
Top15['PopEst'] = re.sub("(\d)(?=(\d{3})+(?!\d))", r"\1,", "%d" % Top15['PopEst'])
I think what you are looking for is this:
Top15["PopEst"] = Top15["PopEst"].map(lambda x: "{:,}".format(x))
The "{:,}".format() would work as a thousand separator for a single string/float/int, so you can use map() to apply it to each of the elements in column.
Top15['PopEst']=(Top15['Energy Supply']/Top15['Energy Supply per Capita'])
df = Top15[['PopEst']]
df = df.reset_index()
i=0
while(len(df)>i):
v =str(df.iloc[i]['PopEst']).split('.')
strr = str(format(int(v[0]), ',d'))+"."+v[1]
df.iloc[i] = df.iloc[i].replace(df.iloc[i]['PopEst'],strr)
i = i+1
df.set_index(['Country Name'], inplace=True)
I think this will help you.
v= format(12345678, ',d')
print(v)
//12,345,678

How to change string to timeformat with numpy.genfromtxt

I have a CSV file, one of colons value is timestamps but when I use numpy.getfromtxt it change it to string. My goal is to create a graph but with normal time format.
this is my array:
array([('0:00:00',), ('0:00:00.001000',), ('0:00:00.002000',),
('0:00:00.081000',), ('0:00:00.095000',), ('0:00:00.195000',),
('0:00:00.294000',), ...
this is my code:
col1 = numpy.genfromtxt("mycsv.csv",usecols=(1),delimiter=',',dtype=None, names=True)
I found an answer and it is like so:
convertfunc = lambda x: datetime.strptime(x, '%H:%M:%S.%f')
col1 = numpy.genfromtxt('00_12_4B_00_58_9B.csv',usecols=(1),delimiter=',',dtype=("object",float), names=True, converters={1: convertfunc})
But what I need to do more is to set ms values to 0 if they are missing and not None

The result being cast to double in Pig but is still being ordered as a string

I encountered the following problem:
First my data is a string that looks like this:
decimals, decimals
example: 1.345, 3.456
I used the following pig script to put this column, say QQ, into two columns:
result = FOREACH old_table GENERATE FLATTEN(STRSPLIT(QQ, ',')) as (COL1: double, COL2: double);
Then, I want to order it by first field, then second field.
result_ordered = ORDER result BY COL1, COL2;
However, I got the result like the following:
> 59.619198977071434 -151.4586740547339
> 60.52611316847121 -150.8005347076273
> 64.8310014577408 -147.84786488835852
> 7.059652849999997 125.59985130999996
which implies that my data is still being ordered as a string and not as a double. Has anyone encountered this issue and know how to solve it? Thank you in advance!
I'm not sure why STRSPLIT is returning a chararray though you explicitly state they are doubles. But if you look at http://pig.apache.org/docs/r0.10.0/basic.html#arithmetic, notice that chararrays can't be multiplied by 1.0 to doubles, but bytearrays can. Therefore you can do something like:
result = FOREACH old_table
GENERATE FLATTEN(STRSPLIT(QQ, ',')) AS (COL1: bytearray, COL2: bytearray);
B = FOREACH result GENERATE 1.0 * COL1 AS COL1, 1.0 * COL2 AS COL2 ;
result_ordered = ORDER B BY COL1, COL2;
Which gives me the correct output of:
result_ordered: {COL1: double,COL2: double}
(7.059652849999997,125.59985130999996)
(59.619198977071434,-151.4586740547339)
(60.52611316847121,-150.8005347076273)
(64.8310014577408,-147.84786488835852)
Instead of assigning the output of FLATTEN to a schema with two doubles, try actually casting the fields with (chararray). It may be that Pig only uses the :chararray syntax for applying schema checking, but requires the explicit cast to convert the types during execution.