How to change string to timeformat with numpy.genfromtxt - python-2.7

I have a CSV file, one of colons value is timestamps but when I use numpy.getfromtxt it change it to string. My goal is to create a graph but with normal time format.
this is my array:
array([('0:00:00',), ('0:00:00.001000',), ('0:00:00.002000',),
('0:00:00.081000',), ('0:00:00.095000',), ('0:00:00.195000',),
('0:00:00.294000',), ...
this is my code:
col1 = numpy.genfromtxt("mycsv.csv",usecols=(1),delimiter=',',dtype=None, names=True)

I found an answer and it is like so:
convertfunc = lambda x: datetime.strptime(x, '%H:%M:%S.%f')
col1 = numpy.genfromtxt('00_12_4B_00_58_9B.csv',usecols=(1),delimiter=',',dtype=("object",float), names=True, converters={1: convertfunc})
But what I need to do more is to set ms values to 0 if they are missing and not None

Related

Problem with string to float conversion of values in pandas

My pandas dataframe column which have prices are mostly in the format r'\d+\.\d+' , which is what you expect. But when I try to convert it astype float, it says that I have got few numbers in the format \d+\.\d+\.\d+ like this one '6041.60.1'.
How do I go about converting all of them in the format \d+\.\d+ with series.str.replace()? The expected result is '6041.60'.
I'd recommand using .apply
df1["column"] = df1["column"].apply(lambda x: "".join(x.rsplit(".",1)), axis = 1 )#remove the last "."
df1["column"] = df1["column"].astype("float")

Convert date format to character string

I have a column of format DATETIME23. like this:
14.02.2017 13:00:25
I want to conver it to a string, so later, i would be able to modern it, so, for example, the final version would look like:
2017-02-14 13:00:25.000
Problem occures, when i try to convert date to char format: in result i have a string of smth like 1802700293 - which is the number of seconds.
I tried:
format date $23.0
or
date = put(date, $23.0)
P.S This is nother try:
data a;
format d date9.;
d = '12jan2016'd;
dtms = cat(day(d),'-',month(d),'-',year(d),' 00:00:00.000');
/* если нужно обязательно двухзначные день и месяц, то такой колхоз: */
if day(d) < 10 then dd=cat('0',put(day(d),$1.));
else ddday=put(day(d),$2.);
if month(d) < 10 then mm=cat('0',put(month(d),$1.));
else mm=put(month(d),$2.);
yyyy=put(year(d),$4.);
/*dtms2 = cat(dd,'-',mm,'-',yyyy,' 00:00:00.000');*/
dtms2 = cat(dd,'-',mm,'-',yyyy,' 00:00:00.000');
dtms = cat(day(d),'-',month(d),'-',year(d),' 00:00:00.000');
run;
BUT, abnormally, the dtms2 concat destroys the zero in the month element
If your datetime is stored as a SAS datetime, just use the appropriate format :
data test ;
dt = '09feb2017:13:53:26'dt ; /* specify a datetime constant */
new_dt = put(dt,E8601DT23.3) ; /* ISO datetime format */
run ;
Output
dt new_dt
1802267606 2017-02-09T13:53:26.000
If you need to replace the 'T' with a space, simply add a translate function around the put().
For your dtms solution you can use put and the Z2. format to keep the leading zero when you concatenate:
dtms = cat(day(d),'-', put(month(d),z2.),'-',year(d),' 00:00:00.000');
You should be able to just use put(date, datetime23.) for your problem though instead of $23, which is converting the number of seconds to a string with length 23. However, as a comment has mentioned datetime23. is not the format from your example.

Adding comma separators to a string in a Dataframe Column with pandas

I am trying to add comma separators to indicate thousands to my string to a column in a dataframe. Can someone help with format? I do not understand how to do this to an entire column in a dataframe
Top15['PopEst'] = re.sub("(\d)(?=(\d{3})+(?!\d))", r"\1,", "%d" % Top15['PopEst'])
I think what you are looking for is this:
Top15["PopEst"] = Top15["PopEst"].map(lambda x: "{:,}".format(x))
The "{:,}".format() would work as a thousand separator for a single string/float/int, so you can use map() to apply it to each of the elements in column.
Top15['PopEst']=(Top15['Energy Supply']/Top15['Energy Supply per Capita'])
df = Top15[['PopEst']]
df = df.reset_index()
i=0
while(len(df)>i):
v =str(df.iloc[i]['PopEst']).split('.')
strr = str(format(int(v[0]), ',d'))+"."+v[1]
df.iloc[i] = df.iloc[i].replace(df.iloc[i]['PopEst'],strr)
i = i+1
df.set_index(['Country Name'], inplace=True)
I think this will help you.
v= format(12345678, ',d')
print(v)
//12,345,678

Split Pandas Column by values that are in a list

I have three lists that look like this:
age = ['51+', '21-30', '41-50', '31-40', '<21']
cluster = ['notarget', 'cluster3', 'allclusters', 'cluster1', 'cluster2']
device = ['htc_one_2gb','iphone_6/6+_at&t','iphone_6/6+_vzn','iphone_6/6+_all_other_devices','htc_one_2gb_limited_time_offer','nokia_lumia_v3','iphone5s','htc_one_1gb','nokia_lumia_v3_more_everything']
I also have column in a df that looks like this:
campaign_name
0 notarget_<21_nokia_lumia_v3
1 htc_one_1gb_21-30_notarget
2 41-50_htc_one_2gb_cluster3
3 <21_htc_one_2gb_limited_time_offer_notarget
4 51+_cluster3_iphone_6/6+_all_other_devices
I want to split the column into three separate columns based on the values in the above lists. Like so:
age cluster device
0 <21 notarget nokia_lumia_v3
1 21-30 notarget htc_one_1gb
2 41-50 cluster3 htc_one_2gb
3 <21 notarget htc_one_2gb_limited_time_offer
4 51+ cluster3 iphone_6/6+_all_other_devices
First thought was to do a simple test like this:
ages_list = []
for i in ages:
if i in df['campaign_name'][0]:
ages_list.append(i)
print ages_list
>>> ['<21']
I was then going to convert ages_list to a series and combine it with the remaining two to get the end result above but i assume there is a more pythonic way of doing it?
the idea behind this is that you'll create a regular expression based on the values you already have , for example if you want to build a regular expressions that capture any value from your age list you may do something like this '|'.join(age) and so on for all the values you already have cluster & device.
a special case for device list becuase it contains + sign that will conflict with the regex ( because + means one or more when it comes to regex ) so we can fix this issue by replacing any value of + with \+ , so this mean I want to capture literally +
df = pd.DataFrame({'campaign_name' : ['notarget_<21_nokia_lumia_v3' , 'htc_one_1gb_21-30_notarget' , '41-50_htc_one_2gb_cluster3' , '<21_htc_one_2gb_limited_time_offer_notarget' , '51+_cluster3_iphone_6/6+_all_other_devices'] })
def split_df(df):
campaign_name = df['campaign_name']
df['age'] = re.findall('|'.join(age) , campaign_name)[0]
df['cluster'] = re.findall('|'.join(cluster) , campaign_name)[0]
df['device'] = re.findall('|'.join([x.replace('+' , '\+') for x in device ]) , campaign_name)[0]
return df
df.apply(split_df, axis = 1 )
if you want to drop the original column you can do this
df.apply(split_df, axis = 1 ).drop( 'campaign_name', axis = 1)
Here I'm assuming that a value must be matched by regex but if this is not the case you can do your checks , you got the idea

The result being cast to double in Pig but is still being ordered as a string

I encountered the following problem:
First my data is a string that looks like this:
decimals, decimals
example: 1.345, 3.456
I used the following pig script to put this column, say QQ, into two columns:
result = FOREACH old_table GENERATE FLATTEN(STRSPLIT(QQ, ',')) as (COL1: double, COL2: double);
Then, I want to order it by first field, then second field.
result_ordered = ORDER result BY COL1, COL2;
However, I got the result like the following:
> 59.619198977071434 -151.4586740547339
> 60.52611316847121 -150.8005347076273
> 64.8310014577408 -147.84786488835852
> 7.059652849999997 125.59985130999996
which implies that my data is still being ordered as a string and not as a double. Has anyone encountered this issue and know how to solve it? Thank you in advance!
I'm not sure why STRSPLIT is returning a chararray though you explicitly state they are doubles. But if you look at http://pig.apache.org/docs/r0.10.0/basic.html#arithmetic, notice that chararrays can't be multiplied by 1.0 to doubles, but bytearrays can. Therefore you can do something like:
result = FOREACH old_table
GENERATE FLATTEN(STRSPLIT(QQ, ',')) AS (COL1: bytearray, COL2: bytearray);
B = FOREACH result GENERATE 1.0 * COL1 AS COL1, 1.0 * COL2 AS COL2 ;
result_ordered = ORDER B BY COL1, COL2;
Which gives me the correct output of:
result_ordered: {COL1: double,COL2: double}
(7.059652849999997,125.59985130999996)
(59.619198977071434,-151.4586740547339)
(60.52611316847121,-150.8005347076273)
(64.8310014577408,-147.84786488835852)
Instead of assigning the output of FLATTEN to a schema with two doubles, try actually casting the fields with (chararray). It may be that Pig only uses the :chararray syntax for applying schema checking, but requires the explicit cast to convert the types during execution.