I'm trying to generate a QuickSight analysis with a simple .csv file. The file contains some arbitrary data like
Yifei, 24, Male, 2
Joe, 30, Male, 3
Winston, 40, Male, 7
Emily, 18, Female, 5
Wendy, 32, Female, 4
I placed the file in an S3 bucket, and then use AWS Athena to parse that into a table. The table treats all columns as strings, and I can query it properly
SELECT * FROM users
returns
name age gender consumed
1 Yifei 24 Male 2
2 Joe 30 Male 3
3 Winston 40 Male 7
4 Emily 18 Female 5
5 Wendy 32 Female 4
Okay so far so good. Then in QuickSight, I import the table as dataset, and it's properly displayed under fields with the correct values. The only problem remaining is that age and consumed are treated as strings, not numbers. So, I created two calculated fields:
age_calc: parseInt({age})
consumed_calc: parseInt({consume})
Works just fine, now under the fields I can see the newly created fields with correct values. However, once I try to create actual visualization (For example, a pie chart with how much everyone consumed) using the field consumed_calc, the value of consumed_calc is just null.
I found the issue. Basically, csv does not work very well with spaces, so despite the calculated fields showing correct result in preview, when parsed the field " 23" gets an error. Removing the spaces in the original .csv file solved this issue
Related
I have a data frame with 4 columns. i will typically have 200 or more so rows. i have an example below showing 4 rows as an example. There is a column for account number. this account number may appear multiple times in the column. i have a separate excel sheet with 2 columns, listing account number and account name. I want to replace the account number with the corresponding account name shown on my excel sheet. I cannot manually type out using the replace function for every account number, as there are hundreds of account names and numbers. is there a way i can replace the account number with their relevant account names? or perhaps append a new column showing the relevant account name?
If I understand correctly, you want something like the following (You can get l1 and l2 by parsing the excel sheet):
import pandas as pd
l1 = [100, 200]
l2 = [1000, 2000]
z = pd.DataFrame({"one": [100,100,300,200], "two": [100,100,300,200]})
"""
one two
0 100 100
1 100 100
2 300 300
3 200 200
"""
print(z)
z.two.replace(l1, l2, inplace=True)
"""
one two
0 100 1000
1 100 1000
2 300 300
3 200 2000
"""
print(z)
I am working in Stata with a dataset on electric vehicle charging stations. Variables include
station_name name of charging station
review_text all of the customer reviews for a specific station delimited by }{
num_reviews number of customer reviews.
I'm trying to make a new file where each observation represents one customer review in a new variable customer_review and another variable station_id has the name of the corresponding station. So, if the original dataset had 100 observations (one per station) with 5 reviews each, the new file should have 500 observations.
How can I do this? I would include some code I have tried but I have no idea how to start.
If your data look like this:
station reviews n
1. 1 {good}{bad}{great} 3
2. 2 {poor}{excellent} 2
Then the following:
split(reviews), parse(}{)
drop reviews n
reshape long reviews, i(station) j(review_num)
drop if reviews==""
replace reviews = subinstr(reviews, "}","",.)
replace reviews = subinstr(reviews, "{","",.)
will produce:
station review~m reviews
1. 1 1 good
2. 1 2 bad
3. 1 3 great
4. 2 1 poor
5. 2 2 excellent
I have 2 reports/data sets to create a dashboard in Visual Insight. One data set is from Teradata (directly connected to MicroStrategy). The other data set is from Google BigQuery (connected to MicroStrategy via Intelligent Cube connector). The key of these 2 data sets is Categories.
The problem is the Categories attribute in Teradata is in number values i.e. 55, 45, 14, 29, 30 etc. And the values of Categories from the BQ data set is text i.e Food, Fashion. Food consists of numbers 55, 45 & 14. Numbers 29 & 30 make up Fashion. I tried grouping the number as text in the corresponding naming but the new grouped Teradata attribute doesn't link properly with the other data set.
So my challenge is how to align these 2 data sets with the key attribute and link them properly. I'm thinking of creating new attribute using Case/If function but didn't figure it out. Any other suggestion would also be very much appreciated!
Thank you very much,
Willow
You need to create a new table or a view in MicroStrategy holding both CategoryDESC and CategoryID
where you will have the following
Teradata
Column1
55
45
14
29
30
BigQuery
Column1
Food
Fashion
New table
Column1 Column2
Food 55
Food 45
Food 14
Fashion 29
Fashion 30
Here is my dataframe:
ID AMT DATE
0 1496846 54.76 2015-02-11
1 1496846 195.00 2015-01-09
2 1571558 11350.00 2015-04-30
3 1498812 135.00 2014-07-11
4 1498812 157.00 2014-08-04
5 1498812 110.00 2014-09-23
6 1498812 1428.00 2015-01-28
7 1558450 4355.00 2015-01-26
8 1858606 321.52 2015-03-27
9 1849431 1046.81 2015-03-19
I would like to make this a dataframe consisting of time series data for each ID. That is, each column name is a date (sorted), and it is indexed by ID, and the values are the AMT values corresponding to each date. I can get so far as doing something like
df.set_index("DATE").T
but from here I'm stuck.
I also tried
df.pivot(index='ID', columns='DATE', values='AMT')
but this gave me an error on having duplicate entries (the IDs).
I envision it as transposing DATE, and then grouping by unique ID and melting AMT underneath.
you want to use pivot_table where there is an aggfunc parameter that handles duplicate indices.
df.pivot_table('AMT', 'DATE', 'ID', aggfunc='sum')
You'll want to choose how to handle the dups. I put 'sum' in there. It defaults to 'mean'
I have three different questions about modifying a dataset in SAS. My data contains: the day and the specific number belonging to the tag which was registred by an antenna on a specific day.
I have three separate questions:
1) The tag numbers are continuous and range from 1 to 560. Can I easily add numbers within this range which have not been registred on a specific day. So, if 160-280 is not registered for 23-May and 40-190 for 24-May to add these non-registered numbers only for that specific day? (The non registered numbers are much more scattered and for a dataset encompassing a few weeks to much to do by hand).
2) Furthermore, I want to make a new variable saying a tag has been registered (1) or not (0). Would it work to make this variable and set it to 1, then add the missing variables and (assuming the new variable is not set for the new number) set the missing values to 0.
3) the last question would be in regard to the format of the registered numbers which is along the line of 528 000000000400 and 000 000000000054. I am only interested in the last three digits of the number and want to remove the others. If I could add the missing numbers I could make a new variable after the data has been sorted by date and the original transponder code but otherwise what would you suggest?
I would love some suggestions and thank you in advance.
I am inventing some data here, I hope I got your questions right.
data chickens;
do tag=1 to 560;
output;
end;
run;
data registered;
input date mmddyy8. antenna tag;
format date date7.;
datalines;
01012014 1 1
01012014 1 2
01012014 1 6
01012014 1 8
01022014 1 1
01022014 1 2
01022014 1 7
01022014 1 9
01012014 2 2
01012014 2 3
01012014 2 4
01012014 2 7
01022014 2 4
01022014 2 5
01022014 2 8
01022014 2 9
;
run;
proc sql;
create table dates as
select distinct date, antenna
from registered;
create table DatesChickens as
select date, antenna, tag
from dates, chickens
order by date, antenna, tag;
quit;
proc sort data=registered;
by date antenna tag;
run;
data registered;
merge registered(in=INR) DatesChickens;
by date antenna tag;
Registered=INR;
run;
data registeredNumbers;
input Numbers $16.;
datalines;
528 000000000400
000 000000000054
;
run;
data registeredNumbers;
set registeredNumbers;
NewNumbers=substr(Numbers,14);
run;
I do not know SAS, but here is how I would do it in SQL - may give you an idea of how to start.
1 - Birds that have not registered through pophole that day
SELECT b.BirdId
FROM Birds b
WHERE NOT EXISTS
(SELECT 1 FROM Pophole_Visits p WHERE b.BirdId = p.BirdId AND p.date = ????)
2 - Birds registered through pophole
If you have a dataset with pophole data you can query that to find if a bird has been through. What would you flag be doing - finding a bird that has never been through any popholes? Looking for dodgy sensor tags or dead birds?
3 - Data code
You might have more joy with the SUBSTRING function
Good luck