Illegal Characters in Parquet file - amazon-web-services

I recently got data from Google Analytics (GA) and wanted to store the data in AWS as parquet file.
When I wanted to preview the file in the WebUI I realised it gives me an error.
I took me a while to realise that the column "pagePath" coming from GA is the reason for this as I am able to preview data once I remove the column.
I can´t share any data but is there any "illegal" characters that lead to failures?
I have >10k unique page paths and I can´t just figure out what the problem is.

Related

Refresh Power Bi dataset based on Onedrive Excel file

Im trying to just understand the basics of Power Bi service refreshing reports from datasets, and despite several questions on here (like this one and this one) and several microsoft articles, Im still lost.
Lets say I go on my corporate onedrive account and create a very basic excel file. Two columns, some fake names and fake numbers.
Then I close that. Open Power Bi desktop and create a very basic report where I just dump that data into a table. Also worth pointing out I bring that data in using this method to copy and paste the sharepoint path to the file (and do all the other stuff on that page as far as authenticating my work account, getting the data, etc.. I do that because in other places I read that if the onedrive path was the path as its seen on your computer, it wont work.
I then publish the report to the service online. Works great so far.
I now want to refresh my report based on a change I made to the data. So I open the onedrive excel doc, change a number, save and close.
Now no matter what I do, the report wont refresh. When I click to refresh here:
It gives me this error:
When I google that error it says I need to change the permissions, so I go to the data source settings in Power bi desktop and have done all sorts of things in various attempts. I.e. "clear permissions" or "edit permissions" and make sure it says "none", etc... etc..
What am I misunderstanding?
Lastly, I dont understand it at all, but Ive heard previously I may need a "gateway". At one point a week or two ago I had my companies IT department install a gateway and as far as I can tell its on right now:
I was told I need to select OAuth2 as the Authentication method, but where I would expect to find the menu for dataset credentials (pic below, I'd expect it to read dataset description, then gateway connection, then data source credentials) I dont even see the option:
Also obviously that warning on the screen (cant refresh because data sources dont support refresh) sounds relevant, but a- it looks like the whole error isnt even there, b- its just an excel file on onedrive, shouldn't that work?
After publishing the pbix in the Power BI you need to configure the data source credentials to use OAuth2.

S3 to Dynamodb item logs

I am working on some data stuff. Where I am transferring/importing csv data from s3 to my DynamoDB table. Since the data is very huge it is very difficult to find any discrepancy in each row. Some-how my importing fails with error. The provided key element does not match the schema and the row it is telling me, is alright and does not have any weird values.
What I want to do now:
I want to some-how log each item that is taken from csv s3 to dynamodb. In this way I can find the row which made this error.
Is there anyway to log that?
Thank you for your suggestions.

Amazon athena with big gzip json files?

Im taking my first steps with amazon athena and i dont know why im not getting the expected results.
Im dealing with big json files, encoded in gzip and stored in s3, and i cannot get results for even a simple count query.
Now im testing with two files, each one with about 10gb of compressed json.
When i test the table, with the limit 10, i get the results, so the table is created and working, but when i have to make another query, even with a simple where, the query never ends, i mean, i had to stop it when reaching 30 minutes without response.
Ive read about data partitioning and i know big files its not the best option to store data in s3 if you want to use athena.
Despite of this, ive been searching a little in internet and get to some test where people query over big files (70-80gb) obtaining the result in about 10 second.
The use of athena seems very easy, but there must be something im doing wrong in addittion to the unpartitioned data.
Could you give any tips, or there is no solution for this situation.
Thank you

"No data" message in Google Data Studio chart after connecting dataset from BigQuery?

I am trying to connect and visualise aggregation of metrics from a wildcard table in BigQuery. This is the first time I am connecting a table from this particular Google Cloud project to Data Studio. Prior to this, I have successfully connected and visualised metrics from other BigQuery tables from other Google Cloud projects in Google Data Studio and never encountered this issue. Any ideas? Could this be something to do with project-level permissions for Google Data Studio to access a BigQuery table for the first time?
More details of this instance: the dataset itself seems to be successfully connected into Data Studio so errors were encountered. After adding some charts connected to that data source and aggregating metrics, no other Data Studio error messages were encounterd. Just the words "No data" displayed in the chart. Could this also be a formatting issue in the BigQuery table itself? The BigQuery table in question was created via pandas-gbq in a loop to split the original dataset into individual daily _YYYYMMDD tables. However, this has been done before and never presented a problem.
I have been struggling with the same problem for a while, and eventually I find out that, at least for my case, it is related to the date I add to the suffix (_YYYYMMDD). If I add "today" to the suffix, DataStudio won't recognize it and will display "no data", but if I change it to "yesterday" (a day earlier), it will then display the data correctly. I think it is probably related to the timezones, e.g., "today" here is not yet there in the US, so the system can't show. Hopefully it helps.

AWS Personalize complaining about user_id not present in csv when it is. Any suggestions?

I have a csv with user-interactions data and trying to create a data import job on amazon personalize, however it keeps on failing saying column user_id does not exist. can someone please help?
I have tried renaming the column different things and changing schema accordingly but it still fails on that first column.
I figured it out myself, kinda annoying but the first column in CSV cannot be any of the columns that Personalize requires. So, just add some random key or something in first column and it'll pass thru their validation. I hope it helps if anyone has the same issue.
I've also had this issue recently. Turns out my file was encoded in UTF-16 and that didn't play too well with Amazon's systems.