Athena is not loading Database and Table list - amazon-web-services

It's as simple as that. Athena used to load databases and tables that I crawled using Glue. The data is present in S3 and Athena used to work before. But all of a sudden the loading icon goes round and round but it doesn't load the list of databases and tables.
I'm in the right region. It works when I send the queries through Python/SageMaker i.e. I use awswrangler and the data output from that is fine. But it's not possible to query within Athena itself even though I used to do it before.
Totally stumped on what the problem could be as I have no clues.

This has been solved. I am not sure what the fix was (This has been an issue since atleast 3 months and I have tried solving it before with similar methods).
But I did two things before it 'fixed itself':
Tried changing the Athena output location through the workgroup settings.
Tried changing the same(I'm not sure if both settings point to the same property) through the settings icon on the right top of the page.
And suddenly the list of databases and tables shows up in the Query Editor page.

Related

AWS Glue: Add An Attribute to CSV Distinguish Between Data Sets

I need to pull two companies' data from their respective AWS S3 buckets, map their columns in Glue, and export them to a specific schema in a Microsoft SQL database. The schema is to have one table, with the companies' data being distinguished with attributes for each of their sites (each company has multiple sites).
I am completely new to AWS and SQL, would someone mind explaining to me how to add an attribute to the data, or point me to some good literature on this? I feel like manipulating the .csv in the Python script I'm already running to automatically download the data from another site then upload it to S3 could be an option (deleting NaN columns and adding a column for site name), but I'm not entirely sure.
I apologize if this has already been answered elsewhere. Thanks!
I find this website to generally be pretty helpful with figuring out SQL stuff. I've linked to the ALTER TABLE commands that would allow you to do this through SQL.
If you are running a python script to edit the .csv to start, then I would edit the data there, personally. Depending on the size of the data sets, you can run your script as a Lambda or Batch job to grab, edit, and then upload to s3. Then you can run your Glue crawler or whatever process you're using to map the columns.

How should I create this DynamoDB table for almost 100000 records?

I've got a CSV with nearly 100000 records in it and I'm trying to query this data in AWS DynamoDB, but I don't think I did it right.
When creating the table I added a key column airport_id, and then wrote a small console app to import all the records setting a new GUID for the key column for each record as I've only ever used relational databases before so I don't know if this is supposed to work differently.
The problem came when querying the data, as picking a record somewhere in the middle and querying it using the AWS SDK in .NET produced no results at all. I can only put this down to bad DB design on my part as it did get some results depending on the query.
What I'd like, is to be able to query the data, for example
Get all the records that have iso_country of JP (Japan) - doesn't find many
Get all the records that have municipality of Inverness - doesn't find any, the Inverness records are halfway through the set
There are records there, but I'm reckoning it does not have the right design in order to get it in a timely fashion.
Show should I create the DynamoDB table based on the below screenshot?

Copying BigQuery result to another table is not working?

I have noticed one weird problem on BigQuery in last 2-3 days, earlier it was working fine.
I have a bigquery table in a dataset located in EU region. I am running a simple SELECT query on that table and it ran without any issue.
Now, I am trying to save that query result into another bigquery table in the same dataset, it is giving below error -
To copy a table, the destination and source datasets must be in the
same region. Copy an entire dataset to move data between regions.
Strange part is that, other alternatives are working fine, such as -
Copying the source table to new table is working fine.
When I set the destination table in the query setting and run the query then it is able to save the query result into that configured table.
I ran the query and access the temporary table where BigQuery actually stores the query result and then copy that temporary table to destination table, this is also working.
Not sure why only the save result option is not working, it was working before though.
Anyone has any idea if something has changed on GCP recently?
You can try to create or replace table 'abc.de.omg' AS SELECT .... to store the same result.
edit: another workaround is to set it up as a schedule query and run it as a backfill once.
On another note, anyone finding this can comment on the reported bug here: https://issuetracker.google.com/issues/233184546 (i'm not the original poster)
I tried to save query results as the BigQuery table as below, I manually gave the Dataset name and it worked.
This happens when you have source and destination dataset in different region.
you can share source and destination dataset screenshots to check the region.
Tried to reproduce issue
Error when i typed dataset name manually.
Successfull When I selected dataset from drop-down - Strange

How to access AWS public dataset using Databricks?

For one of my classes, I have to analyze a "big data" dataset. I found the following dataset on the AWS Registry of Open Data that seems interesting:
https://registry.opendata.aws/openaq/
How exactly can I create a connection and load this dataset into Databricks? I've tried the following:
df = spark.read.format("text").load("s3://openaq-fetches/")
However, I receive the following error:
java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths:
Also, it seems that this dataset has multiple folders. How do I access a particular folder in Databricks, and if possible, can I focus on a particular time range? Let's say, from 2016 to 2020?
Ultimately, I would like to perform various SQL queries in order to analyze the dataset and perhaps create some visualizations as well. Thank you in advance.
If you browse the bucket, you'll see that there are multiple datasets there, in different formats, that will require different access methods. So you need to point to the specific folder (and maybe its subfolder to load data). Like, to load daily dataset you need to use CSV format:
df = spark.read.format("csv").option("inferSchema", "true")\
.option("header", "false").load("s3://openaq-fetches/daily/")
To load only subset of the data you can use path filters, for example. See Spark documentation on loading data.
P.S. the inferSchema isn't very optimal from performance standpoint, so it's better to explicitly provide schema when reading.

Can I raise the point limit on a geospatial map in Quicksight?

I have a csv with approx. 108,000 rows, each of which is a unique long/lat combination. The file is uploaded into S3 and visualised in Quicksight.
My problem is that Quicksight is only showing the first 10,000 points. The points that it shows are in the correct place, the map works perfectly, it's just missing 90%+ of the points I wish to show. I don't know if it makes a difference but I am using an admin enabled role for both S3 and Quicksight as this is a Dev environment.
Is there a way to increase this limit so that I can show all of my data points?
I have looked in the visualisation settings (the drop doen in the viz) and explored the tab on the left as much as I can. I am quite new to AWS so this may be a really easy one.
Thanks in advance!
You could consider combining lat/lng that are near each other based on some rule you come up when preparing your data.
There appears to be limitations on how many rows and columns you can serve to QuickSight:
https://docs.aws.amazon.com/quicksight/latest/user/data-source-limits.html