I am new to gcp. My mission is to download the query result in patents dataset, but the result is too large. I cannot download it directly because gcp only supports download 16000 lines data.
I select several columns and the data is already too large
SELECT country_code, kind_code, application_kind, family_id, publication_date, filing_date, cpc.code as cpc_code, ipc.code as ipc_code
FROM
`patents-public-data.patents.publications` p
cross join unnest(p.cpc) as cpc
cross join unnest(p.ipc) as ipc
I expect I can download the result table, or download by the country_code in different tables.
To complement the response of #Christopher, and for achieving your download, here the steps to perform:
Perform your query
Save result in (temporary) table
Extract table to Google storage bucket
Download file(s) where you want, manually in console or with gsutil tool
Note that there is no limitation on size, but you can have more than 1 file of the result is huge. Take care of format for nested field and prefer gzip compression for faster download!
You can write the result in another table or export table data in Cloud Storage (take note of the export limitations)
Related
I took a look at the link and trying to understand what s3 select is.
Most applications have to retrieve the entire object and then filter out only the required data for further analysis. S3 Select enables applications to offload the heavy lifting of filtering and accessing data inside objects to the Amazon S3 service.
Based on the statement above, I am trying to imagine what is the proper use case.
Is it helpful that if I have a single excel file with 100million rows, sitting on S3, I can use S3 Select to query partial rows, instead of downloading the entire 100mil rows?
There are many use cases. But two cases that are apparent are centralization and time efficiency.
Lets say you have this "single excel file with 100million rows" in S3. Now if you have several people/department/branches that need to access it, all of them would have to download it, store and process. Since it would be downloaded by each of them separately, in no time you would end up with all of them either having old version of the file (new version could be uploaded to S3), or just different versions - one person version from today, the other would work on a version from last week. With S3 select, all of them would query and get data from the one version of the object stored in S3.
Also if you have 100 million of records, you getting selected data can save you a lot of time. Just image one person needing only 10 records from this file, other person 1000 records. Instead of downloading 100 million records, the first person uses S3 Select to find 10 records only, while the other just gets his/hers 1000 records. All this without needing to download 100 million records.
Even more benefits come from using S3 select in Glacier, from where you can't readily download your files if needed.
We can query results in any language from Google BigQuery using the predefined methods -> see docs.
Alternatively, we can also query the results and store them to cloud storage, for example in a .csv -> see docs on storing data to GCS
When we repeatedly need to extract the same data, eg lets say 100 times per day, does it make sense to cache the data to Cloud Storage and load it from there, or to redo the BigQuery requests?
What is more cost efficient and how would I obtain the unit cost of these requests, to estimate a % difference?
BigQuery's pricing model is based on how much bytes your query uses.
So if you want to query the results, than store the results as a BigQuery table. Set a destination table when you run your query.
There is no point of reloading from GCS a previous results. The cost will be the same, you just complicate this.
For a project we've inherited we have a large-ish set of legacy data, 600GB, that we would like to archive, but still have available if need be.
We're looking at using the AWS data pipeline to move the data from the database to be in S3, according to this tutorial.
https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-copyactivity.html
However, we would also like to be able to retrieve a 'row' of that data if we find the application is actually using a particular row.
Apparently that tutorial puts all of the data from a table into a single massive CSV file.
Is it possible to split the data up into separate files, with 100 rows of data in each file, and giving each file a predictable file name, such as:
foo_data_10200_to_10299.csv
So that if we realise we need to retrieve row 10239, we can know which file to retrieve, and download just that, rather than all 600GB of the data.
If your data is stored in CSV format in Amazon S3, there are a couple of ways to easily retrieve selected data:
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.
S3 Select (currently in preview) enables applications to retrieve only a subset of data from an object by using simple SQL expressions.
These work on compressed (gzip) files too, to save storage space.
See:
Welcome - Amazon Athena
S3 Select and Glacier Select – Retrieving Subsets of Objects
I have a pipe delimited text file that is 360GB, compressed (gzip).
It has over 1,620 columns. I can't show the exact field names, but here's basically what it is:
primary_key|property1_name|property1_value|property800_name|property800_value
12345|is_male|1|is_college_educated|1
Seriously, there are over 800 of these property name/value fields.
There are roughly 280 million rows.
The file is in an S3 bucket.
I need to get the data into Redshift, but the column limit in Redshift is 1,600.
The users want me to pivot the data. For example:
primary_key|key|value
12345|is_male|1
12345|is_college_educated|1
What is a good way to pivot the file in the aws environment? The data is in a single file, but I'm planning on splitting the data into many different files to allow for parallel processing.
I've considered using Athena. I couldn't find anything that states the maximum number of columns allowed by Athena. But, I found a page about Presto (on which Athena is based) that says “there is no exact hard limit, but we've seen stuff break with more than few thousand.” (https://groups.google.com/forum/#!topic/presto-users/7tv8l6MsbzI).
Thanks.
First, pivot your data, then load to Redshift.
In more detail, the steps are:
Run a spark job (using EMR or possibly AWS Glue) which reads in your
source S3 data and writes out (to a different s3 folder) a pivoted
version. by this i mean if you have 800 value pairs, then you would
write out 800 rows. At the same time, you can split the file into multiple parts to enable parallel load.
"COPY" this pivoted data into Redshift
What I learnt from most of the time from AWS is, if you are reaching a limit, you are doing it in a wrong way or not in a scalable way. Most of the time architects designed with scalability, performance in mind.
We had similar problems, having 2000 columns. Here is how we solved it.
Split the file across 20 different tables, 100+1 (primary key) column each.
Do a select across all those tables in a single query to return all the data you want.
If you say you want to see all the 1600 columns in a select, then the business user is looking at wrong columns for their analysis or even for machine learning.
To load 10TB+ of data we had split the data into multiple files and load them in parallel, that way loading was faster.
Between Athena and Redshift, performance is the only difference. Rest of them are same. Redshift performs better than Athena. Initial Load time and Scan Time is higher than Redshift.
Hope it helps.
I have a star schema kind of database structure, like one fact table having all the id’s & skeys, whereas there are multiple dimension tables having the actual id, code, descriptions for the id’s referred in the fact table.
we are moving all these tables (fact & dimensions) to S3 (cloud) individually and each table data are split into multiple parquet files in S3 location (one S3 object per table)
Query: i need to perform a transformation on cloud (ie) i need strip of all the id’s & skeys referred in the fact table and replace it with the actual code that is residing in the dimension tables and create another file and store the final output back in S3 location. This file will later be consumed by Redshift for Analytics.
My Doubt:
Whats the best way to achieve this solution, cos i don’t need raw data (skeys & id’s) in Redshift for cost and storage optimization?
Do we need to first combine these split files (parquet) into one large file (ie) before performing the data transformation. Also, after data transformation, I am planning to save the final output file in parquet format, but the catch is, Redshift doesn’t allow copy of parquet file, so is there a workaround for that
I am not a hardcore programmer and want to avoid using scala/python in a EMR, but I am good at SQL, so is there a way to perform data transformation in cloud thru SQL thru EMR and save the output data into a file or files. Please advise
You should be able to run redshift type queries directly against your s3 parquet data by using amazon athena
some information on that
https://aws.amazon.com/blogs/big-data/10-best-practices-for-amazon-redshift-spectrum/