Incremental update in sqoop - hdfs

Hi I loaded a data from mysql to hdfs via sqoop connector. Now what if row between the existing data get updated, is there any query to update the data value of existing row in sqoop. I am aware of the incremental update?Will in incremental update also update the existing row? I am new to sqoop?

Yes, but you should use the lastmodified mode when you perform your incremental imports. According to the documentation:
An alternate table update strategy supported by Sqoop is called
lastmodified mode. You should use this when rows of the source table
may be updated, and each such update will set the value of a
last-modified column to the current timestamp. Rows where the check
column holds a timestamp more recent than the timestamp specified with
--last-value are imported.
At the end of an incremental import, the value which should be
specified as --last-value for a subsequent import is printed to the
screen. When running a subsequent import, you should specify
--last-value in this way to ensure you import only the new or updated data. This is handled automatically by creating an incremental import
as a saved job, which is the preferred mechanism for performing a
recurring incremental import. See the section on saved jobs later in
this document for more information.
Bear in mind that this mode requires a column which holds a date value (such as date, time, datetime and timestamp).
This answer shows an alternative importing strategy for existing values using merge-key.

Related

How to fetch the latest schema change in BigQuery and restore deleted column within 7 days

Right now I fetch columns and data type of BQ tables via the below command:
SELECT COLUMN_NAME, DATA_TYPE
FROM `Dataset`.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
WHERE table_name="User"
But if I drop a column using command : Alter TABLE User drop column blabla:
the column blabla is not actually deleted within 7 days(TTL) based on official documentation.
If I use the above command, the column is still there in the schema as well as the table Dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
It is just that I cannot insert data into such column and view such column in the GCP console. This inconsistency really causes an issue.
If I want to write bash script to monitor schema changes and do some operation based on it.
I need more visibility on the table schema of BigQuery. The least thing I need is:
Dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS can store a flag column that indicates deleted or TTL:7days
My questions are:
How can I fetch the correct schema in spanner which reflects the recently deleted the column?
If the column is not actually deleted, is there any way to easily restore it?
If you want to fetch the recently deleted column you can try searching through Cloud Logging. I'm not sure what tools Spanner supports but if you want to use Bash you can use gcloud to fetch logs. Though it will be difficult to parse the output and get the information you want.
Command used below fetched the logs for google.cloud.bigquery.v2.JobService.InsertJob since an ALTER TABLE is considered as an InsertJob and filter it based from the actual query where it says drop. The regex I used is not strict (for the sake of example), I suggest updating the regex to be stricter.
gcloud logging read 'protoPayload.methodName="google.cloud.bigquery.v2.JobService.InsertJob" AND protoPayload.metadata.jobChange.job.jobConfig.queryConfig.query=~"Alter table .*drop.*"'
Sample snippet from the command above (Column PADDING is deleted based from the query):
If you have options other than Bash, I suggest that you create a BQ sink for your logging and you can perform queries there and get these information. You can also use client libraries like Python, NodeJS, etc to either query in the sink or directly query in the GCP Logging.
As per this SO answer, you can use the time travel feature of BQ to query the deleted column. The answer also explains behavior of BQ to retain the deleted column within 7 days and a workaround to delete the column instantly. See the actual query used to retrieve the deleted column and the workaround on deleting a column on the previously provided link.

Is it possible delete entire table stored in S3 buckets from athena query?

I want a table to store the history of a object for a week and then replace the same with history of next week. What would be the best way to achieve this in aws?
The data is stored in json format in s3 is a weekly dump. The pipeline runs the script weekly once and dumps data into s3 for analysis. For the next run of the script i do not need the previous week-1 data, so this needs to be replaced with new week-2 data. The schema of the table remains constant but the data keeps changing every week.
I would recommend to use data partitioning to solve your issue without deleting underlying S3 files from previous weeks (which is not possible via an Athena query).
Thus, the idea is to use a partition key based on the date, and then use this partition key in the WHERE clause of your Athena request, which will cause Athena to ignore previous files (which are not under the last partition).
For example, if you use the file dump date as partition key (let's say we chose to name it dump_key), your files will have to be stored in subfolders like
s3://your-bucket/subfolder/dump_key=2021-01-01-13-00/files.csv
s3://your-bucket/subfolder/dump_key=2021-01-07-13-00/files.csv
Then, during your data processing, you'll first need to create your table and specify a partition key with the PARTITIONED BY option.
Then, you'll have to make sure you added a new partition using the PARTITION ADD command every time it's necessary for your use case:
ALTER TABLE your_table ADD PARTITION (dump_key='2021-01-07-13-00') location 's3://your-bucket/subfolder/dump_key=2021-01-07-13-00/'
Then you'll be able to query your table by filtering previous data using the right WHERE clause:
SELECT * FROM my_table WHERE dump_key >= 2021-01-05-00-00
This will cause Athena to ignore files in previous partitions when querying your table.
Documentation here:
https://docs.aws.amazon.com/athena/latest/ug/partitions.html

How to get rid of __key__ columns in BigQuery table for every 'Record' Type field?

For every 'Record' Type of my Firestore table, BigQuery is automatically adding the 'key' columns. I do not want to have these added for each of the 'Record' Type fields. How can I get rid of these extra columns automatically being added by BigQuery? (I want to get rid of the below columns in my BigQuery table schema highlighted in yellow)
This is intended behavior, citing Bigquery GCP documentation:
Each document in Firestore has a unique key that contains
information such as the document ID and the document path. BigQuery
creates a RECORD data type (also known as a STRUCT) for the key,
with nested fields for each piece of information, as described in the
following table.
Due to the fact that Firestore export method is fully integrated with GCP managed import and export service, you can't change this behavior, excluding __key__.* properties being sent for each RECORD field in the target Bigquery table.
I guess in your use case, Bigquery table modification action will require some hand-on intervention, since it requires manually changing schema data.
In order to set up this feasibility I would encourage you to raise a service request to the vendor via Google public issue tracker.

Read spanner data from a table which is simultaneously being written

I'm copying Spanner data to BigQuery through a Dataflow job. The job is scheduled to run every 15 minutes. The problem is, if the data is read from a Spanner table which is also being written at the same time, some of the records get missed while copying to BigQuery.
I'm using readOnlyTransaction() while reading Spanner data. Is there any other precaution that I must take while doing this activity?
It is recommended to use Cloud Spanner commit timestamps to populate columns like update_date. Commit timestamps allow applications to determine the exact ordering of mutations.
Using commit timestamps for update_date and specifying an exact timestamp read, the Dataflow job will be able to find all existing records written/committed since the previous run.
https://cloud.google.com/spanner/docs/commit-timestamp
https://cloud.google.com/spanner/docs/timestamp-bounds
if the data is read from a Spanner table which is also being written at the same time, some of the records get missed while copying to BigQuery
This is how transactions work. They present a 'snapshot view' of the database at the time the transaction was created, so any rows written after this snapshot is taken will not be included.
As #rose-liu mentioned, using commit timestamps on your rows, and keeping track of the timestamp when you last exported (available from the ReadOnlyTransaction object) will allow you to accurately select 'new/updated rows since last export'

Reflecting changes on big tables in hdfs

I have an order table in the OLTP system.
Each order record has a OrderStatus field.
When end users created an order, OrderStatus field set as "Open".
When somebody cancels the order, OrderStatus field set as "Canceled".
When an order process finished(transformed into invoice), OrderStatus field set to "Close".
There are more than one hundred million record in the table in the Oltp system.
I want to design and populate data warehouse and data marts on hdfs layer.
In order to design data marts, I need to import whole order table to hdfs and then I need to reflect changes on the table continuously.
First, I can import whole table into hdfs in the initial load process by using sqoop. I may take long time but I will do this once.
When an order record is updated or a new order record entered, I need to reflect changes in hdfs. How can I achieve this in hdfs for such a big transaction table?
Thanks
One of the easier ways is to work with database triggers in your OLTP source db and every change an update happens use that trigger to push an update event to your Hadoop environment.
On the other hand (this depends on the requirements for your data users) it might be enough to reload the whole data dump every night.
Also, if there is some kind of last changed timestamp, it might be a possible way to load only the newest data and do some kind of delta check.
This all depends on your data structure, your requirements and your ressources at hand.
There are several other ways to do this but usually those involve messaging, development and new servers and I suppose in your case this infrastructure or those ressources are not available.
EDIT
Since you have a last changed date, you might be able to pull the data with a statement like
SELECT columns FROM table WHERE lastchangedate < (now - 24 hours)
or whatever your interval for loading might be.
Then process the data with sqoop or ETL tools or the like. If the records are already available in your Hadoop environment, you want to UPDATE it. If the records are not available, INSERT them with your appropriate mechanism. This is also called UPSERTING sometimes.