How to pull data from API and store it in HDFS - hdfs

I am aware of flume and Kafka but these are event driven tools. I don't need it to be event driven or real time but may be just schedule the import once in a day.
What are the data ingestion tools available for importing data from API's in HDFS?
I am not using HBase either but HDFS and Hive.
I have used R language for that for quite a time but I am looking for a more robust,may be native solution to Hadoop environment.

Look into using Scala or Python for this. There are a couple ways to approach pulling from an API into HDFS. The first approach would be to write a script which runs on your edge node(essentially just a linux server) and pulls data from the API and lands it in a directory on the linux file system. Then your script can use HDFS file system commands to put the data into HDFS.
The second approach would be to use Scala or Python with Spark to call the API and directly load the data into HDFS using a Spark submit job. Again this script would be run from an edge node it is just utilizing Spark to bypass having to land the data in the LFS.
The first option is easier to implement. The second option is worth looking into if you have huge data volumes or an API that could be parallelized by making calls to muliple IDs/accounts at once.

Related

Data streaming from raspberry pi CSV file to BigQuery table

I have some CSV files generated by raspberry pi that needs to be pushed into bigquery tables.
Currently, we have a python script using bigquery.LoadJobConfig for batch upload and I run it manually. The goal is to have streaming data(or every 15 minutes) in a simple way.
I explored different solutions:
Using airflow to run the python script (high complexity and maintenance)
Dataflow (I am not familiar with it but if it does the job I will use it)
Scheduling pipeline to run the script through GitLab CI (cron syntax: */15 * * * * )
Could you please help me and suggest to me the best way to push CSV files into bigquery tables in real-time or every 15 minutes?
Good news, you have many options! Perhaps the easiest would be to automate the python script that you have currently, since it does what you need. Assuming you are running it manually on a local machine, you could upload it to a lightweight VM on Google Cloud, the use CRON on the VM to automate the running of it, I used used this approach in the past and it worked well.
Another option would be to deploy your Python code to a Google Cloud Function, a way to let GCP run the code without you having to worry about maintaining the backend resource.
Find out more about Cloud Functions here: https://cloud.google.com/functions
A third option, depending on where your .csv files are being generated, perhaps you could use the BigQuery Data Transfer service to handle the imports into BigQuery.
More on that here: https://cloud.google.com/bigquery/docs/dts-introduction
Good luck!
Adding to #Ben's answer, you can also implement Cloud Composer to orchestrate this workflow. It is built on Apache Airflow and you can use Airflow-native tools, such as the powerful Airflow web interface and command-line tools, Airflow scheduler etc without worrying about your infrastructure and maintenance.
You can implement DAGs to
upload CSV from local to GCS then
GCS to BQ using GCSToBigQueryOperator
More on Cloud Composer

Faster development turnaround time with AWS Glue

AWS Glue looks promising but I'm having a challenge with the development cycle time. If I edit PySpark scripts through the AWS console, it takes several minutes to run even on a minimal test dataset. This makes it a challenge to iterate quickly if I have to wait 3-5 minutes just to see whether I called the right method on glueContext or understood a particular DynamicFrame behavior.
What techniques would allow me to iterate faster?
I suppose I could develop Spark code locally, and deploy it to Glue as an execution framework. But if I need to test code with Glue-specific extensions, I am stuck.
For development and testing scripts Glue has Development Endpoints which you can use with notebooks like Zeppelin installed either on a local machine or on Amazon EC2 instance (other options are 'REPL Shell' and 'PyCharm Professional').
Please don't forget to remove the endpoint when you are done with testing since you pay for it even if it's idling.
I keep pyspark code in separate class file and glue code in another file. We use glue for reading and writing data only. We do test driven development using pytest in local machine. No need of dev endpoint or zeppelin. Once all syntactical or business logic specific bugs are fixed in pyspark, end-to-end testing is done using glue. We also wrote shell script, which uploads latest code to S3 bucket from which glue job is run.
https://github.com/fatangare/aws-glue-deploy-utility
https://github.com/fatangare/aws-python-shell-deploy

Apache Spark/AWS EMR and tracking of processed files

I have AWS S3 folder where the big number of JSON files is stored. I need to ETL these files with AWS EMR over Spark and store the transformation into AWS RDS.
I have implemented the Spark job for this purpose on Scala and everything is working fine. I plan to execute this job once a week.
From time to time the external logic can add a new files to AWS S3 folder so the next time when my Spark job is starting I'd like to process only the new(unprocessed) JSON files.
Right now I don't know where to store the information about the processed JSON files so the Spark job can decide what files/folders to process. Could you please advise me what is the best practice(and how) to track this changes with Spark/AWS?
If it is spark streaming job, checkpointing is what you are looking for, it is discussed here.
Checkpointing stores the state information (ie offsets etc) in hdfs/s3 bucket, so when the job is started again, spark picks up only the un-processed files. Checkpointing offers better fault tolerance in case of failures as well, as state is handled automatically by spark itself.
Again checkpointing only works in the streaming mode of spark job.

Google App Engine Parse Logs in DataStore Save to Table

I am new to GAE and I am trying to quickly find a way to retrieve logs in DataStore, clean them to my specs, and then save them to a table to be called on later for a reports view in my app. I was thinking of using Google Data Flow and creating batch jobs (app is python/Django) but the documentation does not seem to fit my use case so maybe data flow is not the answer. I could create a python script with BigQuery and schedule through CRON but then I would have to contend with errors and it would seem that there is a faster way to solve this problem.
Any help/thoughts/suggestions is always greatly appreciated.
You can use Dataflow/Beam Python SDK to develop a pipeline that read entities from Datastore [1], transform data, and write a table to BigQuery [2]. To schedule this job to run regularly you'll have to use a third party mechanism such as a cron job. Note that Dataflow performs automatic scaling and perform retries to handle errors so you are not expected to manually address these complexities.
[1] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/datastore/v1/datastoreio.py
[2] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py

How data gets into HDFS files system

I am trying to understand how data from multiple sources and systems gets into HDF? I want to push web server log files form 30+ systems. These logs are sitting on 18 different servers.
Thx
Veer
You can create a map-reduce job. The input for your mapper would be a file sitting on a server, and your reducer would deduct to which path to put the file in hdfs. You can either aggregate all of your files in your reducer, or simply write the file as is at the given path.
You can use Oozie to schedule the job, or you can run it sporadically by submitting the map-reduce job on the server which hosts the job tracker service.
You could also create an java application that uses the hdfs api. The FileSystem object can be used to do standard file system operation, like writing a file to a given path.
Either way, you need to request the creation through hdfs api, because the name node is responsible for splitting the file in blocks and writing it on distributed servers.