Upload Spark RDD to REST webservice POST method - web-services

Frankly i'm not sure if this feature exist?sorry for that
My requirement is to send spark analysed data to file server on daily basis, file server supports file transfer through SFTP and REST Webservice post call.
Initial thought was to save Spark RDD to HDFS and transfer to fileserver through SFTP.
I would like to know is it possible to upload the RDD directly by calling REST service from spark driver class without saving to HDFS.
Size of the data is less than 2MB
Sorry for my bad english!

There is no specific way to do that with Spark. With that kind of data size it will not be worth it to go through HDFS or another type of storage. You can collect that data in your driver's memory and send it directly. For a POST call you can just use plain old java.net.URL, which would look something like this:
import java.net.{URL, HttpURLConnection}
// The RDD you want to send
val rdd = ???
// Gather data and turn into string with newlines
val body = rdd.collect.mkString("\n")
// Open a connection
val url = new URL("http://www.example.com/resource")
val conn = url.openConnection.asInstanceOf[HttpURLConnection]
// Configure for POST request
conn.setDoOutput(true);
conn.setRequestMethod("POST");
val os = conn.getOutputStream;
os.write(input.getBytes);
os.flush;
A much more complete discussion of using java.net.URL can be found at this question. You could also use a Scala library to handle the ugly Java stuff for you, like akka-http or Dispatch.

Spark itself does not provide this functionality (it is not a general-purpose http client).
You might consider using some existing rest client library such as akka-http, spray or some other java/scala client library.
That said, you are by no means obliged to save your data to disk before operating on it. You could for example use collect() or foreach methods on your RDD in combination with your REST client library.

Related

Correct way to fetch data from an aws server into a flutter app?

I have a general understanding question. I am building a flutter app that relies on a content library containing text files, latex equations, images, pdfs, videos etc.
The content lies on an aws amplify backend. Depending on the navigation of the user in the app, the corresponding data is fetched and displayed.
I am not sure about the correct way of fetching the data. The current method (which works) is that the data is stored in an S3 bucket. When data is requested, the data is downloaded to a temporary directory and then opened and processed in the app. This is actually not slow, but I feel that it is not the way it should be done.
When data is downloaded a file transfer notification pops up, which bothers me because it is shown all the time. Also I would like to read the data directly with something like a get request, without downloading the file first (specially for text files, which I would like to read directly into a String). But here I don't know how it works, because I don't see that you can save data in a file system with the other amplify services like data store or the rest api. Also, the S3 bucket is an intuitive way of storing data that is easy to use for the content creators of my company, for me it seems that the S3 bucket is the way to go. However with S3 I have only figured out the download method to fetch data.
Could someone give me a hint on what is the correct approach for this use case? Thank you very much!

Is there a way to deal with changes in log schema?

I am in a situation where I need to Extract the log JSON data, which might have changes in its data structure, to AWS S3 in real time manner.
I am thinking of using AWS S3 + AWS Glue Streaming ETL. The thing is the structure or schema of the log JSON data might change(these changes are unpredictable), so my solution needs to be aware of such changes and should still stream the log data smoothly without causing errors... But as far as I know, all the AWS Glue tutorials are showing the demo as if there is no changes in the structure of the incoming data.
Can you recommend or tell me the solution within AWS that's suitable for my case?
Thanks.

Need recommendation to create an API by aggregating data from multiple source APIs

Before I start doing this I wanted to get advice from the community on the best and most efficient manner to go about doing it.
Here is what I want to do:
Ingest data from multiple API's which returns JSON
Store it in either S3 or DynamoDB
Modify the data to use my JSON structure
Pipe out the aggregate data as an API
The data will be updated twice a day, so I would pull in the data from the source APIs and put it through my pipeline twice a day.
So basically I want to create an API by aggregating data from multiple source APIs.
I've started playing with Lambda and created the following function using Python.
#https://stackoverflow.com/a/41765656
import requests
import json
def lambda_handler(event, context):
#https://www.nylas.com/blog/use-python-requests-module-rest-apis/ USEFUL!!!
#https://stackoverflow.com/a/65896274
response = requests.get("https://remoteok.com/api")
#print(response.json())
return {
'statusCode': 200,
'body': response.json()
}
#https://stackoverflow.com/questions/63733410/using-lambda-to-add-json-to-dynamodb DYNAMODB
This works and returns a JSON response.
Here are my questions:
Should I store the data on S3 or DynamoDB?
Which AWS service should I use to aggregate the data into my JSON structure?
Which service should I use to publish the aggregate data as an API, API Gateway?
However, before I go further I would like to know what is the best way to go about doing this.
If you have experience with this I would love to hear from you.
The answer will vary depending on the quantity of data you're planning to mine. Lambdas are designed for short-duration, high-frequency workloads and thus might not be suitable.
I would recommend looking into AWS Glue, as this seems like a fairly typical ETL (Extract Transform Load) problem. You can set up glue jobs to run on a schedule, and as for data aggregation, that's the T in ETL.
It's simple to output the glue dataframe (result of a transformation) as s3 files, which can then be queried directly by Amazon Athena (as if they were db content).
As for exposing that data via an API, the serverless framework or SST are great tools for taking the sting out of spinning up a serverless API and associated resources.

ELK stack (Elasticsearch, Logstash, Kibana) - is logstash a necessary component?

We're currently processing daily mobile app log data with AWS lambda and posting it into redshift. The lambda structures the data but it is essentially raw. The next step is to do some actual processing of the log data into sessions etc, for reporting purposes. The final step is to have something do feature engineering, and then use the data for model training.
The steps are
Structure the raw data for storage
Sessionize the data for reporting
Feature engineering for modeling
For step 2, I am looking at using Quicksight and/or Kibana to create reporting dashboard. But the typical stack as I understand it is to do the log processing with logstash, then have it go to elasticsreach and finally to Kibana/Quicksight. Since we're already handling the initial log processing through lambda, is it possible to skip this step and pass it directly into elasticsearch? If so where does this happen - in the lambda function or from redshift after it has been stored in a table? Or can elasticsearch just read it from the same s3 where I'm posting the data for ingestion into a redshift table?
Elasticsearch uses JSON to perform all operations. For example, to add a document to an index, you use a PUT operation (copied from docs):
PUT twitter/_doc/1
{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch"
}
Logstash exists to collect log messages, transform them into JSON, and make these PUT requests. However, anything that produces correctly-formatted JSON and can perform an HTTP PUT will work. If you already invoke Lambdas to transform your S3 content, then you should be able to adapt them to write JSON to Elasticsearch. I'd use separate Lambdas for Redshift and Elasticsearch, simply to improve manageability.
Performance tip: you're probably processing lots of records at a time, in which case the bulk API will be more efficient than individual PUTs. However, there is a limit on the size of a request, so you'll need to batch your input.
Also: you don't say whether you're using an AWS Elasticsearch cluster or self-managed. If the former you'll also have to deal with authenticated requests, or use an IP-based access policy on the cluster. You don't say what language your Lambdas are written in, but if it's Python you can use the aws-requests-auth library to make authenticated requests.

Serialize Json / Web Services to Observable Collection Model

I want to ask, so i've consume a web service api and than serialize it into a observable collection of Model.
My question is how i can use this observable collection everywhere, so i don't have to call/get/consume from web services everytime?
So just call the api one time and then can use the data everytime without callng API again?
Thanks
As #thang mentioned above there are many ways to store the data in the app to eliminate calling web service each time.
I will suggest you the way I am doing it:
1.When I retrieve the JSON data from the Web Api I am parsing it to Observable Collection:
ObservableCollection<User> usersList = JsonConvert.DeserializeObject<ObservableCollection<User>>(responseJson).Users;
2.Once I have my list I can also save serialized objects (in JSON format) to the text file (remember that JSON is nothing else like string):
private async void saveUsersToFile(string serializedUsersListAsJson)
{
StorageFolder storageFolder = ApplicationData.Current.LocalFolder;
StorageFile usersFile = await storageFolder.CreateFileAsync("users.txt", CreationCollisionOption.OpenIfExists);
await FileIO.WriteTextAsync(usersFile, serializedUsersListAsJson);
}
This step allows you to store the data even if the app is closed and relaunched.
3.When you launch the app you can invoke below method to read data from the file:
private async void retrieveNotes()
{
StorageFolder storageFolder = ApplicationData.Current.LocalFolder;
StorageFile usersFile = await storageFolder.CreateFileAsync("users.txt", CreationCollisionOption.OpenIfExists)
string serializedUsersList = await FileIO.ReadTextAsync(usersFile );
// Deserialize JSON list to the ObservableCollection:
if(serializedUsersList!=null)
{
var usersList= JsonConvert.DeserializeObject<ObservableCollection<User>>(serializedUsersList);
}
4.Last step is to declare Observable Collection field in Pages where you need to use it. For instance if you need to pass this list between Pages you can just use:
Frame.Navigate(typeof(MainPage), usersList);
Remember to read the data from the file once the app is launched. After that you can just use it while app is running. My suggestion is to cache data each time you connect Web Api to retrieve new data.
Hope this will help. If you want to read more about data storage please read below post on my blog:
https://mobileprogrammerblog.wordpress.com/2016/05/23/universal-windows-10-apps-data-storage/
To save the data for next time user open the app
Store the data in local Sqlite database, or serialized the collection to a local file to use later.
To use the data in same section
Store the data in a common object, and retrieve it every time you need to initialize the ViewModel
To use the data across Win 10 device
Store the file / database in OneDrive and sync when needed
If the data size is small and you dont have critical need to have 100% sync data, store it inside roaming folder