Dataflow Job - HTTP 400 Non Empty Data - google-cloud-platform

I launched a dataflow batch job to load csv data from GCS to Pubsub.
Dataflow job is failing with the following log portion:
Error message from worker: com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request { "code" : 400, "errors" : [ { "domain" : "global", "message" : "One or more messages in the publish request is empty. Each message must contain either non-empty data, or at least one attribute.", "reason" : "badRequest" } ], "message" : "One or more messages in the publish request is empty. Each message must contain either non-empty data, or at least one attribute.", "status" : "INVALID_ARGUMENT" } com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:150) com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113) com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40) com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:443) com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1108) com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:541) com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:474) com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:591) org.apache.beam.sdk.io.gcp.pubsub.PubsubJsonClient.publish(PubsubJsonClient.java:138) org.apache.beam.sdk.io.gcp.pubsub.PubsubIO$Write$PubsubBoundedWriter.publish(PubsubIO.java:1195) org.apache.beam.sdk.io.gcp.pubsub.PubsubIO$Write$PubsubBoundedWriter.finishBundle(PubsubIO.java:1184)
So basically it's saying that there is at least on line empty but the csv data contains some empty fields but not a full empty line
here is a sample below
2019-12-01 00:00:00 UTC,remove_from_cart,5712790,1487580005268456287,,f.o.x,6.27,576802932,51d85cb0-897f-48d2-918b-ad63965c12dc
2019-12-01 00:00:00 UTC,view,5764655,1487580005411062629,,cnd,29.05,412120092,8adff31e-2051-4894-9758-224bfa8aec18
2019-12-01 00:00:02 UTC,cart,4958,1487580009471148064,,runail,1.19,494077766,c99a50e8-2fac-4c4d-89ec-41c05f114554
2019-12-01 00:00:05 UTC,view,5848413,1487580007675986893,,freedecor,0.79,348405118,722ffea5-73c0-4924-8e8f-371ff8031af4
2019-12-01 00:00:07 UTC,view,5824148,1487580005511725929,,,5.56,576005683,28172809-7e4a-45ce-bab0-5efa90117cd5
2019-12-01 00:00:09 UTC,view,5773361,1487580005134238553,,runail,2.62,560109803,38cf4ba1-4a0a-4c9e-b870-46685d105f95
2019-12-01 00:00:18 UTC,cart,5629988,1487580009311764506,,,1.19,579966747,1512be50-d0fd-4a92-bcd8-3ea3943f2a3b
Any help ?
Thanks

Related

AWS Cloud Watch: How to specify which field to use for timestamp in json?

I have
datetime_format = "%Y-%m-%dT%H:%M:%S.%f%z"
in /etc/awslogs/awslogs.conf
And I have log like this:
{
"level": "info",
"ts": "2023-01-08T21:46:03.381067Z",
"caller": "bot/bot.go:172",
"msg": "Creating test subscription declined",
"user_id": "0394c017-2a94-416c-940c-31b1aadb12ee"
}
However timestamp does not parsed
I see warning in logs
2023-01-08 21:46:03,423 - cwlogs.push.reader - WARNING - 9500 - Thread-4 - Fall back to previous event time: {'timestamp': 1673211877689, 'start_position': 6469L, 'end_position': 6640L}, previousEventTime: 1673211877689, reason: timestamp could not be parsed from message.
upd:
tried to remove level
{
"ts": "2023-01-08T23:15:00.518545Z",
"caller": "bot/bot.go:172",
"msg": "Creating test subscription declined",
"user_id": "0394c017-2a94-416c-940c-31b1aadb12ee"
}
and still does not work.
There 2 different formats of cloudwatch log configurations:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AgentReference.html. This is deprecated as mentioned in the alert section of the page.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-Configuration-File-Details.html. This is the configuration for new unified cloudwatch agent and it doesn't have the parameter datetime_format to configure. Instead it has the timestamp_format.
Since you have mentioned the datetime_format, I'm assuming you are using the old agent. In that case, the %z refers to UTC offset in the form +HHMM or -HHMM. +0000, -0400, +1030 as per the linked documentation[1 above]. Your timestamp doesn't have an offset mentioned hence your format should be %Y-%m-%dT%H:%M:%S.%fZ. There the Z is similar to T where it just represents a character. Also, specify the time_zone as UTC.

Regex - extract ip

I'm tring to pull some data from a plain log file with a json convrestor.
this is the log entry:
01/04/2022 15:29:34.2934 +03:00 - [INFO] - [w3wp/LPAPI-Last Casino/177] - AppsFlyerPostback?re_targeting_conversion_type=&is_retargeting=false&app_id=id624512118&platform=ios&event_type=in-app-event&attribution_type=organic&ip=8.8.8.8&name=blabla
This is the regex I'm using:
(?P<date>[0-9]{2}\/[0-9]{2}\/[0-9]{4}).(?P<time>\s*[0-9]{2}:[0-9]{2}:[0-9]{2}).*(?P<level>\[\D+\]).-.\[(?P<application_subsystem_thread>.*)\].-.(?P<message>.*)
This is the output I'm getting:
{
"application_subsystem_thread": "w3wp/LPAPI-Last Casino/177",
"date": "01/04/2022",
"level": "[INFO]",
"message": "AppsFlyerPostback?re_targeting_conversion_type=&is_retargeting=false&app_id=id624512118&platform=ios&event_type=in-app-event&attribution_type=organic&ip=8.8.8.8&name=blabla",
"time": "15:29:34"
}
As you can see, the convertor is using the group names as the json key.
I would like to get the following output instead:
{
"application_subsystem_thread": "w3wp/LPAPI-Last Casino/177",
"date": "01/04/2022",
"level": "[INFO]",
"message": "AppsFlyerPostback?re_targeting_conversion_type=&is_retargeting=false&app_id=id624512118&platform=ios&event_type=in-app-event&attribution_type=organic&ip=8.8.8.8&name=blabla",
"time": "15:29:34",
"ip": "8.8.8.8"
}
As you can see I would like to get the IP as well how can I do it ?
You could extract it from the part of the message:
As defined in the message it could be captured with
ip\=(?P<ip_address>(?:[0-9]+\.){3}[0-9]+)
So then we incoperate it as part of the greater message group
(?P<message>.*ip\=(?P<ip_address>(?:[0-9]+\.){3}[0-9]+).*)
Resulting in the final expression
(?P<date>[0-9]{2}\/[0-9]{2}\/[0-9]{4}).(?P<time>\s*[0-9]{2}:[0-9]{2}:[0-9]{2}).*(?P<level>\[\D+\]).-.\[(?P<application_subsystem_thread>.*)\].-.(?P<message>.*ip\=(?P<ip_address>(?:[0-9]+\.){3}[0-9]+).*)
var message = `01/04/2022 15:29:34.2934 +03:00 - [INFO] - [w3wp/LPAPI-Last Casino/177] - AppsFlyerPostback?re_targeting_conversion_type=&is_retargeting=false&app_id=id624512118&platform=ios&event_type=in-app-event&attribution_type=organic&ip=8.8.8.8&name=blabla`;
// NOTE - The regex in this code sample has been modified to be ECMAScript compliant
console.log(/(?<date>[0-9]{2}\/[0-9]{2}\/[0-9]{4}).(?<time>\s*[0-9]{2}:[0-9]{2}:[0-9]{2}).*(?<level>\[\D+\]).-.\[(?<application_subsystem_thread>.*)\].-.(?<message>.*ip\=(?<ip_address>(?:[0-9]+\.){3}[0-9]+).*)/gm.exec(message).groups)

Sagemaker batch transform job failure for 'batchStrategy: MultiRecord' along with data processing

We are using SageMaker Batch Transform job and to fit as many records in a mini-batch as can fit within the MaxPayloadInMB limit, we are setting BatchStrategy to MultiRecord and SplitType to Line.
Input to the SageMaker batch transform job is:
{"requestBody": {"data": {"Age": 90, "Experience": 26, "Income": 30, "Family": 3, "CCAvg": 1}}, "mName": "loanprediction", "mVersion": "1", "testFlag": "false", "environment": "DEV", "transactionId": "5-687sdf87-0bc7e3cb3454dbf261ed1353", "timestamp": "2022-01-15T01:45:32.955Z"}
{"requestBody": {"data": {"Age": 55, "Experience": 26, "Income": 450, "Family": 3, "CCAvg": 1}}, "mName": "loanprediction", "mVersion": "1", "testFlag": "false", "environment": "DEV", "transactionId": "5-69e22778-594916685f4ceca66c08bfbc", "timestamp": "2022-01-15T01:46:32.386Z"}
This is the SageMaker batch transform job config:
apiVersion: sagemaker.aws.amazon.com/v1
kind: BatchTransformJob
metadata:
generateName: '...-batchtransform'
spec:
batchStrategy: MultiRecord
dataProcessing:
JoinSource: Input
OutputFilter: $
inputFilter: $.requestBody
modelClientConfig:
invocationsMaxRetries: 0
invocationsTimeoutInSeconds: 3
mName: '..'
region: us-west-2
transformInput:
contentType: application/json
dataSource:
s3DataSource:
s3DataType: S3Prefix
s3Uri: s3://....../part-
splitType: Line
transformOutput:
accept: application/json
assembleWith: Line
kmsKeyId: '....'
s3OutputPath: s3://..../batch_output
transformResources:
instanceCount: ..
instanceType: '..'
The SageMaker batch transform job fails with:
Error in batch transform data-log -
2022-01-27T00:55:39.781:[sagemaker logs]:
ephemeral-dev-435945521637/loanprediction-usw2-dev/my-loanprediction/1/my-pipeline-9v28r/part-00001-99fb4b99-e8e7-4945-ac44-b6c5a95a2ffe-c000.txt:
2022-01-27T00:55:39.781:[sagemaker logs]:
ephemeral-dev-435945521637/loanprediction-usw2-dev/my-loanprediction/1/my-pipeline-9v28r/part-00001-99fb4b99-e8e7-4945-ac44-b6c5a95a2ffe-c000.txt:
400 Bad Request 2022-01-27T00:55:39.781:[sagemaker
logs]:
ephemeral-dev-435945521637/loanprediction-usw2-dev/my-loanprediction/1/my-pipeline-9v28r/part-00001-99fb4b99-e8e7-4945-ac44-b6c5a95a2ffe-c000.txt:
Failed to decode JSON object: Extra data: line 2 column 1 (char
163)
Observation:
This issue occurs when we provide batchStrategy: MultiRecord in the manifest along with these data processing configs:
dataProcessing:
JoinSource: Input
OutputFilter: $
inputFilter: $.requestBody
NOTE: If we put batchStrategy: SingleRecord along with the aforementioned data processing configs, it just works fine (job succeeds)!
Question: How can we achieve successful run with batchStrategy: MultiRecord along with the aforementioned data processing config?
A successful output with batchStrategy: SingleRecord looks like this:
{"SageMakerOutput":{"prediction":0},"environment":"DEV","transactionId":"5-687sdf87-0bc7e3cb3454dbf261ed1353","mName":"loanprediction","mVersion":"1","requestBody":{"data":{"Age":90,"CCAvg":1,"Experience":26,"Family":3,"Income":30}},"testFlag":"false","timestamp":"2022-01-15T01:45:32.955Z"}
{"SageMakerOutput":{"prediction":0},"environment":"DEV","transactionId":"5-69e22778-594916685f4ceca66c08bfbc","mName":"loanprediction","mVersion":"1","requestBody":{"data":{"Age":55,"CCAvg":1,"Experience":26,"Family":3,"Income":450}},"testFlag":"false","timestamp":"2022-01-15T01:46:32.386Z"}
Region name – optional: Relevant resource ARN – optional:
arn:aws:sagemaker:us-west-2:435945521637:transform-job/my-pipeline-9v28r-bat-e548fbfb125946528957e0f123456789
When your input data is in JSON line format and you choose a SingleRecord BatchStrategy, your container will receive a single JSON payload body like below
{ <some JSON data> }
However, if you use MultiRecord, Batch transform will split your JSON line input (which might contain 100 lines for example) into multiple records (say 10 records) all sent at once to your container as shown below:
{ <some JSON data> }
{ <some JSON data> }
{ <some JSON data> }
{ <some JSON data> }
.
.
.
{ <some JSON data> }
Therefore your container should be able to handle such input for it to work. However, from the error message, I can see it is complaining about invalid JSON format as it reads the second row of the request.
I also noticed that you have supplied ContentType and AcceptType as application/json but instead should be application/jsonlines
Could you please test your container to see if it can handle multiple JSON line records per single invocation.
You can not use dataProcessing to join input with prediction when using MultiRecord. You will need to merge manually the input with prediction separately after batch processing job. MultiRecord strategy process records in batches with maximum size specified in MaxPayloadInMB.
For the input in:
input1
input2
input3
input4
input5
input6
The output would be in below format:
output1,output2,output3
output3,output4,output6
You will need to process the output file to merge with input data to get desired outcome. I expect an json array for predictions as your output format is json. You can explode the array and merge it with the input. You need to maintain the order of prediction when you explode it. You can find more details on:
https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_batch_transform/introduction_to_batch_transform/batch_transform_pca_dbscan_movie_clusters.html#Batch-prediction-on-new-data

Dataprep Bigquery running in different region

So I get the following error when running dataprep.
java.io.IOException: Query job beam_job_9e016180fbb74637b35319c89b6ed6d7_clouddataprepleads6085795bynick-query-d23eb37a1bee4a788e7b16c1de1f92e6 failed, status: { "errorResult" : { "message" : "Not found: Dataset lumi-210601:attribution was not found in location US", "reason" : "notFound" }, "errors" : [ { "message" : "Not found: Dataset lumi-210601:attribution was not found in location US", "reason" : "notFound" } ], "state" : "DONE" }
I've tried doing the following already:
Under dataprep "Project Settings" I've et the regional endpoint to asia-east1 and the zone to australia-southeast-1a
Have set under "Profile" all of the upload/job run dir/temp dir to new directories in a bucket that belongs to the australia south east
In the flow output I'm just doing a basic CSV output to test with dataprep execution settings to be australia region as well
I can't seem to find any other references to US anywhere. If I go into the dataflow UI I can see that the jobs ran and failed on the asia-east region as well

Grok Filter for Confluence Logs

I am trying to write a Grok expression to parse Confluence logs and I am partially successful.
My Current Grok pattern is :
%{TIMESTAMP_ISO8601:conflog_timestamp} %{LOGLEVEL:conflog_severity} \[%{APPNAME:conflog_ModuleName}\] \[%{DATA:conflog_classname}\] (?<conflog_message>(.|\r|\n)*)
APPNAME [a-zA-Z0-9\.\#\-\+_%\:]+
And I am able to parse the below log line :
Log line 1:
2020-06-14 10:44:01,575 INFO [Caesium-1-1] [directory.ldap.cache.AbstractCacheRefresher] synchroniseAllGroupAttributes finished group attribute sync with 0 failures in [ 2030ms ]
However I do have other log lines such as :
Log line 2:
2020-06-15 09:24:32,068 WARN [https-jsse-nio2-8443-exec-13] [atlassian.confluence.pages.DefaultAttachmentManager] getAttachmentData Could not find data for attachment:
-- referer: https://confluence.jira.com/index.action | url: /download/attachments/393217/global.logo | traceId: 2a0bfc77cad7c107 | userName: abcd
and Log Line 3 :
2020-06-12 01:19:03,034 WARN [https-jsse-nio2-8443-exec-6] [atlassian.seraph.auth.DefaultAuthenticator] login login : 'ABC' tried to login but they do not have USE permission or weren't found. Deleting remember me cookie.
-- referer: https://confluence.jira.com/login.action?os_destination=%2Findex.action&permissionViolation=true | url: /dologin.action | traceId: 8744d267e1e6fcc9
Here the params "userName" , "referer", "url" and "traceId" may or maynot be present in the Log line.
I can write concrete grok expressions for each of these. Instead can we handle all these in the same grok expression ?
In shorts - Match all log lines..
If log line has "referer" param store it in a variable. If not, proceed to match rest of the params.
If log line has "url" param store it, if not try to match rest of the params.
Repeat for 'traceId' and 'userName'
Thank you..