Dataflow pipeline "lost contact with the service" - google-cloud-platform

I'm running into trouble with an Apache Beam pipline on Google Cloud Dataflow.
The pipeline is simple: Reading json from GCS, extracting text from some nested fields, writing back to GCS.
It works fine when testing with a smaller subset of input files but when I run it on the full data set, I get the following error (after running fine through around 260M items).
Somehow the "worker eventually lost contact with the service"
(8662a188e74dae87): Workflow failed. Causes: (95e9c3f710c71bc2): S04:ReadFromTextWithFilename/Read+FlatMap(extract_text_from_raw)+RemoveLineBreaks+FormatText+WriteText/Write/WriteImpl/WriteBundles/Do+WriteText/Write/WriteImpl/Pair+WriteText/Write/WriteImpl/WindowInto(WindowIntoFn)+WriteText/Write/WriteImpl/GroupByKey/Reify+WriteText/Write/WriteImpl/GroupByKey/Write failed., (da6389e4b594e34b): A work item was attempted 4 times without success. Each time the worker eventually lost contact with the service. The work item was attempted on:
extract-tags-150110997000-07261602-0a01-harness-jzcn,
extract-tags-150110997000-07261602-0a01-harness-828c,
extract-tags-150110997000-07261602-0a01-harness-3w45,
extract-tags-150110997000-07261602-0a01-harness-zn6v
The Stacktrace shows a Failed to update work status/Progress reporting thread got error error:
Exception in worker loop: Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 776, in run deferred_exception_details=deferred_exception_details) File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 629, in do_work exception_details=exception_details) File "/usr/local/lib/python2.7/dist-packages/apache_beam/utils/retry.py", line 168, in wrapper return fun(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 490, in report_completion_status exception_details=exception_details) File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 298, in report_status work_executor=self._work_executor) File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/workerapiclient.py", line 333, in report_status self._client.projects_locations_jobs_workItems.ReportStatus(request)) File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/dataflow/internal/clients/dataflow/dataflow_v1b3_client.py", line 467, in ReportStatus config, request, global_params=global_params) File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 723, in _RunMethod return self.ProcessHttpResponse(method_config, http_response, request) File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 729, in ProcessHttpResponse self.__ProcessHttpResponse(method_config, http_response, request)) File "/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py", line 600, in __ProcessHttpResponse http_response.request_url, method_config, request) HttpError: HttpError accessing <https://dataflow.googleapis.com/v1b3/projects/qollaboration-live/locations/us-central1/jobs/2017-07-26_16_02_36-1885237888618334364/workItems:reportStatus?alt=json>: response: <{'status': '400', 'content-length': '360', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'vary': 'Origin, X-Origin, Referer', 'server': 'ESF', '-content-encoding': 'gzip', 'cache-control': 'private', 'date': 'Wed, 26 Jul 2017 23:54:12 GMT', 'x-frame-options': 'SAMEORIGIN', 'content-type': 'application/json; charset=UTF-8'}>, content <{ "error": { "code": 400, "message": "(7f8a0ec09d20c3a3): Failed to publish the result of the work update. Causes: (7f8a0ec09d20cd48): Failed to update work status. Causes: (afa1cd74b2e65619): Failed to update work status., (afa1cd74b2e65caa): Work \"6306998912537661254\" not leased (or the lease was lost).", "status": "INVALID_ARGUMENT" } } >
And Finally:
HttpError: HttpError accessing <https://dataflow.googleapis.com/v1b3/projects/[projectid-redacted]/locations/us-central1/jobs/2017-07-26_18_28_43-10867107563808864085/workItems:reportStatus?alt=json>: response: <{'status': '400', 'content-length': '358', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'vary': 'Origin, X-Origin, Referer', 'server': 'ESF', '-content-encoding': 'gzip', 'cache-control': 'private', 'date': 'Thu, 27 Jul 2017 02:00:10 GMT', 'x-frame-options': 'SAMEORIGIN', 'content-type': 'application/json; charset=UTF-8'}>, content <{ "error": { "code": 400, "message": "(5845363977e915c1): Failed to publish the result of the work update. Causes: (5845363977e913a8): Failed to update work status. Causes: (44379dfdb8c2b47): Failed to update work status., (44379dfdb8c2e88): Work \"9100669328839864782\" not leased (or the lease was lost).", "status": "INVALID_ARGUMENT" } } >
at __ProcessHttpResponse (/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py:600)
at ProcessHttpResponse (/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py:729)
at _RunMethod (/usr/local/lib/python2.7/dist-packages/apitools/base/py/base_api.py:723)
at ReportStatus (/usr/local/lib/python2.7/dist-packages/apache_beam/runners/dataflow/internal/clients/dataflow/dataflow_v1b3_client.py:467)
at report_status (/usr/local/lib/python2.7/dist-packages/dataflow_worker/workerapiclient.py:333)
at report_status (/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py:298)
at report_completion_status (/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py:490)
at wrapper (/usr/local/lib/python2.7/dist-packages/apache_beam/utils/retry.py:168)
at do_work (/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py:629)
at run (/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py:776)
This looks like an error to the data flow internals to me. Can anyone confirm? Are there any workarounds?

The HttpError typically appears after the workflow has failed and is part of the failure/teardown process.
It looks like there were others error reported in your pipeline, such as the following. Note that if the same elements fail 4 times it will be marked failing.
Try looking the Stack Traces section in the UI to identify the other errors and their stack traces. Since this only occurs on the larger dataset, consider the possibility of their being malformed elements that only exist in the larger dataset.

Related

Box SDK client as_user request requires higher privileges than provided by the access token

I have this code in my Django project:
# implememtation
module_dir = os.path.dirname(os.path.dirname(os.path.dirname(__file__))) # get current directory
box_config_path = os.path.join(module_dir, 'py_scripts/transactapi_funded_trades/config.json') # the config json downloaded
config = JWTAuth.from_settings_file(box_config_path) #creating a config via the json file
client = Client(config) #creating a client via config
user_to_impersonate = client.user(user_id='8********6') #iget main user
user_client = client.as_user(user_to_impersonate) #impersonate main user
The above code is what I use to transfer the user from the service account created by Box to the main account user with ID 8********6. No error is thrown so far, but when I try to implement the actual logic to retrieve the files, I get this:
[2022-09-13 02:50:26,146: INFO/MainProcess] GET https://api.box.com/2.0/folders/0/items {'headers': {'As-User': '8********6',
'Authorization': '---LMHE',
'User-Agent': 'box-python-sdk-3.3.0',
'X-Box-UA': 'agent=box-python-sdk/3.3.0; env=python/3.10.4'},
'params': {'offset': 0}}
[2022-09-13 02:50:26,578: WARNING/MainProcess] "GET https://api.box.com/2.0/folders/0/items?offset=0" 403 0
{'Date': 'Mon, 12 Sep 2022 18:50:26 GMT', 'Transfer-Encoding': 'chunked', 'x-envoy-upstream-service-time': '100', 'www-authenticate': 'Bearer realm="Service", error="insufficient_scope", error_description="The request requires higher privileges than provided by the access token."', 'box-request-id': '07cba17694f7ea32f0c2cd42790bce39e', 'strict-transport-security': 'max-age=31536000', 'Via': '1.1 google', 'Alt-Svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"'}
b''
[2022-09-13 02:50:26,587: WARNING/MainProcess] Message: None
Status: 403
Code: None
Request ID: None
Headers: {'Date': 'Mon, 12 Sep 2022 18:50:26 GMT', 'Transfer-Encoding': 'chunked', 'x-envoy-upstream-service-time': '100', 'www-authenticate': 'Bearer realm="Service", error="insufficient_scope", error_description="The request requires higher privileges than provided by the access token."', 'box-request-id': '07cba17694f7ea32f0c2cd42790bce39e', 'strict-transport-security': 'max-age=31536000', 'Via': '1.1 google', 'Alt-Svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"'}
URL: https://api.box.com/2.0/folders/0/items
Method: GET
Context Info: None
It says it needs higher access. What might I be doing wrong? I've been stuck with this particular problem for a little over a week now so any help is highly appreciated.
Can you test to see if the user is in fact being impersonated?
Something like this:
from boxsdk import JWTAuth, Client
def main():
"""main function"""
auth = JWTAuth.from_settings_file('./.jwt.config.json')
auth.authenticate_instance()
client = Client(auth)
me = client.user().get()
print(f"Service account user: {me.id}:{me.name}")
user_id_to_impersonate = '18622116055'
folder_of_user_to_impersonate = '0'
user_to_impersonate = client.user(user_id=user_id_to_impersonate).get()
# the .get() is just to be able to print the impersonated user
print(f"User to impersonate: {user_to_impersonate.id}:{user_to_impersonate.name}")
user_client = client.as_user(user_to_impersonate)
items = user_client.folder(folder_id=folder_of_user_to_impersonate).get_items()
print(f"Items in folder:{items}")
# we need a loop to actually get the items info
for item in items:
print(f"Item: {item.type}\t{item.id}\t{item.name}")
if __name__ == '__main__':
main()
Check out my output:
Service account user: 20344589936:UI-Elements-Sample
User to impersonate: 18622116055:Rui Barbosa
Items in folder:<boxsdk.pagination.limit_offset_based_object_collection.LimitOffsetBasedObjectCollection object at 0x105fffe20>
Item: folder 172759373899 Barduino User Folder
Item: folder 172599089223 Bookings
Item: folder 162833533610 Box Reports
Item: folder 163422716106 Box UI Elements Demo

How to use Runner_v2 for apache beam dataflow job?

My python code for dataflow job looks like below:
import apache_beam as beam
from apache_beam.io.external.kafka import ReadFromKafka
from apache_beam.options.pipeline_options import PipelineOptions
topic1="topic1"
conf={'bootstrap.servers':'gcp_instance_public_ip:9092'}
pipeline = beam.Pipeline(options=PipelineOptions())
(pipeline
| ReadFromKafka(consumer_config=conf,topics=['topic1'])
)
pipeline.run()
As i am using kafkaIO in python code, someone suggested me to use DataflowRunner_V2( I think V1 doesn't support python).
As per dataflow documentation, i am using this parameter to use runner v2:--experiments=use_runner_v2 (I have not made any change on code level for switching from V1 to V2.)
I am getting below error:
http_response, method_config=method_config, request=request)
apitools.base.py.exceptions.HttpBadRequestError: HttpError accessing <https://dataflow.googleapis.com/v1b3/projects/metal-voyaasfger-23424/locations/us-central1/jobs?alt=json>: response: <{'vary': 'Origin, X-Origin, Referer', 'content-type': 'application/json; charset=UTF-8', 'date': 'Wed, 08 Jul 2020 07:23:21 GMT', 'server': 'ESF', 'cache-control': 'private', 'x-xss-protection': '0', 'x-frame-options': 'SAMEORIGIN', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'status': '400', 'content-length': '544', '-content-encoding': 'gzip'}>, content <{
"error": {
"code": 400,
"message": "(5fd1bf4d41e8b7e): The workflow could not be created. Causes: (5fd1bf4d41e8018): The workflow could not be created due to misconfiguration. If you are trying any experimental feature, make sure your project and the specified region support that feature. Contact Google Cloud Support for further help. Experiments enabled for project: [enable_streaming_engine, enable_windmill_service, shuffle_mode=service], experiments requested for job: [use_runner_v2]",
"status": "INVALID_ARGUMENT"
}
}
I have already added service account using export GOOGLE_APPLICATION_CREDENTIALS=(project owner permission) command.
Can someone help where is my mistake. Am i mistaking using Runner_V2?
I will really thnkful if someone shortly tell whats difference in using Runner_v1 and Runner_V2.
Thanks ... :)
I was able to reproduce your issue. The error message was complaining that use_runner_v2 isn't enabled because Runner v2 isn't enabled for batch jobs.
Experiments enabled for project: [enable_streaming_engine, enable_windmill_service, shuffle_mode=service], experiments requested for job: [use_runner_v2]",
Please try running your job with the --streaming flag added.

How to stream word document in bytes stored in AWS S3 from boto3

Using boto3, I am trying to retrieve a Microsoft Word document stored in S3. However, when I try to access the object calling client.get_object() the content-length of Word Document is 0 while files with .txt extensions return the correct content-length. Is there a way to decode the Word Document in order to write its output to a stream?
I have tested this with .txt files and .docs files and I have also tried using the .decode() method after reading the file, but based on the content being returned, there doesn't seem to be anything to decode.
Accessing a .txt Document I notice that the content-length is 17 (the number of characters in the file) and they can be read by calling txt_file.read()
s3 = boto3.client('s3')
txt_file = s3.get_object(Bucket="test_bucket", Key="test.txt").get()
>>> txt_file
{
u'Body': <botocore.response.StreamingBody object at 0x7fc5f0074f10>,
u'AcceptRanges': 'bytes',
u'ContentType': 'text/plain',
'ResponseMetadata': {
'HTTPStatusCode': 200,
'RetryAttempts': 0,
'HTTPHeaders': {
'content-length': '17',
'accept-ranges': 'bytes',
'server': 'AmazonS3',
'last-modified': 'Sat, 06 Jul 2019 02:13:45 GMT',
'date': 'Sat, 06 Jul 2019 15:58:21 GMT',
'x-amz-server-side-encryption': 'AES256',
'content-type': 'text/plain'
}
}
}
Accessing a .docx Document I notice that the content-length is 0 (while the document has the same string written to the .txt file) and calling txt_file.read() outputs the empty string u''
s3 = boto3.client('s3')
word_file = s3.get_object(Bucket="test_bucket", Key="test.docx").get()
>>> word_file
{
u'Body': <botocore.response.StreamingBody object at 0x7fc5f0074f10>,
u'AcceptRanges': 'bytes',
u'ContentType': 'binary/octet-stream',
'ResponseMetadata': {
'HTTPStatusCode': 200,
'RetryAttempts': 0,
'HTTPHeaders': {
'content-length': '0',
'accept-ranges': 'bytes',
'server': 'AmazonS3',
'last-modified': 'Thu, 04 Jul 2019 21:51:53 GMT',
'date': 'Sat, 06 Jul 2019 15:58:30 GMT',
'x-amz-server-side-encryption': 'AES256',
'content-type': 'binary/octet-stream'
}
}
}
I expect the content-length of both files to output the number of bytes in the file, however, only the .txt file is returning data.

Trouble saving to S3 from Jenkins using Django

I am running tests in Jenkins for a Django application, but received 403 errors when running the tests that upload to S3. The values are exported as environment variables and accessed in the settings file with values.Value() (https://django-configurations.readthedocs.org/en/stable/values/).
# settings.py
AWS_ACCESS_KEY_ID = values.Value()
AWS_SECRET_ACCESS_KEY = values.Value()
My console output looks like this:
[EnvInject] - Injecting as environment variables the properties content
AWS_ACCESS_KEY_ID='ABC123'
AWS_SECRET_ACCESS_KEY='blah'
[EnvInject] - Variables injected successfully.
...
+ python manage.py test
======================================================================
ERROR: test_document (documents.tests.DocTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/var/lib/jenkins/jobs/Job/workspace/myapp/documents/tests.py", line 25, in setUp
self.document.doc.save('test_file', File(f), save=True)
File "/var/lib/jenkins/jobs/Job/workspace/.venv/local/lib/python2.7/site-packages/django/db/models/fields/files.py", line 89, in save
self.name = self.storage.save(name, content)
File "/var/lib/jenkins/jobs/Job/workspace/.venv/local/lib/python2.7/site-packages/django/core/files/storage.py", line 51, in save
name = self._save(name, content)
File "/var/lib/jenkins/jobs/Job/workspace/.venv/local/lib/python2.7/site-packages/storages/backends/s3boto.py", line 385, in _save
key = self.bucket.get_key(encoded_name)
File "/var/lib/jenkins/jobs/Job/workspace/.venv/local/lib/python2.7/site-packages/boto/s3/bucket.py", line 192, in get_key
key, resp = self._get_key_internal(key_name, headers, query_args_l)
File "/var/lib/jenkins/jobs/Job/workspace/.venv/local/lib/python2.7/site-packages/boto/s3/bucket.py", line 230, in _get_key_internal
response.status, response.reason, '')
S3ResponseError: S3ResponseError: 403 Forbidden
-------------------- >> begin captured logging << --------------------
boto: DEBUG: path=/documents/test_file
boto: DEBUG: auth_path=/my-bucket/documents/test_file
boto: DEBUG: Method: HEAD
boto: DEBUG: Path: /documents/test_file
boto: DEBUG: Data:
boto: DEBUG: Headers: {}
boto: DEBUG: Host: my-bucket.s3.amazonaws.com
boto: DEBUG: Port: 443
boto: DEBUG: Params: {}
boto: DEBUG: Token: None
boto: DEBUG: StringToSign:
HEAD
Sun, 13 Sep 2015 06:02:36 GMT
/my-bucket/documents/test_file
boto: DEBUG: Signature:
AWS 'ABC123':RanDoM123#*$
boto: DEBUG: Final headers: {'Date': 'Sun, 13 Sep 2015 06:02:36 GMT', 'Content-Length': '0', 'Authorization': u"AWS 'ABC123':RanDoM123#*$, 'User-Agent': 'Boto/2.38.0 Python/2.7.3 Linux/3.2.0-4-amd64'}
boto: DEBUG: Response headers: [('x-amz-id-2', 'MoReRanDom123*&^'), ('server', 'AmazonS3'), ('transfer-encoding', 'chunked'), ('x-amz-request-id', '3484RANDOM19394'), ('date', 'Sun, 13 Sep 2015 06:02:36 GMT'), ('content-type', 'application/xml')]
--------------------- >> end captured logging << ---------------------
Am I missing something important in order to upload files to S3 from Jenkins? I'm having no issues on my local machine.
Does your CORS configuration on this bucket have any restrictions per IP? For example, if AllowedOrigin specifies the IP, it could be one reason why it fails.
<?xml version="1.0" encoding="UTF-8"?>
<CORSConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<CORSRule>
<AllowedOrigin>*</AllowedOrigin>
I would also print out your AWS values on Jenkins for debugging, just to confirm that the correct values are being used in that environment.

Facebook graph API : can post on "me/feed" but not on "page_id/feed" (error : 1455002)

I guess the answer to this one is straightforward but I cannot find it. Any help would be very much appreciated.
I. Use case
The application (back-end in python / django) should write on a facebook page.
II. Symptoms
When running the code below on "me/feed", the post is correctly inserted
When running the code below on "PAGE_ID/feed", there is an exception (see below in section IV.)
The scope of the authorisation is publish_stream, manage_pages
Also, the user_token is from a user in the test domain
III. Code
## Getting the user_access_token is dealt with before
h = Http()
data = dict(message="Hello", access_token=user_access_token['access_token'])
resp, content = h.request("https://graph.facebook.com/PAGE_ID/feed", "POST", urlencode(data))
IV. Exception generated (using /PAGE_ID/feed)
resp : Response: {'status': '400', 'content-length': '119', 'expires': 'Sat, 01 Jan 2000 00:00:00 GMT', 'www-authenticate':
'OAuth "Facebook Platform" "invalid_request" "(#1) An unknown error occurred"', 'x-fb-rev': '976458',
'connection': 'keep-alive', 'pragma': 'no-cache', 'cache-control': 'no-store', 'date': 'Tue, 22 Oct 2013 21:45:20
GMT', 'access-control-allow-origin': '*', 'content-type': 'text/javascript; charset=UTF-8', 'x-fb-debug':
'HFItWh64ob+3hErv+rgYdFzHlRBVHP7Pg0Eg4hvqYlY='}
content str: {"error":{"message":"(#1) An unknown error occurred","type":"OAuthException","code":1,"error_data":
{"kError":1455002}}}