IBM Rational Synergy CLI get list of work area conflicts - cm-synergy

When using the Synergy GUI you can "sync work area" and then you get a list of files with conflicts.
I want to get the same list by using the CLI. However I have had no luck with this.
I've tried to use
ccm conflicts -r NAME:Version
And it prints the following
Checking for missing fix tasks., 11%...
Collecting objects and tasks beyond baseline., 32%...
Collecting objects and tasks beyond baseline., 46%...
Collecting objects and tasks beyond baseline., 48%...
Collecting objects and tasks beyond baseline., 49%...
Collecting objects and tasks beyond baseline., 50%...
Finding objects that are not included., 90%...
Conflict detection completed., 100%
Collecting objects and tasks beyond baseline., 20%...
Collecting objects and tasks beyond baseline., 26%...
Collecting objects and tasks beyond baseline., 32%...
Collecting objects and tasks beyond baseline., 38%...
Collecting objects and tasks beyond baseline., 44%...
Collecting objects and tasks beyond baseline., 49%...
Finding objects that are not included., 90%...
Conflict detection completed., 100%
Finding objects that are not included., 90%...
Conflict detection completed., 100%
Conflict detection completed., 100%
Getting explicitly included objects., 16%...
Finding objects that are not included., 90%...
Conflict detection completed., 100%
But I get no list of files.
If I do the same through the GUI. IE calls the Sync work area on members only. I get a list of about 30 files.
What am I doing wrong.
Do i need to run a Query instead. And where do I find information on what I can query after Keywoards ect.
Regards

Try the command
ccm reconcile -s -p NAME -cu
This will do your "sync work area".
Further information about the command is documented in the synergy online help:
https://www.ibm.com/support/knowledgecenter/en/SSRNYG_7.2.1/com.ibm.rational.synergy.manage.doc/topics/sc_t_h_show_work_area_conflicts.html

Related

Deleteing millions of files from S3

I need to delete 64 million objects from a bucket, leaving about the same number of objects untouched. I have created an inventory of the bucket and used that to create a filtered inventory that has only the objects that need to be deleted.
I created a Lambda function that uses NodeJS to 'async' delete the objects that are fed to it.
I have created smaller inventories (10s, 100s and 1000s of objects) from the filtered one, and used S3 Batch Operation jobs to process these, and those all seem to check out: the expected files were deleted, and all other files remained.
Now, my questions:
Am I doing this right? Is this the preferred method to delete millions of files, or did my Googling misfire?
Is it advised to just create on big batch job and let that run, or is it better to break it up in chunks of, say, a million objects?
How long will this take (approx. of course)? Will S3 Batch go through the list and do each file sequentially? Or does it automagically scale out and do a whole bunch in parallel?
What am I forgetting?
Any suggestions, thoughts or criticisms are welcome. Thanks!
You might have a look into Stepfunctions Distributed Map feature. I do not know your specific use case but it could help to get the proper scaling.
Here is a short blog entry how you can achieve it.

AWS Glue ETL: Reading huge JSON file format to process but, got OutOfMemory Error

I am working on AWS-Glue ETL part for reading huge json file (only test 1 file and around 9 GB.) to work in ETL process but, I got an error from AWS Glue of java.lang.OutOfMemoryError: Java heap space after running and processing for a while
My code and flow is so simple as
df = spark.read.option("multiline", "true").json(f"s3/raw_path")
// ...
// and write to be as source_df to other object in s3
df.write.json(f"s3/source_path", lineSep=",\n")
In error/log It seems likes It failed and terminated container since reading this huge file. I have already tried to upgrade worker type to be G1.X with a sample number of worker node, however, I just would like to ask and find another solution that does not look like vertical scaling as increasing resources
I am so new in this area and service so, wanna optimize cost and time as low as possible :-)
Thank you alls in advance
After looking into Glue and Spark, I found that to get the benefit of parallelism processing across multiple executors, for my case - I split the (large) file into multiple smaller files and it worked! The files are distributed to multiple executors.

Code changes needed for custom distributed ML Engine Experiment

I completed this tutorial on distributed tensorflow experiments within an ML Engine experiment and I am looking to define my own custom tier instead of the STANDARD_1 tier that they use in their config.yaml file. If using the tf.estimator.Estimator API, are any additional code changes needed to create a custom tier of any size? For example, the article suggests: "If you distribute 10,000 batches among 10 worker nodes, each node works on roughly 1,000 batches." so this would suggest the config.yaml file below would be possible
trainingInput:
scaleTier: CUSTOM
masterType: complex_model_m
workerType: complex_model_m
parameterServerType: complex_model_m
workerCount: 10
parameterServerCount: 4
Are any code changes needed to the mnist tutorial to be able to use this custom configuration? Would this distribute the X number of batches across the 10 workers as the tutorial suggests would be possible? I poked around some of the other ML Engine samples and found that reddit_tft uses distributed training, but they appear to have defined their own runconfig.cluster_spec within their trainer package: task.pyeven though they are also using the Estimator API. So, is there any additional configuration needed? My current understanding is that if using the Estimator API (even within your own defined model) that there should not need to be any additional changes.
Does any of this change if the config.yaml specifies using GPUs? This article suggests for the Estimator API "No code changes are necessary as long as your ClusterSpec is configured properly. If a cluster is a mixture of CPUs and GPUs, map the ps job name to the CPUs and the worker job name to the GPUs." However, since the config.yaml is specifically identifying the machine type for parameter servers and workers, I am expecting that within ML-Engine the ClusterSpec will be configured properly based on the config.yaml file. However, I am not able to find any ml-engine documentation that confirms no changes are needed to take advantage of GPUs.
Last, within ML-Engine I am wondering if there are any ways to identify usage of different configurations? The line "If you distribute 10,000 batches among 10 worker nodes, each node works on roughly 1,000 batches." suggests that the use of additional workers would be roughly linear, but I don't have any intuition around how to determine if more parameter servers are needed? What would one be able to check (either within the cloud dashboards or tensorboard) to determine if they have a sufficient number of parameter servers?
are any additional code changes needed to create a custom tier of any size?
No; no changes are needed to the MNIST sample to get it to work with different number or type of worker. To use a tf.estimator.Estimator on CloudML engine, you must have your program invoke learn_runner.run, as exemplified in the samples. When you do so, the framework reads in the TF_CONFIG environment variables and populates a RunConfig object with the relevant information such as the ClusterSpec. It will automatically do the right thing on Parameter Server nodes and it will use the provided Estimator to start training and evaluation.
Most of the magic happens because tf.estimator.Estimator automatically uses a device setter that distributes ops correctly. That device setter uses the cluster information from the RunConfig object whose constructor, by default, uses TF_CONFIG to do its magic (e.g. here). You can see where the device setter is being used here.
This all means that you can just change your config.yaml by adding/removing workers and/or changing their types and things should generally just work.
For sample code using a custom model_fn, see the census/customestimator example.
That said, please note that as you add workers, you are increasing your effective batch size (this is true regardless of whether or not you are using tf.estimator). That is, if your batch_size was 50 and you were using 10 workers, that means each worker is processing batches of size 50, for an effective batch size of 10*50=500. Then if you increase the number of workers to 20, your effective batch size becomes 20*50=1000. You may find that you may need to decrease your learning rate accordingly (linear seems to generally work well; ref).
I poked around some of the other ML Engine samples and found that
reddit_tft uses distributed training, but they appear to have defined
their own runconfig.cluster_spec within their trainer package:
task.pyeven though they are also using the Estimator API. So, is there
any additional configuration needed?
No additional configuration needed. The reddit_tft sample does instantiate its own RunConfig, however, the constructor of RunConfig grabs any properties not explicitly set during instantiation by using TF_CONFIG. And it does so only as a convenience to figure out how many Parameter Servers and workers there are.
Does any of this change if the config.yaml specifies using GPUs?
You should not need to change anything to use tf.estimator.Estimator with GPUs, other than possibly needing to manually assign ops to the GPU (but that's not specific to CloudML Engine); see this article for more info. I will look into clarifying the documentation.

How should I parallelize a mix of cpu- and network-intensive tasks (in Celery)

I have a job that scans a network file system (can be remote), pulls many files, runs a computation on them and pushes the results (per file) into a DB. I am in process of moving this to Celery so that it can be scaled up. The number of files can get really huge (1M+).
I am not sure what design approach to take, specifically:
Uniform "end2end" tasks
A task gets a batch (list of N files), pulls them, computes and uploads results.
(Using batches rather than individual files is for optimizing the connection to the remote file system and the DB, although it is a pure heuristics at this point)
Clearly, a task would spend a large part of it waiting for I/O, so we'll need to play with number of worker processes (much more than # of CPUs), so that I have enough tasks running (computing) concurrently.
pro: simple design, easier coding and control.
con: probably will need to tune the process pool size individually per installation, as it would depend on environment (network, machines etc.)
Split into dedicated smaller tasks
download, compute, upload (again, batches).
This option is appealing intuitively, but I don't actually see the advantage.
I'd be glad to get some references to tutorials on concurrency design, as well as design suggestions.
How long does it take to scan the network file system compared to computation per file?
How does the hierarchy of the remote file system look like? Are the files evenly distributed? How can you use this in your advantage ?
I would follow a process like this:
1. In one process, list the first two levels of the root remote target folder.
2. For each of the discovered folders, spin up a separate celery process that further lists the content of those folders. You may also want to save the location of the discovered files just in case things go wrong.
3. After you have listed the content of the remote file system and all celery processes that list files terminate you can go in processing mode.
4. You may want to list files with 2 processes and use the rest of your cores to start doing per file work.
NB: Before doing everything in python I would also investigate how does bash tools like xargs and find work together in remote file discovery. Xargs allows you to spin up multiple C processes that do what you want. That might be the most efficient way to do the remote file discovery and then pipe everything to you python code.
Instead of celery, you can write a simple python script which runs on k*cpu_count threads just to connect to remote servers and fetch files without celery.
Personally, I found that k value in between 4 to 7 gives better results in terms of CPU utilization for IO bound tasks. Depending on the number of files produced or the rate at which you want to consume, you can use a suitable number of threads.
Alternatively, you can use celery + gevent or celery with threads if your tasks are IO bound.
For computation and updating DB you can use celery so that you can scale as per your requirements dynamically. If you are too many tasks at a time which need DB connection, you should use DB connection pooling for workers.

Using sorl-thumbnail with MongoDB storage

I've extended sorl-thumbnail's KVStoreBase class, and made a key-value backend that uses a single MongoDB collection.
This was done in order to avoid installing a discrete key-value store (e.g. Redis).
Should I clear the collection every once in a while?
What are the downsides?
Only clear the collection if low disk usage is more important to you than fast access times.
The downsides are that your users will all hit un-cached thumbs simultaneously (And simultaneously begin recomputing them).
Just run python manage.py thumbnail cleanup
This cleans up the Key Value Store from stale cache. It removes references to images that do not exist and thumbnail references and their actual files for images that do not exist. It removes thumbnails for unknown images.