Multi-threading writing to multiple distributed tables in a DolphinDB database - concurrency

I tried to conduct concurrent writes in DolphinDB and set enableChunkGranularityConfig = true. I have three tables, oq, mk, ts, all under the 1db3tb database. I want to write data to these tables and test the concurrent performance. How to write a dos command to conduct concurrent writes?
I created a dos file as follows, but it seems still single-threaded. How can I append data to the tables at the same time?
tableObj1.append!(t1)
tableObj2.append!(t2)
tableObj3.append!(t3)

You can refer to the following script:
def writeData(dbName, tableName, data){
timer loadTable(dbName, tableName, data){
}
submitJob("write1", "write data", writeData, dbName, tableName1, data1)
submitJob("write2", "write data", writeData, dbName, tableName1, data2)
submitJob("write3", "write data", writeData, dbName, tableName1, data2)
getRecentJobs()

Related

BIGQUERY csv file load with an additional column with a default value

From the example given by Google, I have managed to load CSV files into BigQuery(BQ) table following the guide(link and code below)
Now I want to add several files into BQ, and want to add a new column filename which contains the filename.
Is there a way to add column with default data?
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv
// Import the Google Cloud client libraries
const {BigQuery} = require('#google-cloud/bigquery');
const {Storage} = require('#google-cloud/storage');
// Instantiate clients
const bigquery = new BigQuery();
const storage = new Storage();
/**
* This sample loads the CSV file at
* https://storage.googleapis.com/cloud-samples-data/bigquery/us-states/us-states.csv
*
* TODO(developer): Replace the following lines with the path to your file.
*/
const bucketName = 'cloud-samples-data';
const filename = 'bigquery/us-states/us-states.csv';
async function loadCSVFromGCS() {
// Imports a GCS file into a table with manually defined schema.
/**
* TODO(developer): Uncomment the following lines before running the sample.
*/
// const datasetId = 'my_dataset';
// const tableId = 'my_table';
// Configure the load job. For full list of options, see:
// https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoad
const metadata = {
sourceFormat: 'CSV',
skipLeadingRows: 1,
schema: {
fields: [
{name: 'name', type: 'STRING'},
{name: 'post_abbr', type: 'STRING'},
// {name: 'filemame', type: 'STRING', value=filename} // I WANT TO ADD COLUMN WITH FILE NAME HERE
],
},
location: 'US',
};
// Load data from a Google Cloud Storage file into the table
const [job] = await bigquery
.dataset(datasetId)
.table(tableId)
.load(storage.bucket(bucketName).file(filename), metadata);
// load() waits for the job to finish
console.log(`Job ${job.id} completed.`);
// Check the job's status for errors
const errors = job.status.errors;
if (errors && errors.length > 0) {
throw errors;
}
}
I would say you have a few choices.
Add a column to the CSV before uploading, e.g. with awk or preprocessing in JS.
Add the individual CSV files to separate tables. You can easily query across many tables as one in BigQuery. This way you can easily see what data comes from which file, and you can access table meta data for the file-name
Post process the data, by adding the column after the data is loaded with normal sql/api calls.
See also this possible duplicate How to add new column with metadata value to csv when loading it to bigquery
According to BigQuery’s documentation [1], there is no option to set a default value for columns. The closest option without any post-processing, would be to use a NULL value for nullable columns.
However, a possible postprocessing workaround for this would be to create a View of the raw table and add a script that maps the NULL value to any default value. Here’s some information about scripting in BigQuery [2].
In case it is possible to add a pre-processing code, adding the value to the source file would be easy to achieve using any scripting language.
I think that static and function-based values will be a good feature for BigQuery’s future scope.
[1] -
https://cloud.google.com/bigquery/docs
[2] -
https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting
You have multiple options:
you could rebuild your CSV with the filename as a column data
you can load data into a temporary table, then moving to final table with a second step specifying the missing file name column
convert the example to be an external table where _FILE_NAME is a pseduocolumn, and later you can query and move to a final table. See more about this here.

Cannot create index on non-empty table

I'm currently using AWS Lambda (NodeJS) with AWS QLDB.
The scenario is like this.
I have the first table and its indexes when I deployed the service. So the table and indexes will be created. My problem is that, once I need to add new table and its indexes; it can't create the index because there's existing table.
My workaround to be able to create new table even if there's an existing table in my Ledger is that I'm querying the list of tables I have.
const getTables = async (transactionExecutor: TransactionExecutor) => {
const statement = `SELECT name FROM information_schema.user_tables`;
return await transactionExecutor.execute(statement);
};
Then I have this condition to check if the table is already existing
const tables = JSON.stringify(result.getResultList());
if (
!JSON.parse(tables).some((object): boolean => object.name === process.env.TABLE_NAME)
) {
console.log('TABLE A NOT EXISTING');
await createTable(transactionExecutor, process.env.TABLE_NAME);
}
if (
!JSON.parse(tables).some(
(object): boolean => object.name === process.env.TABLE_NAME_1,
)
) {
console.log('TABLE B NOT EXISTING');
await createTable(transactionExecutor, process.env.TABLE_NAME_1);
}
I don't know how to do it with indexes, I tried using SQL commands in QLDB but it's not working.
I hope you can help me.
Thank you
I'm not quite sure what your question is (the post title and body hint at different things), but I'm going to do my best to answer.
First, QLDB stores data in Ion, not JSON. So, please use the Ion APIs to parse data and not the JSON ones. The reason your code works at all is because Ion is a superset of JSON and the result set doesn't include types that are unknown to JSON. So, for example, if the result set was changed to include an Ion Timestamp, then your code would break.
Next, actually getting a list of tables has first class support in the driver. Simply use driver.getTableNames.
Third, I think you have a question "can I add an index to a non-empty table?". The answer is "no". This is planned functionality and I will update this answer when it is available. UPDATE: Now you can! https://aws.amazon.com/about-aws/whats-new/2020/09/amazon-qldb-launches-index-improvements/
Finally, I think you're also asking if there is a way to list indexes on a table in the same way as you can list tables in a ledger. The answer to that is 'yes'. The documents returned in information_schema.user_tables look like this:
{
tableId:"...",
name:"THE_TABLE_NAME",
indexes:[
{
expr:"[THE_FIELD_BEING_INDEXED]"
}
],
status:"ACTIVE"
}

Consistency level LOCAL_ONE is not supported for this operation. Supported consistency levels are: LOCAL_QUORUM

I am working with AWS keyspaces and trying to insert data from C# but getting this error."Consistency level LOCAL_ONE is not supported for this operation. Supported consistency levels are: LOCAL_QUORUM". can anyone please help out here.
AWS keyspace
CREATE KEYSPACE IF NOT EXISTS "DevOps"
WITH REPLICATION={'class': 'SingleRegionStrategy'} ;
Table
CREATE TABLE IF NOT EXISTS "DevOps"."projectdetails" (
"id" UUID PRIMARY KEY,
"name" text,
"lastupdatedtime" timestamp,
"baname" text,
"customerid" UUID)
C# code
public async Task AddRecord(List<projectdetails> projectDetails)
{
try
{
if (projectDetails.Count > 0)
{
foreach (var item in projectDetails)
{
projectdetails projectData = new projectdetails();
projectData.id = item.id;
projectData.name = item.name;
projectData.baname = "Vishal";
projectData.lastupdatedtime = item.lastupdatedtime;
projectData.customerid = 1;
await mapper.InsertAsync<projectdetails>(projectData);
}
}
}
catch (Exception e)
{
}
}
The error clearly says that you need to use correct consistency level LOCAL_QUORUM instead of the LOCAL_ONE that is used by default. AWS documentation says that for write operations, it's only the consistency level supported. You can set consistency level by using the version of InsertAsync that accepts the CqlQueryOptions, like this (maybe create instance of the query options only once, during initialization of the application):
mapper.InsertAsync<projectdetails>(projectData,
new CqlQueryOptions().SetConsistencyLevel(ConsistencyLevel.LocalQuorum))

Simple list of applicants webapp

I am creating a web application that lists applicants and their position on a waiting list.
We need to be able to add new applicants to this list and remove applicants from the list. There will be under 10k applicants in the list.
Specifics:
I plan to write the app in Golang.
The list needs to be safe, I the program shuts down, it should be recoverable.
The app should contain this data for every applicant: Name, Student ID, position.
Questions:
How do I secure the list (lock?) so it is updated correctly for both if two updates to it is made at the same time?
Should I save the data in a database or use a file?
I need your help!
UPDATE:
Mockup code:
package main
import (
"log"
"sync"
"time"
"github.com/boltdb/bolt"
)
type applicant struct {
FirstName string
LastName string
StudentID string
Position int
}
type priorityList struct {
sync.Mutex
applicants []applicant
}
func (l *priorityList) newApplicant(fn string, ln string, sid string) error {
// add applicant to priorityList
return nil
}
func (l *priorityList) removeApplicant(sid string) error {
// remove applicant from priorityList
return nil
}
func (l *priorityList) editApplicant(sid string) error {
// edit applicant in priorityList
return nil
}
func main() {
// Database
db, err := bolt.Open("priorityList.db", 0600, &bolt.Options{Timeout: 1 * time.Second})
if err != nil {
log.Fatal(err)
}
defer db.Close()
}
If you use a file, you could use a Mutex to block concurrent writes.
Otherwise a database would be fine. For example BoltDB could be suitable. It is pure go and runs withing your program.
There are many approaches. You can use file and protect it with Go mutex or system lock. You can memory map the file for performance. You either can use BoltDB which is nice peace of software and provide needed machinery and can work in-process. If you write rare and mostly read, then constant DB https://github.com/colinmarc/cdb also looks interesting.
But, classic SQL DB has some advantages
You can use third party store for data and easely migrate when needed
You can access your data from third party app or just plain SQL
request
You can think about data schema and code logic separately

jpa FlushModeType COMMIT

In FlushModeType.AUTO mode, the persistence context is synchronized with the database at the following times:
before each SELECT operation
at the end of a transaction
after a flush or close operation on the persistence context
In FlushModeType.COMMIT mode, means that it does not have to flush
the persistence context before executing a query because you have indicated that there is no changed data in memory that would affect the results of the database query.
I have made an example in jboss as 6.0:
#Stateless
public class SessionBeanTwoA implements SessionBeanTwoALocal {
#PersistenceContext(unitName = "entity_manager_trans_unit")
protected EntityManager em;
#EJB
private SessionBeanTwoBLocal repo;
#Override
#TransactionAttribute(TransactionAttributeType.REQUIRED)
public void findPersonByEmail(String email) {
1. List<Person> persons = repo.retrievePersonByEmail(email);
2. Person person = persons.get(0);
3. System.out.println(person.getAge());
4. person.setAge(2);
5. persons = repo.retrievePersonByEmail(email);
6. person=persons.get(0);
7. System.out.println(person.getAge());
}
}
#Stateless
public class SessionBeanTwoB extends GenericCrud implements SessionBeanTwoBLocal {
#Override
public List<Person> retrievePersonByEmail(String email) {
Query query = em.createNamedQuery("Person.findAllPersonByEmail");
query.setFlushMode(FlushModeType.COMMIT);
query.setParameter("email", email);
List<Person> persons;
persons = query.getResultList();
return persons;
}
}
FlushModeType.COMMIT does not seem to work. At line 1., the person's age is taken from the database and print 35 at line3. At line 4., the person is updated within the persistent context but in line 7. the person's age is 2.
The jpa 2.0 spec says:
Type.COMMIT is set, the effect of updates made to entities in the persistence context upon queries is
unspecified.
But in many books, they explains what I wrote in the beginning of this post.
So what FlushModeType COMMIT really does?
Tks in advance for your help.
The javadocs mentions this here for FlushModeType COMMIT
Flushing to occur at transaction commit. The provider may flush at
other times, but is not required to.
So if the provider thinks it should then it can flush even though it is configured to flush on commit. Since for AUTO setting, the provider typically flushes at various times ( which requires expensive traversal of all managed entities - especially if the number is huge- to check if any database updates/deletes needs to be scheduled) so if we are sure that there are no database changes happening then we may use COMMIT setting to cut down on frequent checks for any changes and save on some CPU cycles.