How do I sort data in Apache Arrow - apache-arrow

I can't seen to find any examples of sorting data with apache arrow. The closed I have found is this which sorts the data in userspace.
More specifically I'm interested in the JS version.

It appears that the rust implementation are getting a sort implementation.

Related

Apache Arrow Flight: Getting sorted data from multiple endpoints

According to the document (https://arrow.apache.org/docs/dev/format/Flight.html), an Apache Arrow Flight client cannot get sorted data from multiple endpoints. It seems that this is by design.
In the introduction document (https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/), they say "While Flight streams are not necessarily ordered, we provide for application-defined metadata which can be used to serialize ordering information.". But I think the application-defined metadata is not very useful since a general client (like a BI application) that uses a wrapper - for example, Apache Arrow Flight SQL, let alone a wrapper of wrapper: Apache Arrow Flight SQL JDBC driver - does not know it.
Is there any standard way to get sorted data from multiple Apache Arrow Flight endpoints? If not, why did the designers choose not to support that feature?
Thanks.
It was not considered at the time, but you are right: it would be useful to have a way to indicate this so that various wrappers and projects building on top have a standardized way to know how to handle this.
The main idea is that if data is sorted, you should return a single endpoint. I believe the reasoning was that it would be rare to have an implementation capable of doing sorting across multiple endpoints, since that would be expensive to implement. Of course, that isn't very useful if your backend can actually sort data across multiple workers!
I (as one of the contributors to the project) am planning to put up a proposal to handle this case. If you are interested, please keep a watch on the mailing list: dev#arrow.apache.org.

Integrating Google maps with C++ Program

I am making an artificial intelligence based shortest distance finder between two points in c++ language. My coding for that is complete and working fine. Now I want to integrate it with Google Maps Api. I want to show the shortest distance graphically on google maps exactly same as google maps show directions. I am stuck and can't find any help. I know I have to do socket programming for this. Please guide me with proper steps and coding snippets. Thanks in advance!
Check out the official api :
https://cloud.google.com/maps-platform/
For interaction with the http api you could use a client such as curl (http://curl.haxx.se/) or maybe boost asio if your are using boost. see this question at SO: Boost.ASIO-based HTTP client library (like libcurl)
There is a c++ client/ helper available. Disclosure I have not tried it yet.
https://google.github.io/google-api-cpp-client/latest/
Many links seem to be broken (The samples directory link). So I'm not sure how well supported it is but it looks like it might be helpful.
The description page is at :
https://google.github.io/google-api-cpp-client/latest/start/installation.html
IF that fails there are samples in other languages that you may have to translate by hand. (better than nothing)
https://developers.google.com/api-client-library/

Text indexing library in C/C++

I am developing a Windows desktop product which requires text indexing library in C/C++. I would want to give it series of words and a record that needs to be stored against those words. Searching those words should bring back one or more records quickly. Data will be stored on disk.
I have searched this forum and found Lucene. But it is basically Java. There is a CLucene C++ port also. But I am not sure if it is suitable (light weight?) for a small Windows desktop product.
I have found other .net based libraries but not something light weight and for C++.
Can you help please?
Have you considered sqlite? A RDBMS might be a little heavy, but I believe that it is used inside of some web browsers to implement HTML5 "Local Databases".

Very Simple C++ Web Crawler/Spider?

I am trying to do a very simple web crawler/spider app in C++. I have been searching using Google for a simple one to understand the concept. I found this:
spider_simpleCrawler
However, it is complicated to understand for me, since I started learning C++ about 1 month ago.
This is, for example, what I'm trying to do:
Enter the URL: www.example.com (I will use bash->wget, to get the contents/source code),
Look for, maybe "a href" link, and then store in some data file.
Is there a simpler tutorial or guide on the Internet?
All right, I'll try to point you in the right direction. Conceptually, a webcrawler is pretty simple. It revolves around a FIFO queue data structure which stores pending URLs. C++ has a built-in queue structure in the standard libary, std::queue, which you can use to store URLs as strings.
The basic algorithm is pretty straightforward:
Begin with a base URL that you
select, and place it on the top of
your queue
Pop the URL at the top of the queue
and download it
Parse the downloaded HTML file and extract all links
Insert each extracted link into the queue
Goto step 2, or stop once you reach some specified limit
Now, I said that a webcrawler is conceptually simple, but implementing it is not so simple. As you can see from the above algorithm, you'll need: an HTTP networking library to allow you to download URLs, and a good HTML parser that will let you extract links. You mentioned you could use wget to download pages. That simplifies things somewhat, but you still need to actually parse the downloaded HTML docs. Parsing HTML correctly is a non-trivial task. A simple string search for <a href= will only work sometimes. However, if this is just a toy program that you're using to familiarize yourself with C++, a simple string search may suffice for your purposes. Otherwise, you need to use a serious HTML parsing library.
There are also other considerations you need to take into account when writing a webcrawler, such as politeness. People will be pissed and possibly ban your IP if you attempt to download too many pages, too quickly, from the same host. So you may need to implement some sort of policy where your webcrawler waits for a short period before downloading each site. You also need some mechanism to avoid downloading the same URL again, obey the robots exclusion protocol, avoid crawler traps, etc... All these details add up to make actually implementing a robust webcrawler not such a simple thing.
That said, I agree with larsmans in the comments. A webcrawler isn't the greatest way to learn C++. Also, C++ isn't the greatest language to write a webcrawler in. The raw-performance and low-level access you get in C++ is useless when writing a program like a webcrawler, which spends most of its time waiting for URLs to resolve and download. A higher-level scripting language like Python or something is better suited for this task, in my opinion.
Check this Web crawler and indexer written in C++ at: Mitza web crawler
The code can be used as reference. Is clean and provides good start for a
webcrawler codding. Sequence diagrams can be found at the above link pages.
A web-crawler has the following components in it:
Downloading an HTML file
Extracting links from it
Pushing all the links into a queue
{web indexing and ranking if necessary}
Repeating this with the front element of the queue
This one has it all Web-Crawler.
It would be very helpful for beginners to learn about complete understanding of a web-crawler, concepts of multithreading and web-ranking.

Wiki with good support for page moves?

We use DokuWiki to manage our internal documentation but the page renames / moves are not supported very well (there is no built-in way other than messing with raw files manually and the third-party plugin 'pagemove' is no longer developed). Which is a pain.
I'm looking for an alternative which will be similarly simple as DokuWiki (must be filesystem-based) but handle the page renames/moves well. Any suggestions?
For anyone whose search lands them on this page - you might also be interested in the plug-in that keeps links for moved and renamed pages in DokuWiki:
http://www.dokuwiki.org/plugin:move
Starting with Comparison of wiki software and sorting them by Data backend, there seem to be quite a few file system based wiki's. Skipping the webpages that are down or incomprehensible turns up the following viable candidates:
MoinMoin
Twiki
PmWiki (after installing a plugin)
JSPWiki
In the end it's up to you to decide which of these best suits your needs & supports migrating your existing contents to the new wiki (no small feat), but at least it's a start.