ConcurrentHashMap put - concurrency

I have a following map
ConcurrentMap<K, V> myMap = new ConcurrentHashMap<>();
and now I would like to insert new object into that map atomically, so I can do something like
V myMethod(K key, int val) {
return map.putIfAbsent(key, new V(val));
}
but it is not atomic, because firstly new V will be created and then inserted into map. Is there any way to do this without using synchronized (or using synchronized is the fastest way here)?

But... ConcurrentHashMap already uses synchronized internally.
/*
* ORACLE PROPRIETARY/CONFIDENTIAL. Use is subject to license terms.
* Written by Doug Lea with assistance from members of JCP JSR-166
* Expert Group and released to the public domain, as explained at
* http://creativecommons.org/publicdomain/zero/1.0/
*/
/** Implementation for put and putIfAbsent */
final V putVal(K key, V value, boolean onlyIfAbsent) {
if (key == null || value == null) throw new NullPointerException();
int hash = spread(key.hashCode());
int binCount = 0;
for (Node<K,V>[] tab = table;;) {
Node<K,V> f; int n, i, fh;
if (tab == null || (n = tab.length) == 0)
tab = initTable();
else if ((f = tabAt(tab, i = (n - 1) & hash)) == null) {
if (casTabAt(tab, i, null,
new Node<K,V>(hash, key, value, null)))
break; // no lock when adding to empty bin
}
else if ((fh = f.hash) == MOVED)
tab = helpTransfer(tab, f);
else {
V oldVal = null;
synchronized (f) {
if (tabAt(tab, i) == f) {
if (fh >= 0) {
binCount = 1;
for (Node<K,V> e = f;; ++binCount) {
K ek;
if (e.hash == hash &&
((ek = e.key) == key ||
(ek != null && key.equals(ek)))) {
oldVal = e.val;
if (!onlyIfAbsent)
e.val = value;
break;
}
Node<K,V> pred = e;
if ((e = e.next) == null) {
pred.next = new Node<K,V>(hash, key,
value, null);
break;
}
}
}
else if (f instanceof TreeBin) {
Node<K,V> p;
binCount = 2;
if ((p = ((TreeBin<K,V>)f).putTreeVal(hash, key,
value)) != null) {
oldVal = p.val;
if (!onlyIfAbsent)
p.val = value;
}
}
}
}
if (binCount != 0) {
if (binCount >= TREEIFY_THRESHOLD)
treeifyBin(tab, i);
if (oldVal != null)
return oldVal;
break;
}
}
}
addCount(1L, binCount);
return null;
}
And
* Insertion (via put or its variants) of the first node in an
* empty bin is performed by just CASing it to the bin. This is
* by far the most common case for put operations under most
* key/hash distributions. Other update operations (insert,
* delete, and replace) require locks. We do not want to waste
* the space required to associate a distinct lock object with
* each bin, so instead use the first node of a bin list itself as
* a lock. Locking support for these locks relies on builtin
* "synchronized" monitors.
You shouldn't need to specify synchronized yourself.

It appears you are looking for computeIfAbsent in Java 8.
V myMethod(K key, int val) {
return map.computeIfAbsent(key, () -> new V(val));
}
This will only create a V object once per key. Note: it will still create a lambda object each time (unless Escape Analysis places it on the stack)

The Javadoc explicitly states that putIfAbsent() is atomic.

Related

Sort file names func [duplicate]

I'm sorting strings that are comprised of text and numbers.
I want the sort to sort the number parts as numbers, not alphanumeric.
For example I want: abc1def, ..., abc9def, abc10def
instead of: abc10def, abc1def, ..., abc9def
Does anyone know an algorithm for this (in particular in c++)
Thanks
I asked this exact question (although in Java) and got pointed to http://www.davekoelle.com/alphanum.html which has an algorithm and implementations of it in many languages.
Update 14 years later: Dave Koelle’s blog has gone off line and I can’t find his actual algorithm, but here’s an implementation.
https://github.com/cblanc/koelle-sort
Several natural sort implementations for C++ are available. A brief review:
natural_sort<> - based on Boost.Regex.
In my tests, it's roughly 20 times slower than other options.
Dirk Jagdmann's alnum.hpp, based on Dave Koelle's alphanum algorithm
Potential integer overlow issues for values over MAXINT
Martin Pool's natsort - written in C, but trivially usable from C++.
The only C/C++ implementation I've seen to offer a case insensitive version, which would seem to be a high priority for a "natural" sort.
Like the other implementations, it doesn't actually parse decimal points, but it does special case leading zeroes (anything with a leading 0 is assumed to be a fraction), which is a little weird but potentially useful.
PHP uses this algorithm.
This is known as natural sorting. There's an algorithm here that looks promising.
Be careful of problems with non-ASCII characters (see Jeff's blog entry on the subject).
Partially reposting my another answer:
bool compareNat(const std::string& a, const std::string& b){
if (a.empty())
return true;
if (b.empty())
return false;
if (std::isdigit(a[0]) && !std::isdigit(b[0]))
return true;
if (!std::isdigit(a[0]) && std::isdigit(b[0]))
return false;
if (!std::isdigit(a[0]) && !std::isdigit(b[0]))
{
if (a[0] == b[0])
return compareNat(a.substr(1), b.substr(1));
return (toUpper(a) < toUpper(b));
//toUpper() is a function to convert a std::string to uppercase.
}
// Both strings begin with digit --> parse both numbers
std::istringstream issa(a);
std::istringstream issb(b);
int ia, ib;
issa >> ia;
issb >> ib;
if (ia != ib)
return ia < ib;
// Numbers are the same --> remove numbers and recurse
std::string anew, bnew;
std::getline(issa, anew);
std::getline(issb, bnew);
return (compareNat(anew, bnew));
}
toUpper() function:
std::string toUpper(std::string s){
for(int i=0;i<(int)s.length();i++){s[i]=toupper(s[i]);}
return s;
}
Usage:
std::vector<std::string> str;
str.push_back("abc1def");
str.push_back("abc10def");
...
std::sort(str.begin(), str.end(), compareNat);
To solve what is essentially a parsing problem a state machine (aka finite state automaton) is the way to go. Dissatisfied with the above solutions i wrote a simple one-pass early bail-out algorithm that beats C/C++ variants suggested above in terms of performance, does not suffer from numerical datatype overflow errors, and is easy to modify to add case insensitivity if required.
sources can be found here
For those that arrive here and are already using Qt in their project, you can use the QCollator class. See this question for details.
Avalanchesort is a recursive variation of naturall sort, whiche merge runs, while exploring the stack of sorting-datas. The algorithim will sort stable, even if you add datas to your sorting-heap, while the algorithm is running/sorting.
The search-principle is simple. Only merge runs with the same rank.
After finding the first two naturell runs (rank 0), avalanchesort merge them to a run with rank 1. Then it call avalanchesort, to generate a second run with rank 1 and merge the two runs to a run with rank 2. Then it call the avalancheSort to generate a run with rank 2 on the unsorted datas....
My Implementation porthd/avalanchesort divide the sorting from the handling of the data using interface injection. You can use the algorithmn for datastructures like array, associative arrays or lists.
/**
* #param DataListAvalancheSortInterface $dataList
* #param DataRangeInterface $beginRange
* #param int $avalancheIndex
* #return bool
*/
public function startAvalancheSort(DataListAvalancheSortInterface $dataList)
{
$avalancheIndex = 0;
$rangeResult = $this->avalancheSort($dataList, $dataList->getFirstIdent(), $avalancheIndex);
if (!$dataList->isLastIdent($rangeResult->getStop())) {
do {
$avalancheIndex++;
$lastIdent = $rangeResult->getStop();
if ($dataList->isLastIdent($lastIdent)) {
$rangeResult = new $this->rangeClass();
$rangeResult->setStart($dataList->getFirstIdent());
$rangeResult->setStop($dataList->getLastIdent());
break;
}
$nextIdent = $dataList->getNextIdent($lastIdent);
$rangeFollow = $this->avalancheSort($dataList, $nextIdent, $avalancheIndex);
$rangeResult = $this->mergeAvalanche($dataList, $rangeResult, $rangeFollow);
} while (true);
}
return $rangeResult;
}
/**
* #param DataListAvalancheSortInterface $dataList
* #param DataRangeInterface $range
* #return DataRangeInterface
*/
protected function findRun(DataListAvalancheSortInterface $dataList,
$startIdent)
{
$result = new $this->rangeClass();
$result->setStart($startIdent);
$result->setStop($startIdent);
do {
if ($dataList->isLastIdent($result->getStop())) {
break;
}
$nextIdent = $dataList->getNextIdent($result->getStop());
if ($dataList->oddLowerEqualThanEven(
$dataList->getDataItem($result->getStop()),
$dataList->getDataItem($nextIdent)
)) {
$result->setStop($nextIdent);
} else {
break;
}
} while (true);
return $result;
}
/**
* #param DataListAvalancheSortInterface $dataList
* #param $beginIdent
* #param int $avalancheIndex
* #return DataRangeInterface|mixed
*/
protected function avalancheSort(DataListAvalancheSortInterface $dataList,
$beginIdent,
int $avalancheIndex = 0)
{
if ($avalancheIndex === 0) {
$rangeFirst = $this->findRun($dataList, $beginIdent);
if ($dataList->isLastIdent($rangeFirst->getStop())) {
// it is the last run
$rangeResult = $rangeFirst;
} else {
$nextIdent = $dataList->getNextIdent($rangeFirst->getStop());
$rangeSecond = $this->findRun($dataList, $nextIdent);
$rangeResult = $this->mergeAvalanche($dataList, $rangeFirst, $rangeSecond);
}
} else {
$rangeFirst = $this->avalancheSort($dataList,
$beginIdent,
($avalancheIndex - 1)
);
if ($dataList->isLastIdent($rangeFirst->getStop())) {
$rangeResult = $rangeFirst;
} else {
$nextIdent = $dataList->getNextIdent($rangeFirst->getStop());
$rangeSecond = $this->avalancheSort($dataList,
$nextIdent,
($avalancheIndex - 1)
);
$rangeResult = $this->mergeAvalanche($dataList, $rangeFirst, $rangeSecond);
}
}
return $rangeResult;
}
protected function mergeAvalanche(DataListAvalancheSortInterface $dataList, $oddListRange, $evenListRange)
{
$resultRange = new $this->rangeClass();
$oddNextIdent = $oddListRange->getStart();
$oddStopIdent = $oddListRange->getStop();
$evenNextIdent = $evenListRange->getStart();
$evenStopIdent = $evenListRange->getStop();
$dataList->initNewListPart($oddListRange, $evenListRange);
do {
if ($dataList->oddLowerEqualThanEven(
$dataList->getDataItem($oddNextIdent),
$dataList->getDataItem($evenNextIdent)
)) {
$dataList->addListPart($oddNextIdent);
if ($oddNextIdent === $oddStopIdent) {
$restTail = $evenNextIdent;
$stopTail = $evenStopIdent;
break;
}
$oddNextIdent = $dataList->getNextIdent($oddNextIdent);
} else {
$dataList->addListPart($evenNextIdent);
if ($evenNextIdent === $evenStopIdent) {
$restTail = $oddNextIdent;
$stopTail = $oddStopIdent;
break;
}
$evenNextIdent = $dataList->getNextIdent($evenNextIdent);
}
} while (true);
while ($stopTail !== $restTail) {
$dataList->addListPart($restTail);
$restTail = $dataList->getNextIdent($restTail);
}
$dataList->addListPart($restTail);
$dataList->cascadeDataListChange($resultRange);
return $resultRange;
}
}
My algorithm with test code of java version. If you want to use it in your project you can define a comparator yourself.
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Comparator;
import java.util.List;
import java.util.function.Consumer;
public class FileNameSortTest {
private static List<String> names = Arrays.asList(
"A__01__02",
"A__2__02",
"A__1__23",
"A__11__23",
"A__3++++",
"B__1__02",
"B__22_13",
"1_22_2222",
"12_222_222",
"2222222222",
"1.sadasdsadsa",
"11.asdasdasdasdasd",
"2.sadsadasdsad",
"22.sadasdasdsadsa",
"3.asdasdsadsadsa",
"adsadsadsasd1",
"adsadsadsasd10",
"adsadsadsasd3",
"adsadsadsasd02"
);
public static void main(String...args) {
List<File> files = new ArrayList<>();
names.forEach(s -> {
File f = new File(s);
try {
if (!f.exists()) {
f.createNewFile();
}
files.add(f);
} catch (IOException e) {
e.printStackTrace();
}
});
files.sort(Comparator.comparing(File::getName));
files.forEach(f -> System.out.print(f.getName() + " "));
System.out.println();
files.sort(new Comparator<File>() {
boolean caseSensitive = false;
int SPAN_OF_CASES = 'a' - 'A';
#Override
public int compare(File left, File right) {
char[] csLeft = left.getName().toCharArray(), csRight = right.getName().toCharArray();
boolean isNumberRegion = false;
int diff=0, i=0, j=0, lenLeft=csLeft.length, lenRight=csRight.length;
char cLeft = 0, cRight = 0;
for (; i<lenLeft && j<lenRight; i++, j++) {
cLeft = getCharByCaseSensitive(csLeft[i]);
cRight = getCharByCaseSensitive(csRight[j]);
boolean isNumericLeft = isNumeric(cLeft), isNumericRight = isNumeric(cRight);
if (isNumericLeft && isNumericRight) {
// Number start!
if (!isNumberRegion) {
isNumberRegion = true;
// Remove prefix '0'
while (i < lenLeft && cLeft == '0') i++;
while (j < lenRight && cRight == '0') j++;
if (i == lenLeft || j == lenRight) break;
}
// Diff start: calculate the diff value.
if (cLeft != cRight && diff == 0)
diff = cLeft - cRight;
} else {
if (isNumericLeft != isNumericRight) {
// One numeric and one char.
if (isNumberRegion)
return isNumericLeft ? 1 : -1;
return cLeft - cRight;
} else {
// Two chars: if (number) diff don't equal 0 return it.
if (diff != 0)
return diff;
// Calculate chars diff.
diff = cLeft - cRight;
if (diff != 0)
return diff;
// Reset!
isNumberRegion = false;
diff = 0;
}
}
}
// The longer one will be put backwards.
return (i == lenLeft && j == lenRight) ? cLeft - cRight : (i == lenLeft ? -1 : 1) ;
}
private boolean isNumeric(char c) {
return c >= '0' && c <= '9';
}
private char getCharByCaseSensitive(char c) {
return caseSensitive ? c : (c >= 'A' && c <= 'Z' ? (char) (c + SPAN_OF_CASES) : c);
}
});
files.forEach(f -> System.out.print(f.getName() + " "));
}
}
The output is,
1.sadasdsadsa 11.asdasdasdasdasd 12_222_222 1_22_2222 2.sadsadasdsad 22.sadasdasdsadsa 2222222222 3.asdasdsadsadsa A__01__02 A__11__23 A__1__23 A__2__02 A__3++++ B__1__02 B__22_13 adsadsadsasd02 adsadsadsasd1 adsadsadsasd10 adsadsadsasd3
1.sadasdsadsa 1_22_2222 2.sadsadasdsad 3.asdasdsadsadsa 11.asdasdasdasdasd 12_222_222 22.sadasdasdsadsa 2222222222 A__01__02 A__1__23 A__2__02 A__3++++ A__11__23 adsadsadsasd02 adsadsadsasd1 adsadsadsasd3 adsadsadsasd10 B__1__02 B__22_13
Process finished with exit code 0
// -1: s0 < s1; 0: s0 == s1; 1: s0 > s1
static int numericCompare(const string &s0, const string &s1) {
size_t i = 0, j = 0;
for (; i < s0.size() && j < s1.size();) {
string t0(1, s0[i++]);
while (i < s0.size() && !(isdigit(t0[0]) ^ isdigit(s0[i]))) {
t0.push_back(s0[i++]);
}
string t1(1, s1[j++]);
while (j < s1.size() && !(isdigit(t1[0]) ^ isdigit(s1[j]))) {
t1.push_back(s1[j++]);
}
if (isdigit(t0[0]) && isdigit(t1[0])) {
size_t p0 = t0.find_first_not_of('0');
size_t p1 = t1.find_first_not_of('0');
t0 = p0 == string::npos ? "" : t0.substr(p0);
t1 = p1 == string::npos ? "" : t1.substr(p1);
if (t0.size() != t1.size()) {
return t0.size() < t1.size() ? -1 : 1;
}
}
if (t0 != t1) {
return t0 < t1 ? -1 : 1;
}
}
return i == s0.size() && j == s1.size() ? 0 : i != s0.size() ? 1 : -1;
}
I am not very sure if it is you want, anyway, you can have a try:-)

Extending python3, how does the garbage collection work

I'm making my own PriorityQueue in C as a python module. I read the basics of python ownership and reference system, so I thought I'd do the following:
In push(): Accept an priority(int) and an object to be saved. Increment the reference count on the object to be saved, since we will be keeping that.
In pop(): Delete the object from my priorityqueue, but don't decrement the reference counter, since that might destroy the object. Instead I transfer my reference ownership to the python function calling my function.
This seemed to work at first hand. But when actually using it in an application I get the following error:
Fatal Python error: GC object already tracked
What does this mean? The stacktrace is not useful at all, it's all inside python files I don't recognize(sre_parse and apport_python_hook).
Just for clarity, these are my C push and pop functions:
(self->heap[index]->key is the priority of the element at that index
self->heap[index]->value is the object)
PyObject* pop(CDSHeap *self) {
//If there aare no elements
if (self->heap[0].value == 0 || self->end == 0) {
Py_RETURN_NONE;
}
//If there is only one element
if (self->end == 1) {
PyObject* result = self->heap[0].value;
self->heap[0].key = 0;
self->end = 0;
return result;
}
//Two or more elements:
//First save the result:
PyObject* result = self->heap[0].value;
//Get the last element, and place it at the top
while (self->heap[self->end].value == 0) self->end--;
self->heap[0].value = self->heap[self->end].value;
self->heap[0].key = self->heap[self->end].key;
self->heap[self->end].value = 0;
//Reheapify the heap
int ptr = 0;
while (self->end >= ptr) {
if (self->heap[ptr*2+1].value != 0 && self->heap[ptr*2+1].key < self->heap[ptr].key
&& (self->heap[ptr*2+2].value == 0 || self->heap[ptr*2+1].key <= self->heap[ptr*2+2].key)) {
swapElement(self->heap, ptr, ptr*2+1);
ptr = ptr*2+1;
}else
if (self->heap[ptr*2+2].value != 0 && self->heap[ptr*2+2].value < self->heap[ptr].value) {
swapElement(self->heap, ptr, ptr*2+2);
ptr = ptr*2+2;
} else {
break;
}
}
return result;
}
PyObject* push(CDSHeap *self, PyObject* args) {
int k;
PyObject *obj;
if (!PyArg_ParseTuple(args, "iO",&k, &obj)){
return NULL;
}
Py_INCREF(obj);
//Add the element to the end of the heap
self->heap[self->end].key = k;
self->heap[self->end].value = obj;
//Increment the size and reheapify
int ptr = self->end++;
while (ptr > 0) {
int parent = (ptr-1)/2;
if (self->heap[ptr].key < self->heap[parent].key) {
swapElement(self->heap, ptr, parent);
ptr = parent;
} else {
Py_RETURN_NONE;
}
}
Py_RETURN_NONE;
}

Dangling memory/ unallocated memory issue

I've this piece of code, which is called by a timer Update mechanism.
However, I notice, that the memory size of the application, while running, continuously increases by 4, indicating that there might be a rogue pointer, or some other issue.
void RtdbConnection::getItemList()
{
std::vector<CString> tagList = mItemList->getItems();
//CString str(_T("STD-DOL1"));
PwItemList* pil = mPwSrv->GetItemList();
CPwItem pw ;
for(auto it = tagList.begin(); it != tagList.end(); ++it)
{
pw = mPwSrv->GetItem(*it);
pil->AddItem(&(PwItem)pw);
}
pil->AddInfo(DB_DESC); //Description
pil->AddInfo(DB_QALRM); // Alarm Status
pil->AddInfo(DB_QUNAK); //UNACK status
pil->AddInfo(DB_AL_PRI); // Priority of the alarm tag
pil->ExecuteQuery();
int i = 0;
for (auto it = tagList.begin(); i < pil->GetInfoRetrievedCount() && it != tagList.end(); i+=4, it++)
{
//item = {0};
CString str(*it);
PwInfo info = pil->GetInfo(i);
CString p(info.szValue().c_str());
bool isAlarm = pil->GetInfo(i+1).bValue();
bool isAck = pil->GetInfo(i+2).bValue();
int priority = pil->GetInfo(i+3).iValue();
item = ItemInfo(str, p, isAlarm, isAck, priority);
//int r = sizeof(item);
mItemList->setInfo(str, item); // Set the details for the item of the List
}
delete pil;
pil = NULL;
}
I cannot seem to find a memory block requiring de-allocation here. Nor is there any allocation of memory when I step inside the following function :
mItemList->setInfo(str, item);
which is defined as :
void ItemList::setInfo(CString tagname, ItemInfo info)
{
int flag = 0;
COLORREF tempColour;
std::map<CString, ItemInfo>::iterator tempIterator;
if ( (tempIterator = mAlarmListMap.find(tagname)) !=mAlarmListMap.end() )
{
//remove the current iteminfo and insert new one
if(mAlarmListMap[tagname].getPriority() != info.getPriority() && (mAlarmListMap[tagname].getPriority()!=0))
{
mAlarmListMap[tagname].updatePriority(info.getPriority());
mAlarmListMap[tagname].mPrioChanged = TRUE;
}
else
{
mAlarmListMap[tagname].mPrioChanged = FALSE;
((mAlarmListMap[tagname].getPrevPriority() != 0)?(mAlarmListMap[tagname].ResetPrevPriority()):TRUE);
mAlarmListMap[tagname].setPriority(info.getPriority());
}
mAlarmListMap[tagname].setDescription(info.getDescription());
mAlarmListMap[tagname].setAlarm(info.getAlarmStat());
mAlarmListMap[tagname].setAlarmAck(info.getAckStat());
tempColour = mColourLogic->setUpdatedColour(mAlarmListMap[tagname].getAlarmStat(), mAlarmListMap[tagname].getAckStat(), flag);
mAlarmListMap[tagname].setColour(tempColour);
if(!(info.getAlarmStat() || info.getAckStat()))
{
flag = 1;
mAlarmListMap[tagname].mIsRTN = true;
mAlarmListMap[tagname].setDisplayCondition(false);
}
else
{
mAlarmListMap[tagname].setDisplayCondition(true);
}
//((mAlarmListMap[tagname].mIsRTN == true)?
}
else
{
tempIterator = mAlarmListMap.begin();
tempColour = mColourLogic->fillColourFirst(info.getAlarmStat(), info.getAckStat());
info.setColour(tempColour);
mAlarmListMap.insert(tempIterator, std::pair<CString,ItemInfo>(tagname,info));
}
}
I tried juggling with the allocations, but the increase is always a constant 4.
Could anyone kindly look and highlight where the issue could be?
Thanks a lot.

Algorithm for finding the maximum difference in an array of numbers

I have an array of a few million numbers.
double* const data = new double (3600000);
I need to iterate through the array and find the range (the largest value in the array minus the smallest value). However, there is a catch. I only want to find the range where the smallest and largest values are within 1,000 samples of each other.
So I need to find the maximum of: range(data + 0, data + 1000), range(data + 1, data + 1001), range(data + 2, data + 1002), ...., range(data + 3599000, data + 3600000).
I hope that makes sense. Basically I could do it like above, but I'm looking for a more efficient algorithm if one exists. I think the above algorithm is O(n), but I feel that it's possible to optimize. An idea I'm playing with is to keep track of the most recent maximum and minimum and how far back they are, then only backtrack when necessary.
I'll be coding this in C++, but a nice algorithm in pseudo code would be just fine. Also, if this number I'm trying to find has a name, I'd love to know what it is.
Thanks.
This type of question belongs to a branch of algorithms called streaming algorithms. It is the study of problems which require not only an O(n) solution but also need to work in a single pass over the data. the data is inputted as a stream to the algorithm, the algorithm can't save all of the data and then and then it is lost forever. the algorithm needs to get some answer about the data, such as for instance the minimum or the median.
Specifically you are looking for a maximum (or more commonly in literature - minimum) in a window over a stream.
Here's a presentation on an article that mentions this problem as a sub problem of what they are trying to get at. it might give you some ideas.
I think the outline of the solution is something like that - maintain the window over the stream where in each step one element is inserted to the window and one is removed from the other side (a sliding window). The items you actually keep in memory aren't all of the 1000 items in the window but a selected representatives which are going to be good candidates for being the minimum (or maximum).
read the article. it's abit complex but after 2-3 reads you can get the hang of it.
The algorithm you describe is really O(N), but i think the constant is too high. Another solution which looks reasonable is to use O(N*log(N)) algorithm the following way:
* create sorted container (std::multiset) of first 1000 numbers
* in loop (j=1, j<(3600000-1000); ++j)
- calculate range
- remove from the set number which is now irrelevant (i.e. in index *j - 1* of the array)
- add to set new relevant number (i.e. in index *j+1000-1* of the array)
I believe it should be faster, because the constant is much lower.
This is a good application of a min-queue - a queue (First-In, First-Out = FIFO) which can simultaneously keep track of the minimum element it contains, with amortized constant-time updates. Of course, a max-queue is basically the same thing.
Once you have this data structure in place, you can consider CurrentMax (of the past 1000 elements) minus CurrentMin, store that as the BestSoFar, and then push a new value and pop the old value, and check again. In this way, keep updating BestSoFar until the final value is the solution to your question. Each single step takes amortized constant time, so the whole thing is linear, and the implementation I know of has a good scalar constant (it's fast).
I don't know of any documentation on min-queue's - this is a data structure I came up with in collaboration with a coworker. You can implement it by internally tracking a binary tree of the least elements within each contiguous sub-sequence of your data. It simplifies the problem that you'll only pop data from one end of the structure.
If you're interested in more details, I can try to provide them. I was thinking of writing this data structure up as a paper for arxiv. Also note that Tarjan and others previously arrived at a more powerful min-deque structure that would work here, but the implementation is much more complex. You can google for "mindeque" to read about Tarjan et al.'s work.
std::multiset<double> range;
double currentmax = 0.0;
for (int i = 0; i < 3600000; ++i)
{
if (i >= 1000)
range.erase(range.find(data[i-1000]));
range.insert(data[i]);
if (i >= 999)
currentmax = max(currentmax, *range.rbegin());
}
Note untested code.
Edit: fixed off-by-one error.
read in the first 1000 numbers.
create a 1000 element linked list which tracks the current 1000 number.
create a 1000 element array of pointers to linked list nodes, 1-1 mapping.
sort the pointer array based on linked list node's values. This will rearrange the array but keep the linked list intact.
you can now calculate the range for the first 1000 numbers by examining the first and last element of the pointer array.
remove the first inserted element, which is either the head or the tail depending on how you made your linked list. Using the node's value perform a binary search on the pointer array to find the to-be-removed node's pointer, and shift the array one over to remove it.
add the 1001th element to the linked list, and insert a pointer to it in the correct position in the array, by performing one step of an insertion sort. This will keep the array sorted.
now you have the min. and max. value of the numbers between 1 and 1001, and can calculate the range using the first and last element of the pointer array.
it should now be obvious what you need to do for the rest of the array.
The algorithm should be O(n) since the delete and insertion is bounded by log(1e3) and everything else takes constant time.
I decided to see what the most efficient algorithm I could think of to solve this problem was using actual code and actual timings. I first created a simple solution, one that tracks the min/max for the previous n entries using a circular buffer, and a test harness to measure the speed. In the simple solution, each data value is compared against the set of min/max values, so that's about window_size * count tests (where window size in the original question is 1000 and count is 3600000).
I then thought about how to make it faster. First off, I created a solution that used a fifo queue to store window_size values and a linked list to store the values in ascending order where each node in the linked list was also a node in the queue. To process a data value, the item at the end of the fifo was removed from the linked list and the queue. The new value was added to the start of the queue and a linear search was used to find the position in the linked list. The min and max values could then be read from the start and end of the linked list. This was quick, but wouldn't scale well with increasing window_size (i.e. linearly).
So I decided to add a binary tree to the system to try to speed up the search part of the algorithm. The final timings for window_size = 1000 and count = 3600000 were:
Simple: 106875
Quite Complex: 1218
Complex: 1219
which was both expected and unexpected. Expected in that using a sorted linked list helped, unexpected in that the overhead of having a self balancing tree didn't offset the advantage of a quicker search. I tried the latter two with an increased window size and found that the were always nearly identical up to a window_size of 100000.
Which all goes to show that theorising about algorithms is one thing, implementing them is something else.
Anyway, for those that are interested, here's the code I wrote (there's quite a bit!):
Range.h:
#include <algorithm>
#include <iostream>
#include <ctime>
using namespace std;
// Callback types.
typedef void (*OutputCallback) (int min, int max);
typedef int (*GeneratorCallback) ();
// Declarations of the test functions.
clock_t Simple (int, int, GeneratorCallback, OutputCallback);
clock_t QuiteComplex (int, int, GeneratorCallback, OutputCallback);
clock_t Complex (int, int, GeneratorCallback, OutputCallback);
main.cpp:
#include "Range.h"
int
checksum;
// This callback is used to get data.
int CreateData ()
{
return rand ();
}
// This callback is used to output the results.
void OutputResults (int min, int max)
{
//cout << min << " - " << max << endl;
checksum += max - min;
}
// The program entry point.
void main ()
{
int
count = 3600000,
window = 1000;
srand (0);
checksum = 0;
std::cout << "Simple: Ticks = " << Simple (count, window, CreateData, OutputResults) << ", checksum = " << checksum << std::endl;
srand (0);
checksum = 0;
std::cout << "Quite Complex: Ticks = " << QuiteComplex (count, window, CreateData, OutputResults) << ", checksum = " << checksum << std::endl;
srand (0);
checksum = 0;
std::cout << "Complex: Ticks = " << Complex (count, window, CreateData, OutputResults) << ", checksum = " << checksum << std::endl;
}
Simple.cpp:
#include "Range.h"
// Function to actually process the data.
// A circular buffer of min/max values for the current window is filled
// and once full, the oldest min/max pair is sent to the output callback
// and replaced with the newest input value. Each value inputted is
// compared against all min/max pairs.
void ProcessData
(
int count,
int window,
GeneratorCallback input,
OutputCallback output,
int *min_buffer,
int *max_buffer
)
{
int
i;
for (i = 0 ; i < window ; ++i)
{
int
value = input ();
min_buffer [i] = max_buffer [i] = value;
for (int j = 0 ; j < i ; ++j)
{
min_buffer [j] = min (min_buffer [j], value);
max_buffer [j] = max (max_buffer [j], value);
}
}
for ( ; i < count ; ++i)
{
int
index = i % window;
output (min_buffer [index], max_buffer [index]);
int
value = input ();
min_buffer [index] = max_buffer [index] = value;
for (int k = (i + 1) % window ; k != index ; k = (k + 1) % window)
{
min_buffer [k] = min (min_buffer [k], value);
max_buffer [k] = max (max_buffer [k], value);
}
}
output (min_buffer [count % window], max_buffer [count % window]);
}
// A simple method of calculating the results.
// Memory management is done here outside of the timing portion.
clock_t Simple
(
int count,
int window,
GeneratorCallback input,
OutputCallback output
)
{
int
*min_buffer = new int [window],
*max_buffer = new int [window];
clock_t
start = clock ();
ProcessData (count, window, input, output, min_buffer, max_buffer);
clock_t
end = clock ();
delete [] max_buffer;
delete [] min_buffer;
return end - start;
}
QuiteComplex.cpp:
#include "Range.h"
template <class T>
class Range
{
private:
// Class Types
// Node Data
// Stores a value and its position in various lists.
struct Node
{
Node
*m_queue_next,
*m_list_greater,
*m_list_lower;
T
m_value;
};
public:
// Constructor
// Allocates memory for the node data and adds all the allocated
// nodes to the unused/free list of nodes.
Range
(
int window_size
) :
m_nodes (new Node [window_size]),
m_queue_tail (m_nodes),
m_queue_head (0),
m_list_min (0),
m_list_max (0),
m_free_list (m_nodes)
{
for (int i = 0 ; i < window_size - 1 ; ++i)
{
m_nodes [i].m_list_lower = &m_nodes [i + 1];
}
m_nodes [window_size - 1].m_list_lower = 0;
}
// Destructor
// Tidy up allocated data.
~Range ()
{
delete [] m_nodes;
}
// Function to add a new value into the data structure.
void AddValue
(
T value
)
{
Node
*node = GetNode ();
// clear links
node->m_queue_next = 0;
// set value of node
node->m_value = value;
// find place to add node into linked list
Node
*search;
for (search = m_list_max ; search ; search = search->m_list_lower)
{
if (search->m_value < value)
{
if (search->m_list_greater)
{
node->m_list_greater = search->m_list_greater;
search->m_list_greater->m_list_lower = node;
}
else
{
m_list_max = node;
}
node->m_list_lower = search;
search->m_list_greater = node;
}
}
if (!search)
{
m_list_min->m_list_lower = node;
node->m_list_greater = m_list_min;
m_list_min = node;
}
}
// Accessor to determine if the first output value is ready for use.
bool RangeAvailable ()
{
return !m_free_list;
}
// Accessor to get the minimum value of all values in the current window.
T Min ()
{
return m_list_min->m_value;
}
// Accessor to get the maximum value of all values in the current window.
T Max ()
{
return m_list_max->m_value;
}
private:
// Function to get a node to store a value into.
// This function gets nodes from one of two places:
// 1. From the unused/free list
// 2. From the end of the fifo queue, this requires removing the node from the list and tree
Node *GetNode ()
{
Node
*node;
if (m_free_list)
{
// get new node from unused/free list and place at head
node = m_free_list;
m_free_list = node->m_list_lower;
if (m_queue_head)
{
m_queue_head->m_queue_next = node;
}
m_queue_head = node;
}
else
{
// get node from tail of queue and place at head
node = m_queue_tail;
m_queue_tail = node->m_queue_next;
m_queue_head->m_queue_next = node;
m_queue_head = node;
// remove node from linked list
if (node->m_list_lower)
{
node->m_list_lower->m_list_greater = node->m_list_greater;
}
else
{
m_list_min = node->m_list_greater;
}
if (node->m_list_greater)
{
node->m_list_greater->m_list_lower = node->m_list_lower;
}
else
{
m_list_max = node->m_list_lower;
}
}
return node;
}
// Member Data.
Node
*m_nodes,
*m_queue_tail,
*m_queue_head,
*m_list_min,
*m_list_max,
*m_free_list;
};
// A reasonable complex but more efficent method of calculating the results.
// Memory management is done here outside of the timing portion.
clock_t QuiteComplex
(
int size,
int window,
GeneratorCallback input,
OutputCallback output
)
{
Range <int>
range (window);
clock_t
start = clock ();
for (int i = 0 ; i < size ; ++i)
{
range.AddValue (input ());
if (range.RangeAvailable ())
{
output (range.Min (), range.Max ());
}
}
clock_t
end = clock ();
return end - start;
}
Complex.cpp:
#include "Range.h"
template <class T>
class Range
{
private:
// Class Types
// Red/Black tree node colours.
enum NodeColour
{
Red,
Black
};
// Node Data
// Stores a value and its position in various lists and trees.
struct Node
{
// Function to get the sibling of a node.
// Because leaves are stored as null pointers, it must be possible
// to get the sibling of a null pointer. If the object is a null pointer
// then the parent pointer is used to determine the sibling.
Node *Sibling
(
Node *parent
)
{
Node
*sibling;
if (this)
{
sibling = m_tree_parent->m_tree_less == this ? m_tree_parent->m_tree_more : m_tree_parent->m_tree_less;
}
else
{
sibling = parent->m_tree_less ? parent->m_tree_less : parent->m_tree_more;
}
return sibling;
}
// Node Members
Node
*m_queue_next,
*m_tree_less,
*m_tree_more,
*m_tree_parent,
*m_list_greater,
*m_list_lower;
NodeColour
m_colour;
T
m_value;
};
public:
// Constructor
// Allocates memory for the node data and adds all the allocated
// nodes to the unused/free list of nodes.
Range
(
int window_size
) :
m_nodes (new Node [window_size]),
m_queue_tail (m_nodes),
m_queue_head (0),
m_tree_root (0),
m_list_min (0),
m_list_max (0),
m_free_list (m_nodes)
{
for (int i = 0 ; i < window_size - 1 ; ++i)
{
m_nodes [i].m_list_lower = &m_nodes [i + 1];
}
m_nodes [window_size - 1].m_list_lower = 0;
}
// Destructor
// Tidy up allocated data.
~Range ()
{
delete [] m_nodes;
}
// Function to add a new value into the data structure.
void AddValue
(
T value
)
{
Node
*node = GetNode ();
// clear links
node->m_queue_next = node->m_tree_more = node->m_tree_less = node->m_tree_parent = 0;
// set value of node
node->m_value = value;
// insert node into tree
if (m_tree_root)
{
InsertNodeIntoTree (node);
BalanceTreeAfterInsertion (node);
}
else
{
m_tree_root = m_list_max = m_list_min = node;
node->m_tree_parent = node->m_list_greater = node->m_list_lower = 0;
}
m_tree_root->m_colour = Black;
}
// Accessor to determine if the first output value is ready for use.
bool RangeAvailable ()
{
return !m_free_list;
}
// Accessor to get the minimum value of all values in the current window.
T Min ()
{
return m_list_min->m_value;
}
// Accessor to get the maximum value of all values in the current window.
T Max ()
{
return m_list_max->m_value;
}
private:
// Function to get a node to store a value into.
// This function gets nodes from one of two places:
// 1. From the unused/free list
// 2. From the end of the fifo queue, this requires removing the node from the list and tree
Node *GetNode ()
{
Node
*node;
if (m_free_list)
{
// get new node from unused/free list and place at head
node = m_free_list;
m_free_list = node->m_list_lower;
if (m_queue_head)
{
m_queue_head->m_queue_next = node;
}
m_queue_head = node;
}
else
{
// get node from tail of queue and place at head
node = m_queue_tail;
m_queue_tail = node->m_queue_next;
m_queue_head->m_queue_next = node;
m_queue_head = node;
// remove node from tree
node = RemoveNodeFromTree (node);
RebalanceTreeAfterDeletion (node);
// remove node from linked list
if (node->m_list_lower)
{
node->m_list_lower->m_list_greater = node->m_list_greater;
}
else
{
m_list_min = node->m_list_greater;
}
if (node->m_list_greater)
{
node->m_list_greater->m_list_lower = node->m_list_lower;
}
else
{
m_list_max = node->m_list_lower;
}
}
return node;
}
// Rebalances the tree after insertion
void BalanceTreeAfterInsertion
(
Node *node
)
{
node->m_colour = Red;
while (node != m_tree_root && node->m_tree_parent->m_colour == Red)
{
if (node->m_tree_parent == node->m_tree_parent->m_tree_parent->m_tree_more)
{
Node
*uncle = node->m_tree_parent->m_tree_parent->m_tree_less;
if (uncle && uncle->m_colour == Red)
{
node->m_tree_parent->m_colour = Black;
uncle->m_colour = Black;
node->m_tree_parent->m_tree_parent->m_colour = Red;
node = node->m_tree_parent->m_tree_parent;
}
else
{
if (node == node->m_tree_parent->m_tree_less)
{
node = node->m_tree_parent;
LeftRotate (node);
}
node->m_tree_parent->m_colour = Black;
node->m_tree_parent->m_tree_parent->m_colour = Red;
RightRotate (node->m_tree_parent->m_tree_parent);
}
}
else
{
Node
*uncle = node->m_tree_parent->m_tree_parent->m_tree_more;
if (uncle && uncle->m_colour == Red)
{
node->m_tree_parent->m_colour = Black;
uncle->m_colour = Black;
node->m_tree_parent->m_tree_parent->m_colour = Red;
node = node->m_tree_parent->m_tree_parent;
}
else
{
if (node == node->m_tree_parent->m_tree_more)
{
node = node->m_tree_parent;
RightRotate (node);
}
node->m_tree_parent->m_colour = Black;
node->m_tree_parent->m_tree_parent->m_colour = Red;
LeftRotate (node->m_tree_parent->m_tree_parent);
}
}
}
}
// Adds a node into the tree and sorted linked list
void InsertNodeIntoTree
(
Node *node
)
{
Node
*parent = 0,
*child = m_tree_root;
bool
greater;
while (child)
{
parent = child;
child = (greater = node->m_value > child->m_value) ? child->m_tree_more : child->m_tree_less;
}
node->m_tree_parent = parent;
if (greater)
{
parent->m_tree_more = node;
// insert node into linked list
if (parent->m_list_greater)
{
parent->m_list_greater->m_list_lower = node;
}
else
{
m_list_max = node;
}
node->m_list_greater = parent->m_list_greater;
node->m_list_lower = parent;
parent->m_list_greater = node;
}
else
{
parent->m_tree_less = node;
// insert node into linked list
if (parent->m_list_lower)
{
parent->m_list_lower->m_list_greater = node;
}
else
{
m_list_min = node;
}
node->m_list_lower = parent->m_list_lower;
node->m_list_greater = parent;
parent->m_list_lower = node;
}
}
// Red/Black tree manipulation routine, used for removing a node
Node *RemoveNodeFromTree
(
Node *node
)
{
if (node->m_tree_less && node->m_tree_more)
{
// the complex case, swap node with a child node
Node
*child;
if (node->m_tree_less)
{
// find largest value in lesser half (node with no greater pointer)
for (child = node->m_tree_less ; child->m_tree_more ; child = child->m_tree_more)
{
}
}
else
{
// find smallest value in greater half (node with no lesser pointer)
for (child = node->m_tree_more ; child->m_tree_less ; child = child->m_tree_less)
{
}
}
swap (child->m_colour, node->m_colour);
if (child->m_tree_parent != node)
{
swap (child->m_tree_less, node->m_tree_less);
swap (child->m_tree_more, node->m_tree_more);
swap (child->m_tree_parent, node->m_tree_parent);
if (!child->m_tree_parent)
{
m_tree_root = child;
}
else
{
if (child->m_tree_parent->m_tree_less == node)
{
child->m_tree_parent->m_tree_less = child;
}
else
{
child->m_tree_parent->m_tree_more = child;
}
}
if (node->m_tree_parent->m_tree_less == child)
{
node->m_tree_parent->m_tree_less = node;
}
else
{
node->m_tree_parent->m_tree_more = node;
}
}
else
{
child->m_tree_parent = node->m_tree_parent;
node->m_tree_parent = child;
Node
*child_less = child->m_tree_less,
*child_more = child->m_tree_more;
if (node->m_tree_less == child)
{
child->m_tree_less = node;
child->m_tree_more = node->m_tree_more;
node->m_tree_less = child_less;
node->m_tree_more = child_more;
}
else
{
child->m_tree_less = node->m_tree_less;
child->m_tree_more = node;
node->m_tree_less = child_less;
node->m_tree_more = child_more;
}
if (!child->m_tree_parent)
{
m_tree_root = child;
}
else
{
if (child->m_tree_parent->m_tree_less == node)
{
child->m_tree_parent->m_tree_less = child;
}
else
{
child->m_tree_parent->m_tree_more = child;
}
}
}
if (child->m_tree_less)
{
child->m_tree_less->m_tree_parent = child;
}
if (child->m_tree_more)
{
child->m_tree_more->m_tree_parent = child;
}
if (node->m_tree_less)
{
node->m_tree_less->m_tree_parent = node;
}
if (node->m_tree_more)
{
node->m_tree_more->m_tree_parent = node;
}
}
Node
*child = node->m_tree_less ? node->m_tree_less : node->m_tree_more;
if (node->m_tree_parent->m_tree_less == node)
{
node->m_tree_parent->m_tree_less = child;
}
else
{
node->m_tree_parent->m_tree_more = child;
}
if (child)
{
child->m_tree_parent = node->m_tree_parent;
}
return node;
}
// Red/Black tree manipulation routine, used for rebalancing a tree after a deletion
void RebalanceTreeAfterDeletion
(
Node *node
)
{
Node
*child = node->m_tree_less ? node->m_tree_less : node->m_tree_more;
if (node->m_colour == Black)
{
if (child && child->m_colour == Red)
{
child->m_colour = Black;
}
else
{
Node
*parent = node->m_tree_parent,
*n = child;
while (parent)
{
Node
*sibling = n->Sibling (parent);
if (sibling && sibling->m_colour == Red)
{
parent->m_colour = Red;
sibling->m_colour = Black;
if (n == parent->m_tree_more)
{
LeftRotate (parent);
}
else
{
RightRotate (parent);
}
}
sibling = n->Sibling (parent);
if (parent->m_colour == Black &&
sibling->m_colour == Black &&
(!sibling->m_tree_more || sibling->m_tree_more->m_colour == Black) &&
(!sibling->m_tree_less || sibling->m_tree_less->m_colour == Black))
{
sibling->m_colour = Red;
n = parent;
parent = n->m_tree_parent;
continue;
}
else
{
if (parent->m_colour == Red &&
sibling->m_colour == Black &&
(!sibling->m_tree_more || sibling->m_tree_more->m_colour == Black) &&
(!sibling->m_tree_less || sibling->m_tree_less->m_colour == Black))
{
sibling->m_colour = Red;
parent->m_colour = Black;
break;
}
else
{
if (n == parent->m_tree_more &&
sibling->m_colour == Black &&
(sibling->m_tree_more && sibling->m_tree_more->m_colour == Red) &&
(!sibling->m_tree_less || sibling->m_tree_less->m_colour == Black))
{
sibling->m_colour = Red;
sibling->m_tree_more->m_colour = Black;
RightRotate (sibling);
}
else
{
if (n == parent->m_tree_less &&
sibling->m_colour == Black &&
(!sibling->m_tree_more || sibling->m_tree_more->m_colour == Black) &&
(sibling->m_tree_less && sibling->m_tree_less->m_colour == Red))
{
sibling->m_colour = Red;
sibling->m_tree_less->m_colour = Black;
LeftRotate (sibling);
}
}
sibling = n->Sibling (parent);
sibling->m_colour = parent->m_colour;
parent->m_colour = Black;
if (n == parent->m_tree_more)
{
sibling->m_tree_less->m_colour = Black;
LeftRotate (parent);
}
else
{
sibling->m_tree_more->m_colour = Black;
RightRotate (parent);
}
break;
}
}
}
}
}
}
// Red/Black tree manipulation routine, used for balancing the tree
void LeftRotate
(
Node *node
)
{
Node
*less = node->m_tree_less;
node->m_tree_less = less->m_tree_more;
if (less->m_tree_more)
{
less->m_tree_more->m_tree_parent = node;
}
less->m_tree_parent = node->m_tree_parent;
if (!node->m_tree_parent)
{
m_tree_root = less;
}
else
{
if (node == node->m_tree_parent->m_tree_more)
{
node->m_tree_parent->m_tree_more = less;
}
else
{
node->m_tree_parent->m_tree_less = less;
}
}
less->m_tree_more = node;
node->m_tree_parent = less;
}
// Red/Black tree manipulation routine, used for balancing the tree
void RightRotate
(
Node *node
)
{
Node
*more = node->m_tree_more;
node->m_tree_more = more->m_tree_less;
if (more->m_tree_less)
{
more->m_tree_less->m_tree_parent = node;
}
more->m_tree_parent = node->m_tree_parent;
if (!node->m_tree_parent)
{
m_tree_root = more;
}
else
{
if (node == node->m_tree_parent->m_tree_less)
{
node->m_tree_parent->m_tree_less = more;
}
else
{
node->m_tree_parent->m_tree_more = more;
}
}
more->m_tree_less = node;
node->m_tree_parent = more;
}
// Member Data.
Node
*m_nodes,
*m_queue_tail,
*m_queue_head,
*m_tree_root,
*m_list_min,
*m_list_max,
*m_free_list;
};
// A complex but more efficent method of calculating the results.
// Memory management is done here outside of the timing portion.
clock_t Complex
(
int count,
int window,
GeneratorCallback input,
OutputCallback output
)
{
Range <int>
range (window);
clock_t
start = clock ();
for (int i = 0 ; i < count ; ++i)
{
range.AddValue (input ());
if (range.RangeAvailable ())
{
output (range.Min (), range.Max ());
}
}
clock_t
end = clock ();
return end - start;
}
Idea of algorithm:
Take the first 1000 values of data and sort them
The last in the sort - the first is range(data + 0, data + 999).
Then remove from the sort pile the first element with the value data[0]
and add the element data[1000]
Now, the last in the sort - the first is range(data + 1, data + 1000).
Repeat until done
// This should run in (DATA_LEN - RANGE_WIDTH)log(RANGE_WIDTH)
#include <set>
#include <algorithm>
using namespace std;
const int DATA_LEN = 3600000;
double* const data = new double (DATA_LEN);
....
....
const int RANGE_WIDTH = 1000;
double range = new double(DATA_LEN - RANGE_WIDTH);
multiset<double> data_set;
data_set.insert(data[i], data[RANGE_WIDTH]);
for (int i = 0 ; i < DATA_LEN - RANGE_WIDTH - 1 ; i++)
{
range[i] = *data_set.end() - *data_set.begin();
multiset<double>::iterator iter = data_set.find(data[i]);
data_set.erase(iter);
data_set.insert(data[i+1]);
}
range[i] = *data_set.end() - *data_set.begin();
// range now holds the values you seek
You should probably check this for off by 1 errors, but the idea is there.

How to implement a natural sort algorithm in c++?

I'm sorting strings that are comprised of text and numbers.
I want the sort to sort the number parts as numbers, not alphanumeric.
For example I want: abc1def, ..., abc9def, abc10def
instead of: abc10def, abc1def, ..., abc9def
Does anyone know an algorithm for this (in particular in c++)
Thanks
I asked this exact question (although in Java) and got pointed to http://www.davekoelle.com/alphanum.html which has an algorithm and implementations of it in many languages.
Update 14 years later: Dave Koelle’s blog has gone off line and I can’t find his actual algorithm, but here’s an implementation.
https://github.com/cblanc/koelle-sort
Several natural sort implementations for C++ are available. A brief review:
natural_sort<> - based on Boost.Regex.
In my tests, it's roughly 20 times slower than other options.
Dirk Jagdmann's alnum.hpp, based on Dave Koelle's alphanum algorithm
Potential integer overlow issues for values over MAXINT
Martin Pool's natsort - written in C, but trivially usable from C++.
The only C/C++ implementation I've seen to offer a case insensitive version, which would seem to be a high priority for a "natural" sort.
Like the other implementations, it doesn't actually parse decimal points, but it does special case leading zeroes (anything with a leading 0 is assumed to be a fraction), which is a little weird but potentially useful.
PHP uses this algorithm.
This is known as natural sorting. There's an algorithm here that looks promising.
Be careful of problems with non-ASCII characters (see Jeff's blog entry on the subject).
Partially reposting my another answer:
bool compareNat(const std::string& a, const std::string& b){
if (a.empty())
return true;
if (b.empty())
return false;
if (std::isdigit(a[0]) && !std::isdigit(b[0]))
return true;
if (!std::isdigit(a[0]) && std::isdigit(b[0]))
return false;
if (!std::isdigit(a[0]) && !std::isdigit(b[0]))
{
if (a[0] == b[0])
return compareNat(a.substr(1), b.substr(1));
return (toUpper(a) < toUpper(b));
//toUpper() is a function to convert a std::string to uppercase.
}
// Both strings begin with digit --> parse both numbers
std::istringstream issa(a);
std::istringstream issb(b);
int ia, ib;
issa >> ia;
issb >> ib;
if (ia != ib)
return ia < ib;
// Numbers are the same --> remove numbers and recurse
std::string anew, bnew;
std::getline(issa, anew);
std::getline(issb, bnew);
return (compareNat(anew, bnew));
}
toUpper() function:
std::string toUpper(std::string s){
for(int i=0;i<(int)s.length();i++){s[i]=toupper(s[i]);}
return s;
}
Usage:
std::vector<std::string> str;
str.push_back("abc1def");
str.push_back("abc10def");
...
std::sort(str.begin(), str.end(), compareNat);
To solve what is essentially a parsing problem a state machine (aka finite state automaton) is the way to go. Dissatisfied with the above solutions i wrote a simple one-pass early bail-out algorithm that beats C/C++ variants suggested above in terms of performance, does not suffer from numerical datatype overflow errors, and is easy to modify to add case insensitivity if required.
sources can be found here
For those that arrive here and are already using Qt in their project, you can use the QCollator class. See this question for details.
Avalanchesort is a recursive variation of naturall sort, whiche merge runs, while exploring the stack of sorting-datas. The algorithim will sort stable, even if you add datas to your sorting-heap, while the algorithm is running/sorting.
The search-principle is simple. Only merge runs with the same rank.
After finding the first two naturell runs (rank 0), avalanchesort merge them to a run with rank 1. Then it call avalanchesort, to generate a second run with rank 1 and merge the two runs to a run with rank 2. Then it call the avalancheSort to generate a run with rank 2 on the unsorted datas....
My Implementation porthd/avalanchesort divide the sorting from the handling of the data using interface injection. You can use the algorithmn for datastructures like array, associative arrays or lists.
/**
* #param DataListAvalancheSortInterface $dataList
* #param DataRangeInterface $beginRange
* #param int $avalancheIndex
* #return bool
*/
public function startAvalancheSort(DataListAvalancheSortInterface $dataList)
{
$avalancheIndex = 0;
$rangeResult = $this->avalancheSort($dataList, $dataList->getFirstIdent(), $avalancheIndex);
if (!$dataList->isLastIdent($rangeResult->getStop())) {
do {
$avalancheIndex++;
$lastIdent = $rangeResult->getStop();
if ($dataList->isLastIdent($lastIdent)) {
$rangeResult = new $this->rangeClass();
$rangeResult->setStart($dataList->getFirstIdent());
$rangeResult->setStop($dataList->getLastIdent());
break;
}
$nextIdent = $dataList->getNextIdent($lastIdent);
$rangeFollow = $this->avalancheSort($dataList, $nextIdent, $avalancheIndex);
$rangeResult = $this->mergeAvalanche($dataList, $rangeResult, $rangeFollow);
} while (true);
}
return $rangeResult;
}
/**
* #param DataListAvalancheSortInterface $dataList
* #param DataRangeInterface $range
* #return DataRangeInterface
*/
protected function findRun(DataListAvalancheSortInterface $dataList,
$startIdent)
{
$result = new $this->rangeClass();
$result->setStart($startIdent);
$result->setStop($startIdent);
do {
if ($dataList->isLastIdent($result->getStop())) {
break;
}
$nextIdent = $dataList->getNextIdent($result->getStop());
if ($dataList->oddLowerEqualThanEven(
$dataList->getDataItem($result->getStop()),
$dataList->getDataItem($nextIdent)
)) {
$result->setStop($nextIdent);
} else {
break;
}
} while (true);
return $result;
}
/**
* #param DataListAvalancheSortInterface $dataList
* #param $beginIdent
* #param int $avalancheIndex
* #return DataRangeInterface|mixed
*/
protected function avalancheSort(DataListAvalancheSortInterface $dataList,
$beginIdent,
int $avalancheIndex = 0)
{
if ($avalancheIndex === 0) {
$rangeFirst = $this->findRun($dataList, $beginIdent);
if ($dataList->isLastIdent($rangeFirst->getStop())) {
// it is the last run
$rangeResult = $rangeFirst;
} else {
$nextIdent = $dataList->getNextIdent($rangeFirst->getStop());
$rangeSecond = $this->findRun($dataList, $nextIdent);
$rangeResult = $this->mergeAvalanche($dataList, $rangeFirst, $rangeSecond);
}
} else {
$rangeFirst = $this->avalancheSort($dataList,
$beginIdent,
($avalancheIndex - 1)
);
if ($dataList->isLastIdent($rangeFirst->getStop())) {
$rangeResult = $rangeFirst;
} else {
$nextIdent = $dataList->getNextIdent($rangeFirst->getStop());
$rangeSecond = $this->avalancheSort($dataList,
$nextIdent,
($avalancheIndex - 1)
);
$rangeResult = $this->mergeAvalanche($dataList, $rangeFirst, $rangeSecond);
}
}
return $rangeResult;
}
protected function mergeAvalanche(DataListAvalancheSortInterface $dataList, $oddListRange, $evenListRange)
{
$resultRange = new $this->rangeClass();
$oddNextIdent = $oddListRange->getStart();
$oddStopIdent = $oddListRange->getStop();
$evenNextIdent = $evenListRange->getStart();
$evenStopIdent = $evenListRange->getStop();
$dataList->initNewListPart($oddListRange, $evenListRange);
do {
if ($dataList->oddLowerEqualThanEven(
$dataList->getDataItem($oddNextIdent),
$dataList->getDataItem($evenNextIdent)
)) {
$dataList->addListPart($oddNextIdent);
if ($oddNextIdent === $oddStopIdent) {
$restTail = $evenNextIdent;
$stopTail = $evenStopIdent;
break;
}
$oddNextIdent = $dataList->getNextIdent($oddNextIdent);
} else {
$dataList->addListPart($evenNextIdent);
if ($evenNextIdent === $evenStopIdent) {
$restTail = $oddNextIdent;
$stopTail = $oddStopIdent;
break;
}
$evenNextIdent = $dataList->getNextIdent($evenNextIdent);
}
} while (true);
while ($stopTail !== $restTail) {
$dataList->addListPart($restTail);
$restTail = $dataList->getNextIdent($restTail);
}
$dataList->addListPart($restTail);
$dataList->cascadeDataListChange($resultRange);
return $resultRange;
}
}
My algorithm with test code of java version. If you want to use it in your project you can define a comparator yourself.
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Comparator;
import java.util.List;
import java.util.function.Consumer;
public class FileNameSortTest {
private static List<String> names = Arrays.asList(
"A__01__02",
"A__2__02",
"A__1__23",
"A__11__23",
"A__3++++",
"B__1__02",
"B__22_13",
"1_22_2222",
"12_222_222",
"2222222222",
"1.sadasdsadsa",
"11.asdasdasdasdasd",
"2.sadsadasdsad",
"22.sadasdasdsadsa",
"3.asdasdsadsadsa",
"adsadsadsasd1",
"adsadsadsasd10",
"adsadsadsasd3",
"adsadsadsasd02"
);
public static void main(String...args) {
List<File> files = new ArrayList<>();
names.forEach(s -> {
File f = new File(s);
try {
if (!f.exists()) {
f.createNewFile();
}
files.add(f);
} catch (IOException e) {
e.printStackTrace();
}
});
files.sort(Comparator.comparing(File::getName));
files.forEach(f -> System.out.print(f.getName() + " "));
System.out.println();
files.sort(new Comparator<File>() {
boolean caseSensitive = false;
int SPAN_OF_CASES = 'a' - 'A';
#Override
public int compare(File left, File right) {
char[] csLeft = left.getName().toCharArray(), csRight = right.getName().toCharArray();
boolean isNumberRegion = false;
int diff=0, i=0, j=0, lenLeft=csLeft.length, lenRight=csRight.length;
char cLeft = 0, cRight = 0;
for (; i<lenLeft && j<lenRight; i++, j++) {
cLeft = getCharByCaseSensitive(csLeft[i]);
cRight = getCharByCaseSensitive(csRight[j]);
boolean isNumericLeft = isNumeric(cLeft), isNumericRight = isNumeric(cRight);
if (isNumericLeft && isNumericRight) {
// Number start!
if (!isNumberRegion) {
isNumberRegion = true;
// Remove prefix '0'
while (i < lenLeft && cLeft == '0') i++;
while (j < lenRight && cRight == '0') j++;
if (i == lenLeft || j == lenRight) break;
}
// Diff start: calculate the diff value.
if (cLeft != cRight && diff == 0)
diff = cLeft - cRight;
} else {
if (isNumericLeft != isNumericRight) {
// One numeric and one char.
if (isNumberRegion)
return isNumericLeft ? 1 : -1;
return cLeft - cRight;
} else {
// Two chars: if (number) diff don't equal 0 return it.
if (diff != 0)
return diff;
// Calculate chars diff.
diff = cLeft - cRight;
if (diff != 0)
return diff;
// Reset!
isNumberRegion = false;
diff = 0;
}
}
}
// The longer one will be put backwards.
return (i == lenLeft && j == lenRight) ? cLeft - cRight : (i == lenLeft ? -1 : 1) ;
}
private boolean isNumeric(char c) {
return c >= '0' && c <= '9';
}
private char getCharByCaseSensitive(char c) {
return caseSensitive ? c : (c >= 'A' && c <= 'Z' ? (char) (c + SPAN_OF_CASES) : c);
}
});
files.forEach(f -> System.out.print(f.getName() + " "));
}
}
The output is,
1.sadasdsadsa 11.asdasdasdasdasd 12_222_222 1_22_2222 2.sadsadasdsad 22.sadasdasdsadsa 2222222222 3.asdasdsadsadsa A__01__02 A__11__23 A__1__23 A__2__02 A__3++++ B__1__02 B__22_13 adsadsadsasd02 adsadsadsasd1 adsadsadsasd10 adsadsadsasd3
1.sadasdsadsa 1_22_2222 2.sadsadasdsad 3.asdasdsadsadsa 11.asdasdasdasdasd 12_222_222 22.sadasdasdsadsa 2222222222 A__01__02 A__1__23 A__2__02 A__3++++ A__11__23 adsadsadsasd02 adsadsadsasd1 adsadsadsasd3 adsadsadsasd10 B__1__02 B__22_13
Process finished with exit code 0
// -1: s0 < s1; 0: s0 == s1; 1: s0 > s1
static int numericCompare(const string &s0, const string &s1) {
size_t i = 0, j = 0;
for (; i < s0.size() && j < s1.size();) {
string t0(1, s0[i++]);
while (i < s0.size() && !(isdigit(t0[0]) ^ isdigit(s0[i]))) {
t0.push_back(s0[i++]);
}
string t1(1, s1[j++]);
while (j < s1.size() && !(isdigit(t1[0]) ^ isdigit(s1[j]))) {
t1.push_back(s1[j++]);
}
if (isdigit(t0[0]) && isdigit(t1[0])) {
size_t p0 = t0.find_first_not_of('0');
size_t p1 = t1.find_first_not_of('0');
t0 = p0 == string::npos ? "" : t0.substr(p0);
t1 = p1 == string::npos ? "" : t1.substr(p1);
if (t0.size() != t1.size()) {
return t0.size() < t1.size() ? -1 : 1;
}
}
if (t0 != t1) {
return t0 < t1 ? -1 : 1;
}
}
return i == s0.size() && j == s1.size() ? 0 : i != s0.size() ? 1 : -1;
}
I am not very sure if it is you want, anyway, you can have a try:-)