Finding duplicates in a list/file. [Groovy/Java] - list

I have an input file where each line is a special record.
I would gladly work on the file level but might be a more convenient way to transfer the file into a list. (each object in the list = each row in the file)
In the input file, there can be several duplicate rows.
The goal: Split the given file/list into unique records and duplicate records, i.e., Records which are present multiple times, keep one occurrence and other duplicate parts store in a new list
I found an easy way how to remove duplicates but never found a way how to store them
File inputFile = new File("....")
inputFile.eachLine { inputList.add(it) } //fill the list
List inputList = [1,1,3,3,1,2,2,3,4,1,5,6,7,7,8,9,8,10]
inputList = inputList.unique() // remove duplicates
println inputList
// inputList = [1, 3, 2, 4, 5, 6, 7, 8, 9, 10]
The output should look like: Two lists/files with removed duplicates and duplicates itself
inputList = [1,3,2,4,5,6,7,8,9,10] //only one ocurance of each line
listOfDuplicates = [1,1,1,3,3,2,7,8] //duplicates removed from original list
The output does not need to correspond with the initial order of items.
Thank you for help, Matt

You could simply iterate over the list yourself:
def inputList = [1,1,3,3,1,2,2,3,4,1,5,6,7,7,8,9,8,10]
def uniques = []
def duplicates = []
inputList.each { uniques.contains(it) ? duplicates << it : uniques << it }
assert inputList.size() == uniques.size() + duplicates.size()
assert uniques == [1,3,2,4,5,6,7,8,9,10] //only one ocurance of each line
assert duplicates == [1,3,1,2,3,1,7,8] //duplicates removed from original list
inputList = uniques // if desired

There are many ways to do this,following is the simplest way
def list = [1,1,3,3,1,2,2,3,4,1,5,6,7,7,8,9,8,10]
def unique=[]
def duplicates=[]
list.each {
if(unique.contains(it))
duplicates.add(it)
else
unique.add(it)
}
println list //[1, 1, 3, 3, 1, 2, 2, 3, 4, 1, 5, 6, 7, 7, 8, 9, 8, 10]
println unique //[1, 3, 2, 4, 5, 6, 7, 8, 9, 10]
println duplicates //[1, 3, 1, 2, 3, 1, 7, 8]
Hope this will helps you

Something very straight-forward:
List inputList = [1,1,3,3,1,2,2,3,4,1,5,6,7,7,8,9,8,10]
def uniques = [], duplicates = []
Iterator iter = inputList.iterator()
iter.each{
iter.remove()
inputList.contains( it ) ? ( duplicates << it ) : ( uniques << it )
}
assert [2, 3, 4, 1, 5, 6, 7, 9, 8, 10] == uniques
assert [1,1,3,3,1,2,7,8] == duplicates

If order of duplicates isn't important:
def list = [1,1,3,3,1,2,2,3,4,1,5,6,7,7,8,9,8,10]
def (unique, dups) = list.groupBy().values()*.with{ [it[0..0], tail()] }.transpose()*.sum()
assert unique == [1,3,2,4,5,6,7,8,9,10]
assert dups == [1,1,1,3,3,2,7,8]

This code should solve the problem
List listOfDuplicates = inputList.clone()
listOfDuplicates.removeAll{
listOfDuplicates.count(it) == 1
}

The more the merrier:
groovy:000> list.groupBy().values()*.tail().flatten()
===> [1, 1, 1, 3, 3, 2, 7, 8]
Group by identity (this is basically a "frequencies" function).
Take just the values
Clip the first element
Combine the lists

Related

Accumulate monotonically increasing sequence list to a linear list

Hi I'm new to python and in programming in general.
I have a monotonic list, ex. L = [1,2,3,4,0,1,2,3,4,5,0,1,2,3,4,0,1...]
And whenever the series reaches 0, I would like to take the previous value and add to the remaining of the list so the list becomes linear.
L_sequence = [1,2,3,4,0,1,2,3,4,5,0,1,2,3,4,0,1...]
L_linear = [1,2,3,4,4,5,6,7,8,9,9,10,11,12,13,13,14...]
I know a nasty way to do this, but if any of you have a good solution, you are welcome to share.
I think it should solve your problem:
l = [1,2,3,4,0,1,2,3,4,5,0,1,2,3,4,0,1]
x = []
helperOne = 0
helperTwo = 0
goForIt = False
for i in l:
if i == 0:
goForIt = True
helperTwo = helperOne + helperTwo
if goForIt==True:
t = helperTwo + i
x.append(t)
else:
x.append(i)
helperOne = i
print x
Output:
[1, 2, 3, 4, 4, 5, 6, 7, 8, 9, 9, 10, 11, 12, 13, 13, 14]

Best way to shift a list in Python?

I have a list of numbers, let's say :
my_list = [2, 4, 3, 8, 1, 1]
From this list, I want to obtain a new list. This list would start with the maximum value until the end, and I want the first part (from the beginning until just before the maximum) to be added, like this :
my_new_list = [8, 1, 1, 2, 4, 3]
(basically it corresponds to a horizontal graph shift...)
Is there a simple way to do so ? :)
Apply as many as you want,
To the left:
my_list.append(my_list.pop(0))
To the right:
my_list.insert(0, my_list.pop())
How about something like this:
max_idx = my_list.index(max(my_list))
my_new_list = my_list[max_idx:] + my_list[0:max_idx]
Alternatively you can do something like the following,
def shift(l,n):
return itertools.islice(itertools.cycle(l),n,n+len(l))
my_list = [2, 4, 3, 8, 1, 1]
list(shift(my_list, 3))
Elaborating on Yasc's solution for moving the order of the list values, here's a way to shift the list to start with the maximum value:
# Find the max value:
max_value = max(my_list)
# Move the last value from the end to the beginning,
# until the max value is the first value:
while my_list[0] != max_value:
my_list.insert(0, my_list.pop())

Making a dictionary? from 2 lists / columns

I have a large database with several columns, i need data from 2 of these.
The end result is to have 2 drop down menus where the first one sets "names" and the second one is the "numbers" values that has been merged into the name. I just need the data available so i can input it into another program.
So a list or dictionary that contains the Unique values of the "names" list, with the numbers from the numbers list appended to them.
# Just a list of random names and numbers for testing
names = [
"Cindi Brookins",
"Cumberband Hamberdund",
"Roger Ramsden",
"Cumberband Hamberdund",
"Lorean Dibble",
"Lorean Dibble",
"Coleen Snider",
"Rey Bains",
"Maxine Rader",
"Cindi Brookins",
"Catharine Vena",
"Lanny Mckennon",
"Berta Urban",
"Rey Bains",
"Roger Ramsden",
"Lanny Mckennon",
"Catharine Vena",
"Berta Urban",
"Maxine Rader",
"Coleen Snider"
]
numbers = [
6,
5,
7,
10,
3,
9,
1,
1,
2,
7,
4,
2,
8,
3,
8,
10,
4,
9,
6,
5
]
So in the above example "Berta Urban" would appear once, but still have the numbers 8 and 9 assigned, "Rey Bains" would have 1 and 3.
I have tried with
mergedlist = dict(zip(names, numbers))
But that only assigns the last of the numbers to the name.
I am not sure if i can make a dictionary with Unique "names" that holds multiple "numbers".
You only get the last number associated with each name because dictionary keys are unique (otherwise they wouldn't be much use). So if you do
mergedlist["Berta Urban"] = 8
and after that
mergedlist["Berta Urban"] = 9
the result will be
{'Berta Urban': 9}
Just as if you did:
berta_urban = 8
berta_urban = 9
In that case you would expect the value of berta_urban to be 9 and not [8,9].
So, as you can see, you need an append not an assignment to your dict entry.
from collections import defaultdict
mergedlist = defaultdict(list)
for (name,number) in zip(names, numbers): mergedlist[name].append(number)
This gives:
{'Coleen Snider': [1, 5],
'Cindi Brookins': [6, 7],
'Cumberband Hamberdund': [5, 10],
'Roger Ramsden': [7, 8],
'Lorean Dibble': [3, 9],
'Rey Bains': [1, 3],
'Maxine Rader': [2, 6],
'Catharine Vena': [4, 4],
'Lanny Mckennon': [2, 10],
'Berta Urban': [8, 9]
}
which is what I think you want. Note that you will get duplicates, as in 'Catharine Vena': [4, 4] and you will also get a list of numbers for each name, even if the list has only one number in it.
You cannot have multiple keys of the same name in a dict, but your dict keys can be unique while holding a list of matching numbers. Something like:
mergedlist = {}
for i, v in enumerate(names):
mergedlist[v] = mergedlist.get(v, []) + [numbers[i]]
print(mergedlist["Berta Urban"]) # prints [8, 9]
Not terribly efficient, tho. In dependence of the datatbase you're using, chances are that the database can get you the results in the form you prefer faster than you post-processing and reconstructing the data.

Python 2.7 current row index on 2d array iteration

When iterating on a 2d array, how can I get the current row index? For example:
x = [[ 1. 2. 3. 4.]
[ 5. 6. 7. 8.]
[ 9. 0. 3. 6.]]
Something like:
for rows in x:
print x current index (for example, when iterating on [ 5. 6. 7. 8.], return 1)
Enumerate is a built-in function of Python. It’s usefulness can not be summarized in a single line. Yet most of the newcomers and even some advanced programmers are unaware of it. It allows us to loop over something and have an automatic counter. Here is an example:
for counter, value in enumerate(some_list):
print(counter, value)
And there is more! enumerate also accepts an optional argument which makes it even more useful.
my_list = ['apple', 'banana', 'grapes', 'pear']
for c, value in enumerate(my_list, 1):
print(c, value)
.
# Output:
# 1 apple
# 2 banana
# 3 grapes
# 4 pear
The optional argument allows us to tell enumerate from where to start the index. You can also create tuples containing the index and list item using a list. Here is an example:
my_list = ['apple', 'banana', 'grapes', 'pear']
counter_list = list(enumerate(my_list, 1))
print(counter_list)
.
# Output: [(1, 'apple'), (2, 'banana'), (3, 'grapes'), (4, 'pear')]
enumerate:
In [42]: x = [[ 1, 2, 3, 4],
...: [ 5, 6, 7, 8],
...: [ 9, 0, 3, 6]]
In [43]: for index, rows in enumerate(x):
...: print('current index {}'.format(index))
...: print('current row {}'.format(rows))
...:
current index 0
current row [1, 2, 3, 4]
current index 1
current row [5, 6, 7, 8]
current index 2
current row [9, 0, 3, 6]

How do i check for duplicate values present in a Dictionary?

I want to map a function that takes a dictionary as the input and returns a list of the keys.
The keys in the list must be of only the unique values present in the dictionary.
So, this is what I have done.
bDict={}
for key,value in aDict.items():
if bDict.has_key(value) == False:
bDict[value]=key
else:
bDict.pop(value,None)
This is the output :
>>> aDict.keys()
Out[4]: [1, 3, 6, 7, 8, 10]
>>> aDict.values()
Out[5]: [1, 2, 0, 0, 4, 0]
>>> bDict.keys()
Out[6]: [0, 1, 2, 4]
>>> bDict.values()
Out[7]: [10, 1, 3, 8]
But, the expected output should be for bDict.values() : [*1,3,8*]
This may help.
CODE
aDict = { 1:1, 3:2, 6:0, 7:0, 8:4, 10:0, 11:0}
bDict = {}
for i,j in aDict.items():
if j not in bDict:
bDict[j] = [i]
else:
bDict[j].append(i)
print map(lambda x: x[0],filter(lambda x: len(x) == 1,bDict.values()))
OUTPUT
[1, 3, 8]
So it appears you're creating a new dictionary with the keys and values inverted, keeping pairs where the value is unique. You can figure out which of the items are unique first then build a dictionary off of that.
def distinct_values(d):
from collections import Counter
counts = Counter(d.itervalues())
return { v: k for k, v in d.iteritems() if counts[v] == 1 }
This yields the following result:
>>> distinct_values({ 1:1, 3:2, 6:0, 7:0, 8:4, 10:0 })
{1: 1, 2: 3, 4: 8}
Here is a solution (with two versions of the aDict to test a rand case which failed in another solution):
#aDict = { 1:1, 3:2, 6:0, 7:0, 8:4, 10:0}
aDict = { 1:1, 3:2, 6:0, 7:0, 8:4, 10:0, 11:2}
seenValues = {}
uniqueKeys = set()
for aKey, aValue in aDict.items():
if aValue not in seenValues:
# Store the key of the value, and assume it is unique
seenValues[aValue] = aKey
uniqueKeys.add(aKey)
elif seenValues[aValue] in uniqueKeys:
# The value has been seen before, and the assumption of
# it being unique was wrong, so remove it
uniqueKeys.remove(seenValues[aValue])
print "Remove non-unique key/value pair: {%d, %d}" % (aKey, aValue)
else:
print "Non-unique key/value pair: {%d, %d}" % (aKey, aValue)
print "Unique keys: ", sorted(uniqueKeys)
And this produces the output:
Remove non-unique key/value pair: {7, 0}
Non-unique key/value pair: {10, 0}
Remove non-unique key/value pair: {11, 2}
Unique keys: [1, 8]
Or with original version of aDict:
Remove non-unique key/value pair: {7, 0}
Non-unique key/value pair: {10, 0}
Unique keys: [1, 3, 8]
As a python 2.7 one-liner,
[k for k,v in aDict.iteritems() if aDict.values().count(v) == 1]
Note that the above
Calls aDict.values() many times, once for each entry in the dictionary, and
Calls aDict.values().count(v) multiple times for each replicated value.
This is not a problem if the dictionary is small. If the dictionary isn't small, the creation and destruction of those duplicative lists and the duplicative calls to count() may be costly. It may help to cache the value of adict.values(), and it may also help to create a dictionary that maps the values in the dictionary to the number of occurrences as a dictionary entry value.