Duplicate key exception reading map from edn file - clojure

I am persisting a Clojure map to a file using
(spit (str tmpdir "/" "results.edn") (.toString c))
where c is a very large hash-map (the file is 201MB). The keys of the map are strings and the values are numbers.
When I try to read the map back in from the file I'm getting a duplicate key exception. Here is how I'm trying to read it back in
(def phrases (edn/read
(PushbackReader.
(io/reader
"/tmp/mednotes8910030368496690883/results.edn"))))
How can the map end up with duplicate keys when it is written out to the edn file? Any ideas?
Here's the meat of the exception:
Caused by java.lang.IllegalArgumentException
Duplicate key: ? 5
PersistentHashMap.java: 67 clojure.lang.PersistentHashMap/createWithCheck
RT.java: 1538 clojure.lang.RT/map
EdnReader.java: 631 clojure.lang.EdnReader$MapReader/invoke
EdnReader.java: 142 clojure.lang.EdnReader/read
EdnReader.java: 108 clojure.lang.EdnReader/read
edn.clj: 35 clojure.edn/read
edn.clj: 33 clojure.edn/read
AFn.java: 154 clojure.lang.AFn/applyToHelper
AFn.java: 144 clojure.lang.AFn/applyTo
As requested, here is a sample of what is in the map:
{"cervical region of" 32,
"partial brachial" 64,
"is an effective medication for" 32,
", as stopping them" 32,
", supportive treatment" 160,
"should be eating a normal" 128,
"o call back if supplement" 32,
"diagnosiscod diabc }" 32,
"days in case of allergies" 32,
"ointment 8 drops" 32,
"leg from pinched" 32,
...
#fl00r suggested in the comments to grep the results.edn file for the string "? 5". Using egrep to show the 20 characters before and after this string, I get 4 entries with that key:
[/tmp/mednotes8910030368496690883]> egrep -o ".{20}\"\? 5\" .{20}" results.edn
oothbrush for" 160, "? 5" 32, ". ) during his
t with sutures" 32, "? 5" 352, "4.81 pounds" 1
"being up all" 32, "? 5" 32, "limited financi
everytime she" 32, "? 5" 32, "had a partial m
interesting that 3 of the 4 have exactly the same counts. Still not sure how this can happen when written from a seemingly good map that doesn't have these duplications.

Related

How can find some pin point with regex

I am trying to analyze a string with regex (e.g. 20, 38,, 20, 24 n2,, 20, 28, 38,, 851, 859 n3,) in XML files.
Example text:
<p>Gilmer v Interstate/Johnson Lane Corp. (1991) 500 US 20, 38, 111 S Ct 1647:</p>
<p>Gilmer v Interstate/Johnson Lane Corp. (1991) 500 US 20, 24 n2, 111 S Ct 1647</p>
<p>Gilmer v Interstate/Johnson Lane Corp.</italic> (1991) 500 US 20, 28, 38, 111 S Ct 1647</p>
<p>International Bhd. of Elec. Workers v Hechler (1987) 481 US 851, 859 n3, 107 S Ct 2161:</p>
I want to modify the (\([^()]*)|([0-9]+,)\s*[0-9]+,?\s*[0-9]+, regex because I am replacing the text with $1$2.
(https://regex101.com/r/jWt2w1/2)
Use
(\([^()]*)|([0-9]+,)\s*[0-9]+(?:\s+[a-z]+)?,?\s*[0-9]+(?:\s+[a-z]+)?,
See proof
The (?:\s+[a-z]+)? optionally matches one or more whitespace characters and one or more letters.

Remove regex pattern from string and store in csv

I am trying to clean up a CSV by using regex. I have accomplished the first part which extracts the regex pattern from the address table and writes it to the street_numb field. The part I need help with is removing that same pattern from the street field so I only end up with the following (i.e., Steinway St, 31 St, 82nd Rd, and 19th St) stored in the street field. Hence these values would be removed (-78, -45, -35, -54) from the street field.
b street_numb street address zipcode
1 246 FIFTH AVE 246 FIFTH AVE 11215
2 30 -78 -78 STEINWAY ST 30 -78 STEINWAY ST 11016
3 25 -45 -45 31ST ST 25 -45 31ST ST 11102
4 123 -35 -35 82ND RD 123 -35 82ND RD 11415
5 22 -54 -54 19TH ST 22 -54 19TH ST 11105
Sample Data (above)
import csv
import re
path = '/Users/darchcruise/Desktop/bldg_zip_codes.csv'
with open(path, 'rU') as infile, open(path+'out.csv', 'w') as outfile:
fieldnames = ['b', 'street_numb', 'street', 'address', 'zipcode']
readablefile = csv.DictReader(infile)
writablefile = csv.DictWriter(outfile, fieldnames=fieldnames)
for row in readablefile:
add = re.match(r'\d+\s*-\s*\d+', row['address'])
if add:
row['street_numb'] = add.group()
# row['street'] = remove re.string (add.group()) from street field
writablefile.writerow(row)
else:
writablefile.writerow(row)
What code in line 12 (# remove re.string from row['street']) could be used to resolve my issue (removing -78, -45, -35, -54 from the street field)?
You can use capturing group with findall like this
[x for x in re.findall("(\d+\s*(-\s*\d+\s+)?)((\w|\s)+)", row['address'])][0][0]-->gives street number
[x for x in re.findall("(\d+\s*(-\s*\d+\s+)?)((\w|\s)+)", row['address'])][0][2]-->gives address

Repeating Capture Groups Regex

I have a large chunk of class data that I need to run a regular expression on and get data back from. The problem is that I need a repeating capturing group in order to acomplish that.
Womn St 157A QUEERHISTORY MAKING
CCode Typ Sec Unt Instructor Time Place Max Enr Req Rstr Status
32680 LEC A 4 SHAH, P. TuTh 11:00-12:20p IAB 131 35 37 60 FULL
Womn St 171 SEX/RACE & CONQUEST
CCode Typ Sec Unt Instructor Time Place Max Enr Req Rstr Status
32710 LEC A 4 O'TOOLE, R. TuTh 2:00- 3:20p DBH 1300 52 13/45 24 OPEN
~ Same as 25610 (GlblClt 103B, Lec A); 26350 (History 169, Lec A); and
~ 60320 (Anthro 139, Lec B).
32711 DIS 1 0 MONSON, A. W 9:00- 9:50 HH 105 25 5/23 8 OPEN
O'TOOLE, R.
~ Same as 25612 (GlblClt 103B, Dis 1); 26351 (History 169, Dis 1); and
~ 60321 (Anthro 139, Dis 1).
The result I need would return two matches
Match
Group1:Womn St 157A
Group2:QUEERHISTORY MAKING
Group3:32680
Group4:LEC
Group5:A
Group6:SHAH, P.
Group7:TuTh 11:00-12:20p
Group8:IAB 13
Match
Group1:Womn St 171
Group2:SEX/RACE & CONQUEST
Group3:32710
Group4:LEC
Group5:A
Group6:O'TOOLE, R.
Group7:TuTh 2:00- 3:20p
Group8:DBH 1300
Group9:25610
Group10:26350
Group11:60320
Group12:32711
Group13:DIS
Group14:1
Group15:MONSON, A.
Group16: W 9:00- 9:50
Group17:HH 105
Group18:25612
Group19:26351
Group20:60321

How to extract character component in values and replace the values with -99

My data looks like:
VAR_A: 134, 15M3, 2004, 301ME, 201E, 41, 53, 22
I'd like to change this vector like below:
VAR_A: 134, -99, 2004, -99, -99, 41, 53, 22
If a value contain characters (e.g., M, E), I want to change those values with -99.
How could I do it in R? I've heard that regular expression would be a possible way, but I'm not good at it.
It seems to me you want to replace the values that are not digits, if that is the case ...
x <- c('134', '15M3', '2004', '301ME', '201E', '41', '53', '22')
sub('.*\\D.*', '-99', x)
# [1] "134" "-99" "2004" "-99" "-99" "41" "53" "22"
Or essentially you could do:
x[grepl('\\D', x)] <- -99
as.numeric(x)
# [1] 134 -99 2004 -99 -99 41 53 22

JasperReports: Converting String into array and populating List with it

What I have is this String [125, 154, 749, 215, 785, 1556, 3214, 7985]
(string can have anything from 1 to 15 ID's in it and the reason it is a string and not a List is that, its being sent through a URL)
I need to populate a List called campusAndFaculty with it
I am using iReport 5.0.0
I've tried entering this in the campusAndFaculty default value Expression field
Array.asList(($P{campusAndFacultyString}.substring( 1, ($P{campusAndFacultyString}.length() -2 ))).split("\\s*,\\s*"))
But it does not populate the campusAndFaculty List
Any idea how I can populate the List campusAndFaculty using that String ("campusAndFacultyString")?
======================
UPDATE
I have these variables in iReport (5.0.0)
String campusAndFacultyFromBack = "[111, 125, 126, 4587, 1235, 1259]"
String noBrackets = $P{campusAndFacultyFromBack}.substring(1 ($P{campusAndFacultyFromBack}.length() -1 ))
List campusAndFacultyVar = java.util.Arrays.asList(($V{noBrackets}).split("\\s*,\\s*"))
When I print campusAndFacultyVar It returns "[111, 125, 126, 4587, 1235, 1259]"
but when I use it in a Filter I get the "Cannot evaluate the following expression: org_organisation.org_be_id in null"
This works for me:
String something = "[125, 154, 749, 215, 785, 1556, 3214, 7985]";
Arrays.asList((something.substring(1, (something.length() -1 ))).split("\\s*,\\s*"));
Which means you can do this in iReport:
java.util.Arrays.asList(($P{campusAndFacultyString}.substring(1, (something.length() -1 ))).split("\\s*,\\s*"));
Differences with your snippet:
It's Arrays, not Array
You should take 1, not 2 from the length
Fully qualified reference to Arrays class (which may or may not matter depending on how your iReport is configured)