Why the OCONV() with the 'MCT' code change the original string? - universe

I've a string with some Spanish characters and when I use the U2's OCONV() function with the code 'MCT', it changes the Spanish Character to something else. Does anyone know?
STRING: T r L=16 `CITáN, MOR 32000'
TEST.MCT: 5: STR2 = OCONV(STR,'MCT')
:: S
TEST.MCT: 6: CRT STR2
:: S
Cit?9: Mor 32000

Note that I created the following program, and I do not see the issue.
CT BP SO
SO
0001 STR = "CIT":CHAR(225):"N, MOR 3200"
0002 STR2 = OCONV(STR, "MCT")
0003 PRINT STR
0004 PRINT STR2
0005 PRINT SEQ(STR2[4,1])
When I compile and run it, I get the following:
CITáN, MOR 3200
Citán, Mor 3200
225
>
Note that I tested UniVerse11.2.2, on Windows. Can you try the sample code I provided from the HS.SALES account, and let me know what it does?
If it still has an issue, please let us know the full UniVerse version, and the OS you are running this on.
added info:
note tested on UniVerse 11.1.1 on AIX 6.1 and it worked for me. If you are still having issues, I suggest you contact you UniVerse Maintenance provider.

It is hard to read your output since it combines lines.
My run through RAID shows the correct information.
RAID BP SO
SO: 1: STR = "CIT":CHAR(225):"N, MOR 3200"
:: S
SO: 2: STR2 = OCONV(STR, "MCT")
:: S
SO: 3: PRINT STR
:: S
CITáN, MOR 3200
SO: 4: PRINT STR2
:: S
Citán, Mor 3200
SO: 5: PRINT SEQ(STR2[4,1])
:: S
225
Yet, I have LANG=en_US in my UNIX environment variables.
So, there may be an issue with your environment based on what LANG is set, I suggest you contact you U2 Maintenance provider.

Related

how to remove everything but letters, numbers and ! ? . ; , # ' using regex in python pandas df?

I am trying to remove everythin but letters, numbers and ! ? . ; , # ' from my python pandas column text.
I have already read some other questions on the topic, but still can not make mine work.
Here is an example of what I am doing:
import pandas as pd
df = pd.DataFrame({'id':[1,2,3,4],
'text':['hey+ guys! wuzup',
'hello p3ople!What\'s up?',
'hey, how- thing == do##n',
'my name is bond, james b0nd']}
)
Then we have the following table:
id text
1 hey+ guys! wuzup
2 hello p3ople!What\'s up?
3 hey, how- thing == do##n
4 my name is bond, james b0nd
Now, tryng to remove everything but letters, numbers and ! ? . ; , # '
First try:
df.loc[:,'text'] = df['text'].str.replace(r"^(?!(([a-zA-z]|[\!\?\.\;\,\#\'\"]|\d))+)$",' ',regex=True)
output
id text
1 hey+ guys! wuzup
2 hello p3ople!What's up?
3 hey, how- thing == do##n
4 my name is bond, james b0nd
Second try
df.loc[:,'text'] = df['text'].str.replace(r"(?i)\b(?:(([a-zA-Z\!\?\.\;\,\#\'\"\:\d])))",' ',regex=True)
output
id text
1 ey+ uys uzup
2 ello 3ople hat p
3 ey ow- hing == o##
4 y ame s ond ames 0nd
Third try
df.loc[:,'text'] = df['text'].str.replace(r'(?i)(?<!\w)(?:[a-zA-Z\!\?\.\;\,\#\'\"\:\d])',' ',regex=True)
output
id text
1 ey+ uys! uzup
2 ello 3ople! hat' p?
3 ey, ow- hing == o##
4 y ame s ond, ames 0nd
Afterwars, I also tried using re.sub() function using the same regex patterns, but still did not manage to have the expected the result. Being this expected result as follows:
id text
1 hey guys! wuzup
2 hello p3ople!What's up?
3 hey, how- thing don
4 my name is bond, james b0nd
Can anyone help me with that?
Links that I have seen over the topic:
Is there a way to remove everything except characters, numbers and '-' from a string
How do check if a text column in my dataframe, contains a list of possible patterns, allowing mistyping?
removing newlines from messy strings in pandas dataframe cells?
https://stackabuse.com/using-regex-for-text-manipulation-in-python/
Is this what you are looking for?
df.text.str.replace("(?i)[^0-9a-z!?.;,#' -]",'')
Out:
0 hey guys! wuzup
1 hello p3ople!What's up?
2 hey, how- thing don
3 my name is bond, james b0nd
Name: text, dtype: object

Spacy to Conll format without using Spacy's sentence splitter

This post shows how to get dependencies of a block of text in Conll format with Spacy's taggers. This is the solution posted:
import spacy
nlp_en = spacy.load('en')
doc = nlp_en(u'Bob bought the pizza to Alice')
for sent in doc.sents:
for i, word in enumerate(sent):
if word.head == word:
head_idx = 0
else:
head_idx = word.head.i - sent[0].i + 1
print("%d\t%s\t%s\t%s\t%s\t%s\t%s"%(
i+1, # There's a word.i attr that's position in *doc*
word,
word.lemma_,
word.tag_, # Fine-grained tag
word.ent_type_,
str(head_idx),
word.dep_ # Relation
))
It outputs this block:
1 Bob bob NNP PERSON 2 nsubj
2 bought buy VBD 0 ROOT
3 the the DT 4 det
4 pizza pizza NN 2 dobj
5 to to IN 2 dative
6 Alice alice NNP PERSON 5 pobj
I would like to get the same output WITHOUT using doc.sents.
Indeed, I have my own sentence-splitter. I would like to use it, and then give Spacy one sentence at a time to get POS, NER, and dependencies.
How can I get POS, NER, and dependencies of one sentence in Conll format with Spacy without having to use Spacy's sentence splitter ?
A Document in sPacy is iterable, and in the documentation is states that it iterates over Tokens
| __iter__(...)
| Iterate over `Token` objects, from which the annotations can be
| easily accessed. This is the main way of accessing `Token` objects,
| which are the main way annotations are accessed from Python. If faster-
| than-Python speeds are required, you can instead access the annotations
| as a numpy array, or access the underlying C data directly from Cython.
|
| EXAMPLE:
| >>> for token in doc
Therefore I believe you would just have to make a Document for each of your sentences that are split, then do something like the following:
def printConll(split_sentence_text):
doc = nlp(split_sentence_text)
for i, word in enumerate(doc):
if word.head == word:
head_idx = 0
else:
head_idx = word.head.i - sent[0].i + 1
print("%d\t%s\t%s\t%s\t%s\t%s\t%s"%(
i+1, # There's a word.i attr that's position in *doc*
word,
word.lemma_,
word.tag_, # Fine-grained tag
word.ent_type_,
str(head_idx),
word.dep_ # Relation
))
Of course, following the CoNLL format you would have to print a newline after each sentence.
This post is about a user facing unexpected sentence breaks from using the spacy sentence boundary detection. One of the solutions proposed by the developers at Spacy (as on the post) is to add flexibility to add ones own sentence boundary detection rules. This problem is solved in conjunction with dependency parsing by Spacy, not before it. Therefore, I don't think what you're looking for is supported at all by Spacy at the moment, though it might be in the near future.
#ashu 's answer is partly right: dependency parsing and sentence boundary detection are tightly coupled by design in spaCy. Though there is a simple sentencizer.
https://spacy.io/api/sentencizer
It seems the sentecizer just uses punctuation (not the perfect way). But if such sentencizer exists then you can create a custom one using your rules and it will affect sentence boundaries for sure.

Rearrange words in a given sentence in python

I'm very new to python.
I get a serial data in COM port in fixed format as a string like this:
"21-12-2015 10:12:05 005 100 10.5 P"
The format is 'date time id count data data'
Here i don't require count and first data, instead i want to add one more data and send this again through another COM port.
I want to rearrange this and give output as
21-12-2015 10:12:05
SI.NO: 1451
Result: 10.5 P
My attempt:
ip = '21-12-2015_10:12:05_005_100_10.5 P'
dt = ip[0]+ip[1]+ip[3]+..... #save date as dt
tm = ip[9]+ip[10]+ip[11]+.... etc
and at the end
Result = dt + tm +"\n" + " "+ "SI.NO"+.......
Please suggest some good concept to do this in python 2.7.11
If you can mention some ideas i will search for the code.
Thank you
You can split up your string on whitespace into fields with split and build a new string using Python's string formatting syntax:
ip = "21-12-2015 10:12:05 005 100 10.5 P"
fields = ip.split()
s = '{date} {time}\n SI.NO: {sino}\n Result: {x} {y}'.format(
date=fields[0],
time=fields[1],
sino=1451, # Provide your own counter here
x=fields[4],
y=fields[5])
print s
21-12-2015 10:12:05
SI.NO: 1451
Result: 10.5 P
It isn't clear from your question whether your fields are separated by spaces or underscores. In the latter case, use fields = ip.split('_').

R: need to replace invisible/accented characters with regex

I'm working with a file generated from several different machines that had different locale-settings, so I ended up with a column of a data frame with different writings for the same word:
CÓRDOBA
CÓRDOBA
CÒRDOBA
I'd like to convert all those to CORDOBA. I've tried doing
t<-gsub("Ó|Ó|Ã’|°|°|Ò","O",t,ignore.case = T) # t is the vector of names
Wich works until it finds some "invisible" characters:
As you can see, I'm not able to see, in R, the additional charater that lies between à and \ (If I copy-paste to MS Word, word shows it with an empty rectangle). I've tried to dput the vector, but it shows exactly as in screen (without the "invisible" character).
I ran Encoding(t), and ir returns unknown for all values.
My system configuration follows:
> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)
locale:
[1] LC_COLLATE=Spanish_Colombia.1252 LC_CTYPE=Spanish_Colombia.1252 LC_MONETARY=Spanish_Colombia.1252 LC_NUMERIC=C
[5] LC_TIME=Spanish_Colombia.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] zoo_1.7-12 dplyr_0.4.2 data.table_1.9.4
loaded via a namespace (and not attached):
[1] R6_2.1.0 assertthat_0.1 magrittr_1.5 plyr_1.8.3 parallel_3.2.1 DBI_0.3.1 tools_3.2.1 reshape2_1.4.1 Rcpp_0.11.6 stringi_0.5-5
[11] grid_3.2.1 stringr_1.0.0 chron_2.3-47 lattice_0.20-31
I've saveRDS a file with a data frame of actual and expected toy values, wich could be loadRDS from here. I'm not absolutely sure it will load with the same problems I have (depending on you locale), but I hope it does, so you can provide some help.
At the end, I'd like to convert all those special characters to unaccented ones (Ó to O, etc.), hopefully without having to manually input each one of the special ones into a regex (in other words, I'd like --if possible-- some sort of gsub("[:weird:]","[:equivalentToWeird:]",t). If not possible, at least I'd like to be able to find (and replace) those "invisible" characters.
Thanks,
############## EDIT TO ADD ###################
If I run the following code:
d<-readRDS("c:/path/to(downloaded/Dropbox/file/inv_char.Rdata")
stri_escape_unicode(d$actual)
This is what I get:
[1] "\\u00c3\\u201cN N\\u00c2\\u00b0 08 \\\"CACIQUE CALARC\\u00c3\\u0081\\\" - ARMENIA"
[2] "\\u00d3N N\\u00b0 08 \\\"CACIQUE CALARC\\u00c1\\\" - ARMENIA"
[3] "\\u00d3N N\\u00b0 08 \\\"CACIQUE CALARC\\u00c1\\\" - ARMENIA(ALTERNO)"
Normal output is:
> d$actual
[1] ÓN N° 08 "CACIQUE CALARCÃ" - ARMENIA ÓN N° 08 "CACIQUE CALARCÁ" - ARMENIA ÓN N° 08 "CACIQUE CALARCÁ" - ARMENIA(ALTERNO)
With the help of #hadley, who pointed me towards stringi, I ended up discovering the offending characters and replacing them. This was my initial attempt:
unweird<-function(t){
t<-stri_escape_unicode(t)
t<-gsub("\\\\u00c3\\\\u0081|\\\\u00c1","A",t)
t<-gsub("\\\\u00c3\\\\u02c6|\\\\u00c3\\\\u2030|\\\\u00c9|\\\\u00c8","E",t)
t<-gsub("\\\\u00c3\\\\u0152|\\\\u00c3\\\\u008d|\\\\u00cd|\\\\u00cc","I",t)
t<-gsub("\\\\u00c3\\\\u2019|\\\\u00c3\\\\u201c|\\\\u00c2\\\\u00b0|\\\\u00d3|\\\\u00b0|\\\\u00d2|\\\\u00ba|\\\\u00c2\\\\u00ba","O",t)
t<-gsub("\\\\u00c3\\\\u2018|\\\\u00d1","N",t)
t<-gsub("\\u00a0|\\u00c2\\u00a0","",t)
t<-gsub("\\\\u00f3","o",t)
t<-stri_unescape_unicode(t)
}
which produced the expected result. I was a little bit curious about other stringi functions, so I wondered if its substitution one could be faster on my 3.3 million rows. I then tried stri_replace_all_regex like this:
stri_unweird<-function(t){
stri_unescape_unicode(stri_replace_all_regex(stri_escape_unicode(t),
c("\\\\u00c3\\\\u0081|\\\\u00c1",
"\\\\u00c3\\\\u02c6|\\\\u00c3\\\\u2030|\\\\u00c9|\\\\u00c8",
"\\\\u00c3\\\\u0152|\\\\u00c3\\\\u008d|\\\\u00cd|\\\\u00cc",
"\\\\u00c3\\\\u2019|\\\\u00c3\\\\u201c|\\\\u00c2\\\\u00b0|\\\\u00d3|\\\\u00b0|\\\\u00d2|\\\\u00ba|\\\\u00c2\\\\u00ba",
"\\\\u00c3\\\\u2018|\\\\u00d1",
"\\u00a0|\\u00c2\\u00a0",
"\\\\u00f3"),
c("A","E","I","O","N","","o"),
vectorize_all = F))
}
As a side note, I ran microbenchmark on both methods, these are the results:
g<-microbenchmark(unweird(t),stri_unweird(t),times = 100L)
summary(g)
min lq mean median uq max neval cld
1 423.0083 425.6400 431.9609 428.1031 432.6295 490.7658 100 b
2 118.5831 119.5057 121.2378 120.3550 121.8602 138.3111 100 a

Fortran95 -- Reading from a formatted text file

I need to read some values from a table. These are the first five rows, to give you some idea of what it should look like:
1 + 3 98 96 1
2 + 337 2799 2463 1
3 + 2801 3733 933 1
4 + 3734 5020 1287 1
5 + 5234 5530 297 1
My interest is in the first four columns of each row. I need to read these into arrays. I used the following code:
program ----
implicit none
integer, parameter :: totbases = 4639675, totgenes = 4395
integer :: codtot, ks
integer, dimension(totgenes) :: ngene, lend, rend
character :: genome*4639675, sign*4
open(1,file='e_coli_g_info')
open(2,file='e_coli_g_str')
do ks = 1, totgenes
read(1,100) ngene(ks),sign(ks:ks),lend(ks), rend(ks)
end do
100 format(1x,i4,8x,a1, 2(5x,i7), 22x)
do ks = 1, 100
write(*,*) ngene(ks), sign(ks:ks),lend(ks), rend(ks)
end do
end program
The loop at the end of the program is to print the first hundred entries to test that they are being read correctly. The problem is that I am getting this garbage (the fourth row is the problem):
1 + 3 757934891
2 + 337 724249387
3 + 2801 757803819
4 + 3734 757803819
5 + 5234 757935405
Clearly, the fourth column is way off. In fact, I cannot find these values anywhere in the file that I am reading from. I am using the gfortran compiler for Ubuntu 12.04. I would greatly appreciate if somebody would point me in the right direction. I'm sure it's likely that I'm missing something very obvious because I'm new at Fortran.
Fortran formats are (traditionally, there's some newer stuff that I won't go into here) fixed format, that is, they are best suited for file formats with fixed columns. I.e. column N always starts at character position M, no ifs or buts. If your file format is more "free format"-like, that is, columns are separated by whitespace, it's often easier and more robust to read data using list formatting. That is, try to do your read loop as
do ks = 1, totgenes
read(1, *) ngene(ks), sign(ks:ks), lend(ks), rend(ks)
end do
Also, as a general advice, when opening your own files, start from unit 10 and go upwards from there. Fortran implementations typically use some of the low-numbered units for standard input, output, and error (a common choice is units 1, 5, and 6). You probably don't want to redirect those.
PS 2: I haven't tried your code, but it seems that you have a bounds overflow in the sign variable. It's declared of length 4, but then you assign to index ks which goes all the way up to totgenes. As you're using gfortran on Ubuntu 12.04 (that is, gfortran 4.6), when developing compile with options "-O1 -Wall -g -fcheck=all"