Converting a single txt file to arff file automatically - weka

I have a single .txt file including a lot of Arabic text, and I want to convert this file to an .arff file automatically, so I can use it in Weka to get rules from it.
As my Professor requested I need to have 30 attributes, and each attribute should have all words in the text file, and each line of data will include real sentences, but separated to words using , and if the sentence includes less than 30 words, the remaining part will be filled with ?.
The arff file should look like the following:
#relation RelName
#attribute 'x1'{*will include all words in the text file*}
#attribute 'x2'{*will include all words in the text file*}
.
.
.
#attribute 'x30'{*will include all words in the text file*}
#data
Wordx,Wordy,Wordz,Wordq,Wordw,?,?,?,?,?...................,? //till 30 word
.
.
.
.
and so on
So is there anyway to generate this format .arff file from a single .txt file automatically? Thank you for your help

You can use arff 0.9. Works well with python 2.x and 3.x.
EG:
import arff
data = [[1,2,3], [10, 20, 30]]
arff.dump('result.arff', data, relation="test", names=['one', 'two', 'three'])
The command is going to create a relation test with three attributes 'one', 'two' and 'three'. First column will contain 1,10. Second column contains 2,20. Third column contains 3,30.

Related

How to remove unsorted duplicate lines in Notepad++ using Regex

I have my file (link is in comment)
A Sample of Data
Yn2STc5A
MBI1irwA
Yn2STc5A
agCGRvWu
KZIcwFII
414PGEBK
MBI1irwA
KZIcwFII
lln5OKRi
Yn2STc5A
6gCsLHJA
Yn2STc5A
MBI1irwA
KZIcwFII
MBI1irwA
22LYWQsX
22LYWQsX
Yn2STc5A
KZIcwFII
agCGRvWu
lln5OKRi
This file has 528 lines, every line is a repetition of 13 lines, And the 13 lines is a code per a Team link.
I have used and searched many Regex
But only these two was a bit close to what I needed,
Find: ^(.{8}\n)([\S\s]+?\1) and this too ^(.*)([\S\s]+?\1)
Replace All: $2
But I have to press Replace all repetitively, (47 times at least) to reach my goal...
My Desired Output should be out of complete file..
1:22LYWQsX
2:414PGEBK
3:6gCsLHJA
4:C6C8JOnf
5:KZIcwFII
6:MBI1irwA
7:NQid5EnY
8:P68A94uk
9:Yn2STc5A
10:agCGRvWu
11:jbsO5Pzk
12:lln5OKRi
13:vWSvMjaa
Thanks in advance
I recommend to use standard functions of Notepad++ (my version 8.1.9 64 bit) if possible for your needs.
First open the sample data file (*.txt) by Notepad++
From the main menu go to Edit > Line Operations > Remove Duplicate Lines
Go to Edit > Line Operations > Sort Lines Lexicographically Ascending
Format the result as desired for your needs.
Interim result:

How to split a textfile at a certain character creating multiple text files

I have a textfile containing information about different groups and the groups are separated by a '='. I want to separate this files in two multiple text-files, for editing later.
The text-file looks like this:
GROUP 001
LISA ----- 134.5
ROLF ----- 122.0
NICOLAS -- 103.4
=
GROUP 002
NICOLE --- 141.1
ADAM ----- 98.2
And I want two separate text files (preferably called 01.txt and 02.txt) with:
LISA ----- 134.5
ROLF ----- 122.0
NICOLAS -- 103.4
And the other file
NICOLE --- 141.1
ADAM ----- 98.2
I simply tried to read the file and split it at the '=' sign, but this gives me back a list containing all the other groupinfo as an entry.
groups = open('input.txt').read()
groups_divided = groups.split("=\n")
print groups_divided
You started well, the following is one way to finish the task,
groups = open('input.txt').read()
groups_divided = groups.split('=\n')
for group in groups_divided:
temp = group.split('\n')
with open(temp[0].split()[1] + '.txt', 'w') as out:
out.write("\n".join(temp[1:]))
What you got after the groups.split('=\n') was a list of grouped lines in the from of strings. This program processes each string-group in that list - i.e. each physical group and saves the processed version to a file.
It first splits the string-group by the newline character '\n' creating temp. It then extracts the group number for the output file name. Finally it saves all the lines in the group (stored in temp) except for the first one which is the GROUP 00# line. When saving, it joins all the saved lines with newline characters otherwise removed by split('\n').

How do I split a file into list using white spaces and new line in Python?

I have a config.txt file having data in the following format
2
B 6.5 5001
F 2.2 5005
The first line of this file indicates the number of neighbors for Router A. Following this, there is one line dedicated to each
neighbor. It starts with the neighbor ID, followed by the cost to reach this neighbor and finally the port number that this neighbor is using for listening.
I am trying to implement bellman-Ford routing algorithm and for that I am passing this file as command line argument.
I want to make it into a list in order to store it into a data structure for later use but I am not sure whether the list would contain only 3 elements i.e the 3 lines?
Or I need to split each string by whitespace/store each string in every line separated by whitespace, into a list.
I am able to use split function to make a list of the 3 lines but how do I make a list of the strings/characters within a line?
Well, you can always do a double-split - first by line, then by whitespace:
data = """2
B 6.5 5001
F 2.2 5005"""
parsed = [line.split() for line in data.split("\n") if line]
# [['2'], ['B', '6.5', '5001'], ['F', '2.2', '5005']]
You can then iterate through it and convert your fields to whatever you need.

Reading difficult text file in python: no spaces and split numbers in lines

I've been stuck in how to read the following text file --> Extract of the text file I would like to read <-- in Python. This file contains 374 numbers per line, and in total 201 lines. However, the file has the following inconveniences:
1) Numbers are not separated by white spaces or any other delimiter.
2) The first number of each line always doesn't have the first 0 (for instance, it is written .15 instead of 0.15).
Do you have any ideas or solutions of how to read this file in Python? I had never have to deal with such as difficult text file (and I receive the text files from another person, and I cannot ask that person to save the data again).
Thank you in advance,
That looks like numbers in "scientific notation", with two-digit exponents. In your sample it looks like there's never a zero or other digit before the decimal point. Try reading them like this:
with open("file.txt") as inp:
text = inp.read()
numbers = re.findall(r"[.\d]*?D[+-]\d\d", text)
If you want to convert the number strings to floats:
values = [ float(n.replace("D", "e")) for n in numbers ]

Notepad++ find and replace with word containing first word?

I have in one file words in 2 columns and 2 word in row represent interaction.
A0AV96 P25515
A6H8V1 A4D1U5
A8YXX4 A6NCZ6
B0ZBE0 A8BBF9
B3KWQ6 B3KRK5
B4E398 A4D1N9
B6ZGU1 B3KPR3
In second file i have litle bit more information about interactors like this. Here information in every row is extendend info about word from above file.
A0AV96 RBM47_HUMAN
P25515 E2F8_HUMAN
A0JLT2 MED19_HUMAN
A1ZBR5 AKTP2_DROME
A1ZBT5 MED8_DROME
A2A3L6 TTC24_HUMAN
What i need is some way to merge or find and replace words from this two files that i get from this
A0AV96 P25515
to this
A0AV96 RBM47_HUMAN P25515 E2F8_HUMAN
For now i find solution without need for programing. There is at least 2 ways to do this. First i separate each column in single column line list file, as strand of DNA. After that i use 'Find and Replace Multiple Items At Once Software' to change this line in way i need. And after that i merge 2 lines again i 2 columns structure. Only problem is u must align properly your one column from 2 column file and column from file with additional information.