Extracting values from columns in tab delim text file [closed] - regex

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
perlI have tab delimited text file with 3 columns like following
Name Description Ontology
dda1 box1_homodomain gn=box1 os=homo C:Cell;C:surface;F:binding;P:toy
dda2 sox2_plurinet gn=plu os=mouse C:Organ;F:transport;P:carrier;P:avi
dd13 klf4_iPSC gn=klf os=Bos C:Cell;F:tiad;P:abs;P:digestion
Now I would like to split the values (gn=xxx and os=xxx) in column Description and values in Ontology column(C:xxx;F=xxx;P=xxx;) into seperate columns like following:
Name Description gn os C F P
dda1 box1_homodomain box1 homo Cell;surface binding toy
dda2 sox2_plurinet plu mouse Organ; transport carrier;avi
dd13 klf4_iPSC klf Bos Cell; tiad abs;digestion
I want this has to export as tab delim file or excel file.I would be really great if someone can guide how can I achieve that in perl. Please help me.
Thanks in advance

I saw perl question after 5 years of my Java. I was excited and i want to do this exercise. Now, that i remember things i did and pasted the code below. Just enrich the same code for last column 'Ontology' with regexp and same hash concept you are done. You can do multiple ways in perl. It could be more code. But, i remember below way.
#!/usr/bin/perl
use Data::Dumper;
my %output;
open(IN, "stack.txt");
while(<IN>) {
my #nameColumns, #descriptionColumns;
if ($_ =~ /Name/) {
$ouput{'Name'} = #nameColumns;
$ouput{'Description'} = #descriptionColumns;
next;
}
my ($group1, $group2, $group3, $group4, $group5, $group6, $group7) = ($_ =~ m/(\w+)\s+(\w+)\s+(\w+)\=(\w+)\s+(\w+)\=(\w+)\s+(.*)/gi);
# Column 1
#nameColumns = #{$output{'Name'}};
push(#nameColumns, $group1);
$output{'Name'} = [#nameColumns];
# Column 2
#print "$group2, $group3, $group4, $group5, $group6, $group7";
#descriptionColumns = #{$output{'Description'}};
push(#descriptionColumns, $group2);
$output{'Description'} = [#descriptionColumns];
# column 3
#column3 = #{$output{$group3}};
push(#column3, $group4);
$output{$group3} = [#column3];
# column 4
#column4 = #{$output{$group5}};
push(#column4, $group6);
$output{$group5} = [#column4];
#Column ...
}
close(IN);
print Dumper(\%output);
$VAR1 = {
'gn' => [
'box1',
'plu',
'klf'
],
'os' => [
'homo',
'mouse',
'Bos'
],
'Name' => [
'dda1',
'dda2',
'dd13'
],
'Description' => [
'box1_homodomain',
'sox2_plurinet',
'klf4_iPSC'
]
};
Note : Output above. If you still didn't figure out, how to finish this program let me know to spend more time

Related

Google script regex match without results [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 days ago.
Improve this question
I have the following bit in a google script that parses pdfs:
function extractPDFtext(text){
const regexp = /[w,W,s,S]*(\d{3}).?(\d{3}).?(\d{3}).?(\d{3})?.?(\d{3})?[\w\W]*?(\d+.\d+)/gm;
try{
let array = [...text.match(regexp)];
return array;
}catch(e){
let array = ["No items found"]
return array;
}
};
The existing regex partially works (because the pdf's are not all equal) and so I have to restrict the search/matching between words/results and when I try to do it, I get no results. I would like to retrieve the digits related to Reference and Amount tags, while ignoring any words and digits in between. And it's here that I'm having some trouble because on regex101 I get the full match + the correct capturing groups but once on the script, I get no results.
This is a regex example based on what was suggested on another question of mine but in the end has the same problem as any of my other attempts:
^Reference\b[^\d\n]*[\t ](\d{3})[\t ]*(\d{3})[\t ]*(\d{3})[\t ]*(\d{3})[\t ]*(\d{3})(?:\n(?!Amount\b)\S.*)*\nAmount\b[^\d\n]*[\t ](\d+(?:,\d+)?)\b
So I'm wondering if the problem is with the regex or with the script and how to solve in any of those circumstances.
Below, there's some dummy text example of the variable text where the regex is being used on, baring in mind that it can have more words after each "tag" (example: Reference of something // Amount of first payment:); it can have : or not.
Some dummy text that may have words in common like `reference` or `amount` throughout the document
Reference: 245 154 343 345 345
Entity: 34567
Amount: 11,11
Payment date: 14/07/2022
Some more text
Maybe your trying to do too much with one command. Try breaking it up as I show below.
console.log(text);
let ref = text.match(/Reference.+/gm);
if( ref.length > 0 ) {
ref = ref[0].match(/\d.+/);
console.log(ref[0]);
}
ref = text.match(/Amount.+/);
if( ref.length > 0 ) {
ref = ref[0].match(/\d.+/);
console.log(ref[0]);
}
Execution log
8:55:50 AM Notice Execution started
8:55:50 AM Info Some dummy text that may have words in common like `reference` or `amount` throughout the document
Reference: 245 154 343 345 345
Entity: 34567
Amount: 11,11
Payment date: 14/07/2022
Some more text
8:55:50 AM Info 245 154 343 345 345
8:55:50 AM Info 11,11
8:55:50 AM Notice Execution completed

How to use regex or "wildcard" to search text or substring? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 months ago.
The community reviewed whether to reopen this question 3 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
How to use the operator "CONTAINS" with regex or wildcard to find in array of texts, certain text word match with criteria?
I use this select * from person where tasks contains "*service"; to match texts with operator "contains", but it is not working to find a substring.
Here I show the record in the table
[
{
"time": "125.3642ms",
"status": "OK",
"result": [
{
"age": 48,
"tasks": [
"car service",
"whasing floor",
"changes windows",
"repair doors",
"pool service",
"repair roofs"
],
"id": "person:services"
}
]
}
]

CLI method for vlookup like search [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I have a huge csv file, demo.csv (few GBs in size) which has 3 columns like the following:
$ cat demo.csv
call_start_time,called_no,calling_no
43284.85326,1111111111,2222222222
43284.83192,3333333333,1111111111
43284.83205,2222222222,1111111111
43284.81304,4444444444,3333333333
I am trying to find the rows which has repeated values in either column 2 or 3 (whatever the order). For example, this should be the output for the data shown above:
call_start_time,called_no,calling_no
43284.85326,1111111111,2222222222
43284.83205,2222222222,1111111111
I tried to use csvkit:
csvsql --query "select called_no, calling_no, call_start_time, count(1) from file123 group by called_no,calling_no having count(1)>1" file123.csv > new.csv
With awk you can build an associative array a with records as values and keys k build with the fields $2 and $3 sorted and joined with a pipe.
awk -F, 'NR==1; { k=($3<$2) ? $3"|"$2 : $2"|"$3; if (a[k]) { if (a[k]!="#") {print a[k];a[k]="#"} print} else a[k]=$0}' file
If the current record has a key that already exists, the stored record is printed (only if it is the first time) and the current record is printed too.
$ awk '
NR==1 { print; next }
{ key = ($2>$3 ? $2 FS $3 : $3 FS $2) }
seen[key]++ { print orig[key] $0; delete orig[key]; next }
{ orig[key] = $0 ORS }
' file
call_start_time called_no calling_no
43284.85326 1111111111 2222222222
43284.83205 2222222222 1111111111

Extracting each unique data from a string [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have been trying to set up REGEX extraction process for the following to no avail.
I have a set of date values in the formats to follow. I need to be able to extract these as unique individual dates.
If there is a single value, it is a standard simple format of mm/dd/yyyy. That one is easy.
If there are more than one date value, then it can be in a format as follows:
Feb 5, 12, 19, 26, Mar 4, 11 2016
I need to turn these into 02/05/2016, 02/12/2016, etc.
Eventually I will be inserting these dates into a database.
Am I going about this in the wrong way? Thanks for advice.
This will be complete spaghetti if you try to do it with one regex:
You will have to hardcode the names of the months and the corresponding numbers somewhere.
The year doesn't follow after the list of days of the month, rather after the list of all month names - month days this year.
However with a little help from a normal programming language you can still get a short and regex-centric solution. Here is a small Ruby snippet to show the general idea:
# this is the input
dates = "Feb 5, 12, 19, 26, Mar 4, 11 2016, Jul 5, 7, 19, 26, May 4, 11 2017"
# a hash with month name => month number
MONTHS = {
'Jan' => '01',
'Feb' => '02',
'Mar' => '03',
'Apr' => '04',
'May' => '05',
'Jun' => '06',
'Jul' => '07',
'Aug' => '08',
'Sep' => '09',
'Oct' => '10',
'Nov' => '11',
'Dec' => '12',
}
# match and extract three things:
# month - the first found month name (three letters)
# days - list of days separated by commas and spaces for this month
# for example 5, 12, 19, 26,
# year - the first found year (four digits)
# ,? is because we don't have , after the last day of the year
while dates =~ /(\w{3}) ((?:\d\d?,? )+).*?(\d{4})/
month, days, year = $1, $2, $3
# to each day collate a date in the wanted format
# MONTHS[month] gets the month number from the hash above
# sprintf simply makes sure that one digit days will have a leading 0
dates_this_month = days.split(/,? /).map do |day|
"#{MONTHS[month]}/#{sprintf('%02d', day)}/#{year}"
end.join ', '
# substitute the dates for this month with the new format
dates.sub! "#{month} #{days}", "#{dates_this_month}, "
end
# remove leftover years
dates.gsub! /, \d{4}/, ''
Now dates is in the desired format.
Assuming that there are no deviations or anomalies in the data that you're RegExing, the following RegEx can be applied with case-sensitivity set and allow you to access the information you need. With RegExs, it's important to "know your data" because this variable can greatly alter the construction of a RegEx -- the balance between specificity and clarity is important since RegExs can easily become unwieldy and cryptic.
Save the months as: ([A-Z][a-z][a-z]) // this can be your $1 variable (useful later)
Save the day values as: \s*(?:([0-9]?[0-9]),\s)* // $2 variable should work for access to this list of values
Save the year values as: ([0-9]{4,4}) // $3 variable should work for accessing these values NOTE: this only works for #### formatted dates by design although it can be altered to handle different formats; I'm just going off of the example you provided
Stringing it all together you get: (?:([A-Z][a-z][a-z])\s*(?:([0-9]?[0-9]),\s)*)+([0-9]{4,4})
You can then construct objects with these values so that you don't end up with a bunch of chaotic data. Let me know if I addressed you problem properly. If there's something that I missed or some additional functionality that you forgot to mention, I will be happy to assist.

Best way to store hair colors, eyes colors with django [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I have a small problem I'd like to know the best way to store eyes or hairs colors with django.
This is what I did before asking myself what the best way :)
HAIR_COLORS = (
('1', 'braun'),
('2', 'blond'),
('3', 'red'))
hair_color = models.IntegerField(max_length=1, choices=HAIR_COLORS)
Is it a good choice ? Do I have to replace The integerField() by a CharField() with choices like:`
HAIR_COLORS = (
('braun', 'braun'),
('blond', 'blond'),
('red', 'red'))
In fact I'm not satified, what is your favorite way to implement ?
Thank you.
Best way is:
BROWN = 10
BLOND = 20
RED = 30
HAIR_COLORS = (
(BROWN, 'braun'),
(BLOND, 'blond'),
(RED, 'red'))
hair_color = models.IntegerField(max_length=2, choices=HAIR_COLORS)
The reason for that is:
When you use such numbers you always can insert additional color at any place in choices.
Use constants not directly the numbers because in the code when you check or update is far better and explicit to make
if obj.hair_color == BROWN instead if obj.hair_color == 10
or
obj.hair_color = BROWN compared to obj.hair_color = 10
Also better naming convention is recommended for example: prefix for the constants like EYES_BROWN and suffix for the choices like HAIR_COLORS_CHOICES. And these statements should reside outside the model for easier import in other parts in the project.
For those types of definitions, i create a global_defs.py file
global_defs.py:
HAIR_COLOR_BROWN = 1
HAIR_COLOR_BLACK = 2
...
In models:
import global_defs as defs
...
class person(models.Model):
...
color = models.IntegerField(default=defs.HAIR_COLOR_BROWN)
In the end you refer to those values by name in all your code, so it's nicely speaking. However in the Database you cannot directly see what each value means.