Convert alphanumeric string to 16 digit GCID - openoffice-calc

I'm building our inventory feed for Amazon Seller Central in OpenOffice Calc but can't work out how to convert our inhouse product IDs to the Amazon required format GCID.
The standard-product-id must have a specific number of characters according to type: GCID (16 alphanumeric characters), UPC (12 digit number), EAN (13 digit number) or GTIN(14 digit number).
Our product IDs vary by manufacturer, eg:-
123456
AB123456
1234AB
Where the ID is numerical only I can format the cells with leading zeros, however this doesn't work if the cell contains letters.
My file has over 10,000 products so I'm wondering if there is a formula I can apply to all cells to instantly convert them to GCID?

It seems the question was asked when under a misapprehension but having noticed that the example 123456 AB123456 1234AB represents three different IDs and aware that padding to a specified length is quite a common requirement (eg see String.PadLeft Method) a suggestion for OpenOffice might be of use to someone, one day.
Convention is to pad with 0s but since some spreadsheets automatically strip these off the front of numbers (as first example) and databases tend to prefer that fields are of consistent format I suggest separating the padding from the example with a hyphen, to aid identification of alpha numeric codes and to force text format:
=REPT(0;15-LEN(A1))&"-"&A1

Related

Meaning of 3F7.1 in Fortran data format

I am trying to create an MDM file using HLM 7 Student version, but since I don't have access to SPSS I am trying to import my data using ASCII input. As part of this process I am required to input the data format Fortran style. Try as I might I have not been able to understand this step. Could someone familiar with Fortran (or even better HLM itself) explain to me how this works? Here is my current understanding
From the example EG3.DAT they give
(A4,1X,3F7.1)
I think
A4 signifies that the ID is 4 characters long.
1X means skip a space.
F.1 means that it should read 1 decimal places.
I am very confused about what 3F7 might mean.
EG3.DAT
2020 380.0 40.3 12.5
2040 502.0 83.1 18.6
2180 777.0 96.6 44.4
Below are examples from the help documents.
Rules for format statement
Format statement example
EG1 data format
EG2 data format
EG3 data format
One similar question is Explaining Fortran Write Format. Unfortunately it does not explicitly treat the F descriptor.
3F7.1 means 3 floating point numbers, each printed over 7 characters, each with one decimal number behind the decimal point. Leading characters are blanks.
For reading you don't need the .1 info at all, just read a floating point number from those 7 characters.
You guessed the meaning of A4 (string of four characters) and 1X (one blank) correctly.
In Fortran, so-called data edit descriptors (which format the input or output of data) may have repeat specifications.
In the format (A4,1X,3F7.1) the data edit descriptors are A4 and F7.1. Only F7.1 has a repeat specification (the number before the F). This simply means that the format is as though the descriptor appeared repeated: like F7.1, F7.1, F7.1. With a repeat specification of 1, or not given, there is just the single appearance.
The format of the question, then, is like
(A4,1X,F7.1,F7.1,F7.1)
This format is one that is covered by the rules provided in one of the images of the question. In particular, the aspect of repeat specification is given in rule 2 with the corresponding example of rule 3.
Further, in Fortran proper, a repeat count specifier may also be * as special case: that's like an exceptionally large repeat count. *(F7.1) would be like F7.1, F7.1, F7.1, .... I see no indication that this is supported by HLM but if this is needed a very large repeat count may be given instead.
In 1X the 1 isn't a repeat specification but an integral, and necessary, part of the position edit descriptor.
Procedure for making MDM file from excel for HLM:
-Make sure ALL the characters in ALL the columns line up
Select a column, then right click and select Format Cells
Then click on 'Custom' and go to the 'Type' box and enter the number
of 0s you need to line everything up
-Remove all the tabs from the document and replace them with spaces.
Open the document in word and use find and replace
-To save the document as .dat
First save it as .txt
Then open it in Notepad and save it as .dat
To enter the data format (FORTRAN-Style)
The program wants to read the data file space by space, so you have to specify it perfectly so that it reads the whole set properly.
If something is off, even by a single space, then your descriptive stats will be wonky compared to if you check them in another program.
Enclose the code with brackets ()
Divide the entries with commas ,
-Need ID column for all levels
ID column needs to be sorted so that it is in order from smallest to
largest
Use A# with # being the number of characters in the ID
Use an X1 to
move from the ID to the next column
-Need to say how many characters are needed in each column
Use F
After F is the number of characters needed for that column -Use F# (#= number)
There need to be enough character spaces to provide one 'gap' space
between each column
There need to be enough to character spaces to allow for the decimal
As part of the F you need to specify the number of decimal places
You do this by adding a decimal point after the F number and then a
number to represent the spaces you need -F#.#
You can use a number in front of the F so as to 'repeat' it. Not
necessary though. -#F#.#
All in all, it should look something like this:
(A4,X1,F4.0,F5.1)
Helpful links:
https://books.google.de/books?id=VdmVtz6Wtc0C&pg=PA78&lpg=PA78&dq=data+format+fortran+style+hlm&source=bl&ots=kURJ6USN5e&sig=fdtsmTGSKFxn04wkxvRc2Vw1l5Q&hl=en&sa=X&ved=0ahUKEwi_yPurjYrYAhWIJuwKHa0uCuAQ6AEIPzAC#v=onepage&q&f=false
http://www.ssicentral.com/hlm/help6/error/Problems_creating_MDM_files.pdf
http://www.ssicentral.com/hlm/help7/faq/FAQ_Format_specifications_for_ASCII_data.pdf

How to use numbers present as text with different unit prefixes in calculations

I have data in a spreadsheet describing amount of data transferred over a mobile network: data in one column (over 300 rows) has three possible forms:
123,45KB
123,45MB
1,23GB
How can I transform or use this data in order to sum or do other calculations on numbers properly?
Assuming your data is in column A and there are always two characters as unit ("KB", "MB" or "GB") at the end, then the formula for transforming the data to numeric could be:
=--LEFT(A2;LEN(A2)-2)*10^(IF(RIGHT(A2;2)="KB";3;IF(RIGHT(A2;2)="MB";6;IF(RIGHT(A2;2)="GB";9))))
Result:
Put the formula in B2 and fill downwards as needed.
I suspected the decimal delimiter in your locale is comma. If not, please state what it is.
Also since this site is English, I have used English function names. Maybe you need to translate them into your language version.
If the decimal delimiter in your locale is not comma, then you need substituting the comma with your decimal delimiter to get a proper numeric decimal value.
For example if the decimal delimiter is dot, then:
=SUBSTITUTE(LEFT(A2,LEN(A2)-2),",",".")*10^(IF(RIGHT(A2,2)="KB",3,IF(RIGHT(A2,2)="MB",6,IF(RIGHT(A2,2)="GB",9))))
An alternative formula:
=LEFT(A1,LEN(A1)-2)*10^(3*MATCH(RIGHT(LEFT(A1,LEN(A1)-1)),{"K","M","G"},0))
Uses the position of the next to last character in an array to determine the factor.

How to improve a twitter sentiment analyzer?

I'm working on a C++ Twitter company sentiment analysis tool. User inputs a company and the tool analyzes a # of tweets and returns a sentiment.
So far I did the following:
limit tweets to English and recent
make lowercase
remove RT, # symbol, #usernames and URLs
remove characters like &^%$(){}... etc
I then parse the tweet into words and check words against two dictionaries of positive and negative words. I create a total sentiment for each tweet. Then I count the number of positive , neutral and negative tweets to come up with a final answer. No weights are used.
I am thinking of implementing the following two things:
remove stop words from tweets
remove special characters and emoticons from tweets (non english Unicode basically)
However, even with this, most of the searches end up being very neutral. For example if I search "Apple" in 100 tweets I get say 30 positives, 10 negatives and 60 neutral.
Questions:
1. Is there any way to lower the neutrals?
2. What kind of positive and negative words should I add to represent my search criteria(Companies)
You say no weighting is used but why not add it. Assign each +/- word a base weight of 1 then maybe apply some of the following conditions:
If they use words like "very", "extremely", etc, weighting the following adjective heavier (or without weighting just count both of them as a +/- word)
Rather than changing everything to lowercase, if there is capslock involved for words weighting those words heavier with a multiplier
Rating words like "fantastic" heavier than words like "good"

Which format mask of phone number is safe for input for all countries?

I am making an input form for registering employees. I thought the number format in any country would fit in /\+(\d+)-\d\d\d-\d\d\d-\d\d\d\d/, where (\d+) is country code I assume to have different length and next goes exactly 10 digits. I need to create input fields and validation rules to make input as protected and unambiguous as possible but I am also worried if there could be actual numbers that don't fit this regex. Is there a safe international standard way of writing numbers?
There is no particular format which you can apply for all the countries phone number.
However \d* will be one of the choice with which you can proceed with but that too is not the best.
You may check National conventions for writing telephone numbers

telephone number regex

I am currently trying to validate UK telephone numbers:
The format I'm looking for is: 01234 567891 or 01234567891 - So I need the number to have 5 numbers then a space then 6 numbers or simply a 11 numbers.
The number must start with a 0.
I've had a look at a couple of examples:
/^[0-9]{10,11} - to check that the chars are all numbers
/^0[0-9]{9,10}$/ - to check that the first number is a 0
I'm just unsure how to put all these together and check if there is a space or not.
Could someone help me with this regex?
Thanks
Try this regex:
/^0\d{4}\s?\d{6}$/
Many people try to do input validation and formatting in a single step.
It is better to separate these processes.
Match UK telephone number in any format
^(?:(?:\(?(?:0(?:0|11)\)?[\s-]?\(?|\+)44\)?[\s-]?(?:\(?0\)?[\s-]?)?)|(?:\(?0))(?:(?:\d{5}\)?[\s-]?\d{4,5})|(?:\d{4}\)?[\s-]?(?:\d{5}|\d{3}[\s-]?\d{3}))|(?:\d{3}\)?[\s-]?\d{3}[\s-]?\d{3,4})|(?:\d{2}\)?[\s-]?\d{4}[\s-]?\d{4}))(?:[\s-]?(?:x|ext\.?|\#)\d{3,4})?$
The above pattern allows the user to enter the number in any format they are comfortable with. Don't constrain the user into entering specific formats.
Extract NSN, prefix and extension
^(\(?(?:0(?:0|11)\)?[\s-]?\(?|\+)(44)\)?[\s-]?)?\(?0?(?:\)[\s-]?)?([1-9]\d{1,4}\)?[\d[\s-]]+)((?:x|ext\.?|\#)\d{3,4})?$
Next, extract the various elements.
$2 will be '44' if international format was used, otherwise assume national format with leading '0'.
$4 contains the extension number if present.
$3 contains the NSN part.
Validation and formatting
Use further RegEx patterns to check the NSN has the right number of digits for this number range. Finally, store the number in E.164 format or display it in E.123 format.
There's a very detailed list of validation and display formatting RegEx patterns for UK numbers at:
http://www.aa-asterisk.org.uk/index.php/Regular_Expressions_for_Validating_and_Formatting_UK_Telephone_Numbers
It's too long to reproduce here and it would be difficult to maintain multiple copies of this document.
If you are looking for all UK numbers, I'd look for a bit more than just that number, some are in the format 020 7123 4567 etc.
^\s*\(?(020[7,8]{1}\)?[ ]?[1-9]{1}[0-9{2}[ ]?[0-9]{4})|(0[1-8]{1}[0-9]{3}\)?[ ]?[1-9]{1}[0-9]{2}[ ]?[0-9]{3})\s*$
/\d*(*)*+*-*/
Simple Telephone Regex includes + () and - anywhere, as well as digits
I think ^0[\d]{4}\s?[\d]{5,6}} will work for you. I have used [\d] instead of [0-9].
I find that RegExr is a useful online tool to check and try your regular expressions. It also has a nice library of examples to help point you in the right direction
you should just count the number of digits and check that it's 10,
Some UK numbers have only 9 digits, not 10 (not including the leading 0).
These include 40 of the 01 area codes (using "4+5" format), the 016977 area code (using "5+4" format), all 0500 numbers and some 0800 numbers.
There's a list at: http://www.aa-asterisk.org.uk/index.php/01_numbers
This US numbers pattern accepts following phones as well:
800-432-4500, Opt: 9, Ext: 100316
800-432-4500, Opt: 9, Ext: X100316
800-432-4500, Option #3
(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4}),?(?:\s*(?:#|x\.?|opt(\.|:|\.:)?|option)\s*#?(\d+))?,?(?:\s*(?:#|x\.?|ext(\.|:|\.:)?|extension)\s*(\d+))?
(used this answer in other topic as start point)