How does Stack Overflow generate its SEO-friendly URLs? - regex

What is a good complete regular expression or some other process that would take the title:
How do you change a title to be part of the URL like Stack Overflow?
and turn it into
how-do-you-change-a-title-to-be-part-of-the-url-like-stack-overflow
that is used in the SEO-friendly URLs on Stack Overflow?
The development environment I am using is Ruby on Rails, but if there are some other platform-specific solutions (.NET, PHP, Django), I would love to see those too.
I am sure I (or another reader) will come across the same problem on a different platform down the line.
I am using custom routes, and I mainly want to know how to alter the string to all special characters are removed, it's all lowercase, and all whitespace is replaced.

Here's how we do it. Note that there are probably more edge conditions than you realize at first glance.
This is the second version, unrolled for 5x more performance (and yes, I benchmarked it). I figured I'd optimize it because this function can be called hundreds of times per page.
/// <summary>
/// Produces optional, URL-friendly version of a title, "like-this-one".
/// hand-tuned for speed, reflects performance refactoring contributed
/// by John Gietzen (user otac0n)
/// </summary>
public static string URLFriendly(string title)
{
if (title == null) return "";
const int maxlen = 80;
int len = title.Length;
bool prevdash = false;
var sb = new StringBuilder(len);
char c;
for (int i = 0; i < len; i++)
{
c = title[i];
if ((c >= 'a' && c <= 'z') || (c >= '0' && c <= '9'))
{
sb.Append(c);
prevdash = false;
}
else if (c >= 'A' && c <= 'Z')
{
// tricky way to convert to lowercase
sb.Append((char)(c | 32));
prevdash = false;
}
else if (c == ' ' || c == ',' || c == '.' || c == '/' ||
c == '\\' || c == '-' || c == '_' || c == '=')
{
if (!prevdash && sb.Length > 0)
{
sb.Append('-');
prevdash = true;
}
}
else if ((int)c >= 128)
{
int prevlen = sb.Length;
sb.Append(RemapInternationalCharToAscii(c));
if (prevlen != sb.Length) prevdash = false;
}
if (i == maxlen) break;
}
if (prevdash)
return sb.ToString().Substring(0, sb.Length - 1);
else
return sb.ToString();
}
To see the previous version of the code this replaced (but is functionally equivalent to, and 5x faster), view revision history of this post (click the date link).
Also, the RemapInternationalCharToAscii method source code can be found here.

Here is my version of Jeff's code. I've made the following changes:
The hyphens were appended in such a way that one could be added, and then need removing as it was the last character in the string. That is, we never want “my-slug-”. This means an extra string allocation to remove it on this edge case. I’ve worked around this by delay-hyphening. If you compare my code to Jeff’s the logic for this is easy to follow.
His approach is purely lookup based and missed a lot of characters I found in examples while researching on Stack Overflow. To counter this, I first peform a normalisation pass (AKA collation mentioned in Meta Stack Overflow question Non US-ASCII characters dropped from full (profile) URL), and then ignore any characters outside the acceptable ranges. This works most of the time...
... For when it doesn’t I’ve also had to add a lookup table. As mentioned above, some characters don’t map to a low ASCII value when normalised. Rather than drop these I’ve got a manual list of exceptions that is doubtless full of holes, but it is better than nothing. The normalisation code was inspired by Jon Hanna’s great post in Stack Overflow question How can I remove accents on a string?.
The case conversion is now also optional.
public static class Slug
{
public static string Create(bool toLower, params string[] values)
{
return Create(toLower, String.Join("-", values));
}
/// <summary>
/// Creates a slug.
/// References:
/// http://www.unicode.org/reports/tr15/tr15-34.html
/// https://meta.stackexchange.com/questions/7435/non-us-ascii-characters-dropped-from-full-profile-url/7696#7696
/// https://stackoverflow.com/questions/25259/how-do-you-include-a-webpage-title-as-part-of-a-webpage-url/25486#25486
/// https://stackoverflow.com/questions/3769457/how-can-i-remove-accents-on-a-string
/// </summary>
/// <param name="toLower"></param>
/// <param name="normalised"></param>
/// <returns></returns>
public static string Create(bool toLower, string value)
{
if (value == null)
return "";
var normalised = value.Normalize(NormalizationForm.FormKD);
const int maxlen = 80;
int len = normalised.Length;
bool prevDash = false;
var sb = new StringBuilder(len);
char c;
for (int i = 0; i < len; i++)
{
c = normalised[i];
if ((c >= 'a' && c <= 'z') || (c >= '0' && c <= '9'))
{
if (prevDash)
{
sb.Append('-');
prevDash = false;
}
sb.Append(c);
}
else if (c >= 'A' && c <= 'Z')
{
if (prevDash)
{
sb.Append('-');
prevDash = false;
}
// Tricky way to convert to lowercase
if (toLower)
sb.Append((char)(c | 32));
else
sb.Append(c);
}
else if (c == ' ' || c == ',' || c == '.' || c == '/' || c == '\\' || c == '-' || c == '_' || c == '=')
{
if (!prevDash && sb.Length > 0)
{
prevDash = true;
}
}
else
{
string swap = ConvertEdgeCases(c, toLower);
if (swap != null)
{
if (prevDash)
{
sb.Append('-');
prevDash = false;
}
sb.Append(swap);
}
}
if (sb.Length == maxlen)
break;
}
return sb.ToString();
}
static string ConvertEdgeCases(char c, bool toLower)
{
string swap = null;
switch (c)
{
case 'ı':
swap = "i";
break;
case 'ł':
swap = "l";
break;
case 'Ł':
swap = toLower ? "l" : "L";
break;
case 'đ':
swap = "d";
break;
case 'ß':
swap = "ss";
break;
case 'ø':
swap = "o";
break;
case 'Þ':
swap = "th";
break;
}
return swap;
}
}
For more details, the unit tests, and an explanation of why Facebook's URL scheme is a little smarter than Stack Overflows, I've got an expanded version of this on my blog.

You will want to setup a custom route to point the URL to the controller that will handle it. Since you are using Ruby on Rails, here is an introduction in using their routing engine.
In Ruby, you will need a regular expression like you already know and here is the regular expression to use:
def permalink_for(str)
str.gsub(/[^\w\/]|[!\(\)\.]+/, ' ').strip.downcase.gsub(/\ +/, '-')
end

You can also use this JavaScript function for in-form generation of the slug's (this one is based on/copied from Django):
function makeSlug(urlString, filter) {
// Changes, e.g., "Petty theft" to "petty_theft".
// Remove all these words from the string before URLifying
if(filter) {
removelist = ["a", "an", "as", "at", "before", "but", "by", "for", "from",
"is", "in", "into", "like", "of", "off", "on", "onto", "per",
"since", "than", "the", "this", "that", "to", "up", "via", "het", "de", "een", "en",
"with"];
}
else {
removelist = [];
}
s = urlString;
r = new RegExp('\\b(' + removelist.join('|') + ')\\b', 'gi');
s = s.replace(r, '');
s = s.replace(/[^-\w\s]/g, ''); // Remove unneeded characters
s = s.replace(/^\s+|\s+$/g, ''); // Trim leading/trailing spaces
s = s.replace(/[-\s]+/g, '-'); // Convert spaces to hyphens
s = s.toLowerCase(); // Convert to lowercase
return s; // Trim to first num_chars characters
}

For good measure, here's the PHP function in WordPress that does it... I'd think that WordPress is one of the more popular platforms that uses fancy links.
function sanitize_title_with_dashes($title) {
$title = strip_tags($title);
// Preserve escaped octets.
$title = preg_replace('|%([a-fA-F0-9][a-fA-F0-9])|', '---$1---', $title);
// Remove percent signs that are not part of an octet.
$title = str_replace('%', '', $title);
// Restore octets.
$title = preg_replace('|---([a-fA-F0-9][a-fA-F0-9])---|', '%$1', $title);
$title = remove_accents($title);
if (seems_utf8($title)) {
if (function_exists('mb_strtolower')) {
$title = mb_strtolower($title, 'UTF-8');
}
$title = utf8_uri_encode($title, 200);
}
$title = strtolower($title);
$title = preg_replace('/&.+?;/', '', $title); // kill entities
$title = preg_replace('/[^%a-z0-9 _-]/', '', $title);
$title = preg_replace('/\s+/', '-', $title);
$title = preg_replace('|-+|', '-', $title);
$title = trim($title, '-');
return $title;
}
This function as well as some of the supporting functions can be found in wp-includes/formatting.php.

If you are using Rails edge, you can rely on Inflector.parametrize - here's the example from the documentation:
class Person
def to_param
"#{id}-#{name.parameterize}"
end
end
#person = Person.find(1)
# => #<Person id: 1, name: "Donald E. Knuth">
<%= link_to(#person.name, person_path(#person)) %>
# => Donald E. Knuth
Also if you need to handle more exotic characters such as accents (éphémère) in previous version of Rails, you can use a mixture of PermalinkFu and DiacriticsFu:
DiacriticsFu::escape("éphémère")
=> "ephemere"
DiacriticsFu::escape("räksmörgås")
=> "raksmorgas"

I am not familiar with Ruby on Rails, but the following is (untested) PHP code. You can probably translate this very quickly to Ruby on Rails if you find it useful.
$sURL = "This is a title to convert to URL-format. It has 1 number in it!";
// To lower-case
$sURL = strtolower($sURL);
// Replace all non-word characters with spaces
$sURL = preg_replace("/\W+/", " ", $sURL);
// Remove trailing spaces (so we won't end with a separator)
$sURL = trim($sURL);
// Replace spaces with separators (hyphens)
$sURL = str_replace(" ", "-", $sURL);
echo $sURL;
// outputs: this-is-a-title-to-convert-to-url-format-it-has-1-number-in-it
I hope this helps.

I don't much about Ruby or Rails, but in Perl, this is what I would do:
my $title = "How do you change a title to be part of the url like Stackoverflow?";
my $url = lc $title; # Change to lower case and copy to URL.
$url =~ s/^\s+//g; # Remove leading spaces.
$url =~ s/\s+$//g; # Remove trailing spaces.
$url =~ s/\s+/\-/g; # Change one or more spaces to single hyphen.
$url =~ s/[^\w\-]//g; # Remove any non-word characters.
print "$title\n$url\n";
I just did a quick test and it seems to work. Hopefully this is relatively easy to translate to Ruby.

T-SQL implementation, adapted from dbo.UrlEncode:
CREATE FUNCTION dbo.Slug(#string varchar(1024))
RETURNS varchar(3072)
AS
BEGIN
DECLARE #count int, #c char(1), #i int, #slug varchar(3072)
SET #string = replace(lower(ltrim(rtrim(#string))),' ','-')
SET #count = Len(#string)
SET #i = 1
SET #slug = ''
WHILE (#i <= #count)
BEGIN
SET #c = substring(#string, #i, 1)
IF #c LIKE '[a-z0-9--]'
SET #slug = #slug + #c
SET #i = #i +1
END
RETURN #slug
END

I know it's very old question but since most of the browsers now support unicode urls I found a great solution in XRegex that converts everything except letters (in all languages to '-').
That can be done in several programming languages.
The pattern is \\p{^L}+ and then you just need to use it to replace all non letters to '-'.
Working example in node.js with xregex module.
var text = 'This ! can # have # several $ letters % from different languages such as עברית or Español';
var slugRegEx = XRegExp('((?!\\d)\\p{^L})+', 'g');
var slug = XRegExp.replace(text, slugRegEx, '-').toLowerCase();
console.log(slug) ==> "this-can-have-several-letters-from-different-languages-such-as-עברית-or-español"

Assuming that your model class has a title attribute, you can simply override the to_param method within the model, like this:
def to_param
title.downcase.gsub(/ /, '-')
end
This Railscast episode has all the details. You can also ensure that the title only contains valid characters using this:
validates_format_of :title, :with => /^[a-z0-9-]+$/,
:message => 'can only contain letters, numbers and hyphens'

Brian's code, in Ruby:
title.downcase.strip.gsub(/\ /, '-').gsub(/[^\w\-]/, '')
downcase turns the string to lowercase, strip removes leading and trailing whitespace, the first gsub call globally substitutes spaces with dashes, and the second removes everything that isn't a letter or a dash.

There is a small Ruby on Rails plugin called PermalinkFu, that does this. The escape method does the transformation into a string that is suitable for a URL. Have a look at the code; that method is quite simple.
To remove non-ASCII characters it uses the iconv lib to translate to 'ascii//ignore//translit' from 'utf-8'. Spaces are then turned into dashes, everything is downcased, etc.

You can use the following helper method. It can convert the Unicode characters.
public static string ConvertTextToSlug(string s)
{
StringBuilder sb = new StringBuilder();
bool wasHyphen = true;
foreach (char c in s)
{
if (char.IsLetterOrDigit(c))
{
sb.Append(char.ToLower(c));
wasHyphen = false;
}
else
if (char.IsWhiteSpace(c) && !wasHyphen)
{
sb.Append('-');
wasHyphen = true;
}
}
// Avoid trailing hyphens
if (wasHyphen && sb.Length > 0)
sb.Length--;
return sb.ToString().Replace("--","-");
}

Here's my (slower, but fun to write) version of Jeff's code:
public static string URLFriendly(string title)
{
char? prevRead = null,
prevWritten = null;
var seq =
from c in title
let norm = RemapInternationalCharToAscii(char.ToLowerInvariant(c).ToString())[0]
let keep = char.IsLetterOrDigit(norm)
where prevRead.HasValue || keep
let replaced = keep ? norm
: prevWritten != '-' ? '-'
: (char?)null
where replaced != null
let s = replaced + (prevRead == null ? ""
: norm == '#' && "cf".Contains(prevRead.Value) ? "sharp"
: norm == '+' ? "plus"
: "")
let _ = prevRead = norm
from written in s
let __ = prevWritten = written
select written;
const int maxlen = 80;
return string.Concat(seq.Take(maxlen)).TrimEnd('-');
}
public static string RemapInternationalCharToAscii(string text)
{
var seq = text.Normalize(NormalizationForm.FormD)
.Where(c => CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark);
return string.Concat(seq).Normalize(NormalizationForm.FormC);
}
My test string:
" I love C#, F#, C++, and... Crème brûlée!!! They see me codin'... they hatin'... tryin' to catch me codin' dirty... "

The stackoverflow solution is great, but modern browser (excluding IE, as usual) now handle nicely utf8 encoding:
So I upgraded the proposed solution:
public static string ToFriendlyUrl(string title, bool useUTF8Encoding = false)
{
...
else if (c >= 128)
{
int prevlen = sb.Length;
if (useUTF8Encoding )
{
sb.Append(HttpUtility.UrlEncode(c.ToString(CultureInfo.InvariantCulture),Encoding.UTF8));
}
else
{
sb.Append(RemapInternationalCharToAscii(c));
}
...
}
Full Code on Pastebin
Edit: Here's the code for RemapInternationalCharToAscii method (that's missing in the pastebin).

I liked the way this is done without using regular expressions, so I ported it to PHP. I just added a function called is_between to check characters:
function is_between($val, $min, $max)
{
$val = (int) $val; $min = (int) $min; $max = (int) $max;
return ($val >= $min && $val <= $max);
}
function international_char_to_ascii($char)
{
if (mb_strpos('àåáâäãåa', $char) !== false)
{
return 'a';
}
if (mb_strpos('èéêëe', $char) !== false)
{
return 'e';
}
if (mb_strpos('ìíîïi', $char) !== false)
{
return 'i';
}
if (mb_strpos('òóôõö', $char) !== false)
{
return 'o';
}
if (mb_strpos('ùúûüuu', $char) !== false)
{
return 'u';
}
if (mb_strpos('çccc', $char) !== false)
{
return 'c';
}
if (mb_strpos('zzž', $char) !== false)
{
return 'z';
}
if (mb_strpos('ssšs', $char) !== false)
{
return 's';
}
if (mb_strpos('ñn', $char) !== false)
{
return 'n';
}
if (mb_strpos('ýÿ', $char) !== false)
{
return 'y';
}
if (mb_strpos('gg', $char) !== false)
{
return 'g';
}
if (mb_strpos('r', $char) !== false)
{
return 'r';
}
if (mb_strpos('l', $char) !== false)
{
return 'l';
}
if (mb_strpos('d', $char) !== false)
{
return 'd';
}
if (mb_strpos('ß', $char) !== false)
{
return 'ss';
}
if (mb_strpos('Þ', $char) !== false)
{
return 'th';
}
if (mb_strpos('h', $char) !== false)
{
return 'h';
}
if (mb_strpos('j', $char) !== false)
{
return 'j';
}
return '';
}
function url_friendly_title($url_title)
{
if (empty($url_title))
{
return '';
}
$url_title = mb_strtolower($url_title);
$url_title_max_length = 80;
$url_title_length = mb_strlen($url_title);
$url_title_friendly = '';
$url_title_dash_added = false;
$url_title_char = '';
for ($i = 0; $i < $url_title_length; $i++)
{
$url_title_char = mb_substr($url_title, $i, 1);
if (strlen($url_title_char) == 2)
{
$url_title_ascii = ord($url_title_char[0]) * 256 + ord($url_title_char[1]) . "\r\n";
}
else
{
$url_title_ascii = ord($url_title_char);
}
if (is_between($url_title_ascii, 97, 122) || is_between($url_title_ascii, 48, 57))
{
$url_title_friendly .= $url_title_char;
$url_title_dash_added = false;
}
elseif(is_between($url_title_ascii, 65, 90))
{
$url_title_friendly .= chr(($url_title_ascii | 32));
$url_title_dash_added = false;
}
elseif($url_title_ascii == 32 || $url_title_ascii == 44 || $url_title_ascii == 46 || $url_title_ascii == 47 || $url_title_ascii == 92 || $url_title_ascii == 45 || $url_title_ascii == 47 || $url_title_ascii == 95 || $url_title_ascii == 61)
{
if (!$url_title_dash_added && mb_strlen($url_title_friendly) > 0)
{
$url_title_friendly .= chr(45);
$url_title_dash_added = true;
}
}
else if ($url_title_ascii >= 128)
{
$url_title_previous_length = mb_strlen($url_title_friendly);
$url_title_friendly .= international_char_to_ascii($url_title_char);
if ($url_title_previous_length != mb_strlen($url_title_friendly))
{
$url_title_dash_added = false;
}
}
if ($i == $url_title_max_length)
{
break;
}
}
if ($url_title_dash_added)
{
return mb_substr($url_title_friendly, 0, -1);
}
else
{
return $url_title_friendly;
}
}

Now all Browser handle nicely utf8 encoding, so you can use WebUtility.UrlEncode Method , its like HttpUtility.UrlEncode used by #giamin but its work outside of a web application.

I ported the code to TypeScript. It can easily be adapted to JavaScript.
I am adding a .contains method to the String prototype, if you're targeting the latest browsers or ES6 you can use .includes instead.
if (!String.prototype.contains) {
String.prototype.contains = function (check) {
return this.indexOf(check, 0) !== -1;
};
}
declare interface String {
contains(check: string): boolean;
}
export function MakeUrlFriendly(title: string) {
if (title == null || title == '')
return '';
const maxlen = 80;
let len = title.length;
let prevdash = false;
let result = '';
let c: string;
let cc: number;
let remapInternationalCharToAscii = function (c: string) {
let s = c.toLowerCase();
if ("àåáâäãåą".contains(s)) {
return "a";
}
else if ("èéêëę".contains(s)) {
return "e";
}
else if ("ìíîïı".contains(s)) {
return "i";
}
else if ("òóôõöøőð".contains(s)) {
return "o";
}
else if ("ùúûüŭů".contains(s)) {
return "u";
}
else if ("çćčĉ".contains(s)) {
return "c";
}
else if ("żźž".contains(s)) {
return "z";
}
else if ("śşšŝ".contains(s)) {
return "s";
}
else if ("ñń".contains(s)) {
return "n";
}
else if ("ýÿ".contains(s)) {
return "y";
}
else if ("ğĝ".contains(s)) {
return "g";
}
else if (c == 'ř') {
return "r";
}
else if (c == 'ł') {
return "l";
}
else if (c == 'đ') {
return "d";
}
else if (c == 'ß') {
return "ss";
}
else if (c == 'Þ') {
return "th";
}
else if (c == 'ĥ') {
return "h";
}
else if (c == 'ĵ') {
return "j";
}
else {
return "";
}
};
for (let i = 0; i < len; i++) {
c = title[i];
cc = c.charCodeAt(0);
if ((cc >= 97 /* a */ && cc <= 122 /* z */) || (cc >= 48 /* 0 */ && cc <= 57 /* 9 */)) {
result += c;
prevdash = false;
}
else if ((cc >= 65 && cc <= 90 /* A - Z */)) {
result += c.toLowerCase();
prevdash = false;
}
else if (c == ' ' || c == ',' || c == '.' || c == '/' || c == '\\' || c == '-' || c == '_' || c == '=') {
if (!prevdash && result.length > 0) {
result += '-';
prevdash = true;
}
}
else if (cc >= 128) {
let prevlen = result.length;
result += remapInternationalCharToAscii(c);
if (prevlen != result.length) prevdash = false;
}
if (i == maxlen) break;
}
if (prevdash)
return result.substring(0, result.length - 1);
else
return result;
}

No, no, no. You are all so very wrong. Except for the diacritics-fu stuff, you're getting there, but what about Asian characters (shame on Ruby developers for not considering their nihonjin brethren).
Firefox and Safari both display non-ASCII characters in the URL, and frankly they look great. It is nice to support links like 'http://somewhere.com/news/read/お前たちはアホじゃないかい'.
So here's some PHP code that'll do it, but I just wrote it and haven't stress tested it.
<?php
function slug($str)
{
$args = func_get_args();
array_filter($args); //remove blanks
$slug = mb_strtolower(implode('-', $args));
$real_slug = '';
$hyphen = '';
foreach(SU::mb_str_split($slug) as $c)
{
if (strlen($c) > 1 && mb_strlen($c)===1)
{
$real_slug .= $hyphen . $c;
$hyphen = '';
}
else
{
switch($c)
{
case '&':
$hyphen = $real_slug ? '-and-' : '';
break;
case 'a':
case 'b':
case 'c':
case 'd':
case 'e':
case 'f':
case 'g':
case 'h':
case 'i':
case 'j':
case 'k':
case 'l':
case 'm':
case 'n':
case 'o':
case 'p':
case 'q':
case 'r':
case 's':
case 't':
case 'u':
case 'v':
case 'w':
case 'x':
case 'y':
case 'z':
case 'A':
case 'B':
case 'C':
case 'D':
case 'E':
case 'F':
case 'G':
case 'H':
case 'I':
case 'J':
case 'K':
case 'L':
case 'M':
case 'N':
case 'O':
case 'P':
case 'Q':
case 'R':
case 'S':
case 'T':
case 'U':
case 'V':
case 'W':
case 'X':
case 'Y':
case 'Z':
case '0':
case '1':
case '2':
case '3':
case '4':
case '5':
case '6':
case '7':
case '8':
case '9':
$real_slug .= $hyphen . $c;
$hyphen = '';
break;
default:
$hyphen = $hyphen ? $hyphen : ($real_slug ? '-' : '');
}
}
}
return $real_slug;
}
Example:
$str = "~!##$%^&*()_+-=[]\{}|;':\",./<>?\n\r\t\x07\x00\x04 コリン ~!##$%^&*()_+-=[]\{}|;':\",./<>?\n\r\t\x07\x00\x04 トーマス ~!##$%^&*()_+-=[]\{}|;':\",./<>?\n\r\t\x07\x00\x04 アーノルド ~!##$%^&*()_+-=[]\{}|;':\",./<>?\n\r\t\x07\x00\x04";
echo slug($str);
Outputs:
コリン-and-トーマス-and-アーノルド
The '-and-' is because &'s get changed to '-and-'.

Rewrite of Jeff's code to be more concise
public static string RemapInternationalCharToAscii(char c)
{
var s = c.ToString().ToLowerInvariant();
var mappings = new Dictionary<string, string>
{
{ "a", "àåáâäãåą" },
{ "c", "çćčĉ" },
{ "d", "đ" },
{ "e", "èéêëę" },
{ "g", "ğĝ" },
{ "h", "ĥ" },
{ "i", "ìíîïı" },
{ "j", "ĵ" },
{ "l", "ł" },
{ "n", "ñń" },
{ "o", "òóôõöøőð" },
{ "r", "ř" },
{ "s", "śşšŝ" },
{ "ss", "ß" },
{ "th", "Þ" },
{ "u", "ùúûüŭů" },
{ "y", "ýÿ" },
{ "z", "żźž" }
};
foreach(var mapping in mappings)
{
if (mapping.Value.Contains(s))
return mapping.Key;
}
return string.Empty;
}

Related

split a string by commas, but commas in a token

Let's say I have this:
something,"another thing"
This can be split easily with a normal split function.
Now I want to have more complicated syntax and I do:
something,"in a string, oooh",rgba(4,2,0)
This does not work with a regular split function.
I tried using things like replacing commas inside of specific types of tokens, but that became too over-complicated and I feel there has to be a better way.
Then I tried with regular expressions, which worked, until I had to add a new feature, which wouldn't work with the regexp I had (which was pretty bad), also regexp matches can be slow, and this is supposed to be as fast as possible.
What would be a better way to solve this?
Here is the source repo for extra context https://github.com/hyprland-community/hyprparse
And the format in question is the hyprland config format
Iterate over the string keeping a context state:
None
Inside a "..."
Inside a (...)
Inside a context, comma has no separator meaning.
Limitations: This is a midnight hack!
See also Rust Playground
fn split(s: String) -> Vec<String> {
let mut context = None;
let mut i = 0;
let mut start = 0;
let mut items = Vec::new();
for c in s.chars() {
if context == Some('"') {
if c == '"' {
context = None;
}
i = i+1;
continue;
} else if context == Some('(') {
if c == ')' {
context = None;
}
i = i+1;
continue;
}
if c == '"' || c == '(' {
context = Some(c);
}
if c == ',' && context.is_none() {
items.push(s[start..i].to_string());
start = i + 1;
}
i = i+1;
}
items.push(s[start..i].to_string());
items
}
fn main() {
let s = "something,\"in a string, oooh\",rgba(4,2,0)".to_string();
println!("{:?}", split(s));
// -> ["something", "\"in a string, oooh\"", "rgba(4,2,0)"]
}
Thanks for all the help everyone, I eventually came up with my own solution with the help of people I knew IRL, here is my solution:
fn previous_rgb(index: usize, chars: Vec<char>) -> bool {
let string: String = chars[index-3..index].iter().collect();
let string = string.as_str();
if string == "gba" || string == "rgb" {
true
} else {
false
}
}
fn splitter<Str: ToString>(s: Str) -> Vec<String> {
let mut is_in_rgb = false;
let mut is_in_str = false;
let mut last = 0;
let chars: Vec<_> = s.to_string().chars().collect();
let mut final_str = vec![];
for (index, c) in chars.iter().enumerate() {
if *c == '(' && previous_rgb(index, chars.clone()) && !is_in_str {
is_in_rgb = true;
} else if *c == ')' && is_in_rgb && !is_in_str {
is_in_rgb = false;
} else if *c == '"' && is_in_str {
is_in_str = false;
} else if *c == '"' && !is_in_str {
is_in_str = true;
} else if *c == ',' && !is_in_str && !is_in_rgb {
final_str.push(chars[last..index].iter().collect());
last = index + 1
};
}
final_str.push(chars[last..].iter().collect());
final_str
}
fn main() {
let splitted = splitter(r#"test,test: rgba(5,4,3),"test2, seperated""#);
println!("{splitted:?}");
}

Changing if-else into switch statements?

I am using multiple else/if statements and I want to use switch statements. I tried but don't know how to fill in multiple answers.
How can I convert the following else/if statements to switch?
if (this.state.item.type === 'dashboard') {
settings.view_name = this.state.view_name;
settings.layout = this.state.layout;
settings.inline_edit = this.state.inlineEdit;
settings.show_add = this.state.showAdd;
settings.tab_queries = this.state.tabQueries;
settings.carousel = this.state.carousel;
settings.template = this.state.template;
settings.sort = this.state.sort;
settings.templateOptions = this.state.templateOptions;
} else if (this.state.item.type === 'list_option_percentage_conditions') {
settings.pie_charts = this.state.pie_charts;
} else if (this.state.item.type === 'count_list_options') {
settings.count_list_field = this.state.count_list_field;
settings.date_field = this.state.date_field;
settings.use_creation_date = this.state.use_creation_date;
settings.scheme = this.state.scheme;
} else if (this.state.item.type === 'sum_fields') {
settings.sum_field = this.state.sum_field;
settings.scheme = this.state.scheme;
}
I want it to look something like this:
switch (this.state.iten.type) {
case "dashboard":
answer = "";
break;
case "list_option_percentage_conditions":
answer = "";
case "count_list_options":
answer = "";
break;
case "sum_fields":
answer = "";
break;
}

How do I rewrite regex in golang to work around no support for positive lookaheads? [duplicate]

I am trying to write password validation function with regexp and don't know how to do it.
The regex package provided by the standard API of the Go language is different to other languages.
Have someone an idea, how this regexp pattern should looks like?
The pattern should validate:
/*
* Password rules:
* at least 7 letters
* at least 1 number
* at least 1 upper case
* at least 1 special character
*/
That's actually impossible since Go's regex doesn't support backtracking.
However, it's easy to implement, a simple example:
func verifyPassword(s string) (sevenOrMore, number, upper, special bool) {
letters := 0
for _, c := range s {
switch {
case unicode.IsNumber(c):
number = true
case unicode.IsUpper(c):
upper = true
letters++
case unicode.IsPunct(c) || unicode.IsSymbol(c):
special = true
case unicode.IsLetter(c) || c == ' ':
letters++
default:
//return false, false, false, false
}
}
sevenOrMore = letters >= 7
return
}
The right regexp would be... no regexp here.
You can define a custom function that would validate the password, and combine it with other frameworks helping validating a field, like mccoyst/validate (mentioned in this discussion about parameter validation)
You also have go-validator/validator whic allows to define similar validations (but I would still use a custom validator instead of one or several regexps).
Note: go regexp is based on re2, an efficient, principled regular expression library).
So the major trade offs are no back-references for example: (abc)\1 and no matching look-behinds.
In exchange you get high speed regex.
Building from a neighboring answer, I too wrote a helper function that works well for me. This one just assumes overall password length is satisfactory. Check out the following...
func isValid(s string) bool {
var (
hasMinLen = false
hasUpper = false
hasLower = false
hasNumber = false
hasSpecial = false
)
if len(s) >= 7 {
hasMinLen = true
}
for _, char := range s {
switch {
case unicode.IsUpper(char):
hasUpper = true
case unicode.IsLower(char):
hasLower = true
case unicode.IsNumber(char):
hasNumber = true
case unicode.IsPunct(char) || unicode.IsSymbol(char):
hasSpecial = true
}
}
return hasMinLen && hasUpper && hasLower && hasNumber && hasSpecial
}
isValid("pass") // false
isValid("password") // false
isValid("Password") // false
isValid("P#ssword") // false
isValid("P#ssw0rd") // true
Go Playground example
Based on #OneOfOne's answer with some error message improvement
package main
import (
"fmt"
"strings"
"unicode"
)
func verifyPassword(password string) error {
var uppercasePresent bool
var lowercasePresent bool
var numberPresent bool
var specialCharPresent bool
const minPassLength = 8
const maxPassLength = 64
var passLen int
var errorString string
for _, ch := range password {
switch {
case unicode.IsNumber(ch):
numberPresent = true
passLen++
case unicode.IsUpper(ch):
uppercasePresent = true
passLen++
case unicode.IsLower(ch):
lowercasePresent = true
passLen++
case unicode.IsPunct(ch) || unicode.IsSymbol(ch):
specialCharPresent = true
passLen++
case ch == ' ':
passLen++
}
}
appendError := func(err string) {
if len(strings.TrimSpace(errorString)) != 0 {
errorString += ", " + err
} else {
errorString = err
}
}
if !lowercasePresent {
appendError("lowercase letter missing")
}
if !uppercasePresent {
appendError("uppercase letter missing")
}
if !numberPresent {
appendError("atleast one numeric character required")
}
if !specialCharPresent {
appendError("special character missing")
}
if !(minPassLength <= passLen && passLen <= maxPassLength) {
appendError(fmt.Sprintf("password length must be between %d to %d characters long", minPassLength, maxPassLength))
}
if len(errorString) != 0 {
return fmt.Errorf(errorString)
}
return nil
}
// Let's test it
func main() {
password := "Apple"
err := verifyPassword(password)
fmt.Println(password, " ", err)
}
Below it's my implementation of the above answers with custom messages and somehow twisting to them in a good way(performance aware codes).
package main
import (
"fmt"
"strconv"
"unicode"
)
func main() {
pass := "12345678_Windrol"
// call the password validator and give it field name to be known by the user, password, and the min and max password length
isValid, errs := isValidPassword("Password", pass, 8, 32)
if isValid {
fmt.Println("The password is valid")
} else {
for _, v := range errs {
fmt.Println(v)
}
}
}
func isValidPassword(field, s string, min, max int) (isValid bool, errs []string) {
var (
isMin bool
special bool
number bool
upper bool
lower bool
)
//test for the muximum and minimum characters required for the password string
if len(s) < min || len(s) > max {
isMin = false
appendError("length should be " + strconv.Itoa(min) + " to " + strconv.Itoa(max))
}
for _, c := range s {
// Optimize perf if all become true before reaching the end
if special && number && upper && lower && isMin {
break
}
// else go on switching
switch {
case unicode.IsUpper(c):
upper = true
case unicode.IsLower(c):
lower = true
case unicode.IsNumber(c):
number = true
case unicode.IsPunct(c) || unicode.IsSymbol(c):
special = true
}
}
// append error
appendError := func(err string) {
errs = append(errs, field+" "+err)
}
// Add custom error messages
if !special {
appendError("should contain at least a single special character")
}
if !number {
appendError("should contain at least a single digit")
}
if !lower {
appendError("should contain at least a single lowercase letter")
}
if !upper {
appendError("should contain at least single uppercase letter")
}
// if there is any error
if len(errs) > 0 {
return false, errs
}
// everyting is right
return true, errs
}
There are many ways to skin a cat---The other answers seem to veer away from regex completely, so I thought I'd show my method for simple pass/fail testing of a password string, which is styled to suit my thinking. (Note that this doesn't meet the literal "7 letters" requirement in the original question, but does check overall length.) To me, this code is fairly simple and looks easier to read than doing switch statements or a bunch of if statements:
password := "Pa$$w0rd"
secure := true
tests := []string{".{7,}", "[a-z]", "[A-Z]", "[0-9]", "[^\\d\\w]"}
for _, test := range tests {
t, _ := regexp.MatchString(test, password)
if !t {
secure = false
break
}
}
//secure will be true, since the string "Pa$$w0rd" passes all the tests

How do I detect non-letters with virtual codes in the Windows API?

In the Windows API and Direct2D/DirectWrite, I'm detecting the virtual code so text input in a 2D GUI can be appended. While it works fine, How can I include non-letters, such as !?., etc.
For example, when I press Shift+1, I get '1' instead of '!'. When I press '.', I get a boxed character. Can this detection be checked in this function somehow?
wchar_t TextBox::charIsPressed(int getKey)
{
char letter = getKey;
// Check for space character
if (letter == ' ')
return (wchar_t)letter;
// Check if the input is no letter
if ((getKey >= 'A') && (getKey <= 'Z'))
{
if (!GetAsyncKeyState(VK_SHIFT))
letter += 0x20;
}
return (wchar_t)letter;
}
It's calling function:
// Keyboard support
static X2D::Win32::KeyEvent *keyEvent;
if (m_focused)
{
// Check if there's editing space
if ((m_x + m_text.getWidth()) > (m_x + getWidth()))
return;
// Get the latest key event
keyEvent = frm.getKeyEvent();
if (!keyEvent->processed)
{
// Was backspace pressed?
if (keyEvent->virtual_code == VK_BACK)
{
m_text.setText(m_text.getText().substr(0, m_text.getText().length() - 1));
}
else if (keyEvent->virtual_code == VK_RETURN)
{
m_focused = false;
}
else
{
m_text.setText(m_text.getText() + charIsPressed(keyEvent->virtual_code));
}
keyEvent->processed = true;
}
}
Edit:
I found a way for detecting single characters, so it's a start.
// Converts '1' to '!'
if (getKey == '1')
{
if (GetAsyncKeyState(VK_SHIFT))
return '!';
}
Though typing '.' is getting me a semi-snowman Ascii figure.
Try something like this (this is Delphi, but it allows you to see the principle of translation):
function VKToChar(AVirtualCode: Word; out AChar: WideChar): Boolean;
var
KeyboardState: TKeyboardState;
ScanCode: DWORD;
Temp: UnicodeString;
Char: WideChar;
begin
AChar := #0;
Result := GetKeyboardState(KeyboardState);
if not Result then Exit;
ScanCode := MapVirtualKey(AVirtualCode, MAPVK_VK_TO_VSC);
SetLength(Temp, 3);
if ToUnicode(AVirtualCode, ScanCode, KeyboardState, PWideChar(Temp), Length(Temp), 0) = 1 then
begin
AChar := Temp[1];
Result := True;
end
else
Result := False;
end;

Replace largo if, else if... else with FOR or something more compact

I receive via POST a value. Then, I´m comparing the value received (1, 2, 3, 4, 5) with variables pre defined in my code.
Is it possible to do it with FOR or another way to simplify it without changing the functionality of the code?
Yes, I need to receive the value as number and compare it with variables (no MYSQL).
I set on each test the name, eg: $varname = "Paul";
Here´s what I´m doing and what I´d like to change.
Thanks a lot
// from previous page with input name thenumber
$thenumber = $_POST['thenumber'];
$option1 = "1";
$option1 = "2";
$option1 = "3";
$option1 = "4";
$option1 = "5";
$option1 = "6";
...
...
More options
if($thename == $option1)
{
$varname = "Paul";
}
else if ($thename == $option2)
{
$varname = "Louie";
}
else if ($thename == $option3)
{
$varname = "Dimitri";
}
...
...
...
It would be much cleaner to do this with a switch.
I don't think using a for loop will be a good idea.
Be sure to put a break after each case ends.
The default case is when $thename is none of the numbers in the cases.
switch ($thename) {
case 1:
$varname = "paul";
break;
case 2:
$varname = "Louie";
break;
case 3:
$varname = "Dimitri";
break;
...
default:
$varname = "default_name";
break;
}