The “Chad” bug

grardb · on Dec 31, 2015

This is remarkable. I always find it interesting when bugs like this occur.

It reminds me of a hackathon I attended where a food ordering startup (I forget the name, but they were chosen to feed us dinner that night) had a similar bug, which baffled me beyond belief. Without going into crazy detail about my password, it typically follows a certain pattern but is never the same across websites. For some reason, the website kept saying my password was invalid. It met all the password requirements that the website asked for (length, capital letter, etc.).

I forget the exact details, but it ended up being the exact location of a capital letter, the location of a number, or some combination of both. I could never figure out how a bug like that could even be coded up. My best guess is that it was some poorly-formed regex.

> Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Animats · on Dec 31, 2015

"Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems."

Yes. My favorite was a Coyote Point load balancer bug. If the last character of the HTTP options is "m", the connection will not get past the load balancer.[1] I found this because a web crawler was having trouble with one site. Fortunately, I knew someone with their own Coyote Point load balancer, and was able to establish that the connection went into the load balancer and never came out.

The load balancer has a big file of rules which contain regular expressions. Somewhere, I think there's a "\m" where they meant "\n". Reporting this to the vendor, along with a Python program to demonstrate the problem, was of course futile; they suggested "upgrading the software". I demonstrated that the bug existed on their own load balancer on their own site. I finally added a completely useless field to the HTTP header so that the last character was not "m".

[1] https://www.webmasterworld.com/webmaster_hardware/3312997.ht...

chris_wot · on Dec 31, 2015

Tell them they've violated their contract and you are migrating from them ASAP.

rdancer · on Dec 31, 2015

That doesn't work when it's software on your client's computer. Remember when all the web devs revolted and ditched IE6 back in 2001, because MS was just taking the Mickey? Yeah, me neither.

Animats · on Dec 31, 2015

Right. I didn't have a Coyote Point unit. Many other sites did, and they all appeared to be down to my web crawler until I figured out the problem.

(Current web crawler problem: sites that won't let you read their robots.txt file if they don't like your user-agent string.)

ThisIs_MyName · on Jan 1, 2016

>sites that won't let you read their robots.txt file if they don't like your user-agent string

That's hilarious. So do you use borrow a browser's user-agent or do you ignore the robots.txt?

chris_wot · on Jan 1, 2016

I'd assume they want you to crawl their website. When they say you are ignoring their robots.txt file, tell them you actively prevented you from seeing it and you could only assume that mean they WANTED to be crawled.

That would get them to fix the issue pretty quickly :-)

rdancer · on Jan 1, 2016

No `robots.txt` indeed means weapons free, but crawler gleans useful info from it, e.g. which areas of the site are dynamically generated.

More likely the site is trying to serve custom versions of `robots.txt` to different bots, with good intentions, and the code is buggy.

Animats · on Jan 8, 2016

The strict interpretation is that if "robots.txt" returns 403 Forbidden, it's interpreted as "deny all". That's what the Python library does. We list those sites as "Blocked".

beachstartup · on Dec 31, 2015

> upgrading the software

they told you this because they probably fixed the bug in a newer release. what response were you expecting, exactly? to dive in with a hex editor?

for what it's worth we run into the same issues with our network devices, it's a pain in the ass, and that's why we're shifting to SDN.

mgkimsal · on Dec 31, 2015

Proper response would be something like

"You've obviously put a lot of time in to this. Thanks. We actually had found that bug 4 months ago in R1892, and it's patch in R1899. Please upgrade, or let us know that you're, in fact, using R1899 and still experiencing the issue (it's issue #847 on our bug tracker at foobugz.xyz)"

Or maybe...

"You've obviously put a lot of time in to this. Thanks. We actually had found that bug 4 months ago in R1892, and we're planning on a maintenance release next quarter. If you'd like to test out our upgrade version, please contact jaz@balanceco.com and let them know you want to be in the testing group - he'll get you the appropriate documentation and code. Thanks for helping fix and test these issues!"

beachstartup · on Dec 31, 2015

right, so in other words, a software upgrade.

mgkimsal · on Dec 31, 2015

In other words, acknowledge that someone did some work, and actually ask if they're using the latest version, and vitally - acknowledge that they read the correspondence and that upgrading will address the issue.

"Update your software" is often a BS answer that can introduce many other issues or breakages in production systems. Without a clean "downgrade" path (which many companies don't provide) you run a risk of introducing more problems.

beachstartup · on Dec 31, 2015

i think you're misunderstanding me. i'm agreeing with you - network device vendors are universally terrible across the board, and that's why nobody should count on them unless they absolutely have to.

in other words, what i'm saying is to expect that kind of response is naive, and that you should seek alternatives.

0xcde4c3db · on Dec 31, 2015

Besides the usual regex aches and pains, the grammar for email addresses is far more complex than most people realize. According to a highly-voted Stack Overflow answer [1], the current RFC-specified grammar for addresses can't even be matched with regex alone. Combining the edge cases of the grammar with (say) Unicode normalization sounds like a recipe for hours of fun.

[1] https://stackoverflow.com/questions/201323/using-a-regular-e...

eli · on Dec 31, 2015

I think many people are thinking about email address validation the wrong way.

RFC 822 describes how messages are encoded when email servers talk to each other. It isn't really about email address validation and is not intended to be used to validate a form field on some registration page.

Unless you're writing an MTA or similar piece of infrastructure there is no reason you should be using the RFC grammar. Even if implementation were easy, it probably isn't what you want. For example, the spec permits inline comments but that's a nonsensical thing to have in the middle of an address you typed into an HTML form. Email addresses entered on a web form should be rejected if they contain comments, IMHO.

I think what most developers really want to know is something like: Can this given email address receive messages? Or: Does this given address actually belong to this user? Well, the only way to test that is to send it a message. At best, regex validation might warn you earlier that a given address couldn't possibly work because it's so obviously malformed. But you can't validate your way into getting people to enter their real email address if they don't want to or if they don't know what it is. If your intent is really just to help catch typos and mistakes, you'd be much better off looking to something like mailcheck [0] which will flag common typos like "foo@hotnail.com" even if they result in valid looking addresses.

[0] https://github.com/mailcheck/mailcheck

saurik · on Dec 31, 2015

I agree with you "100% in spirit", but feel the need to point out that RFC 5322 (822 is ancient...) is for MIME, and thereby isn't what is even used "when e-mail servers talk to each other": it is the format of the e-mail message itself that is used only within a mailbox (when either your client or, if you are using IMAP, your server parses the message).

The actual standard used by e-mail servers talking to each other is 5321 (which you might still know 821 ;P), the standard for SMTP: this protocol actually has a different way to write comments and escape characters, as it is embedded into a different structure (which I think decimates the arguments people tend to make that you should validate comments).

Years ago I was working quite in earnest on an e-mail server suite, and at the time I was extremely deep in the various standards, and wrote a comment that goes into somewhat more depth on the semantics of e-mail verification. The example I was really happy with is the notion that you would never ask your user to HTML escape their username or password ;P.

https://news.ycombinator.com/item?id=4486872

xenophonf · on Dec 31, 2015

I really hate browser- or webapp-side email validation. I've rarely seen a web developer get this right. My advice is to treat email addresses like blobs and attempt delivery with a confirmation link. That will catch users who typo their own address, and it will let people use whatever address format they want (the "+" in an email address not being considered valid is my own personal bugaboo).

jgalt212 · on Jan 1, 2016

I dunno. I think people with edge case emails (i.e. those not easily validated with a simpleish regex) should be nudged towards getting more standardized emails.

Our company won't accept weird emails for free trial sign ups. We should be nudging users towards good behavior.

Of course, if your email address is provided at the corp level, then you don't have as many options.

Non-standard emails can be used as tools for phishing is another reason why we should not validate them.

huuu · on Dec 31, 2015

I very much agree because in the end the only validation seems to be "does it contain a @ character somewhere in the middle".

For example: example@localhost is valid but is refused by most validators.

maweki · on Dec 31, 2015

Localhost is obviously not a fully qualified domain name which would be expected in any internet setting.

hcf · on Dec 31, 2015

Top level domains do resolve though, email@bs is completely legal to try to send email to.

lmm · on Dec 31, 2015

> Email addresses entered on a web form should be rejected if they contain comments, IMHO.

Why? I use several addresses that start something like myfirstname.mylastname@... . A comment at the start could make it much easier for me to use browser autofill. (Arguably that's "really" a browser UX issue, but we work with the tools we have)

kccqzy · on Dec 31, 2015

I find that quite unbelievable. When I had a similar problem last year, the first resource I found was a W3C specification[1] about <input type=email>. The specification clearly states that email addresses should match:

    /^[a-zA-Z0-9.!#$%&’*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*$/

Since this is an official W3C doc, I see no reason why people shouldn't use this.

Edit: There is also a version by WHATWG[2] here:

    /^[a-zA-Z0-9.!#$%&'*+\/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/

It apparently does a more thorough validation than the W3C one in the domain part, but the difference between the two is not apparent in practice.

[1]: http://www.w3.org/TR/html-markup/input.email.html [2]: https://html.spec.whatwg.org/multipage/forms.html#e-mail-sta...

taspeotis · on Dec 31, 2015

The .NET Framework's email address validation has a pretty ... comprehensive ... regular expression [1].

[1] http://referencesource.microsoft.com/#System.ComponentModel....

chris_wot · on Dec 31, 2015

Holy line of cartoon swear words Batman!

I would hate to be the programmer who had to debug that regex.

leviathan · on Dec 31, 2015

You clearly haven't seen the regex that matches emails according to RFC822:

Behold: http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html

chris_wot · on Jan 1, 2016

/me falls off chair

wyldfire · on Dec 31, 2015

>Since this is an official W3C doc, I see no reason why people shouldn't use this.

Well, as far as authority/canon goes, it's typically dictated by the IETF and not W3. And, sure, they could cooperate with one another but, really -- if there's an authority on the protocols that describe email (SMTP, POP, IMAP, etc) -- it shouldn't be the World Wide Web Consortium.

That said, the IETF does tend to draft RFCs that reflect actual implementations (at least their intended design), but since they often bias towards interoperability, it's unlikely they'd narrow the scope of the email address grammar.

seszett · on Dec 31, 2015

This forces users who have internationalized email addresses (say whatever@maré-design.fr - just a random IDN I know) to lookup the punycode-encoded version of the domain.

Ditto for internationalized TLDs like something@taiwan.台灣).

I'm not sure if this is really that bad, but I don't like it.

kijin · on Dec 31, 2015

It's the job of the validation engine to convert all URLs and email addresses to a canonical form -- either punycode or actual UTF-8, but never both. I prefer standardizing on punycode because it's easier to handle.

Otherwise you're just asking for trouble when the same user who signed up with an email address at maré-design.fr later tries to reset their password with an email address at xn--mar-design-d7a.fr. Sorry, you don't seem to have an account with us.

The same thing happens with URLs. Some browsers send the Host: header in punycode but send the referer in UTF-8. Who knows how they encode CORS headers and all the other newfangled stuff that contains bits and pieces of URLs. You have to consistently convert one to the other before using any of them.

IDNs are a mess.

seszett · on Dec 31, 2015

> It's the job of the validation engine to convert all URLs and email addresses to a canonical form

I don't think <input type="email"> does that though. Maybe it should, but users might be surprised to see their address automatically turn into some ugly unreadable xn--whatver-d7a domain. After form submission then yes of course any sane process will convert them to a canonical form.

I know there were good reasons to use punycode instead of UTF8 for IDNs, but it sure is a mess.

kijin · on Dec 31, 2015

Yeah, I think we'll have to treat email addresses as just another kind of untrustworthy input that needs to be escaped in various ways depending on the context. Punycode for storage and validation, back to UTF-8 for display, back to punycode when it goes in a mailto: link, etc. Ditto for URLs. The days of easy parsing are sadly forever gone.

oinksoft · on Dec 31, 2015

Practically, those W3C/WHATWG regexes probably are fine, but they aren't correct. For instance they fail to match the valid address c."@".t@gmail.com because they don't handle quoted strings correctly. I'd expect they also match some invalid addresses. My understanding was that only a Perl regex can test an address against the standard: http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html

Dylan16807 · on Dec 31, 2015

It may be incorrect to reject that email, but it's good to reject that email.

__david__ · on Dec 31, 2015

Why reject a perfectly valid email? Why do you get to thumb your nose at that particular one?

Dylan16807 · on Dec 31, 2015

It's only 'valid' because of the particular decisions made by someone writing a random RFC. There's next to no reason to allow quoted strings, and even less reason to allow a different character set inside of them.

Email addresses do not need escape sequences.

kevin_thibedeau · on Dec 31, 2015

Where do you draw the line? I occasionally have valid GMail addresses with "+" in them rejected because someone wrote a crappy validator.

Dylan16807 · on Dec 31, 2015

Allow all the characters that are supposed to be valid outside of quotation marks. Letters, numbers, .!#$%&'*+-/=?^_`{|}~

That way if you don't mind letting through double periods or domain segments having more than 63 characters, you can validate an email with two character classes separated by an @. The basic regex looks like [xx]+@[yy]+

mikeash · on Dec 31, 2015

Doesn't this sort of RFC typically summarize and codify real-world usage, rather than coming up with new stuff? If a feature is in there, I'd have thought it's because somebody uses it.

pc86 · on Dec 31, 2015

At some point you just have to reach the level of "Yeah... don't do that." I would suspect have "@" as part of your email probably approaches that line.

mikeash · on Dec 31, 2015

It would be interesting to know the context. If this feature was already in a popular piece of software, then saying "don't do that" in your standard risks fragmenting things rather than codifying them, and that's no good either.

immibis · on Jan 1, 2016

> It's only 'valid' because of the particular decisions made by someone writing a random RFC.

You know that's exactly what "valid" means, when talking about something standardized, right?

In HTML there's no reason to use " outside of an attribute value. Should browsers reject it?

oh_sigh · on Dec 31, 2015

I don't think gmail will let you register that email address.

immibis · on Jan 1, 2016

Okay then, c."@".t@someotherdomain.com

zurn · on Dec 31, 2015

The email input regexp is only about matching the atext@domain subpart of the syntax. The full email address parsers are about parsing the full allowed syntax you can put in an email to: field that includes address lists and display names.

Also W3C produced HTML specs are hardly gospel about email-related things (but I don't know whether there's anything wrong in this case).

zurn · on Dec 31, 2015

Replying to myself (can't edit anymore), looks like the W3C spec bungled it in your [1] by referencing "atext" from the RFC. For example, as specified it disallows addresses with periods in them. In the RFC, "atext" is just a building block in the syntax of the local-part before @ sign.

kccqzy · on Dec 31, 2015

But the question is, if a signup form is asking for your email address, will you type in your name, a pair of angle brackets, and then your email address in between? That's simply not what the web developer wants here.

lmm · on Dec 31, 2015

Will I type that in? No. Will I copy-paste my email from my address book program? Quite possibly.

stavros · on Dec 31, 2015

Saying "it can't even be matched with a regex" is trivial. Here's another, very simple thing that can't be matched with a regex: Matching pairs of nested parentheses.

AceJohnny2 · on Dec 31, 2015

The spec of what HTML will accept for an email in an input form is quite different from what email (which predates HTML by a lot, the major RFC 822 being from 1982) actually accepts.

djur · on Dec 31, 2015

The W3C and WHATWG are web standards bodies. Email isn't in their bailiwick. The acceptable format for email addresses is defined in IETF RFCs.

hitekker · on Dec 31, 2015

direct link to image visualizing the complexity of the state machine required: https://i.stack.imgur.com/SrUwP.png

annnnd · on Dec 31, 2015

> Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Great quote. However, I think regexes got a bad reputation just because of the way people use them. In essence they are a pretty reliable way of parsing because the parsing engine is well tested. But the expression should be kept as simple as possible and developer should avoid using any nonstandard / nonexplicit extensions. I even avoid using \w because, well, what IS a word character? I am sure it is defined somewhere... but I'll always use explicit form (like "[a-zA-Z]" when I want ASCII chars) instead.

Anyway, if you use the form as used in the regex puzzle [0], you'll be fine. As long as you use regex only for what it was meant for, of course [1]...

[0] https://news.ycombinator.com/item?id=10787509

[1] http://blog.codinghorror.com/parsing-html-the-cthulhu-way/

cookiecaper · on Dec 31, 2015

I find a lot of websites will eat passwords that contain special characters. I don't mean that they'll tell you it doesn't match the password policy, I mean that they'll accept the password and then tell you the password is wrong when you come back to sign in. I eventually had to teach my password generator to use only a few usually-properly-handled special characters when generating to avoid the hassle of having to reset the password every time. The same thing is often true of long passwords -- websites will accept the password at the UI level, but it probably gets truncated somewhere in the processing, and you don't know which character you got cut off at, so you have to reset to something shorter.

ars · on Dec 31, 2015

If it's going to eat a character it should at least be consistent and also eat it at the validation stage.

raverbashing · on Dec 31, 2015

From what I've seen from "regular people" writing regular expressions, they seem to not have the slightest clue on how to do it

And then putting it into the program without testing it properly

So, sorry, the issue is not regexes, but people just going for it at an "trial and error" fashion (and sometimes just trial)

TeMPOraL · on Dec 31, 2015

For people here that may ask themselves just how exactly one could test regular expressions, I recommend a visual tool like [0]. Having the regex structure and meaning drawn in front of you helps tremendously.

That, and for the love of God please comment any non-trivial regexp. Either like this, with 'x' option:

  preg_match('/^
              .*              # Match any number of characters...
              (?=.{6,})       # ... AND match at least 6 characters (lookahead) ...
              (?=.*\d)        # ... AND match one digit after any number of characters (lookahead) ...
              (?=.*[a-zA-Z])  # ... AND match one letter after any number of characters (lookahead) ...
              .*              # ... AND allow any number of characters later.
              $/x', $password);

... or just with normal programming language comments and stitching regexp from multiple strings in multiple lines. Also give some semantic meaning to groups if you use them, e.g. tag them with constants so that your code isn't full of stuff "result.get(3)", which makes you waste time on trying to recall what was that group 3 in the code from last month.

I know it's pretty much software engineering 101. It's the basics of basics. But from my experience, even the brightest of engineers in most serious projects suddenly forget how to write code when they touch regular expressions.

[0] - https://www.debuggex.com

raverbashing · on Dec 31, 2015

Using tools is good, but also people should test them in their unit test, for what it should/should not be accepted

If you're forbidding items beginning with numbers, just have a test try passing '1a' and failing the match

TeMPOraL · on Dec 31, 2015

I agree of course. Unit tests like that are both basic sanity checks and, more importantly, protect you from stupid mistakes when the regexp has to be changed.

akerl_ · on Dec 31, 2015

It should be common knowledge at this point, but just in case:

If you're doing regex or any other text manipulation on user input when you ask them to set a password, you're doing it wrong.

chucky_z · on Dec 31, 2015

What's the best way to deal with this problem, and what's the correct way to deal with this problem?

akerl_ · on Dec 31, 2015

The input the user puts into the password prompt should by taken exactly as is, pushed into bcrypt/scrypt/etc, then stored as the user's password hash.

I'm not entirely opposed to requiring a minimum length, but imposing max lengths / character class rules / etc ends up hurting people who want to pick strong passwords more than it helps people who would pick weak ones (enforcing character classes just gets us lots of password1A! and similar)

cookiecaper · on Dec 31, 2015

I agree that the only real restriction that makes sense is a minimum character count. The others just tend to get in the way. I haven't seen anyone implement it in the wild, but it'd also be cool if there was a wordlist of the 25 most common passwords that the site matched against and refused to accept. I think those two policies, minimum length and no super common passwords, would do a lot to minimize the effectiveness of dictionary attacks.

mbreese · on Dec 31, 2015

I personally like the Stanford policy[1]. It basically boils down to: the longer the password, the fewer restrictions. Each password needs to be at least 8 characters, but if you only have 8 characters, you might need uppercase, lowercase, a symbol and a digit. If you have 12 characters, you only need upper/lower/digit. Once you hit 20 characters, you can have whatever you want.

I think that this is a good balance between security for short passwords, while still allowing ridiculously long ones (pass-phrases).

[1] http://arstechnica.com/security/2014/04/stanfords-password-p...

technion · on Dec 31, 2015

    $ dd if=/dev/urandom bs=32 count=1 | xxd -p

That nearly every site on the Internet will refer to that output as "not strong enough" and instead suggest P@ssword1 as a better alternative definitely speaks to the issue.

warfangle · on Dec 31, 2015

Almost like you're validating on entropy and not specific rules..........

Dylan16807 · on Dec 31, 2015

The concept is nice, but their numbers are horribly, unforgivably wrong.

8 random upper/lower/digit/symbol characters are equal to 9 random mixed-case letters. Not 16.

Their cutoffs for different mixes are 8, 12, 16, 20. Realistic cutoffs would look more like 10, 11, 11, 14.

Even worse, they encourage counting the individual letters in words. Never do that. Random words are only as good as two random characters.

tragic · on Dec 31, 2015

> I haven't seen anyone implement it in the wild, but it'd also be cool if there was a wordlist of the 25 most common passwords that the site matched against and refused to accept

It's now in django[0]. And it's a hoot to read...

[0] https://github.com/django/django/blob/master/django/contrib/...

rmc · on Dec 31, 2015

> I haven't seen anyone implement it in the wild, but it'd also be cool if there was a wordlist of the 25 most common passwords that the site matched against and refused to accept.

I have implemented that on a site. It wouldn't accept the top 10,000 most common passwords. (or was it 1k).

Otherwise ~1% of your users will pick "password" as a password.

ars · on Dec 31, 2015

> The input the user puts into the password prompt should by taken exactly as is

I would however suggest trimming trailing spaces.

Also, for a nicer user experience try the password twice: As is, and with case reversed. This lets people login even with capslock on and has little impact on security (i.e. don't be case insensitive! just case reversed.)

tempestn · on Dec 31, 2015

I see the reasoning behind trimming trailing spaces, but it could cause at least as many problems as it solves. Plenty of people use automatically generated passwords, and while hopefully most auto-generators don't include spaces, I expect some might. Better might be to warn the user if their password has a trailing space.

relearn · on Dec 31, 2015

Could you not trim spaces upon attempted login as well to make this a non-issue?

I do like the idea of warning the user upon creation.

tempestn · on Dec 31, 2015

Good idea; I agree that that's better.

nothrabannosir · on Dec 31, 2015

Is there any password generator in this reality which does that? Serious question. More than zero?

tempestn · on Dec 31, 2015

Probably not a serious one, but a lot of people write their own little scripts and utilities to do this. So definitely more than zero. Perhaps not a significant number, but I'd still be concerned about changing a user's entered password without letting them know. relearn's solution is a good one though.

serge2k · on Dec 31, 2015

I would guess the number of people who accidently enter "password " is far greater than the number of people who have "p " as their password. Plus, for the people who do deliberately input spaces they will just have a weaker password, not issues logging in.

M2Ys4U · on Dec 31, 2015

I can see how that could fail horribly when you have arbitrary Unicode data.

For instance, the German sharp s (ß) has an asymmetrical casemapping.

From the Unicode standard[0]:

>The German sharp s character has several complications in case mapping. Not only does its uppercase mapping expand in length, but its default case-pairings are asymmetrical. The default case mapping operations follow standard German orthography, which uses the string “SS” as the regular uppercase mapping for U+00DF ß latin small letter sharp s. In contrast, the alternate, single character uppercase form, U+1E9E latin capital letter sharp s, is intended for typographical representations of signage and uppercase titles, and in other environments where users require the sharp s to be preserved in uppercase. Overall, such usage is uncommon. Thus, when using the default Unicode casing operations, capital sharp s will lowercase to small sharp s, but not vice versa: small sharp s uppercases to “SS”, as shown in Figure 5-16. A tailored casing operation is needed in circumstances requiring small sharp s to uppercase to capital sharp s.

[0] http://www.unicode.org/versions/Unicode7.0.0/ch05.pdf#G21180

ars · on Dec 31, 2015

> For instance, the German sharp s (ß) has an asymmetrical casemapping.

As they say "so don't do that".

This is about capslock, not lettercase in general. Only switch characters that change with capslock on the keyboard.

And even if the mapping is not perfect - so what? The worst that will happen is nothing.

pavel_lishin · on Dec 31, 2015

Why would that fail horribly?

Freaky · on Dec 31, 2015

> imposing max lengths .. ends up hurting people

In that case you'll want to pre-hash your BCrypt-encoded passwords, otherwise you'll be imposing a silent 72 character limit.

Remember to encode the hashes, since BCrypt also silently truncates after NULL bytes.

zeveb · on Dec 31, 2015

> In that case you'll want to pre-hash your BCrypt-encoded passwords, otherwise you'll be imposing a silent 72 character limit.

Or just use PBKDF2 or scrypt, neither of which imposes an artificial length limitation on passwords.

Not that it really matters, since a 72-character password will have hundreds of bits of entropy as long as the alphabet is more than two characters; it's overkill.

Freaky · on Jan 1, 2016

I wouldn't expect a passphrase that long to be particularly entropy-dense - quite the opposite really. Some people do things like pick phrases from Hamlet and stuff some numbers on the end, so truncating them would seriously compromise their expected strength.

Or XKCD style passphrases. Using the Oxford 3000 dictionary gets you about 1 bit of entropy per character. If I have my password manager generate 128 bit passphrases using them for ease of use in the real world, plain bcrypt will quietly reduce their strength by about 17 orders of magnitude. If nothing else that's obnoxious.

And that's ignoring the soft 55 byte limit both the bcrypt spec and the original scrypt paper mention. If we're looking at less than one bit per byte that's starting to get into worryingly weak territory.

Either way, it's all solved by stuffing a (e.g.) base64 SHA384 in there. Bam, every input bit equally effects the output hash regardless of length and position, and the entire thing even becomes NULL byte safe.

rmc · on Dec 31, 2015

The problem is that lots of people will pick "password" as a password. I think it's the responsibility of websites to prevent that from happening.

infogulch · on Dec 31, 2015

That's why you download a list of the 10k most common passwords and not let a user choose one from the list. Way better than some arbitrary character class rules that are easily worked around by appending "1A!" to the end of an awful password like GP mentioned.

bjt · on Dec 31, 2015

Use https://github.com/dropbox/zxcvbn.

lmm · on Dec 31, 2015

The correct way is not to use passwords. Use X.509 client certificates, and let the user secure theirs whatever way makes sense to them (whether that's a password, a smartcard, both, or something else). Unfortunately the browser UX for them is terrible.

ifdefdebug · on Dec 31, 2015

> Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Yeah sure. I think I heard that quote before... just about a million times?

People expect regex to be an easy-to-use tool. Well it's not, and it's a foot gun if you don't take your time to learn it right. But no, people hack up some expressions, hit their feet and blame... the tool of course, not themselves.

Just learn it right, it's a great tool if you know how to make it work for you :)

noonespecial · on Dec 31, 2015

Regex's in perl5 are what introduced me to test automation. Hard.

raldi · on Dec 31, 2015

I bet there's a hashtable involved somewhere, and Chad's address just happens to hash to, like, 0x00000, and it turns out when that happens, there's a bug.

As a workaround, I bet you can use CHAD@... or chad+blah@...

mrb · on Dec 31, 2015

A hashtable bug seems possible.

(Hangouts Dialer still does not see him if saved as CHAD@. It sees him when saved as chad+bla@ but it's annoying because then his email is wrong in my contact list as his email provider does not support + aliases.)

AnkhMorporkian · on Dec 31, 2015

You could try chad()@. () is an empty e-mail address comment which, by the RFC, is supposed to be ignored during delivery. Not every mail server supports it, but it's worth a try until they fix that bug.

saurik · on Dec 31, 2015

Comments in parentheses are a feature of MIME, and have nothing to do with delivery of e-mail: those comments are not valid in the envelope header as parsed by SMTP. (Put another way, there is no "the RFC": there are multiple RFCs used for different purposes and which have different rules.) MIME certainly has no relevance to your address book, and if your structured address book database is attempting to parse an e-mail address as if it were inside of either a MIME header or an SMTP envelope, you should absolutely complain: that should be considered a bug :/.

Think about this: one would hope that if you use characters in that field which would normally need to be escaped if used in a MIME message, and that e-mail address were to end up in a MIME message, that the e-mail client would get the unescaped e-mail address from the database and would then escape it correctly--and by "correctly", we mean a different way of escaping it for MIME vs. escaping it for SMTP--the same as we would expect its usage in an HTML page, an argument to a shell script, or a value in an SQL statement, to also be escaped for each specific purpose.

AnkhMorporkian · on Dec 31, 2015

Huh, TIL. I've never actually used comments in an email address, as they're pretty damn silly. I must have misunderstood the stackoverflow post I read way back.

Thanks for the information!

wildmusings · on Dec 31, 2015

Email address comment? What the hell were they thinking?

csours · on Dec 31, 2015

Think about pre-outlook days, you may want some notes to remember who you are emailing.

Dylan16807 · on Dec 31, 2015

I'm thinking about it. Still doesn't make sense to put it in the address.

mrb · on Dec 31, 2015

Like plus aliases (chad+foo@), Dialer can find chad()@. However the parentheses cause at least 2 minor annoyances (desktop Gmail doesn't let me email such an address, and the Android contact editor won't let me edit the contact.)

raldi · on Dec 31, 2015

Hmm. I guess it's lowercasified before being hashed.

STRML · on Dec 31, 2015

What about a leading/trailing space?

mrb · on Dec 31, 2015

Nope :/

edent · on Dec 31, 2015

It's a pity there's no way to report bugs like this to Google.

The only way I've found of getting anything resolved is to forward issues to a friend inside the company, or hope that you can write a blog post which gets enough attention.

I get that filtering and testing millions of random bug reports from all corners of the Internet is hard - but it's a problem which Google desperately needs to solve if it wants to retain the trust of its users.

asuffield · on Dec 31, 2015

(Tedious disclaimer: not speaking for anybody else, my opinion only, etc. I'm an SRE at Google.)

> It's a pity there's no way to report bugs like this to Google.

This is a popular myth.

General instructions are here: https://www.google.com/tools/feedback/intl/en/

In this particular case, it's an android app, so what you do is tap on the hamburger menu, hit "help and feedback", then "send feedback".

edent · on Dec 31, 2015

Having reported many bugs this way - I don't think I've ever had a response, let alone seen anything fixed.

Looking at Android (OS) issues - https://code.google.com/p/android/issues/list?can=2&q=&sort=... - it's clear that the majority of bugs are ignored. Even when they're well described and affect multiple users / devices.

asuffield · on Dec 31, 2015

We do not routinely release information about what happens to bugs, so you should not expect a response. I've certainly seen bugs reported on these channels be fixed. I cannot release any statistics.

> Looking at Android (OS) issues - https://code.google.com/p/android/issues/list?can=2&q=&sort=.... - it's clear that the majority of bugs are ignored.

A quick glance at that page appears to disprove this claim. If you flip it to display "all issues", there are 206689 bugs in that tracker at the moment, of which 44583 are open. That tells you that 79% of all bugs filed have been closed - so, at least 79% of bugs were not ignored.

Note that this tracker is for the operating system only, and does not include any of the Google apps that the feedback system covers.

edent · on Dec 31, 2015

> We do not routinely release information about what happens to bugs, so you should not expect a response.

Which goes back to my original point about customer trust. If you know I've reported a bug, why would you deliberately not tell me that it has been fixed?

> so, at least 79% of bugs were not ignored

Well, take a look at some of the ones which have been closed - https://www.reddit.com/r/androiddev/comments/2on1fe/google_c... and https://news.ycombinator.com/item?id=8803118

I know a good many people who work in Google - they're all smart and dedicated. But there's something about the corporate culture which imposes a "don't listen to external feedback" mindset.

It's your OS and they're your apps - you can do what you like with them. But don't be surprised when users stop trusting you to listen to their concerns.

Eyas · on Dec 31, 2015

> If you know I've reported a bug, why would you deliberately not tell me that it has been fixed?

Sounds like a typical fallacy of end-users looking at software. Assuming that the developer deliberately is denying you a feature, rather than having simply not spent the engineering time to make it possible.

In this case, for instance, it could be that the pipeline to get from external feedback channels to Google's internal bug-trackers very one-way and its hard to go back. Or, there's a disconnect between when the fix ships and when the ticket is closed, and keeping track of the entire chain of data requires some work. Or, there's no easy distinguishing features on bugs that came externally to make them easily identifiable once fixed. Or, there's no process yet for an automated response that says the bug is fixed (should it provide more detail). Etc.

edent · on Dec 31, 2015

Hence, my first comment.

> I get that filtering and testing millions of random bug reports from all corners of the Internet is hard - but it's a problem which Google desperately needs to solve if it wants to retain the trust of its users.

pygy_ · on Dec 31, 2015

> > We do not routinely release information about what happens to bugs, so you should not expect a response.

> Which goes back to my original point about customer trust. If you know I've reported a bug, why would you deliberately not tell me that it has been fixed?

Another example of http://danluu.com/wat/ (discussion:https://news.ycombinator.com/item?id=10811822).

psykovsky · on Dec 31, 2015

And of those 79% how many were closed without ever being fixed?

ikeboy · on Dec 31, 2015

I've reported chrome bugs that were fixed in the next version. I once reported a bug in arc welder and got a response when it was fixed.

chris_wot · on Dec 31, 2015

LibreOffice, RedHat, Debian, Canonical and Mozilla can do it. This is not a particularly hard problem to solve.

packetized · on Dec 31, 2015

I wonder if this is related to i18n or country lookup. Chad is the only semi-common English-language name that's also a country name, that I can think of.

benplumley · on Dec 31, 2015

Jordan, Georgia. I feel like if this were the cause then the bug would be a lot more common.

My guess is the dialler hashes some parts of the contact to get a UUID, but for this contact it happens to be outside the range the dialler can look at - perhaps off-by-one, where the dialler looks for UUIDs of 1 and above and this happens to hash to 0.

stevoski · on Dec 31, 2015

My niece is called "Ireland".

Continents too: I met a girl called "Africa", and "Asia" is certainly used as a name.

However I don't ever expect to meet someone called "Democratic People's Republic of Korea"

NLips · on Dec 31, 2015

There are also:

  India
  Georgia
  Jordan

I'd guess they are all as common or more common than Chad.

ryporter · on Dec 31, 2015

Somewhat similarly, I encountered a possible bug in Google Docs many years back. I was reorganizing my documents, and I temporarily changed one of the names to "delete". Poof -- I could not longer find it anywhere (or even search for words that I knew were in it). I forget how I got back to it (maybe via my browser history), but I changed the name a bit, and then the document was "found".

This could have simply been a race case unrelated to the filename, but it's much more amusing to speculate that it was due to hack introduced during development. I now regret not trying to reproduce it, but I was pretty frustrated after I found my document again. I did contact support, but didn't hear anything back.

nbakshi · on Dec 31, 2015

This reminds me my favorite name while testing: "McNulla". I have seen quite a few webforms which had a regex to remove any NULL string, because of which it would not take this name as it has a "Null" string in the name.

EvanAnderson · on Dec 31, 2015

I knew a family w/ the last name of "Null" from high school. I wonder, from time to time, if they have suboptimal experiences using the 'net.

isp · on Dec 31, 2015

Obligatory: https://news.ycombinator.com/item?id=3900224

jehna1 · on Dec 31, 2015

I learned there was a place called Nan[1], when my json importer crashed to it while interpreting it as a NaN (not a number)

[1]:https://en.wikipedia.org/wiki/Nan,_Thailand

PhasmaFelis · on Dec 31, 2015

> I exported the contacts and looked at the raw Google CSV data. One of the 2 problematic contacts had a whitespace character at the end of its phone number. I removed it. Bingo, Dialer can now find it!

This is kind of horrifying. Google being tripped up by trailing whitespace?

timberburn · on Dec 31, 2015

When I was setting up an account on Comcast's website, I was consistently getting an nondescript internal server error when submitting the form.

Took me quite awhile and many failed attempts to find that Comcast will throw an error when your requested username contains "comcast".

josegonzalez · on Dec 31, 2015

Same for ConEd.

Moneygram will freeze payments with the word "moneygram" in the associated email. Which is great for those of us that use catch-all emails and use the email address to discern what to do with an email...

rincebrain · on Dec 31, 2015

I've had a number of fun failure modes like that.

My current favorite two include when the change password form permitted longer passwords than the login page, and one where the change password form happily allowed special characters, but if there was e.g. a semicolon in the password, submitting it from the login page would throw a SQL error.

hidroto · on Dec 31, 2015

i wonder if that is to stop people from using names like comcastSucks or worse.

rdancer · on Dec 31, 2015

It may not have been the rationale, but it sure must be the most common use-case.

akurtzhs · on Jan 2, 2016

It also avoids security problems with fraudsters creating official looking emails - "comcastsupport" and the like.

incepted · on Dec 31, 2015

> Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

jwz's wit was a lot of fun in the early 2000s but that quote is too often used wrong.

This quote is not about regexps, it's about using wrong tools for the job. Using it without context makes it sound dumb. "Well, what if regexp is exactly the right solution for that problem?".

lmm · on Dec 31, 2015

A regexp is almost always the wrong solution. It's a way of representing a finite state machine that obscures the states, which are the only valuable part of the state machine abstraction (they're inherently incomprehensible otherwise). And most implementations these days have random extensions, meaning you have all the performance and safety issues of a turing-complete programming language - but a much worse UX. They may have made sense in the days of ed and the teletype, when a terse incomprehensible expression was better than a slightly longer readable one, but they don't now.

incepted · on Dec 31, 2015

> A regexp is almost always the wrong solution.

That's quite a sweeping statement.

Sure, complex regexps can be hard to read but state machines are just hard to read for humans in general, whatever the form.

What do you recommend for parsing simple text entries, then?

lmm · on Dec 31, 2015

If there are no existing libraries, parser combinators. More verbose but so much more readable, and they make it much easier to parse into an actual structure rather than a list of match groups.

incepted · on Jan 1, 2016

> More verbose but so much more readable

First of all, "more readable" is extremely subjective.

Second, there are a lot of different parser combinators, all with very different syntaxes.

Finally, parser combinators are readable by people familiar with them and regexps are readable by people familiar with them. Regexps are also much more widespread and approachable. And very often, writing a parser combinator to parse a simple text entry is way overkill.

There are many good reasons why regexps are so popular.

lmm · on Jan 1, 2016

> Finally, parser combinators are readable by people familiar with them and regexps are readable by people familiar with them. Regexps are also much more widespread and approachable

You don't have to be familiar with them to find something like:

    def emailAddress = userPart ~ "@" ~ hostnamePart ^^
      {(username, at, hostname) => EmailAddress(username, hostname)}

clearer than any regex. Named capture groups help a little bit but I've never seen people using them (and they don't have a consistent syntax across regex implementations either).

> And very often, writing a parser combinator to parse a simple text entry is way overkill.

Disagree. They can be very much a one-liner.

incepted · on Jan 1, 2016

I think you'd be hard pressed to find someone who doesn't know parser combinators tell you the snippet above is readable.

First of all, what language is this? Well, I know, but you seem to forget that the parser combinator syntax varies per language. What's the equivalent syntax for Java? Or for Python? What about a language that doesn't have a parser combinator library? Or one that has several ones, all slightly different?

I think you're falling prey to the specialist fallacy: you're obviously very comfortable with parser combinators but you've forgotten how long it took you to get there and you now see them as an ultimate solution to all problems without realizing their downsides.

lmm · on Jan 1, 2016

> First of all, what language is this? Well, I know, but you seem to forget that the parser combinator syntax varies per language. What's the equivalent syntax for Java? Or for Python? What about a language that doesn't have a parser combinator library? Or one that has several ones, all slightly different?

I very deliberately didn't mention the language or the library, because I think the snippet is readable without knowing that. Minor syntax differences between libraries matter when writing, but not when reading, and reading is more important. (And it's not like there aren't several slightly different implementations of regexes)

I'm not that committed to parser combinators - I'd be happy to consider alternatives - but anything where you a) name the things you're capturing b) can easily combine several small parsers to make a bigger parser will have a huge readability, testability and maintainability advantage over regexes.

omaranto · on Jan 2, 2016

> anything where you a) name the things you're capturing b) can easily combine several small parsers to make a bigger parser will have a huge readability, testability and maintainability advantage over regexes

I think many regex libraries have named captures and in most languages you can either concatenate regexes or build regexes from strings which in turn can be built from concatenation. Furthermore, I thought building regexes by concatenation of smaller pieces was a commonly recommended technique for improving readability (I have no data on whether the recommendation is commonly followed or not, of course).

lmm · on Jan 2, 2016

I've never seen either of them in real-world code. Concatenation is something but I think it's a lot less flexible - if you have a parser for something and want to make a parser for a comma-separated list of that thing, with parser combinators that's one call, whereas with a regex I don't think the string-manipulation is straightforward, and would the named capture groups still work if they were now being hit multiple times?

omaranto · on Jan 2, 2016

You're definitely right that concatenation is less composable, and I agree that parser combinators are more powerful. I was just pointing out there are somethings you can do in regexes to make things more readable.

And about captures inside repetition: I don't see any reason they couldn't capture a list of strings instead of a string in dynamically typed languages but in all regex libraries I'm aware of they do something that seems useless to me: they only capture the last occurrence! (They probably just overwrite the capture on each repetition.)

suprjami · on Dec 31, 2015

I didn't get the reference. https://en.m.wikipedia.org/wiki/Chad_(paper)

snydly · on Dec 31, 2015

Ohh, I didn't get it at first either. Thought of the country.

Best image to explain a hanging "chad": http://images1.fanpop.com/images/photos/1400000/Halloween-ho...

abhishekash · on Dec 31, 2015

Do people with other android version or the phone make face the same issue while using this email id ?

frik · on Dec 31, 2015

Many sites don't support the plus in email addresses ("+" = comment, supported e.g. by GMail). Not so funny if the register process works but the login or password reset features are broken.

Example: a site let me register and login with the plus. But resetting the password was hard, I had to escape the plus to get it working.

db48x · on Dec 31, 2015

Is that "chad@" or "chаd@" (homoglyphs)?

mrb · on Dec 31, 2015

No homoglyphs.

Gravityloss · on Dec 31, 2015

So, when computing power increases, we just add useless parsing at every level of software, decreasing performance and causing bugs like these.

GotAnyMegadeth · on Dec 31, 2015

I also see this with one of my brothers' names which is Olly <surname>.

tyingq · on Dec 31, 2015

A potential workaround...quote the local part:

"chad"@example.com

It seems to be supported by Exchange, gmail, and a few other MTA's I tested, and gets routed to the right place.

dougdonohoe · on Dec 31, 2015

Was it the Chad?

https://www.youtube.com/watch?v=_79BtELxB2k

Eyas · on Dec 31, 2015

I wonder if this is the only case where the name/first name of the contact is exactly the same as the recipient's address.

tmaly · on Dec 31, 2015

regex bugs are bad, I have been bitten by one before. Its pure technical debt

mwpmaybe · on Dec 31, 2015

The Lando system?

mattbillenstein · on Dec 31, 2015

Android is a wasteland.