Numerous applications rely on email addresses to recognize users. Since email addresses are unique, they serve as a reliable way of verifying the authenticity of a user. However, sending emails to incorrect addresses can result in errors. This is why the format of an email address is typically checked after it is entered. While some developers utilize RegEx for this purpose, is it truly the best practice?
A Tour Around The World Wide Web
Obviously, a lot of programmers have reached out to the community for help with email address validation, and the web is filled with potential solutions.
The RegEx Approach
Many of these responses rely on RegEx. An old article from Wired lists four different expressions for this purpose.
Dirt-simple approach:
.+\@.+\..+
Slightly more strict (but still simple) approach:
A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}
Specify all the domain extensions approach:
([a-z0-9][-a-z0-9_\+\.]*[a-z0-9])@([a-z0-9][-a-z0-9\.]*[a-z0-9]\.(arpa|root|aero|biz|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cu|cv|cx|cy|cz|de|dj|dk|dm|do|dz|ec|ee|eg|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|sk|sl|sm|sn|so|sr|st|su|sv|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|um|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)|([0-9]{1,3}\.{3}[0-9]{1,3}))
The first two expressions are quite simple but lack strictness. The third option is stricter, though it requires updates every time a new domain extension is released. Since the article was written in 2008, many new domain extensions have been introduced. Additionally, the second expression needs to be updated, as domain extensions can now be longer than four characters.
The article also mentions an expression from a Perl module with almost 6500 characters. Just for the fun of it, I tried to use the RegEx Source Generator in .net to generate C# code for the expression:
[GeneratedRegex("(? ... )?;\\s*)", RegexOptions.Compiled)]
private static partial Regex IsEmailAddressRegex();
The final output was around 30000 lines of C# code, which included roughly 1300 lines of XML docs that
described the process. Generation and execution were surprisingly fast. However, I would advise against using
this, as the ServiceHub.RoslynCodeAnalysisService.exe
task seems to have a memory leak, causing
memory usage to grow by 100 MB per second with around 10% CPU usage, and it doesn’t seem to stop.
These regular expressions are quite old, and it’s possible that more effective alternatives have been created since then. Nevertheless, we can take away that they require regular maintenance. Just because a particular expression was effective in the past doesn’t mean it’s still suitable for use today. You also don’t want to risk a ReDoS attack (regular expression denial of service) given the disappointing results it can produce.
Using the MailAddress Class
Who knows better if an input is an email address than the System.Net.Mail.MailAddress
class
itself. This approach is mentioned in several cases. This class has a TryCreate
method
which parses a given input and returns whether the input is an email address or not. The method is
even able to split a display name from an email address (what you get when an address is copied in Outlook).
The input "John Doe" <john@doe.com>
would result in a MailAddress
object
with DisplayName
“John Doe” and EmailAddress
“john@doe.com”.
Unfortunately, this extended parsing logic leads to an unexpected result. While john@doe.com
is obviously an email address "John Doe" <john@doe.com>
, the actual input, isn’t.
MailMessage.TryCreate
would return true
anyway. By comparing the input with the
Address
property, we can solve this problem:
public static bool IsEmailAddress(string? input)
{
return (MailAddress.TryCreate(input, out MailAddress? mailAddress) &&
mailAddress.Address == input);
}
Using the EmailAddressAttribute Class
A more
creative solution
is using System.ComponentModel.DataAnnotations.EmailAddressAttribute
. This is the Attribute
that is used to validate a property value in the model binding process.
public static bool IsEmailAddress(string? input)
{
if (input == null)
{
return false;
}
EmailAddressAttribute attribute = new EmailAddressAttribute();
return attribute.IsValid(input);
}
Although it works, an Attribute
is not intended to be used that way.
This class has a long history. Initially, in the .NET Framework, it used
RegEx for input validation.
However, this changed in 2015 due to the vulnerability advisory
CVE-2015-2526,
which pointed out that the expression could trigger denial-of-service attacks, as I previously mentioned.
The validation was then
changed
to a simpler logic: the input must contain at least one @
, and it cannot be at the
start or end. This basic logic is
still in use today,
but got some performance improvements.
Not everyone is
happy
with this simplified solution, since it allows email addresses like x@y
or x!x@x.x
,
but this brings us to the next chapter.
What is a Valid Email Address?
The @
symbol to separate the local part with the host was already
introduced in the year 1971,
to send messages between two computers on the Arpanet. The format of email addresses as we know it
today was, based on other RFCs, specified in the
Section 3.4 of the RFC 2822
in the year 2001 and improved with
RFC 5322
in 2008. Since then, an email address is made up from a local part, the @
symbol, and a domain.
But an email address can be more complicated than expected.
Valid Domain Parts
The domain part may be a domain like gmail.com
or outlook.com
, surprisingly
it can also be an IP address. IP addresses must be surrounded by square brackets.
Therefore, the following email addresses are possible:
info@[123.123.123.123]
info@[IPv6:2001:0db8:85a3:0000:0000:8a2e:0370:7334]
I tried to send an email to my outlook.com
address but
used only the IPv4 address. The Office Outlook app didn’t even let me send the email because the
format of the email address is not supported. The Outlook online app in the browser on the other
hand was ok with the IP address. However, the sending provider (also outlook.com
) was
not able to process the email. I didn’t try it with other providers, but I expect to get similar results.
Knowing that the domain could also be an IP address makes the validation more complicated. Of course, there are good reasons to disallow email addresses with IPs in the domain part, probably most mail providers won’t support IP domains. But at least the domain part is case-insensitive, unlike the local part.
Valid Local Parts
The local part of the email address is everything before the last @
symbol. And yes,
it can be case-sensitive. The local part can contain letters, digits and a set of special
characters. The local part can also contain spaces, horizontal tabs, brackets and even @
symbols. However, these symbols must be in quotes. These are possible email addresses:
john/doe@example.com
" "@example.org
c#@example.org
"b@man & robin! ;-)"@example.com
Only ASCII characters are permitted for email addresses, but some providers also support SMTPUTF8 which allows UTF-8 characters (including emojis).
Just because the specification allows email addresses to contain spaces and special characters, doesn’t mean that mail providers support it. Handling the local part is entirely done by the provider. Most providers obviously don’t implement the local part as case-sensitive. Gmail even ignores dots and the following email addresses go all in the same mailbox:
john.doe@gmail.com
JohnDoe@gmail.com
J.o.h.n.D.o.e@gmail.com
Gmail and other providers also support sub-addressing which
allows you to add a tag to the email address with a +
symbol: john.doe+amazon@gmail.com
.
This can be used to identify where you entered an email address. The specification also allows
comments in brackets in the local part and the domain: john.doe(amazon)@(this)example.com
.
This is again something that most providers don’t support, and the Outlook app doesn’t even allow us
to send emails to domains with comments.
The local part of email addresses can be rather complex, with each provider having its own unique set of rules. It’s unlikely you’ll encounter a genuine email address that includes spaces, tabs, or quotes, as most people would likely abandon such an address because of the problems they encounter. However, this article focuses on the validation of email addresses, and indeed, unusual email addresses can exist. This Wikipedia article describes more rules and has numerous examples as well.
Length
The length of an email address is easy to validate and required to check if the email address
is stored in a fixed-size SQL database column. There are several different specifications for the length. A
StackOverflow answer
references
RFC 3696
which specifies a maximum length of 254 characters. Another possible maximum length is 320 characters,
according to
RFC 5321.
The length of 320 is the sum of 64 for the local part, 1 @
symbol and 255 for the domain.
Checking the length is recommended, either 254 or 320 characters. As a reference, the local part of
outlook.com
addresses are limited to 65 characters (total size 77), gmail.com
only
allows 30 characters (total size 40).
Applying the New Insights
Can we make sure that an email address is valid? No, but we can make good guesses. Depending on your use case,
you can check for the widely used rules (no IP domain and limited character set for the local part) or you can
just check for the existence of an @
symbol. To really validate an email address, you must be able to send
an email to that address and the expected receiver must be able to receive the email. Even a simple email address
could have a typo and therefore reach the wrong receiver. If you can send it and the correct receiver can
receive it, the format shouldn’t matter.
With
Postel’s Law
in mind (“be conservative in what you do, be liberal in what you accept from others”), I would use something
simple like the check in EmailAddressAttribute
, but as a single method in a static class
:
public static bool IsEmailAddress([NotNullWhen(true)] string? input)
{
if (input == null)
{
return false;
}
int inputLength = input.Length;
if (inputLength < 3 || inputLength > 254)
{
return false;
}
ReadOnlySpan<char> inputAsSpan = input.AsSpan();
if (inputAsSpan.ContainsAny('\r', '\n'))
{
return false;
}
int indexOfAtSymbol = inputAsSpan.IndexOf('@');
return (indexOfAtSymbol > 0 && indexOfAtSymbol < inputLength - 1);
}
This method can be easily implemented in any programming language, runs efficiently, and is immune to RegEx attacks. Yes, it is trivial to enter an invalid email address, but it also won’t block an unexpected, valid address.
When an email address is stored, it is important not to alter its case to either lower or upper, as this may affect the intended recipient. However, when searching or comparing email addresses, make sure to handle them in a case-insensitive manner.
Conclusion
In this article, I discussed the complexity of email addresses. Clearly, validating them can be difficult. The final validation method is not very sophisticated, but it accomplishes its goal. Depending on the use case, more checks can be implemented. However, users still can make typos or enter non-existent email addresses. If an email can be sent and received, the specific format shouldn’t be a concern.