Data Sanitization and Validation With WordPress
Proper security is critical to keeping your site or that of your theme or plug-in users safe. Part of that means appropriate data validation and sanitization. In this article we are going to look at why this is important, what needs to be done, and what functions WordPress provides to help.
Since there seem to be various interpretations of what the terms ‘validation’, ‘escaping’ and ‘sanitization’ mean, I’ll first clarify what I mean by them in this article:
- Validation – These are the checks that are run to ensure the data you have is what it should be. For instance, that an e-mail looks like an e-mail address, that a date is a date and that a number is (or is cast as) an integer
- Sanitization / Escaping – These are the filters that are applied to data to make it ‘safe’ in a specific context. For instance, to display HTML code in a text area it would be necessary to replace all the HTML tags by their entity equivalents
Why Is Sanitization Important?
When data is included in some context (say in a HTML document) – that data could be misinterpreted as a code for that environment (for example HTML code). If that data contains malicious code, then using that data without sanitizing it, means that code will be executed. The code doesn’t even necessarily have to be malicious for it to cause undesired effects. The job of sanitization is to make sure that any code in the data isn’t interpreted as code – otherwise you may end up like Bobby Tables’ school…
A seemingly innocuous example might be pre-filling a search field with the currently queried term, using the unescaped
http://yoursite.com?s="/>. The search term ‘jumps’ out of the value attribute, and the following part of the data is interpreted as code and executed. To prevent this, WordPress provides
get_search_query which returns the sanitized search query. Although this is a ‘harmless’ example the injected script could be far more malicious and at best it would just ‘break’ the form if search terms contain double quotes.
How this malicious (or otherwise) code may have found its way onto your site is not the concern here – but rather it is to prevent it from executing. Nor do we make assumptions about the nature of this unwanted code, or its intent – it could have simply been an error on the user’s part. This brings me to rule No.1…
Rule No. 1: Trust Nobody
It’s a common maxim that is used with regards to data sanitization, and it’s a good one. The idea is that you should not assume that any data entered by the user is safe. Nor should you assume that the data you’ve retrieved from the database is safe – even if you had made it ‘safe’ prior to inserting it there. In fact, whether data can be considered ‘safe’ makes no sense without context. Sometimes the same data may be used in multiple contexts on the same page. Titles for instance, can safely contain quotes or double quotes when inside header tags – but will cause problems if used (unescaped) inside a title attribute of a link tag. So it is rather pointless to make data ‘safe’ when adding it to the database, since it is often impossible to make data safe for all contexts simultaneously. (Of course it needs to be made safe to add to the database – but we’ll come to that later).
Even if you only intend to use that data in one specific context, say a form, it is still pointless to sanitize the data when writing to the database because, as per Rule No. 1, you cannot trust that it is still safe when you take it out again.
Rule No. 2: Validate on Input, Escape on Output
This is the procedural maxim that sets out when you should validate data, and when you sanitize it. Simply put – validate your data (check it’s what it should be – and that it’s ‘valid’) as soon as you receive it from the user. When you come to use this data, for example when you output it, you need to escape (or sanitize) it. What form this sanitization takes, depends entirely on the context you are using it in.
The best advice is to perform this ‘late’: escape your data immediately before you use or display it. This way you can be confident that your data has been properly sanitized and you don’t need to remember if the data has been previously checked.
Rule No. 3: Trust WordPress
You might be thinking “Ok, validate before writing to database and sanitize when using it. But don’t I need to make sure the data is safe to write to the database?”. In general, yes. When adding data to a database, or simply using an input to interact with a database, you would need to escape the data incase it contained any SQL commands. But this brings me to Rule No. 3, one which flies in the face of Rule No. 1: Trust WordPress.
In a previous article, I took user input (sent from a search form via AJAX) and used it directly with
get_posts() to return posts that matched that search query:
$posts = get_posts( array( 's'=>$_REQUEST['term'] ) );
An observant reader noticed that I hadn’t performed any sanitization – and they were right. But I didn’t need to. When you use high-level functions such as
get_posts(), you don’t need to worry about sanitizing the data – because the database queries are all properly escaped by WordPress’ internals. It’s a different matter entirely if you are using a direct SQL query – but we’ll look at this in a later section. Similarly, functions like
the_content() etc. perform their own sanitization (for the appropriate context).
When you receive data entered by a user it’s important to validate it. (The settings API, covered in this series, allows you to specify a callback function to do exactly this). Invalid data is either auto-corrected, or the process is aborted and the user is returned to the form to try again (hopefully with an appropriate error message). The concern here is not safety but rather validity – if you’re doing it right, WordPress will take care of safely adding the data to the database. What ‘valid’ means is up to you – it could mean a valid email address, a positive integer, text of a limited length, or one of an array of specified options. However you aim to determine validity, WordPress offers a lot of functions that can help.
When expecting numeric data, it’s possible to check if the data ‘is some form of number’, for instance
is_float. Usually, it’s sufficient to simply cast the data as numeric with:
If you need to ensure the number is padded with leading zeros, WordPress provides the function
zeroise(). Which takes the following parameters:
- Number – the number to pad
- Threshold – the number of digits the number will be padded to
echo zeroise(70,4); // Prints 0070
To check the validity of e-mails, WordPress has the
is_email() function. This function uses simple checks to validate the address. For instance, it checks that it contains the ‘@’ symbol, that it’s longer than 3 characters, the domain contains only alpha-numerics and hyphens, and so forth. Obviously, it doesn’t check that the e-mail address actually exists. Assuming the e-mail address passed the checks, it is returned, otherwise ‘false’ is returned.
$email = is_email('someone@e^ample.com'); // $email is set to false. $email = is_email('firstname.lastname@example.org'); // $email is set to 'email@example.com'.
Often you may wish to allow only some HTML tags in your data – for instance in comments posted on your site. WordPress provides a family of functions of the form
wp_kses_* (KSES Strips Evil Scripts). These functions remove (some subset of) HTML tags, and can be used to ensure that links in the data are of specified protocols. For example the
wp_kses() function accepts three arguments:
content– (string) Content to filter through kses
allowed_html– An array where each key is an allowed HTML element and the value is an array of allowed attributes for that element
allowed_protocols– Optional. Allowed protocol in links (for example
$content = "Click here to visit wptuts+ "; echo wp_kses( $content, array( 'strong' => array(), 'a' => array('href') ) ); // Prints the HTML "Click here to visit wptuts+ ": Click here to visit wptuts+
Of course, specifying every allowed tag and every allowed attribute can be a laborious task. So WordPress provides other functions that allow you to use
wp_kses with pre-set allowed tags and protocols – namely the ones used for validating posts and comments:
The above functions are helpful in ensuring that HTML received from the user only contains whitelisted elements. Once we’ve done that we would also like to ensure that each tag is balanced, that is every opening tag has its corresponding closing tag. For this we can use
balanceTags(). This function accepts two arguments:
- content – Content to filter and balance tags of
- force balance – True or false, whether to force the balancing of tags
// Content with missing closing tag $content = "Click here to visit wptuts+"; echo balanceTags($content,true), // Prints the HTML "Click here to visit wptuts+ "
If you want to create a file in one of your website’s directories, you will want to ensure the filename is both valid and legal. You would also want to ensure that the filename is unique for that directory. For this WordPress provides:
sanitize_file_name( $filename )– sanitizes (or validates) the file-name by removing characters that are illegal in filenames on certain operating systems or that would require escaping at the command line. Replaces spaces with dashes and consecutive dashes with a single dash and removes periods, dashes and underscores from the beginning and end of the filename.
wp_unique_filename( $dir, $filename )– returns a unique (for directory
$dir), sanitized filename (it uses
Data From Text Fields
When receiving data inputted into a text field, you’ll probably want to strip out extra white spaces, tabs and line breaks, as well as stripping out any tags. For this WordPress provides
WordPress also provides
sanitize_key. This is a very generic (and occasionally useful) function. It simply ensures the returned variable contains only lower-case alpha-numerics, dashes, and underscores.
Whereas validation is concerned with making sure data is valid – data sanitization is about making it safe. While some of the validation functions referred to above might be useful in making sure data is safe – in general, it is not sufficient. Even ‘valid’ data might be unsafe in certain contexts.
Rule No. 4: Making Data Safe Is About Context
Simply put you cannot ask “How do I make this data safe?”. Instead you should ask, “How do I make this data safe for using it in X”.
To illustrate this point, suppose you have a widget with a textarea where you intend to allow the user to enter some HTML. Suppose they then enter:
This is perfectly valid, and safe, HTML – however when you click save, we find that the text has jumped out of the textarea. The HTML code is not safe as a value for the textarea:
What is safe to use in one context, is not necessarily safe in another. Whenever you use or display data you must keep in mind what forms of sanitization need to be done in order to make using that data safe. This is why WordPress often provides several functions for the same content, for instance:
the_title– for using the title in standard HTML (inside header tags, for example)
the_title_attribute– for using the title as an attribute value (usually the title attribute in
the_title_rss– for using the title in RSS feeds
These all perform the necessary sanitization for a particular context – and if you’re using them you should be sure to use the correct one. Sometimes though, we’ll need to perform our own sanitization – often because we have custom input beyond the standard post title, permalink, content etc. that WordPress handles for us.
When printing variables to the page we need to be mindful of how the browser will interpret them. Let’s consider the following example:
$title = . Rather than displaying the HTML
This form of injection (as also demonstrated in the search form example) is called Cross-site scripting and this benign example belies its severity. Injected script can essentially control the browser and ‘act on behalf’ of the user or steal the user’s cookies. This becomes an even more serious issue if the user is logged in. To prevent variables printed inside HTML being interpreted as HTML, WordPress provides the well known
esc_html function. In this example:
Now consider the following example:
$value contains double quotes, unescaped it can jump out of the value attribute and inject script, for example, by using the
onfocus attribute. To escape unsafe characters (such as quotes, and double-quotes in this case), WordPress provides the function
esc_html it replaces ‘unsafe’ characters by their entity equivalents. In fact, at the time of writing, these functions are identical – but you should still use the one that is appropriate for the context.
For this example we should have:
esc_attr also come with
esc_html__('Text to translate', 'plugin-domain')/
esc_attr__– returns the escaped translated text,
esc_html_e('Text to translate', 'plugin-domain')/
esc_attr_e– displays the escaped translated text and finally the
esc_html_x('Text to translate', $context, 'plugin-domain')/
esc_attr_x– translates the text according to the passed context, and then returns the escaped translation
HTML Class Names
For class names, WordPress provides
sanitize_html_class – this escapes variables for use in class names, simply by restricting the returned value to alpha-numerics, hyphens and underscores. Note: It does not ensure the class name is valid (reference: http://www.w3.org/TR/CSS21/syndata.html#value-def-identifier).
In CSS, identifiers can contain only the characters
[a-zA-Z0-9]and ISO 10646 characters U+00A0 and higher, plus the hyphen (
-) and the underscore (
_); they cannot start with a digit, two hyphens, or a hyphen followed by a digit. Identifiers can also contain escaped characters and any ISO 10646 character as a numeric code.
Let’s now look at another common practise, printing variables into the
Clearly it is vulnerable to the same form of attack as illustrated in escaping HTML and attributes. But what if the
$url was set as follows:
On clicking the link, the alert function would be fired. This contains no HTML, or any quotes that allow it to jump out of the href attribute – so
esc_attr is not sufficient here. This is why context matters:
esc_attr($url) would be safe in the
title attribute, but not for the
esc_url– for escaping URLs that will be printed to the page.
esc_url_raw– for escaping URLs to save to the database or use in URL redirecting.
esc_url_raw does is almost identical to
esc_url, but it does not replace ampersands and single quotes (which you don’t want to, when using the URL as an URL, rather than displaying it).
In this example, we are displaying the URL, so we use
Although not necessary in most cases, both functions accept an optional array to specify which protocols (such as
mailto, etc) you wish to allow.
In fact, if you are doing this, you should almost certainly be using
wp_localize_script() – which handles sanitization for you. (If anyone can think of a reason why you might need to use the above method instead, I would like to hear it).
However, to make the above example safe, you can use the
When displaying content in a textarea,
esc_html is not sufficient because it does not double encode entities. For example:
text bold' ?>
$var printed in the textarea will appear as:
Rather than also encoding the
& in the
text bold' ?>
Displaying e-mail addresses on your website leaves them prone to e-mail harvesters. One simple method is to disguise the e-mail address. WordPress provides
antispambot, which encodes random parts of the e-mail address into their HTML entities (and hexadecimal equivalents if
$mailto = 1). On each page load the encoding should be different and while the returned address renders correctly in the browser, it should appear as gobbledygook to the spambots. The function accepts two arguments:
mailto– 1 or 0 (1 if using the mailto protocol in a link tag)
$email = "firstname.lastname@example.org"; $email = sanitize_email($email); echo ''.antispambot($email).' ';
If you wish to add (or remove) variables from a query string (this is very useful if you wish to allow users to select an order for your posts), the safest and easiest way is to use
remove_query_arg. These functions handle all the necessary escaping for for the arguments and their values for use in the URL.
add_query_arg accepts two arguments:
query parameters– an associative array of parameters -> values
url– the URL to add the parameters and their values to. If omitted, the URL of the current page is used
remove_query_arg also accepts two arguments, the first is an array of parameters to remove, the second is as above.
// If we are at www.example.com/wp-admin/edit.php?post_type=book $query_params = array ('page' => 'my-bage'); $url = add_query_arg( $query_params ); // Would set $url to be: // www.example.com/wp-admin/edit.php?post_type=book&page=my-page
Validation & Sanitization
As previously mentioned, sanitization doesn’t make much sense without a context – so it’s pretty pointless to sanitize data when writing to the database. Often, you need to store data in its raw format anyway, and in any case – Rule No. 1 dictates that we should always sanitize on output.
Validation of data, on the other hand, should be done as soon as it’s received and before it’s written to the database. The idea is that ‘invalid’ data should either be auto-corrected, or be flagged to the data, and only valid data should be given to the database.
That said – you may want to also perform validation when data is displayed too. In fact sometimes, ‘validation’ will also ensure the data is safe. But the priority here is on safety and you should avoid excessive validation that would run on every page load (the
wp_kses_* functions, for instance, are very expensive to perform).
When using functions such as
get_posts or classes such as
WP_User_Query, WordPress takes care of the necessary sanitization in querying the database. However, when retrieving data from a custom table, or otherwise performing a direct SQL query on the database – proper sanitization is then up to you. WordPress, however, provides a helpful class, the
$wpdb class, that helps with escaping SQL queries.
Let’s consider this basic ‘
SELECT‘ command, where
$firstname are variables storing an age and name that we are querying:
SELECT * WHERE age='$age' AND firstname = '$firstname'
We have not escaped these variables, so potentially further commands could be injected in. Borrowing xkcd’s example from above:
$age = 14; $firstname = "Robert'; DROP TABLE Students;"; $sql = "SELECT * WHERE age='$age' AND firstname = '$firstname';"; $results = $wpdb->query
Will run as the command(s):
SELECT * WHERE age='14' AND firstname = 'Robert'; DROP TABLE Students;';
And delete our entire Students table.
To prevent this, we can use the
$wpdb->prepare method. This accepts two parameters:
- The SQL command as a string, where string variables are replaced by the placeholder
%sand decimal numbers are replaced by the placeholder
%dand floats by
- An array of values for the above placeholders, in the order they appear in the query
In this example:
$age = 14; $firstname = "Robert'; DROP TABLE Students;"; $sql = $wpdb->prepare('SELECT * WHERE age=%d AND firstname = %s;',array($age,$firstname)); $results = $wpdb->get_results($sql);
The escaped SQL query (
$sql in this example) can then be used with one of the methods:
Inserting and Updating Data
For inserting or updating data, WordPress makes life even easier by providing the
$wpdb->insert() method accepts three arguments:
- Table name – the name of the table
- Data – array of data to insert as column->value pairs
- Formats – array of formats for the corresponding values (‘
$age = 14; $firstname = "Robert'; DROP TABLE Students;"; $wpdb->insert( 'Students', array( 'firstname' => $firstname, 'age' => $age ), array( '%s', '%d' ) );
$wpdb->update() method accepts five arguments:
- Table name – the name of the table
- Data – array of data to update as column->value pairs
- Where – array of data to match as column->value pairs
- Data Format – array of formats for the corresponding data values
- Where Format – array of formats for the corresponding ‘where’ values
// Update Robert'; DROP TABLE Students; to Bobby $oldname = "Robert'; DROP TABLE Students;"; $newname = "Bobby"; $wpdb->update( 'Students', array( 'firstname' => $newname ), array( 'firstname' => $oldname ), array( '%s' ), array( '%s' ) );
$wpdb->insert() and the
$wpdb->update() methods perform all the necessary sanitization for writing to the database.
$wpdb->prepare method uses
% to distinguish the place-holders, care needs to be taken when using the
% wildcard in SQL LIKE-statements. The Codex suggests escaping them with a second
%. Alternatively you can escape the term to be searched for with
like_escape and then add the wildcard
% where appropriate, before including this in the query using the prepare method. For instance:
$age=14; $firstname = "Robert'; DROP TABLE Students;"; SELECT * WHERE age=$age (firstname LIKE '%$firstname%');
Would be made safe with:
$age=14; $firstname = "Robert'; DROP TABLE Students;"; SELECT * WHERE age=$age AND (firstname LIKE '%$firstname%'); $query = $wpdb->prepare('SELECT * WHERE age=%d AND (firstname LIKE %s);', array($age, '%'.like_escape($firstname).'%') );
This isn’t an exhaustive list of the functions available for validation and sanitization, but it should cover the vast majority of use cases. A lot of these (and other) functions can be found in
/wp-includes/formatting.php and I’d strongly recommend digging into the core code and having a look into how WordPress core does validation and sanitization of data.
Did you find this article useful? Do you have any further suggestions on best practices for data validation and sanitization in WordPress? Let us know in the comments below.
Original from: http://wp.tutsplus.com/tutorials/creative-coding/data-sanitization-and-validation-with-wordpress/