REM State

16 May

Safe Token Expansion in HTML

Recently, I wrote a token-expander for In Series, in order to facilitate flexible end-user customization of the plugin output. The initial implementation was faulty, due to a lack of consideration for all possible values of the expanded tokens. This article touches on some of the problems that were solved, and is intended to provide a basis for others to perform similar HTML processing in a safe, effective manner.

The general premise of my token expander is that it is far more flexible for the user to specify a high-level layout than it is for a configuration UI to attempt to enumerate and describe pre-defined layouts. However, the technique described here is relevant to any program that wishes to insert arbitrary (especially user-supplied) data into an HTML document or fragment. For the sake of discussion, PHP syntax will be used.

When inserting arbitrary text into HTML, there are a couple negative outcomes that can occur (assuming no processing is performed before text insertion):

  • The text contains malicious HTML and/or JavaScript markup
  • The text — either directly, or due to the context where it is inserted — causes the parent document to be malformed or invalid

The former case is not addressed within my plugin, but the theory is relatively straightforward: any contents that are not trusted should be scrubbed of HTML tags (e.g., using PHP’s strip_tags function). If a subset of tags are allowed, then the attributes on those tags should be scrubbed — and further, if a subset of attributes are allowed, then the contents should be scrubbed. Whitelisting is the key.

The latter case is a little trickier. In my case (and in many cases) the page that is output is not being constructed with a DOM — and even in the cases where it is, it is often the case that a document can exist in many well-formed but invalid states before being complete. Determining at run-time whether or not the change you’re about to make will ultimately result in valid, well-formed markup is tricky, and beyond the scope of this article. However, barring complete validation, it is possible to safely insert arbitrary content into HTML given that three basic assumptions hold true:

  • Tokens are expanded only as HTML, or as a value for an attribute (e.g., <a href="%value">%html</a>, but not <a %attribs>example</a>).
  • All instances of > are properly escaped as &gt; (e.g., the HTML source is like 2 &gt; 1, not 2 > 1).
  • The fragment being inserted is internally consistent with HTML, and the start/end of the fragment properly merge with the content that the fragment is being inserted into (e.g., the fragment doesn’t start with a /> unless context requires it).

Given the above, we can set up two types of replacements: internal and external. Internal replacements are values to HTML element attributes; these replacements must be escaped such that they can be placed safely between double- or single-quotes. In PHP, this means using htmlspecialchars with the ENT_QUOTES option. Whether or not HTML should be stripped for internal replacements (via strip_tags prior to escaping) is dependent on the application; some applications may even want to offer a choice between the two modes, either via configuration or by providing two tokens. External replacements, by contrast, are simple: they exist in the regular flow of HTML, and may be stripped (depending on the application), but generally will not be escaped. The text is dropped into the parent document, and the user is expected to have provided valid content (as per the final assumption from the list above).

The next trick, then, is to discover which tokens should have internal replacement performed, and which should have the external replacement. Again, given our assumptions, this is straightforward: to start, any tokens that are encountered will be considered an external replacement. If a < character comes up, then any tokens encountered will be considered an internal replacement. Upon finding a > character, the behavior will revert back to external replacements. This mechanism is easy to implement in many languages, without the use of special libraries — simply scan, one character at a time, and use the appropriate replacement mechanism when matching tokens are located.

For PHP, I went ahead and used regular expressions instead of performing the scanning myself. Using "/<[^>]*>/" (internal) and "/(^|>)[^<]*/" (external) with preg_replace_callback, I was able to quickly divvy up the strings into portions that I could perform safe token replacements on.

Note that this divvying is very important: unless you are using tokens that expand to only one value, regardless of context (e.g., %title_esc/%title_raw), you will have to perform two replacement passes on the string. This means that the results of the first expansion are subject to being interpreted in the second expansion. Having the fragment %titlee, where %title expands to %titl, means that we’re very likely to wind up with output the user didn’t expect if we do another search/replace on %title. Divvying avoids this problem — by splitting up the “inside” and “outside” portions, we can ensure that each part of the string, as a whole, only has one search-replace performed on it. Finally, it is important to note that doing multiple str_replaces is problematic for the exact same reason as not divvying: prior expansions can lead to further, unintended expansions later on. You can ensure a single-pass replacement through the use of preg_replace with array arguments, or by careful string manipulation.

Leave a Reply

Comment Preview:

© 2009 REM State | Entries (RSS) and Comments (RSS)

Global Positioning System Gazettewordpress logo