Data validation — or more accurately, the lack of it — is the single largest cause of vulnerabilities among all the categories of our security frame. From the vulnerabilities of the ’90s, such as the infamous buffer overflow, to the current bane of Web applications, cross-site scripting — all of these are problems that can be easily mitigated by effective data validation. However, all too often data validation is an afterthought, if it happens at all. This can result in embarrassing and dangerous vulnerabilities manifesting themselves in your applications — vulnerabilities that, in hindsight, cause developers to think, “That would have been so easy to fix.”
All too often data validation is not treated as part of software design but rather as something that is almost “common sense” and therefore the responsibility of each individual developer. As someone once said, however, “Common sense is not so common.” Considerations of “performance,” as well as impending project deadlines, often end up trumping the apparent need for data validation. Developers are also prone to considering only best-case scenarios and normal user behavior, overlooking the fact that malicious attackers do not play by the rules or by the norm. What could have been avoided with a simple check — often a single line of code — can then result in a catastrophic vulnerability affecting the entire user population. Conversely, we have found that when data validation is embodied in the architecture and is made a focus early in the application development lifecycle, more often than not the application is secure by default or easily fixed even when new vulnerability types are discovered.
So What Is Data Validation, Then?
Unfortunately, there are a number of false beliefs when it comes to data validation—beginning with the assumption that data validation means input validation. Data validation, however, validates output as well as input. Further, even within input validation (See Figure A.), the threats to an application go far beyond user inputs. Consider, for instance, object validation or news feeds from third parties. Could those potentially be inputs? Could they be tampered with to cause harmful effects on your application?
At a very fundamental level, data validation essentially comes down to verifying four basic properties of the data: length, range, format and type.
Length, as the name suggests, implies the size of the data: for instance, the number of bytes in a string or the number of characters to be copied into a buffer — and note that these two could differ depending on the encoding formats being used.
Range determines which values are valid and which are not. This will often depend on the business logic. For instance, when dealing with the price of an item in an e-commerce store, a negative value should raise a lot of red flags for violating sanity checks.
Format implies the way the data is meant to look. Is it meant to be, for instance, a sequence of nine digits, with specific rules for blocks within those digits (as, for example, a U.S. Social Security number), or is it meant to be a phone number — and if so, from which country? Is it meant to be an alphanumeric address? Are special characters and punctuations allowed? Valid formats will be dictated by the real-world entity the data describes. Type describes the nature of the raw data associated with the underlying item. Often the bane of scripting and other loosely typed languages, type mismatches can result in unpredictable results from an application. For instance, consider what happens if, instead of passing in a numeric field such as your age, you pass a string containing SQL fragments.
With these definitions in place, we can now go about defining effective and efficient strategies around data validation.
Strategies for Effective Data Validation
First and foremost, as was mentioned above, it is important that data validation not be left to the individual developer but instead be considered an integral and necessary part of the system architecture. Considering data validation early in the application development lifecycle makes it possible to centralize it across the application or components within the application. A data validation strategy coordinated across multiple components has the advantage of being easy to implement, test, maintain and update. But perhaps more importantly, it frees the individual components and the developers who own those components from the need to validate the data they handle; they can focus instead on application logic and be assured that the data will be validated elsewhere. The end result is the reduction or even the elimination of inconsistencies across modules in the application. Further, common programming errors such as validation exclusively on the client side can be much more easily tested for and discovered. (See sidebar.)
Once you have decided on employing a centralized strategy, the obvious follow-up question is, how do you implement such a strategy? A common approach to this is to employ what is often called a validation funnel. As Figure B illustrates, a validation funnel siphons all inputs through a single validation module and handles outputs similarly. It is important to note, however, that such funnels need not be separate physical components but could in fact be shared classes that filter and sanitize all inputs and outputs based on a set of rules. Ideally this set of rules should be configurable declaratively, without having to rebuild the module. Further, the rules must account for varying degrees of access control. For instance, data that can be influenced by an anonymous Internet user must be treated with far less trust than data that requires you to be an administrator in the first place.
Once the application architecture has a centralized data validation chokepoint, the next question that needs to be tackled is where to perform this centralized validation and how often to validate. Answering these questions accurately usually requires understanding of the trust boundaries associated with a system. A trust boundary is a logical edge at which one side does not trust the other. For instance, as indicated in the sidebar, a trust boundary could exist between the client and the server or at a remoting API that is shared by both internal applications and partner applications. Most often the trust boundary can be defined at the location where the policies associated with a system change. A good way to identify these is to look for network devices such as firewalls or VLANs or authentication mechanisms.
When identifying trust boundaries, it is important to avoid a few common misconceptions. For instance, data validation is often thought of as concerning only Web applications. However, in today’s ever-more-connected world, organizations are opening more and more of their internal IT systems to partners and telecommuting employees. The traditional notion of a closed environment is slowly dying away, and legacy applications that made assumptions about such environments are becoming prime targets as they inadvertently (or perhaps even intentionally) are exposed to the outside world. The other factor that comes into play here is that data validation is not restricted to user inputs and outputs but extends to nonconventional data paths such as news feeds and object serialization and deserialization as well as data manipulated from sockets, inter-process communication channels, environment variables and log files.
That said, however, developers might also wonder how many levels of data validation they really need. This brings up the most common argument against data validation— performance. In our experience, having looked at hundreds of applications of all hues, the performance impact of necessary and sufficient data validation is rarely significant, especially when compared to other bottlenecks within a typical application architecture—network bandwidth or encryption overheads, for instance. Further, one has to question the value of a high-performing system if it has been compromised by a malicious attacker. At the same time, it is important to understand the drawbacks of having too much validation.
What To Validate, and How?
The answers to these questions are best understood in terms of the four data properties defined above.
Length. The classic protection against buffer overflows is to validate the length of all buffers before using them. This ensures that the destination buffer is large enough to hold the data about to be copied into that buffer. However, developers often forget to check the other bound on length: the minimum length of a data element. This is especially true when the input obtained from the user is meant to be deserialized into a binary object, for instance.
Given how long buffer overflows have been around, a number of tried and tested solutions exist for length checking. With a number of the newer “managed” languages, such as Java and C#, length checks are less of a consideration due to automated memory management. However, even in these cases, validating for length can help prevent unnecessary reallocations and memory copies. Further, especially when dealing with data that may be sensitive, developers would like to avoid multiple copies being left behind in memory, with only the garbage collector to control the process of clearing them.
C and C++ application development have also seen solutions to some of the problems traditionally associated with these languages. For instance, the standard template library (STL) provides a number of classes that allow for managed string types, smart pointers and dynamically allocated data structures such as vectors. Similarly, a number of the unsafe ANSI C functions—the str* string functions, for instance—have been replaced by safer alternatives that do perform extensive bounds checking.1 However, the one problem that remains despite these safe libraries is the non-standard semantics of a number of commonly used APIs. The biggest source of confusion is situations in which the API expects a particular number of bytes and the developer passes a number of characters (or vice versa). Everything in this case seems to work fine until you encounter, for instance, a UTF-8 encoded string. This is often seen with MultiByteToWideChar on Windows. Confusion also arises as to whether the length parameter is meant to indicate the number of bytes left in the string, or the total size of the buffer; this last issue is especially a problem with the string concatenation functions available in the ANSI C library.
Range. When dealing with an acceptable range of values, ensure that data is validated to be within the expected or acceptable range. The canonical example of this is prices of goods on e-commerce websites. A number of websites and online shopping cart services continue to be vulnerable to “negative price/shipping cost/tax” attacks wherein the attacker can influence the price he or she pays (or indeed ends up being paid!). Such logical problems should be easy to detect and prevent, given the business rules implemented by the system. Similarly, especially when dealing with numbers, it is important to understand the range of the base numeric type being used to store the number, and the difference between signed and unsigned numbers. For instance, what happens when you increase a number beyond its maximum value or decrease it below its minimum value? How does that affect application logic and security? Similarly, when dealing with values returned from drop-down or list boxes, it is best to implement a data indirection pattern wherein only the option index is obtained from the client; and if that index does not fall within an acceptable range, an error is returned. In general, with regards to range-based validation, there are two common approaches: blacklist and white-list. As the name would suggest, black-list data validation involves creation of a list of “bad” data items, which are then blocked. White-listing, on the other hand, involves the creation of a list of items accepted based on business rules, dropping everything else. As one would expect, it is much easier to build an all-encompassing white list than it is to build a black list that is effective in blocking all attacks, both current and future. Therein lies the major problem: your black list is only as effective as your current knowledge of attack patterns.
Format. This is the aspect that is perhaps most ingrained in the business logic of an application. In most cases, format checks ensure that the programmatic representation of an entity is consistent with its real-world counterpart. There are a number of effective mechanisms that perform such validations, but perhaps the most efficient and elegant approach is to leverage regular expressions. The .NET and Apache Struts frameworks, for instance, provide out-of-the-box support for validating forms and controls through the use of regular expression masks. In the .NET framework, this is done using the asp:RegularExpressionValidator object.2 Other tools, such as The RegEx Coach3 and The Regulator,4 can help beginners not only get more comfortable with regular expressions but also query online libraries for tried and tested expressions. When dealing with XML data representations, this can be taken even further through the use of an XSD schema to perform granular validation against the data elements contained within the XML document. A number of recent attacks have attempted to compromise not the application but the XML parser running within the application. The most common attack is what is commonly called XDOS (XML Denial of Service). This attack typically involves feeding the parser with XML that contains embedded entity definitions that are recursive in nature. Given these attacks, it is important that data in XML documents be validated against all such malicious streams before any attempt is even made to parse the document.
Format validation has another important dimension that is often forgotten and can be the source of numerous and repeated problems. The source of these problems primarily lies with the fact that the same data can be represented in multiple different formats. For instance, consider the “less than” symbol, “<”—it can be represented as “<” when HTML-encoded, or “<” or even “<”. Other common encoding formats on the Web include URL encoding and hexadecimal encoding. Given the multitude of formats, canonicalization becomes critical, and all validation must be performed after data has been decoded into its most basic form.
Besides the fact that the same character can be represented in different ways, internationalization can come into play when dealing with languages other than English, especially those from the Far East. The best way to tackle such issues is to use UNICODE and UTF-8 when building the application. Regular expressions have also been extended to support non-English character sets. However, the most typical problem that arises after internationalization is buffer overflows. It is therefore important to remember that while in the United States, for the most part, an 8-character password is represented in 8 bytes. In China, an 8-character name will occupy 16 bytes of UTF-8. Hence, if dealing with buffers, one needs to be very careful with data, especially when operating in non-English locales on the computer.
A special case, where format checking is concerned, comes up when dealing with file uploads or downloads. With file uploads, checking MIME types and performing selective virus scanning is regarded as a good practice. Similarly, file uploads should be throttled to avoid disk space exhaustion attacks. Where downloads are concerned, developers must ensure that arbitrary files cannot be downloaded from outside the equivalent of a chroot jail.5 Developers must decide, for instance, whether path components and relative paths will be allowed at all. Similarly, it is important that all access control be performed on the basis of file-system-based access control lists, rather than simply on the name. This is especially significant when 8.3 names are enabled on the system. For instance, ThisIs~1.doc is identical to the document ThisIsASecretDocument.doc when they are in the same folder. Hence, if your access control is based on the whole name matching, an attacker could trivially subvert your access control mechanism by using the 8.3 file name. All access control must be based on handles rather than names.
Type. This may be the most often ignored and least-used data attribute, especially when dealing with strongly typed languages from C or C++ to C# and Java. However, it continues to be the bane of the weakly-typed scripting languages, such as Perl or JavaScript. In such cases, it is important to ensure that if the application is expecting a string, then that should indeed be what is presented to the application, rather than a numeric type, for example. In the scripting languages this is best done by requiring as a matter of coding standards that all variables be declared with a type before they are ever used. With languages such as C# and Java, the reflection mechanism provides an effective and efficient way of querying object metadata and identifying data types. This is especially true when dealing with dynamic code or object serialization attacks. In C++, the capabilities are fairly limited to the casting operators6 (such as dynamic_cast) that understand polymorphism and can check whether a cast will be valid before actually performing it. They will return NULL if the cast fails, allowing the developer to take remedial actions. Old C-style casts, on the other hand, are extremely loose and allow for arbitrary conversions between unrelated data types (especially when data is passed around as void*). These conversions can not only cause exceptions themselves, but can also result in unpredictable application behaviors.
Conclusion
Data validation is an important aspect in the security of an application. If you consider most common attacks against software, from the age-old buffer overflow to the more recent SQL injection—all of these can be prevented or mitigated by the use of effective data validation. As with most other things in software security, data validation is hard to add on at the end and must be considered from day one as an integral and central part of the application architecture. This ensures that no assumptions are made and that the individual developers do not have to be concerned with data validation in each of their components. Once architected properly, data validation then simply comes down to basic length, range, format and type checks.
Rudolph Araujo is a principal software security consultant at Foundstone, responsible for creating and delivering the threat modeling and security code review service lines. He is also responsible for content creation and training delivery for Foundstone’s Building Secure Software, Writing Secure Code — ASP.NET, and Writing Secure Code — C/C++ classes.
Mark Curphey is the chairman and CTO of SourceClear, an early stage start-up building information security management software. He founded the Open Web Application Security Project (www.owasp.org) and is now working on a new project, the ISM Community (www.ism-community.org). He now lives in the South of France.
1 msdn2.microsoft.com/enus/library/ms861501.aspx
buildsecurityin.us-cert.gov/daisy/bsirules/271.html
2 msdn2.microsoft.com/enus/library/868290ew.aspx
3 weitz.de/regex-coach/
4 sourceforge.net/projects/regulator/
5 en.wikipedia.org/wiki/Chroot
6 msdn2.microsoft.com/enus/library/5f6c9f8h.aspx
|
Can You Trust Your Client?
Large numbers of developers often make the mistake of trusting the client—assuming that if they themselves wrote the client and accounted for security measures when doing so, it must be secure. The problem with that premise is that the client is typically deployed in an untrustworthy environment — within a browser or as a thick client running on the attacker’s desktop, for instance.
In general it is fair to assume that the client desktop is a hostile environment. Client-side users have access to a variety of tools that can help them circumvent, disable or entirely negate security measures built into the client. These classes of tools include, among others, decompilers, debuggers and proxies. These tools can allow an attacker to do something as simple as viewing “secrets” in client-side application artifacts or something as complex as modifying the data stream and communication channel between the client and the server.
The classic example of this is client-side JavaScript-based data validation. Using a proxy such as Paros,1 it is easy to selectively change the JavaScript, change the data field values after those values have left the browser, or completely turn the application’s clientside security model on its head. For this reason it is important that the client never be trusted. Validation and authorization checks may be performed on the client side, but only for performance reasons — i.e., to avoid an unnecessary server round trip to deal with an innocent user error. However, all checks must be performed on the server side. Essentially, developers must assume that a custom client written entirely by an attacker will be used to connect to the server.
— Rudolph Araujo and Mark Curphey
1 www.parosproxy.org/ |