The riak-users list receives regular questions about how to secure a Riak cluster. This is an overview of the security problem, and some general techniques to approach it.
You can skip this, but it may be a helpful primer.
Consider an application composed of agents (Alice, Bob) and a datastore (Store). All events in the system can be parameterized by time, position (whether the event took place in Alice, Bob, or Store), and the change in state. Of course, these events do not occur arbitrarily; they are connected by causal links (wires, protocols, code, etc.)
If Alice downloads a piece of information from the Store, the two events E (Store sends information to Alice) and F (Alice receives information from store) are causally connected by the edge EF. The combination of state events with causal connections between them comprises a directed acyclic graph.
A secure system can be characterized as one in which only certain events and edges are allowed. For example, only after a nuclear war can persons on boats fire ze missiles.
A system is secure if all possible events and edges fall within the proscribed set. If you’re a weirdo math person you might be getting excited about line graphs and dual spaces and possibly lightcones but… let’s bring this back to earth.
Authentication vs Authorization
Authentication is the process of establishing where these events are taking place, in system space. Is the person or agent on the other end of the TCP socket really Alice? Or is it her nefarious twin? Is it the Iranian government?
Authorization is the problem of deciding what edges are allowed. Can Alice download a particular file? Can Bob mark himself as a publisher?
You can usually solve these problems independently of one another.
Asymmetric cryptography combined with PKI allows you to trust big entities, like banks with SSL certificates. Usernames with expensively hashed, salted passwords can verify the repeated identity of a user to a low degree of trust. Oauth providers (like Facebook and Twitter), or OpenID also approach web authentication. You can combine these methods with stronger systems, like RSA secure tokens, challenge-response over a second channel (like texting a code to the user’s cell phone), or one-time passwords for higher guarantees.
Authorization tends to be expressed (more or less formally) in code. Sometimes it’s called a policy engine. It includes rules saying things like “Anybody can download public files”, “a given user can read their own messages”, and “only sysadmins can access debugging information”.
There are a couple of common ways that security can fail. Sometimes the system, as designed, allows insecure operations. Perhaps a check for user identity is skipped when accessing a certain type of record, letting users view each other’s paychecks. Other times the abstraction fails; the SSL channel you presumed to be reliable was tapped, allowing information to flow to an eavesdropper, or the language runtime allows payloads from the network to be executed as code. Thus, even if your model (for instance, application code) is provably correct, it may not be fully secure.
As with all abstractions on unreliable substrates, any guarantees you can make are probabilistic in nature. Your job is to provide reasonable guarantees without overwhelming cost (in money, time, or complexity). And these problems are hard.
There are some overall strategies you can use to mitigate these risks. One of them is known as defense in depth. You use overlapping systems which prevent insecure things from happening at more than one layer. A firewall prevents network packets from hitting an internal system, but it’s reinforced by an SSL certificate validation that verifies the identity of connections at the transport layer.
You can also simplify building secure systems by choosing to whitelist approved actions, as opposed to blacklisting bad ones. Instead of selecting evil events and causal links (like Alice stealing sensitive data), you enumerate the (typically much smaller) set of correct events and edges, deny all actions, then design your system to explicitly allow the good ones.
Re-use existing primitives. Standard cryptosystems and protocols exist for preventing messages from being intercepted, validating the identity of another party, verifying that a message has not been tampered with or corrupted, and exchanging sensitive information. A lot of hard work went into designing these systems; please use them.
Create layers. Your system will frequently mediate between an internal high-trust subsystem (like a database) and an untrusted set of events (e.g. the internet). Between them you can introduce a variety of layers, each of which can make stricter guarantees about the safety of the edges between events. In the case of a web service:
- TCP/IP can make a reasonable guarantee that a stream is not corrupted.
- The SSL terminator can guarantee (to a good degree) that the stream of bytes you’ve received has not been intercepted or tampered with.
- The HTTP stack on top of it can validate that the stream represents a valid HTTP request.
- Your validation layer can verify that the parameters involved are of the correct type and size.
- An authentication layer can prove that the originating request came from a certain agent.
- An authorization layer can check that the operation requested by that person is allowed
- An application layer can validate that the request is semantically valid–that it doesn’t write a check for a negative amount, or overflow an internal buffer.
- The operation begins.
Minimize trust between discrete systems. Don’t relay sensitive information over channels that are insecure. Force other components to perform their own authentication/authorization to obtain sensitive data.
Minimize the surface area for attack. Write less code, and have less ways to interact with the system. The fewer pathways are available, the easier they are to reinforce.
Finally, it’s worth writing evil tests to experimentally verify the correctness of your system. Start with the obvious cases and proceed to harder ones. As the complexity grows, probabilistic methods like Quickcheck or fuzz testing can be useful.
Remember those layers of security? Your datastore resides at the very center of that. In any application which has shared state, your most trusted, validated, safe data is what goes into the persistence layer. The datastore is the most trusted component. A secure system isolates that trusted zone with layers of intermediary security connecting it to the outside world.
Those layers perform the critical task of validating edges between database events (e.g. store Alice’s changes to her user record) and the world at large (e.g. alice submits a user update). If your security model is completely open, you can expose the database directly to the internet. Otherwise, you need code to ensure these actions are OK.
The database can do some computation. It is, after all, software. Therefore it can validate some actions. However, the datastore can only discriminate between actions at the level of its abstraction. That can severely limit its potential.
For instance, all datastores can choose to allow or deny connections. However, only relational stores can allow or deny actions on the the basis of the existence of related records, as with foreign key constraints. Only column-oriented stores can validate actions on the basis of columns, and so forth.
Your security model probably has rules like “Only allow HR employees to read other employee’s salaries” and “Only let IT remove servers”. These constructs, “HR employees”, “Salaries”, “IT”, “remove”, and “servers” may not map to the datastore’s abstraction. In a key-value store, “remove” can mean “write a copy of a JSON document without a certain entry present”. The key-value store is blind to the contents of the value, and hence cannot enforce any security policies which depend on it.
In almost every case, your security model will not be embeddable within the datastore, and the datastore cannot enforce it for you. You will need to apply the security model at least partially at a higher level.
Doing this is easy.
Allow only trusted hosts to initiate connections to the database, using firewall rulesets. Usenames and passwords for database connections typically provide little additional security, as they’re stored in dozens of places across the production environment. Relying on these credentials or any authorization policy linked to them (e.g. SQL GRANT) is worthless when you assume your host, or even client software, has been compromised. The attacker will simply read these credentials from disk or off the wire, or exploit active connections in software.
On trusted hosts, between the datastore and the outside world, write the application which enforces your security model. Separate layers into separate processes and separate hosts, where reasonable. Finally, untrusted hosts connect these layers to the internet. You can have as many or as few layers as you like, depending on how strongly you need to guarantee isolation and security.
Putting it all together
Lets sell storage in Riak to people, over the web. We’ll present the same API as Riak, over HTTP.
Here’s a security model: Only traffic from users with accounts is allowed. Users can only read and write data from their respective buckets, which are transparently assigned on write. Also, users should only be able to issue x requests/second, to prevent them from interfering with other users on the cluster.
We’re going to presuppose the existence of an account service (perhaps Riak, mysql, whatever) which stores account information, and a bucket service that registers buckets to users.
- Internet. Users connect over HTTPS to an application node.
- The HTTPS server’s SSL acceptor decrypts the message and ensures transport validity.
- The HTTP server validates that the request is in fact valid HTTP.
- The authentication layer examines the HTTP AUTH headers for a valid username and password, comparing them to bcrypt-hashed values on the account service.
- The rate limiter checks that this user has not made too many requests recently, and updates the request rate in the account service.
- The Riak validator checks to make sure that the request is a well-formed request to Riak; that it has the appropriate URL structure, accept header, vclock, etc. It constructs a new HTTP request to forward on to Riak.
- The bucket validator checks with the bucket service to see if the bucket to be used is taken. If it is, it verifies that the current authenticated user matches the bucket owner. If it isn’t, it registers the bucket.
- The application node relays the request over the network to a Riak node.
- Riak nodes are allowed by the firewall to talk only to application nodes. The Riak node executes the request and returns a response.
- The response is immediately returned to the client.
Naturally, this only works for certain operations. Mapreduce, for instance, excecutes code in Riak. Exposing it to the internet is asking for trouble. That’s why we need a Riak validation layer to ensure the request is acceptable; it can allow only puts and gets.
I hope this gives you some idea of how to architect secure applications. Apologies for the shoddy editing–I don’t have time for a second pass right now and wanted to get this out the door. Questions and suggestions in the comments, please! :-)
This seems like a good model. I have a couple of questions related to my limited understanding of Riak.
I’m not sure whether you’ve discouraged or just avoided this possibility above, but would you see any problems with enforcing application-type constraints in pre-commit hooks? For instance, a user might only be able to create a particular type of object if she already had a valid container for that type of object. This would also be enforced in the “app server” level, but I always appreciated the “defense in depth” that relational db triggers could provide in addition to application logic.
I saw the exchange on the riak-users list that led to this post. I don’t necessarily agree with the more extreme sentiments expressed there, but it does seem that Riak could do a bit more, especially in more “open” situations such as on a shared host. I was going to ask if self-signed certificates would be helpful, and then I took a look at the v1.0 vm.args file and saw the -ssl_dist_opt client_certfile and -ssl_dist_opt server_certfile lines. I need to investigate these! Hopefully these allow TLS from outside the riak cluster as well as within it…