Wednesday, July 29, 2009

Self-tuning databases survey

In 2007, Surajit Chaudhuri and Vivek Narasayya received the 10 year best paper award at the VLDB'07 conference for their paper on AutoAdmin, a tool for helping with difficult tasks such as automatic index tuning in databases.. They have written a survey of work that has been done over the last decade on self-tuning databases, with special attention for physical design. Very interesting paper if you want to know what has been done on self-tuning databases so far.

Self-Tuning Database Systems: A Decade of Progress

Thursday, July 23, 2009

Alternatives to relational databases

At a brainstorm meeting with one of my colleagues we briefly discussed the way in which data is being stored in the cloud. Since you’re not sure of where data is being stored, it’s difficult to define relations. Most web applications nowadays use relational databases to store their data, which means it would require a significant data model change if relational data can no longer be used when they move to the cloud.

Almost all current database systems are relational: Oracle, MySQL, SQL Server, etc. A problem of relational databases is that they do not comply with the object-oriented programming model. In fact, 40% of total effort often goes into writing and maintaining data access code. Another problem with relational databases is that they are difficult to scale, because of the defined relations. In this blog post I present some alternatives to relation databases.

Alternative 1: Key/value database

In a key/value database, key/value pairs are stored in a domain, which is a bucket of data. These buckets have no relation to each other, nor does the data in them. Instead of defining relations, data is duplicated so that queries can be executed easily. It is important to realize that this may result in data inconsistency and the requirement for more storage capacity. Because no relations are defined between data, the responsibility of data integrity falls to the application, e.g. if a customers is removed, it is up to the application to remove all the orders made by that customer. Key/value databases are currently the preferred way in cloud applications, because they are very easily scalable and have no relations defined within them. Because key/value databases are therefore usually multi-tenanted, some sort of mechanism is required to make sure a user can’t overload the system. This is currently done by either limiting the maximum execution time of a query (Amazon SimpleDB) or the number of rows returned by a query (Google AppEngine Datastore).

Alternative 2: Object-oriented database

In an object-oriented database objects are sent to the database, which stores them as objects. The main advantage of an OO database is that there is no need to write a large amount of data access code, as the database is 1:1 with the code. The downside is that aggregate queries are much more expensive than in SQL.

Alternative 3: XML database

XML databases are very good at storing hierarchical data. The downside is that they are slower to query than SQL databases, especially for aggregate functions.

Alternative 4: Document-oriented database

Document-oriented databases are suitable for applications which are document-oriented (which are in fact most current web applications). They are schema-free, which leads to a very flexible database. The main advantage is that they are much simpler than relational databases, and that they are very scalable. Another advantage is that document-oriented databases have a mature implementation, Apache CouchDB. CouchDB is a distributed database system which offers mechanisms for replication and synchronization (scalability), and uses JavaScript to create views of data.

Conclusion

While trying to achieve a scalable database system, it is important to not lose track of the highest requirement: functionality. If we have a perfectly scalable database, but we can’t perform the queries we want, what good is it?

Non-relational databases appear to be useful in less complex web applications, as they seem to lack (or are very slow with) some advanced functionality such as aggregate queries. I do not see a good alternative yet for relational databases concerning this type of query. Especially in complex applications, with functionality such as report generation, relational databases still seem to be the best approach.

Wednesday, July 22, 2009

Database partitioning

Multi-tenant applications rely on database storage to serve their customers. As the number of customers grows, so will the amount of data used by the application. Eventually, this will lead to storage issues as the database is being filled with data. A possible way to cope with these storage issues is to partition the database. Partitioning the database results in two (or more) smaller database partitions which can be stored in different locations. Partitioning can be done in two ways:

  • Horizontal partitioning - This is done by partitioning the database by separating it by rows. This will result in the same schema in the database partitions, and a performance boost can be achieved by creating local indexes for each partition, which are smaller than the global index.
  • Vertical partitioning - This is done by extracting columns from the database. The advantage is that the database becomes smaller, which leads to faster read times since more rows fit into memory at the same time. Unfortunately, it may become necessary to combine partitions using a JOIN operation.

Partitioning can improve the performance of your database, but only if it is done correctly. It is also important to consider the nature of your data. If it is very dynamic, it is possible that your database needs repartitioning after a certain time (dynamic partitioning).

Most large database management systems offer partitioning functionality. See the following links for more information:

- MySQL :: Improving Database Performance with Partitioning
- Partitioned Tables and Indexes in SQL Server 2005
- Oracle partitioning


Wednesday, July 15, 2009

Salesforce.com gives insight on multi-tenant architecture

Salesforce.com is one of the largest multi-tenant platforms out there. Lately, they have been giving an insight on how their database architecture works. Checkout the paper The design of the force.com multitenant internet application development platform (for subscribed ACM members only) and the (excellent!) presentation by Craig Weissman, Chief Software Architect at Salesforce.com

Heap database
In a nutshell, what Salesforce.com does is place all user data in a heap database. Using tenant-specific metadata, virtual database tables can be defined. Imagine we are implementing a stamp collection application for Salesforce.com. The heap could look like:


In this case, the metadata would describe that the columns val0 and val500 for tenant 123 contain the country and price of a stamp.

Indexing
Since all columns may contain different types of data for each tenant, creating indexes on the heap makes no sense. Therefore, Salesforce.com creates tenant-specific indexes by copying the (small) parts of data which require indexing per tenant. Although this sounds like a good way of allowing indexing in a multi-tenant application, I wonder about the performance penalty of having so many small indexes.

It is very nice to see that larger companies like Salesforce.com are beginning to open up and publish more details about their architecture. It is very cool and useful to learn from industrial cases like this one!

Cost advantages of multi-tenancy

I read an interesting article on the cost advantages of multi-tenancy over single-tenant applications, Multitenancy Can Have a 16:1 Cost Advantage Over Single-Tenant. The author also gives a nice, clear definition of multi-tenancy:
"Multitenancy is the ability to run multiple customers on a single software instance installed on multiple servers to increase resource utilization by allowing load balancing among tenants, and to reduce operational complexity and cost in managing the software to deliver the service. Tenants on a multitenant system can operate as though they have an instance of the software entirely to themselves which is completely secure and insulated from any impact by other tenants."
In the rest of the article, examples of cost-savings by implementing multi-tenant solutions are being given. Finally, the author indicates that providing quality software is much more important for companies offering SaaS solutions, as they are being paid on a monthly basis, rather than receiving a large amount of money when their software is sold, like is the case with traditional software companies.

To conclude, I would like to quote the author with a statement which is very true:
"Focusing on quality actually lowers costs."
So, let's developer our software with the highest quality standards possible, and save some money in these financially hard times!

Tuesday, July 14, 2009

Best practices for multi-tenant applications

On Designing and Deploying Internet-Scale Services is an interesting article in which Hamilton, who was at the time of writing the article doing research on the Windows Live Platform, describes best practices for design and deploy of internet-scale services, such as Hotmail. The recommendations are made with the goal of optimizing the cost of operations, but I believe they can improve the quality of your software, including multi-tenant applications, in general.

Hamilton points out three design principles:

  1. Expect failures. Hardware and software will fail, so better be prepared, otherwise your complete system will crash.
  2. Keep it simple. Simple installation and maintenance procedures will result in less mistakes. Also, keep dependencies as simple as possible to make sure that e.g. replacing a server is easy.
  3. Automate everything. Staff is expensive and will make mistakes. An automated process can be tested and is repeatable.

The rest of the paper is mostly a list of best practices with a short description. I will give a short overview of what I believe to be the most important ones in a multi-tenancy environment. I encourage you all to read the full paper, as it contains excellent recommendations for developing web applications in general.

Application design
  • Develop in a complete environment. Although unit testing is essential, make sure that your component works in the complete system.
  • Zero trust underlying components. Always validate input as the component it came from may not have done this. Never trust another component does the validation for you: better safe than sorry.
  • Do not duplicate functionality in different components. Code multiplication will result in more difficult maintenance.
  • Understand access patterns. Improve your application by understanding how it is being used; e.g. by improving paths for shorter latency.
  • Version everything. Without versioning it is impossible to keep track of which features have been added or removed.
  • Avoid single points of failure. When a single point of failure stops working, your whole system may fail. Therefore, always use redundancy and replication.
Automatic management
  • Be restartable. If a service cannot restart when it is in faulted state, the whole system needs to be restarted.
  • Keep deployment simple. The simpler the deployment process, the simpler it is to automate.
Dependency management
  • Use stable, proven components. Make sure you use components which are reliable. Alpha and beta components may contain many errors, which may cause your system to crash.
Release management
  • Allow rollbacks to previous versions. Necessary in case of an error in an update.
  • Monitor and instrument everything, and give enough fault information for diagnosis. Save more information than e.g. just 'A query has failed'. Save query, time, error message and if possible the state of the application.
  • Make everything configurable. Make diagnosis options configurable, rather than adding them when a system is failing. Adding monitoring to a failing system is asking for problems.
On Designing and Deploying Internet-Scale Services contains many more of these recommendations. What are your recommendations for building high quality software?

Monday, July 13, 2009

What does this mean? Part I

In a meeting with colleagues I discovered that there was some slight confusion over the terminology used in multi-tenant systems. Therefore, I will give a short description of the differences between the terms which I regularly use and may cause confusion.

Multi-user vs single-user
A single-user application allows one user to use the application. Examples of this are most desktop applications, which allow customization (e.g. MS Word) for one user only. Multi-user applications allow multiple users to concurrently use the application, for example a system which would let you log in. An advantage of multi-user systems is that they do not require a new instance for each application user.

Multi-tenant vs single-tenant
In a single-tenant system, all users run their own application and database instance. In a multi-tenant system this instance is shared. I also refer to my introductory post on multi-tenancy.

Multi-tenant vs multi-user
Any system may have multiple users. In a multi-user system multiple users can use the application (e.g. Exact Synergy). The term multi-user does not imply anything for the architecture of the system. On the other hand, while a multi-tenant system is a multi-user system, multi-tenancy tells us something about the architecture of the system: namely that multiple users share the same application and database instance. Note that it is possible to have a multi-user system, which is not multi-tenant.

Web service vs business service
A business service is a service provided by a company, with which they (usually) make money. Examples are cooking lunch, offering a financial application and driving a cab. A web service is a method of accessing a specific application. Examples of a web service are generating a token, returning the server time and storing data in a database. The web service layer is implemented between the application and the user, and offers a user the possibility to easily integrate an application without deep knowledge of its architecture. For example, if I’d want to integrate eBay auctions into my application, I would use their web service to communicate with the eBay platform rather than finding out how their platform is implemented (also, they would probably not let me!).

If any other terms (or even this explanation!) confuse you, please let me know!

Friday, July 10, 2009

Benefits and disadvantages of multi-tenancy

One of the main advantages of an ideal multi-tenant application is the operational benefit. Because all application code is in one place, it is much easier (and cheaper!) to maintain, update and backup the application and its data.

Another advantage of multi-tenancy is the lower system requirements. Because an application and database are shared by multiple clients, it is not necessary to have a dedicated server for every client. This is a clear improvement in resources utilization.

Because multiple clients share a server, scalability may become a problem. In single-tenant applications, all clients have their own resources and whenever a new client wants to use the application, resources are added. In multi-tenant applications, all clients share the same resources and it is possible that, at some point, these resources become overloaded. Something that influences the time before this point is reached, is the database implementation of the application (see 'Supporting Database Applications as a Service'). One can define roughly three implementations of multi-tenancy databases:
  • Independent database, independent database instances (IDII)
  • Independent tables, shared database instances (ITSI)
  • Shared tables, shared database instances (STSI)
Clearly, IDII is not a real multi-tenant database approach. However, it is one that is quite often used as it is very easy to implement. The obvious downside of this approach is that it is very heavy on resources: for example, starting a MySQL database requires about 30M memory. When multiple instances are started, the system will run out of memory quickly.

ITSI is a semi-multi-tenant solution, in which all clients use the same database, but each have their own tables. This approach suffers from the same problem as IDII, however it does take longer before the limits are reached, as a table instance requires less memory than a database instance.

With regard to resources, the ideal solution is STSI. All tenants are in the same table and retrieve their records using the SQL syntax 'SELECT .... WHERE tenant_id = xxx'.

The problem with the STSI approach can be described as an isolation problem. Because application and database are shared, it is important that tenants are isolated from each other regarding security, customization performance, etc. In a next blog post I will discuss ways to isolate tenants from each other in more detail.

Wednesday, July 8, 2009

What is multi-tenancy?

Back in the days, when computers still had CD-ROM drives and BSOD's were daily routine (though some may argue that this is still the fact), the software industry was quite different from today. Software was being sold on CD-ROMs (those rounded, shiny things), from which the software could be installed. After this era, fast Internet access became available and CD-ROMs slowly disappeared. Now, a customer would download software and install it on their computer. This type of software, which installs on the customer's computer, is also known as on-premises software.

Although this system worked, there are some downsides to it. An example is the upgrading process: since all software copies are on different computers, a software upgrade must be pushed to all these machines.

Wouldn't it be cool if we could perform the update in one place instead?

This is exactly what multi-tenancy is about. Multi-tenancy allows multiple users (tenants) to use an application which runs on the same computer. This can be done in two ways:

  • Multiple instances
  • Share instance

The multiple instances pattern runs an application instance for each user, for example by using virtual machines. The obvious downside of this is the resource requirement, as each instance requires allocated memory. Since this pattern is in fact single-tenancy, I will not discuss it further.

The share instance pattern shares the application instance and database amongst multiple users (see figure). This means that we can do an upgrade on only one instance - imagine how much we could save on costs and time! Of course this isn't the only advantage of running a multi-tenancy application over a single-tenancy one. I will discuss the advantages (and disadvantages) of native multi-tenancy applications in a next blog post.

First post

Hi, my name is Cor-Paul and I am 24 years old, living in the Netherlands. I just started my PhD research on multi-tenancy and decided to maintain a blog to let you know what I’m up to, and to (hopefully) receive some feedback on the work I’ve been doing. As I have just started doing research, there isn’t much on this blog yet. But be sure to follow it as many interesting articles will follow! :)