Wednesday, October 14, 2009

Multi-tenancy explained from a non-technical perspective

Recently, I had to give a short presentation about the topic of my PhD research to a group of people with no background in software engineering. I decided to explain multi-tenancy using a metaphor. As I have noticed many people do not understand exactly what multi-tenancy and its benefits are, I’ll elaborate on the metaphor in this blog post.

Imagine you are required to provide housing for a number of people, e.g. because you have a number of foreign employees currently working in your country. The first option to provide housing is to rent a complete house to everyone:

Although this works, there are some disadvantages to this situation:
  1. Costs: As every employee is required to rent a complete house, the costs are very high.

  2. Resource utilization: For most clients, a complete house is too large to use entirely, which results in many spare rooms.

  3. Maintenance: In order to do maintenance, e.g. replace the alarm system due to a bug, a maintenance guy (or girl) must visit all houses and replace the system in everyone of them.

The second option is to rent apartments to everyone, rather than complete houses:

This situation does not have the disadvantages of option 1:

  1. Costs: Renting an apartment is much cheaper than renting a house.

  2. Resource utilization: As the rented space is smaller, it is more likely that the space better fits the needs of the user. This leads to less spare rooms.

  3. Maintenance: As all apartments are in the same location, maintenance can be performed easier (and therefore cheaper).

Note that both a decrease in resource utilization and an increase in the maintenance complexity/effort also cause the costs to increase.

Although this example is for housing, the same ideas apply for software. In many ‘traditional’ situations, software is installed on the client side, either on their desktop or on a dedicated server:

For many customers, this is not an ideal solution, especially not for small companies:
  1. Cost: This solution requires a large investment, due to the requirement of an application server, database server, etc.

  2. Resource utilization: Many small businesses do their administrative tasks perhaps once a week, which means that their servers are idle during the rest of the week.

  3. Maintenance: For every software upgrade, all servers/installations must be upgraded individually.

As you can see, this approach is not very efficient for smaller companies. Luckily, we can apply the apartment principle to software as well. By letting multiple customers share the same application and database server, we can achieve the same benefits:

  1. Cost: As resource costs are much lower, software (or rather: a service, SaaS) can be offered to the customer at a much lower price.

  2. Resource utilization: All customers use the same application and database instance, which results in high utilization of these instances.

  3. Maintenance: All upgrades must be applied to one instance only, which results in lower maintenance costs.

In software engineering terms, we call this situation multi-tenancy:


Unfortunately, multi-tenancy also introduces some new problems and emphasizes some existing problems. I’ll elaborate more on these problems in a next blog post.

Wednesday, September 16, 2009

Early death of a blog?

Definitely not, but I've been very busy with holidays, conference visit (ESEC/FSE'09) and studying ASP.NET. So, this is just a short note to let you know I'll keep updating this blog, hopefully more often than the last 1,5 month :)

Wednesday, July 29, 2009

Self-tuning databases survey

In 2007, Surajit Chaudhuri and Vivek Narasayya received the 10 year best paper award at the VLDB'07 conference for their paper on AutoAdmin, a tool for helping with difficult tasks such as automatic index tuning in databases.. They have written a survey of work that has been done over the last decade on self-tuning databases, with special attention for physical design. Very interesting paper if you want to know what has been done on self-tuning databases so far.

Self-Tuning Database Systems: A Decade of Progress

Thursday, July 23, 2009

Alternatives to relational databases

At a brainstorm meeting with one of my colleagues we briefly discussed the way in which data is being stored in the cloud. Since you’re not sure of where data is being stored, it’s difficult to define relations. Most web applications nowadays use relational databases to store their data, which means it would require a significant data model change if relational data can no longer be used when they move to the cloud.

Almost all current database systems are relational: Oracle, MySQL, SQL Server, etc. A problem of relational databases is that they do not comply with the object-oriented programming model. In fact, 40% of total effort often goes into writing and maintaining data access code. Another problem with relational databases is that they are difficult to scale, because of the defined relations. In this blog post I present some alternatives to relation databases.

Alternative 1: Key/value database

In a key/value database, key/value pairs are stored in a domain, which is a bucket of data. These buckets have no relation to each other, nor does the data in them. Instead of defining relations, data is duplicated so that queries can be executed easily. It is important to realize that this may result in data inconsistency and the requirement for more storage capacity. Because no relations are defined between data, the responsibility of data integrity falls to the application, e.g. if a customers is removed, it is up to the application to remove all the orders made by that customer. Key/value databases are currently the preferred way in cloud applications, because they are very easily scalable and have no relations defined within them. Because key/value databases are therefore usually multi-tenanted, some sort of mechanism is required to make sure a user can’t overload the system. This is currently done by either limiting the maximum execution time of a query (Amazon SimpleDB) or the number of rows returned by a query (Google AppEngine Datastore).

Alternative 2: Object-oriented database

In an object-oriented database objects are sent to the database, which stores them as objects. The main advantage of an OO database is that there is no need to write a large amount of data access code, as the database is 1:1 with the code. The downside is that aggregate queries are much more expensive than in SQL.

Alternative 3: XML database

XML databases are very good at storing hierarchical data. The downside is that they are slower to query than SQL databases, especially for aggregate functions.

Alternative 4: Document-oriented database

Document-oriented databases are suitable for applications which are document-oriented (which are in fact most current web applications). They are schema-free, which leads to a very flexible database. The main advantage is that they are much simpler than relational databases, and that they are very scalable. Another advantage is that document-oriented databases have a mature implementation, Apache CouchDB. CouchDB is a distributed database system which offers mechanisms for replication and synchronization (scalability), and uses JavaScript to create views of data.

Conclusion

While trying to achieve a scalable database system, it is important to not lose track of the highest requirement: functionality. If we have a perfectly scalable database, but we can’t perform the queries we want, what good is it?

Non-relational databases appear to be useful in less complex web applications, as they seem to lack (or are very slow with) some advanced functionality such as aggregate queries. I do not see a good alternative yet for relational databases concerning this type of query. Especially in complex applications, with functionality such as report generation, relational databases still seem to be the best approach.

Wednesday, July 22, 2009

Database partitioning

Multi-tenant applications rely on database storage to serve their customers. As the number of customers grows, so will the amount of data used by the application. Eventually, this will lead to storage issues as the database is being filled with data. A possible way to cope with these storage issues is to partition the database. Partitioning the database results in two (or more) smaller database partitions which can be stored in different locations. Partitioning can be done in two ways:

  • Horizontal partitioning - This is done by partitioning the database by separating it by rows. This will result in the same schema in the database partitions, and a performance boost can be achieved by creating local indexes for each partition, which are smaller than the global index.
  • Vertical partitioning - This is done by extracting columns from the database. The advantage is that the database becomes smaller, which leads to faster read times since more rows fit into memory at the same time. Unfortunately, it may become necessary to combine partitions using a JOIN operation.

Partitioning can improve the performance of your database, but only if it is done correctly. It is also important to consider the nature of your data. If it is very dynamic, it is possible that your database needs repartitioning after a certain time (dynamic partitioning).

Most large database management systems offer partitioning functionality. See the following links for more information:

- MySQL :: Improving Database Performance with Partitioning
- Partitioned Tables and Indexes in SQL Server 2005
- Oracle partitioning


Wednesday, July 15, 2009

Salesforce.com gives insight on multi-tenant architecture

Salesforce.com is one of the largest multi-tenant platforms out there. Lately, they have been giving an insight on how their database architecture works. Checkout the paper The design of the force.com multitenant internet application development platform (for subscribed ACM members only) and the (excellent!) presentation by Craig Weissman, Chief Software Architect at Salesforce.com

Heap database
In a nutshell, what Salesforce.com does is place all user data in a heap database. Using tenant-specific metadata, virtual database tables can be defined. Imagine we are implementing a stamp collection application for Salesforce.com. The heap could look like:


In this case, the metadata would describe that the columns val0 and val500 for tenant 123 contain the country and price of a stamp.

Indexing
Since all columns may contain different types of data for each tenant, creating indexes on the heap makes no sense. Therefore, Salesforce.com creates tenant-specific indexes by copying the (small) parts of data which require indexing per tenant. Although this sounds like a good way of allowing indexing in a multi-tenant application, I wonder about the performance penalty of having so many small indexes.

It is very nice to see that larger companies like Salesforce.com are beginning to open up and publish more details about their architecture. It is very cool and useful to learn from industrial cases like this one!

Cost advantages of multi-tenancy

I read an interesting article on the cost advantages of multi-tenancy over single-tenant applications, Multitenancy Can Have a 16:1 Cost Advantage Over Single-Tenant. The author also gives a nice, clear definition of multi-tenancy:
"Multitenancy is the ability to run multiple customers on a single software instance installed on multiple servers to increase resource utilization by allowing load balancing among tenants, and to reduce operational complexity and cost in managing the software to deliver the service. Tenants on a multitenant system can operate as though they have an instance of the software entirely to themselves which is completely secure and insulated from any impact by other tenants."
In the rest of the article, examples of cost-savings by implementing multi-tenant solutions are being given. Finally, the author indicates that providing quality software is much more important for companies offering SaaS solutions, as they are being paid on a monthly basis, rather than receiving a large amount of money when their software is sold, like is the case with traditional software companies.

To conclude, I would like to quote the author with a statement which is very true:
"Focusing on quality actually lowers costs."
So, let's developer our software with the highest quality standards possible, and save some money in these financially hard times!