Today was one of these days that I learned a lot of something I already knew… index cardinality.
I was working with a collegue in an attempt to optimize a project that has been acting up lately and was passed on to us. The website itself is a Drupal website with a custom module that retrieves data from a MongoDB database. In this database, there’s a collection (for those not familiar with MongoDB, that’s the equivalent of a table) that contains information about roughly 4 million flight from several airlines. The symptoms where simple: page loading times were to high, and when clearing the Drupal cache, the MongoDB server was flooded with more queries than it could handle. So there were two problems to investigate: why was Drupal doing so many queries, and why wasn’t MongoDB able to handle them in a timely fashion? My collegue took on the Drupal side of the problems, I focussed on the MongoDB.
So the first thing I did was track down what queries were actually running. The query that popped out the most, was a query that would filter for a flight from a particular airport to another airport and back, limited by a date and a price range (where do you want to go, when do you want to go and how much are you willing to pay). The query used 6 fields to filter on and took on average anywhere from 1 to 1.5 seconds to execute. The next thing I did, was have a look at the indexes on the collection. To my surprise, the table has 6 ‘relevant’ indexes, one for every columns they were searching on. Since only one index can be used, at least 5 of them were useless. Using explain(), I learned that the index that was used was that of the arrival airport of the outbound flight. This index still resulted in over 430,000 rows that had to be scanned to filter on the other criteria.
So I did what most of us would probably do. I made a best-effort guess on what would be a relavant query, and decided to use the 4 airport fields (from-to, for both the outbound and the inbound flight. BTW all of them were arrays, as a city can have more than one airport) and the price (as the results were sorted by price) as a 5 column compound index. I dropped the existing indexes, replaced them by my compound index, sat back, relaxed and saw how the server load went through the roof. Wait! What?!? Queries started piling up and taking forever to complete. CPU was burning 100% and the website wasn’t really doing anything anymore. I was shocked. How on earth could it be that my brilliantly clever compound index was performing worse than a single column index yielding a 430,000 rows scan?
So I re-added the old indexes and the server gradually recoverd from all the work. When I re-ran the explain() on the query, I noticed that the MongoDB optimizer actually prefered the single column index instead of my compound index (which clearly had to be a bug in MongoDB) . When running explain with index hinting, it showed that my compound index actually resulted in a 420,000 rows scan. The difference between the two indexes was only about 10,000 rows on a set of 4 million rows (that’s 0.25%). As my compound index was roughly 4 times as large as the single column index, performance was actually a lot worse.
So why did this happen? The truth lies in the data. The website is build around the airport of Brussels (BRU). Turns out that the dataset we’re dealing with had already been stripped of a lot of unrelated flights (any flight that does not depart from, or arrive at BRU). So almost by definition, if the outbound destination airport was not BRU, then the flight would come from BRU to that airport, and the return flight would go from that airport back to BRU. This meant that the number of unique combinations in the index (cardinality) was actually almost the same as that of the single column index. So basically, the fact that I did not know (or failed to check) what dataset I was dealing with, I did not pick the right index.
All’s well that ends well, I did manage to find the “perfect” index for the query. A compound index on arrival airport on the outbound flight, combined with the departure date of the outbound flight and the price. This 3 column index manages to limit the result to only a couple of hundreds of results and does so in about 50 milliseconds. My collegue also managed to decimate the number of queries by combining and caching them on the Drupal side. A great improvement for the site, but above all, a great lesson for myself.
Hi there! My name is Tom. I’m a Linux System Administrator and PHP Developer. My job is to keep PHP websites running as smooth as possible. Being both on the ‘dev’ and ‘ops’ side gives me a broad skillset across the entire stack.