According to (my) theory of database query evolution most startups who choose SQL based databases have the following steps of database/query optimization:
1) Using default queries of chosen framework. This may be ActiveRecord for Rails (doing select * from ….) or any other Object-relational mapping (ORM) bridging database and chosen programming language.
2) “This and that action seems to be slow for the end user, could you check them?” hand-picked actions (say reports) get custom SQL queries, performing better than SQL generated by ORM by default. this is usually done using
EXPLAIN SELECT 1;
where select 1 is your actual query producing something big. At this stage a couple of queries are optimized for the shortest running time possible.
3) You discover that your database was installed via a package manager, and is still using the default configuration. This is a big problem as the default configuration for PostgreSQL is meant to allow it to boot on fairly weak machines. This “fail-safe” configuration needs to be adjusted in order to use production machine resources fully.
4) “You can not improve it if you don’t measure it”. At this stage your service gets a lot of monitoring, the end user time for your services is measured, and also server running time for each request is measured and divided into “computation time”, “database time”, “third party requests”, “cache” etc.
5) Someone, say DBA, analyzes “database time” for first time, and goes on Sick leave for a few weeks without any explanation.
6) A real DBA is hired, to check the “database getting slow”. Database starts logging things separately from the application log. The long queries’ log appears to be showing that BizDev analytics are really slow, but not visible in the application log at all. Also statement statistics are created, showing some queries being run too often.
7) Your first 100M-row table, and the database outgrows RAM. At this time you hope that at least indexes will always fit into memory. Your next discovery is that PostgreSQL is fast, even faster when doing a sequence scan.
EXPLAIN SELECT 1;
is not informative enough anymore and (for PostgreSQL database) you start using a stronger weapon:
EXPLAIN (ANALYZE, VERBOSE, BUFFERS, COSTS, TIMING) SELECT 1;
This runs a query for real, showing full names of everything involved, displaying how many blocks it had to read to deliver data, standard cost estimates, and also real timings so you can compare them to estimates. As the query is running for real, in most cases you will need to surround it in transactions, so as not to modify your real data. In case you’re still wondering – a block is a unit of data that your database reads or writes at a time. For PostgreSQL this is 16kb by default.
Here you will get not only data on sequence scan VS index scan, but you can also start experiment on what indexes work better. At this time you start to optimize not for speed, but for “hitting as few blocks as possible”, because disk is slow. PostgreSQL distinguishes whether a block was ‘hit’ in cache (means in memory, no actual HDD read) or ‘read’ from the hard drive.
There is a separate story with writes. Block writes may be intended (like updating and storing new data) or may occur as a side effect – running a query causes some data to be “swapped” from memory before being returned to the user. I would like not to discuss writes in this blog post, so mentions on block writes end here for this post.
These 7 steps are just a beginning, if your service really takes off, you may eventually get into the “big data problem”. However, most startups and services will never have this problem. Because some 100 000 000 or even 500 000 000 rows of data is not “big data” yet. Even your single PostgreSQL instance, which is installed on a standard desktop workstation in your garage will be able to deal with small data.
Be careful with your queries, though. For each request ask only the data that you actually need, and be prepared to explain what you are doing!