The database is often the bottleneck in our application, both in terms of performance and developer velocity. Changing the database is a lot harder than changing the code. Because of this, I prefer to defer database structure decisions as late as possible.
I have mostly worked with PostgreSQL and Ruby on Rails ActiveRecord in the last 10-15 years.
I'm splitting this post on tips into two parts:
General Tips
Tips, I’m going to cover today:
So, here are the general tips 👉
Primary keys
Every table should have an "id" column for the primary key, which should be "bigint" or "UUID." I usually use "bigint."
NEVER use it as a domain value for the primary key of a table like email, IBAN, company identifier, etc.
Data is not immutable - people can change emails.
Can leak private data - user email shouldn’t be in the "users/:id" URL
It can block the making of business rules and changes.
I know about a company that used BULSTAT (company identifier) as the primary key of their accounts table, and this blocked them from supporting individual accounts because individuals don't have BULSTAT. 🤦♂️
Keys and indexes
Always have a foreign key. Except in cases where the table is for logging purposes, and data should be retained even when the related record is deleted. 🔐
In PostgreSQL, when there is a foreign key, and index isn't automatically generated. Indexes are created separately.
The big difference between an index and a foreign key is that a foreign key ensures that a column from one table matches another. For example, "posts.user_id" matches the "users. id" column. However, foreign keys don't make queries faster; indexes are used to improve performance.
Use compound indexes (index on multiple columns). If applicable, also add unique indexes.
If there are indexes on "user_id" and "user_id, slug," for example, the "user_id" index isn't needed because the second index covers it. Order matters - "slug, user_id" doesn't use an index search on "user_id".
When having a "has_one" relationship, use the UNIQUE index.
Naming
I follow the following naming conventions.
Date columns end with "_date"
Times columns end with "_at"
Boolean columns don't start with "is_" unless I'm using Ruby on Rails; otherwise, they start with "is_" / "are_"
Suffix columns with "_eid" for external system references, like "stripe_subscription_eid"
Suffix tables with "_log" for write-only logging tables, like "sign_in_attempts_logs"
Use "kind" instead of "type" because Ruby on Rails uses a type of Single Table Inheritance
Use "record" (record_type/record_id) for polymorphic associations when I can't find a better domain name
Use “source” as a term when one record is responsible for creating another. An example transactions table can have a source which will be what triggered the transaction
Avoid nulls
I always try to force "NOT NULL" as much as possible.
Number columns, for example, are always "NOT NULL DEFAULT 0" by default.
Boolean columns should always be "NOT NULL" and, when possible, have a default value.
Sometimes, you may be tempted to use null for a boolean, like when you have a new notification type and want null to indicate that the user hasn't decided on this option. There are better ways to handle this, like:
Have an enum with "default," "on," and "off" values. This enables adding possible values like "always" or "only important ones" in the future.
We have a separate table, "notification_preferences," with columns "name" and "enabled." The user hasn't picked a value if there is no notification record. This allows new notification types to be added without changing the database.
Often, wanting not to have null pushes me to have a better design. 🤩
Enums
I prefer strings for enum columns instead of the enum type in PostgreSQL.
This helps me avoid:
I don't have to run a database migration when I add new enum values. Changing code is easier than changing the database.
I don't have to think about how to name the enum type. Naming is hard.
I'm not tempted to share enum between tables because enum values look similar.
(I know) I'm missing some performance gains and database constraints. When needed, I use the enum type; however, this was needed very rarely.
Rails enums, by default, are numbers - don't do this! Use the strings:
enum status: { pending: 'pending', sent: 'sent', failed: 'failed' }
Having numbers for enum can lead to "fun" situations.
Example: You have user roles: operator(1), supervisor(2), admin(3). Then you need a director role, which should be between supervisor and admin. What number will you use? 🤔
If 4. It will be weird that 4 roles have fewer privileges than 3. Plus, the code compares roles for permission. This will mess this up.
If 3. Then move admin to 100 (because you want to avoid getting into this situation again). This will require costly migration and changes in the application code.
When you write an analytics query, you will waste time remembering what "4" stands for. 🫠
Time as boolean
Instead of having a boolean column, use a nullable timestamp column.
Example: Marking an apartment as archived. There are two options:
Boolean column - "archived BOOLEAN NOT NULL DEFAULT 'FALSE'"
Timestamp column "archived_at TIMESTAMP WITHOUT TIME ZONE"
I often will like to know when record was archived, so keep this as column "archived_at".
Then, I have an index:
```
CREATE INDEX index_apartments_on_archived_at ON public.apartments USING btree (archived_at) WHERE (archived_at IS NOT NULL);
```
This technique is helpful when a state needs to be changed at a certain time. Example: "posts.published_at"
Not every boolean should be a timestamp. My question when deciding is, "Do I care when this happened?". ⏰
Avoid soft-delete
This one might be a bit controversial. 😅
Adding the "delete_at" column sounds very simple. But it rarely is.
When we haven't soft-deleted as a concern from day one, all the queries aren't expected to be filtered by this column.
When we soft-delete a record, we need to soft-delete everything related to it. So, one soft-deletable record makes everything around it soft-deletable.
Plus, we must upgrade the "deleted" records when we make database changes.
I often prefer a "deleted records" table where I copy data from the deleted records. I use the [auditable](https://rubygems.org/gems/auditable) for this.
The question always should be, why do we need deleted records to be around? How often are we going to restore deleted records?
It is always about context and tradeoffs.
Sometimes, a "soft delete" is needed and can't be avoided. In this case, try to avoid its blast radius.
This is a good post about this subject 👉 Soft Deletion Probably Isn't Worth It.
Record of who did it
Often, I need a record of who performed a certain action, together with some extra information like location, browser information, and so on.
I used to attach this information to the table showing what was acted on.
The table becomes too big - with a lot of columns, which are often unnecessary when selecting data
Data loss - there is only a record of the last performed action and who did it
Explosion of foreign keys/indexes to the "users" table
A better strategy is to have a separate table for user actions, something like the following:
CREATE TABLE user_action_logs (
id bigint NOT NULL,
record_type string NOT NULL,
record_id bigint NOT NULL,
user_id bigint NOT NULL,
action_name string,
action_info jsonb,
request_country string,
request_ip string,
request_user_agent string,
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL
);
Split data out of tables
When there is a big table, it attracts more and more columns. It is like a black hole. Those tables are usually the "Got object" tables in the project. ✂️
I’m always very careful when I need to add new columns to those tables.. In many cases, I prefer to have separate tables. Having separate tables is great because they are another bucket to which I can attach things.
Often, those tables start as a "has_one" relationship to their parent table.
Perform analytics query on a follower database, not primary one
The analytics queries are very different than application queries.
The application queries must be fast, select only a small subset of data, and avoid aggregations.
The analytics queries are the exact opposite—they use a lot of aggregations, select a larger set of data, and can be slow.
Usually, the database schema is optimized for application queries.
It is a mistake to run those different types of queries in the same database instance.
If you can't move your data to a separate analytics database like Amazon Redshift and Snowflake. Second best choice is to perform analytics database queries on a follower database instance. This instance can be smaller and cheaper than the primary database.
At Product Hunt, we had a follower database just for analytics. Then, we moved our daily newsletter there because newsletter generation was slowing the whole website. We later opted out of the delivery, but this was a nice temporary fix.
Conclusion
This is the end of part 1.
When I design a new feature, I put extra care in my database design. Database changes are one of the hardest migrations you need to make occasionally.
Next week, the next part should focus on tips on Ruby on Rails and databases. I had originally tried to list them here, but they made the post too long 😅
Update: Part 2 is published
If you have any tips, you strongly agree or disagree with them or have ideas for something I missed, уou can ping me on Threads, LinkedIn, Mastodon, Twitter, or just leave a comment below 📭