RunLLM If i understand correctly, for frequent loads , we should not declare primary keys in tables as we might run into known issues when it comes to updates (duckdb delete and then inserts a row in parallel). Is this correct?
Yes, that's correct. Declaring primary keys in DuckDB can negatively impact load performance due to the overhead of maintaining the index, as detailed in Performance Guide: Schema. Additionally, updates in the presence of primary keys can lead to unexpected unique constraint violations because updates are treated as deletions followed by insertions, as explained in Indexes. Would you like more detail on this?Sources: - Performance Guide: Schema - Indexes - Statements: Insert
💡 Hint: Mention RunLLM in the thread for followups.
is it recommended to have a primary key in duckdb?
It depends on your specific use case. Primary keys enforce data integrity by ensuring uniqueness, but they can degrade load and update performance due to the overhead of maintaining indexes. According to Performance Guide: Schema, for best bulk load performance, it's recommended to avoid primary keys unless necessary for data integrity. Additionally, Indexing suggests using primary keys only if necessary for enforcing constraints on your data. Would you like more detailed information on this topic?Sources: - Performance Guide: Schema - Indexing - Indexes
💡 Hint: Mention RunLLM in the thread for followups.
I believe a general recommendation on any DB is to delete indices prior to deleting , or loading, or even updating lots of rows. Defining the indices after the load is complete is far faster. That does leave the issue of duplicate primary keys, of course, but sorting the data prior to loading will establish whether there are duplications. An additional factor is the presence of foreign keys, of course. Here again, the problem is speed when foreign keys are checked against primary keys. Having written all that, I cannot find much advantage yet in using indices with DuckDB. Everything I've tried leaves the EXPLAIN showing no access to any of the indices; I probably got something wrong somewhere, but that has been my experience so far. A table with well over 1B rows updates remarkably quickly (seconds); deleting the rows in the table is even faster; loading (inserting) also very fast.
.png)