PGConf.dev 2024 Live-Blog
Table of contents
Live blogging a few sessions at the PostgreSQL Development Conference in Vancouver, BC
.
Lessons from the Support Desk
Speaker:Evan D Macbeth
Surface Area vs. Volume
AREA is the MOST important thing for developers to understand.
Every new release, code patch, feature... expands that area of your application (in this case, PostgreSQL), and every developer is now responsible for an ever-expanding area of that surface.
And your Support organization? They're responsible for the VOLUME. Biggest takeaway:
Even making code better increases the work that your support team has to do.
What happens when you ship a fix?
"We only want to make things better, not worse" β Thomas Munro
That sounds great! ... in theory.
What happens when you make things better?
You bump the version number
Now there's a new release in the wild!
You just doubled your Support team's work! :O (They need to support both the users who upgraded, AND the ones who didn't!)
Even reducing code complexity increases support work.
So please keep your users, support engineers, and technical account managers in mind when you're doing coding.
There's nothing you can do to make their jobs LESS work. (Can make it easier, but not less work.)
Detail your assumptions.
When creating a feature or fix, your utopian case matters!
All problems can be diagnosed as a variance from an assumed state.
"Why did you do it that way?"
"Why did you assume I wouldn't do it that way?"
Detailing assumptions makes it WAY eaier for support team to know what's going on:
What's it supposed to do?
What did you test it on?
Use cases (EXAMPLES!)
ALL examples are good examples, with a bit of context.
Why? Users follow examples before reading documentation.
Users don't read your docs, but your SUPPORT team does!
Most Support contracts are paying the Support team to read the docs for them. π€£
So have your Support team read your docs before release!
(The PostgreSQL docs are excellent!)
Cautionary tale with ChatDox: It's OK to help users find where to look, but should not rely on an answer from AI. (Hallucinations are a thing.)
No one upgrades :(
March 2024 - One particular customer, a bit of less than half of their servers were still running PostgreSQL 11 (EOL Nov 2023) or below. 6 were running 9.6 (EOL Nov 2021) or below. :O
It's great that new versions of PostgreSQL have new features! But, most users aren't going to see those benefits.
An example from a well-known company from ONE WEEK AGO:
https://www.yugabyte.com/blog/yugabytedb-moves-beyond-postgresql-11/
Another example: trouble ticket from January 2024:
The password you entered is too long. \**Windows NT*** will not accept a user password that is longer than 14 characters.*
[[ scream face ]]
Reality: Most of the support things they're dealing with isn't because of new features.
That's not what's keeping your support team / customers awake at night. This old stuff is!
Versions matter.
Give me the specific version numbers. Give me an SBOM with the dependencies.
Backward copatibility is critical (and testing that it works!)
Give me a matrix.
This release has been tested / validated on these versions.
Especially in "ship constantly" era.
Test on the older stuff!
Give us roadmaps!
Ansible announcements: 5 pages of changes coming to the project, but tells me nothing. :(
PostgreSQL, on the other hand, PostgreSQL 17 beta release notes, available for review!
Give me clear roadmaps.
Let us know where things are going.
Then w can tell customers when interacting with them what to exoect when. they upgrade.
Because eventually...
Everyone* Upgrades
(Just not on the timeframe you expect!)
PostgreSQL release cycle is 3 months.
If an organization releases every 6 months, they can't take every minor release, so we can't expect them to.
So. DOCUMENT and talk about what changed and why.
Especially comapatibility deprtecations / assumptions.
If dropping support for something, BOLD and at the TOP of the release notes.
Don't assume! That just because RedHat 6 is EOL you no longer support it. Call it out explicitly.
10 release window? It's great!
But it's going to create an EXTRAORDINARILY large amount of work for Support team. Backward Compatibility is no longer guaranted.
It creates incentives to upgrade... but also creating a lot of anxiety for them.
But!
COmmunity at PostgreSQL let me know this was coming.
That's great! Let us know.
Release notes are your Support team's best friend. <3
Please put ALL the details in there! We read them cover-to-cover.
Take the lesson from PostgreSQL for your own projects. Make your release notes as detailed and high-quality as theirs, and your support team will be very happy. ;)
Everyone Hates Downtime
How many people have been in a meeting with a customer who wants zero downtime? (Lots of people)
How many people believe there is "zero downtime" (no hands :D)
There are only two types of downtime: planned and unplanned. ;)
Users are not patient. They never upgrade, but they're not patient.
Set and manage expectations for changes.
I bet in the PostgreSQL 17 talk they didn't say how long an upgrade from PostgreSQL 16 -> 17 is going to take. ;)
"We can't do that! There are so many variables!"
True! So document your assumptions. I have 100GB database, moving from 16 -> 17, and it took XXX minutes.
Backwards compatibiity SUPER critica
But PLEASE test.
And not just the immediate previous version to the new version.
Most customers are running something 6-7 vrsions back.
The more you test, the better off your support team and customers are going to be. :)
This is a lot of work, and a trade-off.
At minimum, test from every major supported version to the latest.
Multiple success paths: give us many ways of successfully completing an upgrade.
Example:
pg_upgrade, PostgreSQL dump/restore
"rolling" upgrades, buld in flexibiltiy as to what rolls when
Recovery
Backups, restores very important. Why talking to developers about this? I want you to try it.
You built a new fix, a new feature, it works! Super great, yay!
Now, back up the system and try a recovery with your new code in it. ;)
Maybe recovery takes a lot longer now? Something to think about...
If yuor support team is not doing restores from backup all the time, that's a problem. Should be part of standard toolset, muscle memory.
Support team, Customer, even developers!
Why can upgrades take so long? For example, Federal customers. They can't "certify" PostgreSQL because it's open source software so needs to be thoroughly tested and certified themselves.
Can automation help?
PostgreSQL Users
The most important audience not in the room. :)
"There are only two industries that call their customers "users" β illegal drugs and software." β Edward Tufte
Language matters. There's an underlying assumption of "taking" versus being involved in the process.
Open source users a bit different.
Goldfish? They have no memory.
One of the things that's very frustrating for Support but very true...
They're not going to learn
They're not going to remember things
Not because they're bad people.
There's turnover. Lost institutional memory.
So important to remember: they won't necessarily know what they knew before.
But there is a degree to which they are all the same and interchangeable... when they forget things, they revert to a state of commonality.
Remember: YOU are a User, too!
Every single person in the room uses PostgreSQL, so you have a perspectve on this. Imagine how frustrated you get when a tool isn't doing te thing the way you want it to dot it! Congrats, you now have user exprience. :) Hold onto that. Because...
Your Users Are Not Like You.
"The median voter is a 50-something white person who didn't go to college" β the people making the decisions aren't like him.
In a similar fashion, we do not know who our POstgreSQL users are!
This is a big deal. A blind spot.
Timescale has tried to address this, they have a survey it's great.
That survey had 888 respondents.
That is not a statistically significant sample.
Estimated Oracle users is 70 million.
WE DON'T KNOW WHO OUR USERS ARE
So we can't assume they're like us!
But they need what you're building!
And your support team needs your participation/attention!
That helps them understand: Where were you coming from and why?
Suggestion; Give it a try for a week!
Promise: Support team will say yes! ;D
Take a week, field tickets, answer calls... get a feel for experience youi're putting your support team and users through. You'll be a better coder for it!
YOU can build a team like YOUR users
You have some influence on the team you're on.
Even as an IC, make suggestions about your team
Help build/influence a team that looks like your user community.
Makes your team stronger!
Add people who are ike.
37% pf IS population not-white
35% of all STEM employees women in 2021.
27% of STEM graduates were women
is your team a quarter women? Is it a quarter non-white?
If not, is your team reflecting your user community? No?
WHY NOT?
Variety of perspectives leads to better solutions.
Study from Kellogg School of Management. Task, find the murderer. The more diverse the team, the better off their chances.
BUT.
Also less confident that they had the right answer.
Less diverse teams HYPER-condident! Even if they're wrong. ;) (AI Hallucinations, anyone? ;))
So, PLEASE think of your users when you write code. :)
Coda: AI
AI as a user (users submitting erquests bassed on advice / answers given by AI)
AI as a developer
AI β "Best" practice
It can make doc writing better! (Clearer, more understandable)
Declarative / short statements are very yseful, and hard to do
Also test assumptions. Ask ChatGPT question you expect it to work the way you think. The answer back may surprise you! And that's what your customers will get. ;)
Making PostgreSQL Hacking More Inclusive
Speakers:
Amit Langote
Masahiko Sawada
Melanie Plageman
Robert Haas
The PostgreSQL community is amazing! :D But...
So,etimes, judge people on things other than quality of their work
Sometimes, we extend opportunites to people who are more like us
Sometimes, we behave in ways that are mre problematic for some groups of people than others.
Unfairness is a problem.
Looks bad for the project
Drive away potential new contributors (and we need them!)
Drive away / demoralize existing contributors, contributes to burnout.
Make hacking on PostgreSQL less enjoyable
Possible reasons for Inequality of Opportunity
Demographics (race, skin colour, national origin...)
Neurodiversity
Fluency in Engish
...
Amit Langote's Story
Going to share some of the experiences that had a lot of impact on me personally psychologically, kept me engaged for 11 years.
Hoping someone with similar background / similar personality might find this relevant or helpful.
I'm not planning to talk for everyone who's like me, sharing my experience, to bring some benefit to discussion.
Maintaining Two "Personas" as a PostgreSQL Contributor
Many of us communicate differently in our personal lives (indirect, nuanced communication) vs. how we need to be successful in PostgreSQL discussions (direct, assertive style).
- Maintaining two personas can be draining at times.
Inability to be direct and assertive may result in being overlooked or dismissed. People have let me know that this is limiting; I could be more successful if i made certain points in a more direct/assertive manner. Many others can face this same issue. They "have" to be more direct, but they can't... because of their personality or where they are in their journey.
Recommendation: Be more understanding / aware there are people with differnt experiences here.
Challenges of Non-Native English Speakers
Email comunication strucggles
Difficulty expressing complex thoguhts in email as non-native English speaker
Many folks speak very differently in-person vs. in written communication.
Time-consuming drafting process to sound coherent (especially for "high stakes" conversations), potentially leading to perceived disengagement.
Non-native speakers leading discussions may face reduced engagement / disappearing from community. β it's not that, it's that it takes time to put a response together.
What can be done?
Have Patience: Allow non-native speakers time to express themselves.
Use Clear Language: Avoid jargon and idiomatic expressions (and e.g. sarcasm) for clarity.
Ask for Clarifications: Request explanations or provide summaries when needed. (Instead of dismissal.)
On Giving a Chance
When you first start out, you are nervous, don't know if you'll be taken seriously.
Entrusting newcomers with important tasks
Example: In 2024 Robert, Alvaro, others encouraged me, a relative newcomer, to work on a cricual project (table partitioning) in the commmuity.
This was crucial in my sustained involvement.
Give other newcomers this chance, and we have more folks becoming specialists!
Their support, along with backing from in-house managers and colleagues, fueld my enthusiasm for contributing to Postgres hacking.
Diversifying Community Energy
Encouraging newcomers
...
Flipside
ON the Flipside, when newcomers aim for leadersip roles, experienced members may hesitate, fearing it could ompede project success, which may hinder recruitment of other experienced contributors.
Confidende in newcomers' abilities has to be. balanced swith support and mentorship from experienced members.
"Trust, but verify" :)
Off-List Dynamics
Off-list collaboration can be very effective:
Strategic planning and sensitive disucssions may require confidentiality.
Building personal relationships and trust through one-on-one interactions.
Addressing urgent matters promptly without the delay od group discussions.
However, it may lead to some folks feeling excluded:
Business-level discussions may delay patch acceptance, affecting contributor morale
Be transparency about that to the textnent possible
Help people understand how decisions were mde. Otherwise, can lead to people feeling demoralized.
Engineers investing in patches feel undervalued when decisions rely on off-list interactions that don't involve them.
- Cite off-list discussion whenever possible. for transprency.
Masahiko Sawada
Sharing both my experience and experience of other Japenese hackers' experiences that Iv'e heard of.
- For some, no clear solutions exist.
Not intended as complaints / accusations
Difficulties with English
English is a prerequisite
Most Japneses ae not good at communcating in Engouh
Require time to read and write emails in Engish
- Especially when trhing to explain/understand the technically complex things in thread
Vice-versaΓ Engish needs time to read Ebglh written by non-native Egljsh soeaker.
Communicate patiently.
Communicating in non-native language takes time.
Example: At conference, another person patiently tried to make an effort to understand what I was explaining. This stuck with me.
Discussion in pgsql-hackers
Misinterpret the "high-context" discussion in pgsql-hackers.
Non-native NEglish speaker may not be a ble to. uderstand it correcty
Could end up givin inappropriate resonse.
Experiec e of emails not being replied to or failing to get attention.
Try to understand why emails are not being replied to.
My exlanation in English was not clear? Just not interested in? Or all good? No idea... :(
Cultural: Japanese prefer indirect communciation, esp. for concerns.
Time zone differences
Discission is very active. (140 emails per day)
Around 10am UTC / 4apm UTC (evening and nigtht in Asian countries) β happens while we're sleeping
Less opportunity to engage in discussion.
Shifting working hous toEueopean or American time zone
Occasionally required, e.g. when discussing an imporant topic, near feature freeze, etc.
Exaplle: I could participate in discussion live, but it disn't work well.
Vectors: How to better support a nasty data type
Speaker: Jonathan Katz
Agenda
Overview: Why do we care about vector search?
Why use PostgreSLQ for vector searches?
Year-in-review of pgvector developmnt
Ongoing work and recommendations
Why vectors? Why do we care?
30-second summary: Machine-learning models. Typically, needs lots of CPU/GPU power.
But RAG (Retrieval Augmented Generation) is changing this. You can feed private data to
Take data, put it into common represenattion (vector)
erform lookup on database, put into foundation model to get final response.
Vectors used in two parts.
Do we JAVE To use vecgors for this? Can we use sometjong else?
You can!
But vecroes provide a common format for data coming from diferent syustems.
Mapping raw information to same machine learning models.
Challenges working with vectors
It takes time, there's overhead. 100ms to generate vector representation * 1M things = LOTS Of time
Embedding size (4-byte floats) ... 6KiB is a LOT of data packed into what is normally bytes.
1M rows => 5.7 GB :O
Compressoin? No. Doesm't worl. Can actually end up with a larger number than what you stated iwth
Query time; Distance calcuation -look at EVERY value in database!
Find similar vector between one and another, you need to looka t ENTIRE sie o f table
So we use ANN => Approximate nearest neighbour
Not the ENTIRE dataaset.
Faster than exact nearest neigbour
50 vs. 1M lookups
BUt, introduces concept. "Recall" β % of expected results.
not quite 'accuracy" β that deals with exact vlues.
Recall, get 8/10 and mayb e not in the right odder.
Importnat to keep in mind, Affects the overall user epderience
Key metrics to consider
Index build time
Index size
- 1M 6KiB things = 6GB
Account for recal;
Query throuuput
p99 qjery latency
PostgreSQL as a "vector database"
What is this? JSON!
In the past, take these, put it into relational mapping
What if instead, we just put JSON in directly?
timeline
POsrgreSQL added support for JSON in 2012
BUt JSON is just a data type
Isn't vector just a datbase?
Wjy? use it?
SO much developer tooling bhuilt around postgresql
Became fluent in SQL out of necessity. Tooliung not available for ORMs/
Also, PostgreSQL is a database providing ACID gurantees
Why d oyou care?
Atomicity β all or nothing
- If doing some kind of procedure, I want it stored and roll back if it fails.
Consistency β follow rules for other dat
Isolutation: Corectness in returned results. As soon as data inserted / committed, it's availble! (not true of all systems)
Duraiblkity (most important) β if you say data is stored , want to make sure it's stored
PostgreSQL has supported vectors for a very long time (ever since ARRAY) β cibe but kimited to 100 distances
But distance, not so much.
Lots of options now.
Focus on pgcector β why? It's popular!
Another thing:
lLaama index.
pgvector #2 downloaded vector store
This wasn't a given.
Back in April 2023, it was at the bottm.
Why pgvector?
Why is pgcetor so popular?
in 2023 "oit was there"
Can use existing native binary <> binary PostgreSQL drivers, really speeds up
2024?
LTOS of work on pgvector to bring it up to performance
Come for usability,s tay for performance :)
Actively deveolpoped project. Community buy-in
pgvector: Year-in-review timeline
v0.4.x (1-6 2023)
IVFFlat plam costs
* 50 ms to 5 sexonds :O β we fixed that
What do user need? Faster qyeries,
BUT. Index building matters. Helps with not only high-qialitu searches bit very fast high-uaitu searches
But datasets are growing. Need to manage these without taking up too muc space on disk.
LARGER vecotrs foundation models, reduin overall data footprint
Indexing in pgvector
Array are not hard (we had that since 1986) β searching is what's hard.
We took some shortcuts:
L2Norm vector β Normalize vector
If you look at overall size of it, you reduce magniude to 1.
You can now discard some calculations, (division)
INdexing techniques
IVFFlat
- Have all data pre-loaded, cluster data into centres, and look througb cdntres.
HNSW (Graph based)
Some folks were confused about this. PostgreSQL can do graph-based queries?
Graphs are a superset of trees, PostgreSQL can deal with trees.
Index access methods to define custom indeixing.
Unlike IVFFalat, you can add data iteratively, and it'll add it to the best place. IVFFlat can skew recall overtime, msjt rebjuild index.
Most indexes are nice and easy, don't require a lot of thought
A bit more comeexProibes...
If I do one probe, fast search, just vectrors in those list.
But might not get closest neigbour
expand to two probes? Longer qyery, but closer neighbours
More sdearch of index better answers, btu ccost of query time
HNSW easier on query, but harder on th index side.
m.= # of links between idnexed vectrors
ef_costructiopn - keeps track of neigbbous you've seen
m = 16 weems to be working, but depends on embedding model
Lots of what we do here is over compensate for embedding model. If it gives us info that's very east to cluste,r we cojuld. use smaller indexing parameters.
How do you build an HNSW index?
Goes through hierarchies
Find te next closest, then tje next...
HNSW query paranters
Next, next next
Keep a list of all the clestt beigbours
Quantization
reduce amount of information availanle, whole still maintaining original data strucrure
Map to smaller floating points, map to integer alue, but don't change the overall structure of vector
1-byte unit (not supported in pgvector), now you're at 1/8 the size
BINARY quantization... 1 or 0!
Above zero, it's 1, below zero, add it to a zero.
But... tradeoffs
PostgewSQL has amazing feature called expression engines
If oyu have afunction can use to bjild index
G from 4x4 vecror to 2x4 vecor
Keep irigina value and give it smaller value
Los of ninary vecroes look very simolar
Binary = VERY fast recall, but lose
rerank easily: ge real vecor back, get back in correct order
Advangatge son build time
Save space, get smae level of performance β that's godo!
Binary shrinks evenmore, but need to do more work to get adequate recall
Limitations of Quantization
You reduce your information
"Double data" problem. Gigantic vector in table, smaller vector in index, but could also be a large thing.
Pure indexing systems don't do that.
Some techniques only work with LARGE vectors. Enough data, you might have a bunch that look the same, impact recall.
How did pgvector do the past year? (In charts!)
Index build time? 150x speedup!
30-50x speedup even on simpler hardware.
Paralellism - I can use multi cores to speed it up
Concurrent inserts soees it up a lot (0.6)
2 hours better than 10 hours
A few seconds better tnan a few hours
Net impact? Proof build speed matters. (Serial build)
Recall increases as you increase ef_construction (yhough diminishing return)
With parallelism, see that same recall increase
BUt now minutes of bhuld time!
Now can do smaller index sized
16x better
8x-9x smaller Query ltency p99
Show it worh higher recall
HNSW really improed latency compared to IVFV
More incremental after HNSW
Last but not least throughputing
High concurrency
Epxanded ersion of ??? data set
Recall at 90% with huge throughput
80-90% recall seems to resonate with users
What can we do better?
"8K Conundrum"
Page = 8K storage unit
Heap pages are reziable, but index pages are not. :(
This is a REAL problem for vecroes.
1536 dim-4 vector = 6KiB
3K vectors= 12KiB
Quantization can hepl deal with this... but they're getting bigger, we can only wuantize so far
What about TOAST?
4 types f Toast
Most common:
PLan, kepe vector in table
external, keep it outside (default_
Why this cisual?
Oten not searching over teach date
Not main part of query
Vectors are on a "hot path" β primary part pf search. Means the impact is on planner, it treats vecrors as if not in hot path
Sequential scan.
6 workers planned
1.5K = 4 workers
IMprovemtsn. we can make?
Contonueinvesting in qantization
OMprove planner to understand TOAST data pat of hort path
TOAST chaingin sstem
How do we index tjhings > 8KB?
TOAST for indexing / page chaining?
:oading multople pages in index search can be slow as wll
Modfiable size fo index pages
Duynamically modifcable?
Filtering - Where clause
Stratgegoes
Partial
Partition β but be careful.. if not parttioning on filter ley run into over-filtering
"Hybrid search"
Subset of filtering
two different indexing/ filtering methods
Starting to see this more and more
Apporaoches
multi-column indexes
remove extra distance
pishdown to convering indexes
Using other index mechanisms to fiilter data set
LAEGE amounts of dat,a need to work on
pgvector and VACUUM
If I want to delete a vector you can, it hudes then updates.
Can cause slowlness because it's a serial indes buld
hrrd for access method to do that.
Let's make it easier fo parllel vaccum for other indexes
And imlement on diffeten tyable types
HNSW = 1 wee, VACCUM 3 weeka
Distirbuted wueries? Not a big deal yet, but wil become so when query size exceedd memory
probably coming in the future
(See blog post)
Don't get synchronous queries
Add merge append as async options?
no longer a serial lookup across servers.
Do it all at once, merge in
Parlle query, more important for IVDFlat
For some reason it's not chsng parallel plan
Hardware acceleration
pgcector makes big use of this
NOw incorporating from potgewsql
UPcming version 17
"Can weuse GPUs?" <- big topic.
Terea re o[portunies, but probably on idnex building
Need efficient transefer between areas of moneu
Dogs not barkkng?
Still figuring out vedres
Data doesn't start off there...
MAtrices , Tensors (matreix od vectors)
these might make their ay in.
How would we solve them?
If we already struggle with vector on single page, how do we ape a tensor?
Diff level s of storage / caoacith
Native cector support for PostgreSQL
50M vecroes for transactional application
50M rows -= 50M
But now it's 50GB :O
Native vector support in PostgreSQL might speed up adopton.
Running an extension is hard.
ONe extra step in process.
Esp. if admin needs to install, you don't have perms to do within existing .
Last year already talked about this, consensus was hammer on pgvector to make it fasrer.
Here's the hitlist.
Fltering
- Search over subset, not WHOLE dataset
Community! <3
Peole with strong knowledge of ML + knowledge of DB, dramagic performance gains
Flood of net0new data generation
We'd love to see it in postgreSQL
so let's put it in the best possible place