Bank transaction deduplication, the easy way, on purpose

Three banking sources, no shared transaction ID, and the Postgres view we shipped in a day instead of the design we wanted in a quarter.

Date

June 8, 2026

Bank statements don't believe in transactions

Bank transactions are the substrate of every credit decision we make. Six months of cashflow tells you what a small business actually does, which is usually different from what the business says it does on the application form, which is usually different from what it thinks it does when nobody is asking. People are full of stories about themselves. Bank statements aren't. Bank statements are full of $4.50 charges to a coffee shop in Brooklyn at 9:14 AM, which is a different and more honest kind of story.

Three sources deliver these statements to us. Two are bank-data aggregators, intermediaries that go and talk to your bank for you, like translators between systems that have nothing else in common. The third is our own document-spreading agent, which reads PDF bank statements with an LLM stack and produces structured rows. The agent treats each statement the way a careful translator treats a difficult passage. Margin notes about confidence. A willingness to admit uncertainty. The understanding that some of these institutions are still printing statements in Courier New for reasons history will not explain.

The three sources don't agree on anything important. One signs its debits positive. Another signs them differently depending on the account type. The OCR-spreader pipeline reads them off the paper however the paper has chosen to print them. The number of decimal places varies. The account identifier varies. The transaction descriptions vary, sometimes in ways that suggest the bank is actively trying to confuse us. None of them give us a sub-day timestamp, because somewhere in the late twentieth century a meeting was held, probably in a wood-panelled room with a slide projector, at which it was decided that "the day on which it happened" was sufficient precision. The decision has not been revisited.

The central absurdity of working with bank transactions, the one you have to make peace with before anything else can happen, is this: a bank transaction does not have a globally unique identifier. Each aggregator mints its own ID, scoped to itself. A statement PDF has no ID at all, because the bank that printed it had no reason to assign one, and probably did not anticipate that anyone would be parsing it with a language model in 2026. The IDs of the world are not federated. The IDs of the world were never federated. The IDs of the world will not be federated, because federation requires agreement and the parties involved are banks.

So when the same dinner-at-Olive-Garden charge arrives twice, once from an aggregator because the customer linked their account, once from a PDF because Operations uploaded the May statement, there is no shared key that says these are the same row. There are only the columns the rows happen to share: account, date, description, amount, debit-or-credit. That is the largest key the three sources have in common. It is the only key you can build.

It is also, unfortunately, the wrong key.

*Two sources, one truth, no shared ID. The classic.*

The coffees, which were not duplicates

Two genuine four-dollar coffees, purchased on the same day, from the same card, at the same coffee shop, with the same description, are indistinguishable from each other on a key composed of (account, date, description, amount, type). They collapse into one row.

This is a real thing that happens. It happens to me, occasionally, on the kind of mornings where I order the first coffee and then realise halfway through that I needed a stronger coffee, and order a second one, and then later in the evening when I am scrolling through my checking account I see two charges to the same place for the same amount and think, briefly: did they charge me twice? And then I remember the morning. And then I think about the deduplication system, which would not remember the morning, and which would silently collapse those two coffees into one and never tell me.

You can try to fix this. You can hash the description with fuzzy matching. You can add within-day counters that depend on the row order in the source table. You can pull pending-versus-settled flags into the comparison and pray. Each patch fixes a specific collision and introduces a new kind of false negative elsewhere. After a couple of rounds you end up with a system that is harder to reason about than the one you started with, and wrong in ways you can no longer predict.

The problem isn't the key. The key is fine. The problem is the unit. The transaction is not the right object to deduplicate on. We knew this at the time. We knew it the way you know certain things in your spine.

The view that shipped on a Tuesday

We built a Postgres view. Three branches in a UNION ALL, one per source, all normalised into the same column shape, keyed on (account, date, description, amount, type). Inside each application, a window function marks lower-priority duplicates with an is_duplicate flag.

We do not drop the duplicates. We mark them. This matters. Any analyst can pull the flagged rows and see exactly what got merged, which means deduplication mistakes appear in the data instead of disappearing into it. The view shows its work. If a model lies, you want it to lie out loud.

Source priority is a single integer per source. We can flip it in one migration when our judgement about which source to trust shifts, which it has, twice, in ways that surprised us. The view runs at read time. No ETL pipeline. No write-side reconciliation. When new data lands in a source table, it appears in the view on the next query, as if by gravity.

*Quarterly review meeting. We're definitely going to fix it. Next quarter.*

The view shipped on a Tuesday and has been the read surface for underwriting for ten months since. The collision rate on the two-coffees case, measured over a quarter, doesn't move a credit decision driven by six months of cashflow rollups. It is wrong in a known, bounded way that doesn't matter at the resolution at which our decisions operate. We measured. We accepted.

This is the version that is still running. It is also, very clearly, the wrong version.

The question that should have been asked

The right question, the one we were trying not to ask because the answer was complicated, is not is this transaction the same as that transaction. It is:

For this account, on this date, which source is authoritative?

Every source we have produces blocks. An aggregator pull is, by claim, complete over a window from the connection date to today. A PDF statement is, by construction, complete over its printed period. Each source's contribution to an account is (start_date, end_date, transactions), with an implicit guarantee that inside the window the source has reported everything it can.

Once you frame the problem that way, deduplication stops being a row problem and becomes a coverage problem.

*Every cross-source deduplication design conversation, eventually.*

For each account, sort the sources by end date, latest first. The source with the most recent end date wins for its full range, because it has the freshest claim. Move to the next source. Take only the transactions whose date falls before the winner's start date. That older period is the part the higher-priority source doesn't reach. Continue down the list. Each source contributes only the older dates that no fresher source has covered.

The output is a clean, gap-free timeline per account, assembled from the freshest authoritative claim available at every point in time.

Both coffees survive. Because no transaction inside a winning block is deduplicated against another transaction inside the same block. Cross-source overlap stops being a question, because for any given date, exactly one source is consulted. The structural failure mode of the composite-key view doesn't exist in this model. It can't, by construction.

For us, this resolves to a clean pattern. The active aggregator wins for everything from the connection date forward, because its end date is "today." Uploaded statements fill in older periods, which by definition the aggregator can't reach. There is no inter-source debate to be had on any given date. The same Olive Garden charge no longer needs to be matched against itself, because only one source is ever consulted for the date it landed on.

Why we shipped the wrong one

Coverage isn't a column on any of the source tables. You model it.

Aggregators don't always give you explicit start and end dates for their coverage window. You infer them from connection events, ingestion history, and the occasional silence that you have to interpret correctly. Connection breakages mid-window need a position taken on them: does the source still claim coverage across the gap, or does the next source fill in? Statement period extraction has to hold up across hundreds of bank templates, some of which appear to have been designed in MS Paint in the 1990s by someone who lost a bet.

None of this is hard, individually. All of it takes weeks. And the right place for it to live is not in a view but in the write path, which is a different muscle from the one a Tuesday-afternoon Postgres view exercises.

The composite-key view skipped all of this. It treated every transaction as an independent object and asked a weaker question. The weaker question was the wrong question. It also shipped in a day, in a quarter when our underwriters needed a unified view and the alternative was merging three sources in a spreadsheet, which is not a thing a fintech underwriter should be doing in 2025.

*Engineering judgement under time pressure, summarised by a Canadian rapper in 2015.*

So we shipped the wrong one. We knew we were shipping the wrong one. We documented in the original PR description that we were shipping the wrong one, in the same way you might leave a note for your future self, saying please remember to come back and fix this, knowing your future self will read the note, sigh, and continue not fixing it.

We are building the block model now. The view will keep serving in the meantime. When the new write path is parity-verified against the view, the view becomes redundant. We delete it. Maybe we hold a small ceremony. I will probably drink a coffee, and possibly two.

The lesson, if there is one

Engineering judgement under time pressure usually comes down to picking the wrong version that ships, picking the right version that doesn't, and being honest about which one you've committed to. There is no version of this in which you get to ship the right thing on the original timeline. There is only the version where you ship the wrong thing with eyes open and a plan, and the version where you ship the wrong thing pretending it is right.

The composite-key view is the wrong version. It does what it has to do at the resolution our decisions operate at, with a failure mode that costs nothing of consequence, and it bought us the room to design the right version without panic. That is the only way design ever happens well.

If you are merging transactional data from multiple sources right now, and you have settled on composite-key deduplication because that is what the columns let you do, the question worth asking is not whether the key can be tightened. It is whether the unit can be coarsened. A block of coverage is almost always a more honest object than the rows inside it.

And if your deduplication ever silently eats one of your customer's two-coffee mornings, remember: it was always going to. You can either build the block model or buy them a coffee. Both are acceptable. I would suggest you do both, in that order, on a Tuesday, when the light is the colour of weak tea, and nobody has the energy to push back.

‍

Nirjhar Debnath

Senior Software Engineer

Nirjhar Debnath works at Efficient Capital Labs. His exact role has never been written down, which everyone seems to have agreed is for the best. Most of what he thinks about engineering ends up on this blog.