Iris — AI Secretary | Marc GAUTHIER

1. Overall architecture

The system is built around a central backend that orchestrates three client surfaces (mobile app, website, push notifications) together with telephony and AI services.

Architecture diagram of the Iris system — Overall architecture: the backend orchestrates telephony, the AI voice agent, the database and the client surfaces.

The project's guiding principle: the backend is the conductor. The conversational AI talks with the caller, but every call-routing decision (reject, transfer, destination) is executed by the backend — never delegated to a third-party service. This rule, learned from the most expensive lesson of the project (see section 4), guarantees reliable, controlled behaviour.

The architecture is multi-tenant by design: each customer has their own dedicated number, and the backend resolves the relevant account from the called number on every incoming webhook.

2. Tech stack

Telephony — Twilio Programmable Voice: numbers, TwiML webhooks, call bridging and the VoIP SDK.
Voice agent — Vapi.ai for real-time STT → LLM → TTS orchestration, with GPT-4.1 for the dialogue, Speechmatics for French transcription and Cartesia for the synthetic voice.
Backend — Node.js + Express: webhooks, classification, call routing, notifications.
Database — Supabase (Postgres hosted in Europe): authentication and per-user data isolation (RLS).
Mobile app — Expo / React Native (TypeScript), Twilio VoIP SDK to receive calls inside the app, OTA updates via EAS Update.
Website — Next.js (App Router, SSR), bilingual FR/EN.
Payments — Stripe: Checkout, Customer Portal and webhooks.
Notifications — Expo / FCM push for every screened call.
Hosting — personal TrueNAS Scale server (Docker Compose), exposed through an outbound HTTPS tunnel.

3. The journey of a call

Entry and immediate triage

Every incoming call triggers a webhook to the backend, which identifies the customer then triages: a blacklisted number is rejected immediately; a whitelisted number is transferred directly, without AI; an unknown number is sent to the voice assistant. In that last case, the backend memorises the identifier of the parent call leg — the keystone of the transfer (see section 4).

The screening dialogue

The assistant collects the name, the company and the reason, then calls an evaluation tool on the backend. Three verdicts are possible: legitimate call (transfer announced then executed), call to block (the assistant politely declines and hangs up), or ambiguous case — the assistant then decides on its own according to its rules, with a deliberate philosophy: when in doubt, transfer. Better one call too many than a family call rejected.

End of call and traceability

If the recipient doesn't answer, the caller hears a polite message and the user receives a "missed call" notification — since the caller already gave their name and reason, voicemail becomes unnecessary. The end-of-call report (transcript, duration, end reason) then completes the history. Cascading safety nets guarantee that no call is ever orphaned: every call exists in the history, even if the caller hangs up before the screening ends.

4. Telephony: keeping ownership of the call

The hardest part of the project: it took three successive architectures to reach a reliable call transfer.

First dead end — handing over the number

The first version handed the number directly to the voice-agent platform. Simple, but the call then "belongs" to the third-party service: it becomes impossible to redirect it ourselves, and the SIP transfer protocol (REFER) is refused on a regular phone leg (PSTN). Any call-takeover strategy was blocked by design.

Second dead end — the provider's warm transfer

The transfer mechanism offered by the voice-agent platform turned out to be broken: the destination picked up in under a second, but the leg was systematically cut about three seconds later, without the caller ever being bridged. The issue was proven by reading the carrier's call logs, reproduced on every possible path, and turned out to be internal to the provider — therefore unsolvable on our side.

The final architecture

The number stays managed by our own telephony: our TwiML dials an authenticated SIP leg (digest authentication) towards the voice agent. Two decisive consequences: the leg towards the AI is a genuine SIP leg, and the parent leg remains a call we own — therefore freely redirectable.

The transfer then follows a call-takeover strategy (the official "call screening" pattern):

on entry, the backend memorises the parent call identifier in a short-lived in-memory registry;
the assistant screens the call — its SIP leg is only a child of our call;
on a transfer verdict, the backend takes the parent call back through the API and injects a new <Dial> scenario towards the destination;
the AI leg is hung up and the caller is bridged with the recipient: the bridge is 100% telephony, with no dependency on an experimental third-party mechanism.

Diagnosing this part required a close reading of SIP traces (Via headers, error codes, authentication challenges) — real network investigation work, ultimately validated by real end-to-end calls.

5. Backend and database

The Express backend exposes the telephony webhooks (incoming call, end of bridging), the tool webhooks called by the assistant (evaluation, connection, end-of-call report), VoIP token delivery for authenticated users and a health probe used by the app and the website.

Internal services are split by responsibility:

call classification (telemarketing/scam keywords, with an optional LLM classifier);
an in-memory registry of calls currently being screened — the core of the takeover strategy;
structured notifications and push;
a multi-tenant storage facade.

On the robustness side: every call to external services is protected with non-blocking degradation, webhooks validate their inputs, and verdicts have cascading safety nets (immediate verdict → end-of-call report → orphan-row creation → notification).

The Postgres database (Supabase) stores profiles, whitelist and blacklist, and the call history (verdict, decision source, transcript, duration). Access security relies on Row Level Security: an authenticated user can only read and write their own rows; the server key that bypasses those rules is never exposed client-side; the history is read-only for clients. The schema evolved through versioned SQL migrations, including a trigger that automatically creates the business profile at sign-up.

6. Mobile app

The app (Expo / React Native, TypeScript) offers a dashboard, history with transcripts, list management with phone-book import (numbers normalised to international format), a dedicated incoming-call screen and settings — with a light/dark theme.

Receiving a call inside the app: the VoIP chain

In "full service" mode, the user's real number is forwarded to Iris: a classic GSM transfer would create an infinite loop. Legitimate calls are therefore delivered directly inside the app through the VoIP SDK:

the backend delivers a temporary access token to the authenticated user, and the app registers itself automatically at launch — zero configuration after installation;
with the app closed, a silent push notification wakes the native module, which rings the phone full screen, even when locked;
the call context (name, company, reason) travels with the call and is displayed on the ringing screen.

Among the pitfalls encountered: a microphone permission that is declared but not granted at runtime produces a call that connects… in total silence, and a call invite that arrives before JavaScript has started must be explicitly retrieved at launch.

Updates ship over the air (EAS Update): any JavaScript change is pushed in about a minute, with no reinstall and no store review — native rebuilds are only needed when native modules change.

7. Website and payments

The Next.js website (App Router, server-side rendering) is fully bilingual FR/EN. The marketing pages present the service with sourced real-world statistics and animated SVG diagrams of the call journey; the customer area (cookie-based sessions) gives access to the dashboard, the call history and account management.

The payment lifecycle relies on Stripe with a few structuring choices:

subscription via Checkout with a trial period, management (invoices, cancellation) via the Customer Portal;
the signed webhook synchronises state in real time, but the source of truth is a direct reconciliation with Stripe every time the subscription page is displayed — making the system immune to missed webhooks;
every "service active" display (website and app) is conditioned on the actual subscription state;
the app opens the browser already signed in, thanks to a single-use login link generated by the backend;
account deletion (GDPR) cancels the subscription, erases data in cascade and removes the authentication identity, in one click.

8. Deployment, security and personal data

The backend and the website are deployed as containers (Docker Compose) on my personal TrueNAS server — the same one as the NAS project. Public exposure goes through an outbound HTTPS tunnel: no port is opened on the home network, and certificates are managed automatically.

non-root containers with health probes;
secrets only in environment variables, never in the code or the repository;
accounts and history hosted in Europe;
short transcript retention and no raw audio stored;
strict per-user data isolation (RLS), per-user VoIP tokens and single-use login links;
data export and deletion built into the product.

The whole system now runs continuously and serves as the foundation for the planned next steps: enriching the incoming-call screen (company verification, reason summary), live screening transcripts in the app, and an iOS port.

Iris — AI call-screening secretary

1. Problem statement

2. The solution: Iris

3. A complete product

4. Outcome