[{"content":"A short one — this time about a colleague\u0026rsquo;s work worth sharing.\nYury Usishchev built validgo-gen — an OpenAPI 3.0 → Go code generator focused on the one thing Go\u0026rsquo;s JSON handling is notoriously bad at: input validation.\nThe core problem: after json.Unmarshal, a missing field, an explicit null, and an empty \u0026quot;\u0026quot; all collapse into the same zero value. Your handler can\u0026rsquo;t tell them apart — and for a strict API contract, those are three very different cases.\nvalidgo-gen solves it with two-layer validation:\nPre-deserialization — inspects raw JSON before unmarshaling: missing required fields and invalid nulls get caught while they\u0026rsquo;re still distinguishable. Post-deserialization — go-playground/validator struct tags (min, max, oneof, email, unique) on the resulting structs. What I like about the generated code:\nchi integration — plugs into existing middleware, no custom runtime lock-in. Per-operation interfaces — clean dependency injection instead of one giant server interface. Idiomatic types — *string for optionals, decimal.Decimal, time.Time. No Java-flavored Go. Built via go/ast — output is valid, gofmt-compliant Go by construction, not by template luck. If you\u0026rsquo;ve been burned by oapi-codegen (no validation), ogen (non-standard runtime coupling), or openapi-generator (Java-style everything) — give it a look.\n","permalink":"https://rivik.dev/validgo-gen-openapi-go-validation-done-right/","summary":"My colleague Yury built validgo-gen — an OpenAPI 3.0 → Go generator that finally distinguishes missing fields from explicit nulls from zero values. Two-layer validation, chi integration, idiomatic output.","title":"validgo-gen: OpenAPI → Go Validation Done Right"},{"content":"Are you still storing API tokens as plaintext in ~/.secrets? Then any application you install — and any malware that comes with it — can silently read them. On macOS and Windows, \u0026ldquo;disk access\u0026rdquo; is one careless click away.\nSo I built ykvault — a tiny CLI vault where every secret is encrypted with a key derived from your YubiKey (original LinkedIn post). It started as a 90-line script and grew into a small Go tool.\necho \u0026#34;my_api_key\u0026#34; | ykvault set s3-token # store (touch required) ykvault get s3-token # retrieve (touch required) ykvault ls # list secret IDs ykvault rm s3-token # delete How It Works No master password, no cloud, no background daemon. Just YubiKey HMAC-SHA1 challenge-response:\nThe secret ID (e.g. s3-token) is the challenge. The YubiKey computes HMAC-SHA1 of it with a non-extractable hardware secret — physical touch required. SHA-256 of the response → AES-256 key; SHA-256(response‖\u0026ldquo;iv\u0026rdquo;)[:16] → IV. AES-256-CBC encrypts the value into ~/.ykvault/\u0026lt;id\u0026gt;.ykv.slot\u0026lt;N\u0026gt; (mode 0600). Decryption is fully deterministic — nothing extra to store or lose. The encrypted files are useless without your YubiKey.\nSetup:\nykman otp chalresp 2 --touch --generate # one-time YubiKey config go install github.com/sintoniastrategy/ykvault@latest Why Not Bitwarden / KeePassXC CLI? They\u0026rsquo;re fine — and more featureful. But for the narrow \u0026ldquo;give my script a token, with a hardware touch\u0026rdquo; case, ykvault is one binary, zero config, and no vault password that malware can keylog.\nThe Honest Caveat This is hiding secrets, not securing them — like moving from passwords.txt to Bitwarden. Touch-gating stops malware from silently scraping everything at rest, but if active malware can capture the plaintext at the moment you use it, it\u0026rsquo;s already game over. Secret IDs (filenames) aren\u0026rsquo;t encrypted either — only values.\nFor SSH specifically, go further: keys that never leave the hardware — see Hardware-Backed SSH Keys.\n","permalink":"https://rivik.dev/ykvault-stop-storing-api-tokens-as-plaintext/","summary":"Are you still keeping API tokens in ~/.secrets? Any app you install can read them. ykvault encrypts every secret with a YubiKey challenge-response key — each get/set requires a physical touch, and the encrypted files are useless without your key.","title":"ykvault: Stop Storing API Tokens as Plaintext"},{"content":"Hi, I\u0026rsquo;m Ilya, I like computers 🥲\nThat\u0026rsquo;s the short version. The slightly longer one: I\u0026rsquo;ve been pulling apart Linux internals and computer networks since I was a teenager, and somewhere along the way it turned into a career — software, HPC, hardware, ML/AI, with a long detour through fintech at Yandex.\nNow I\u0026rsquo;m CTO and co-founder of iProxy.online. We build enterprise mobile proxy infrastructure for web data, which is a fancy way of saying we run a lot of phones in a lot of places — 100+ countries, 600+ mobile carriers, last I checked. It started as a pet project. I kept doing the parts that felt fun (hiring great people, designing infrastructure, sweating developer experience, market research, project-managing the mess), and the rest, my co-founders and our small team handled with grace. None of this would exist without them.\nOn the side, I run Sintonia Strategy \u0026amp; Technology — a loose circle of hackers and MBA-trained strategists, mostly friends and former colleagues, who occasionally team up when something interesting shows up. No headcount, no roadmap — just good people doing good work together, mostly for the fun of it. A longer-horizon project, and the kind of work I want to still be doing in ten years.\nI have a half-superstitious theory about work: chase the fun problem, not the money. In my experience, money tends to show up afterwards, slightly surprised to find you there.\nI live in Portugal. I\u0026rsquo;m an engineer, an entrepreneur, and — I hope — a decent builder.\nFind me:\nGitHub: @rivik · sintoniastrategy · iproxy-online in/ilya-rusalowski x.com/IlyaRusalowski dev.to/rivik ","permalink":"https://rivik.dev/about/","summary":"About Ilya Rusalowski","title":"About"},{"content":"I was afraid of agents yolo-mode for half a year. All my systems backed up, secrets encrypted, credentials scoped (I can\u0026rsquo;t force push from my daily account, etc). But I just can\u0026rsquo;t stand when agent do npm install -g (in a golang repo 😬, when global md says always install tools to proj dir).\nI tried docker, but it is slowing workflow — I have very unusual tooling, from latex and usb devices to qemu-kvm android emulators and incus clusters. Docker is ideal for software development, but too much for research and POCs. Manage whole of this is just moving my vm to docker.\nAgent sandboxing with this tooling is pain. Too restrictive, constantly \u0026ldquo;please allow Unsandboxed, this is impossible if isolated\u0026rdquo;.\nI just need \u0026ldquo;do whatever you want, respect unix permissions (use usb if you can), but don\u0026rsquo;t cross project dir!\u0026rdquo;.\nagent-landlock — small Go wrapper around Claude Code / Codex / Gemini that uses Linux Landlock LSM (kernel 6.2+) to make host filesystem read-only for the agent process, except $PWD and paths you grant explicitly.\nNo containers, no namespaces, no paired UID, no mount tricks. Process-local, kernel cleans up when process exits. Reads still work everywhere your user can read, so LSP, git, USB, GPU, qemu-kvm, host networking all keep working.\nagent-landlock claude agent-landlock codex exec ... agent-landlock gemini agent-landlock run -- pytest -x Forces YOLO flags by default. Persistent grants via agent-landlock grant ~/.avd. Fails closed if Landlock unavailable.\nGolang, MIT — https://github.com/sintoniastrategy/agent-landlock\n","permalink":"https://rivik.dev/i-was-afraid-of-agents-yolo-mode-for-half-a-year/","summary":"Why I built agent-landlock — a small Go wrapper that uses Linux Landlock LSM to give coding agents YOLO mode without letting them escape the project directory.","title":"I was afraid of agents yolo-mode for half a year"},{"content":"This is a working guide to using a YubiKey for SSH on a real Linux fleet, plus the surrounding landscape — PIV, software-only alternatives, and SSH certificate authorities. The goal is to retire file-based SSH keys without breaking daily operations.\nThe article is structured around four questions:\nWhat does a hardware-backed key actually do, and what knobs do you control? How do you combine those knobs into a policy that works for both root login and Ansible? What if you can\u0026rsquo;t ship YubiKeys? When should you stop managing keys yourself and adopt an SSH CA? The problem with file-based keys Every classic SSH key is a file in ~/.ssh/. That file holds the private key. To log in to a server, your SSH client reads the file and produces a cryptographic signature.\nThere are really two issues here, and they compound:\nThe private key is a file. It exists in the filesystem, can be read by anything with sufficient access, can be copied, backed up, accidentally committed, or extracted via a misconfigured recovery scenario. This is a fundamental property of where the key lives. The discipline that would mitigate this rarely survives daily work. The cryptography is fine; the operational reality isn\u0026rsquo;t. What works in theory: a passphrase-protected key combined with ssh-agent -t 10m is genuinely close to unbreakable. The key is decrypted briefly, signs what it needs, and the agent forgets it. What happens in practice: engineers drop passphrases for convenience, or load the key into ssh-agent on first use and leave the agent running for the entire session. Agent forwarding compounds it: with ssh -A, a key that\u0026rsquo;s been unlocked once can sign on the operator\u0026rsquo;s behalf from any forwarded host for the rest of the agent\u0026rsquo;s lifetime. Hardware-backed keys remove the need for that discipline. The private key never leaves the device, and signing requires the device\u0026rsquo;s physical presence — there\u0026rsquo;s nothing to forget to passphrase, nothing to leave running for too long, nothing for a forwarded host to sign with silently.\nYubiKey is the most flexible option because the same device works on Linux, macOS, Windows, iOS, and Android with the same protocol and the same key files. Most of this article is about YubiKey + FIDO2; the alternatives come later.\nHow a YubiKey actually signs Laptop SSH client asks the device: \"sign this nonce\" no private key on disk request signature YubiKey private key never leaves The SSH client never sees the private key. It hands the YubiKey a small piece of data to sign (a \u0026ldquo;nonce\u0026rdquo;), the YubiKey signs internally and returns the signature. If the device is unplugged, signing is impossible regardless of what\u0026rsquo;s on the laptop.\nThis article uses FIDO2 (the modern protocol; SSH key types sk-ssh-ed25519@openssh.com and sk-ecdsa-sha2-nistp256@openssh.com, generated with ssh-keygen -t ed25519-sk or -t ecdsa-sk). FIDO2 has been first-class in OpenSSH since version 8.2 (February 2020). PIV — the older smartcard protocol — is covered later as an alternative.\nThe four knobs When you generate a FIDO2 key on a YubiKey, four properties determine how it behaves:\nResident vs non-resident — where the credential is stored. Touch — does signing require a tap on the YubiKey? PIN — does signing require the FIDO2 PIN? ssh-agent — is the key loaded into ssh-agent, or used directly? These are independent yes/no choices. Combined, they describe what it takes to sign with that particular key. The next four sections take them one at a time.\nKnob 1: resident vs non-resident This is the one most people get wrong, so it gets the most space.\nResident credential lives on the YubiKey ~/.ssh/id_root ID: 0xA3F2… just a label YubiKey credential here Lose the file? ssh-keygen -K recreates it from the device Passphrase on file? pointless — file holds no secret Non-resident credential split between file and device ~/.ssh/id_wheel encrypted handle + YubiKey master secret derives credentials Lose the file? credential gone re-enroll a fresh handle Passphrase on file? protects real encrypted material Resident (created with -O resident): the credential lives on the YubiKey itself. The file in ~/.ssh/ is just a pointer — a label that says \u0026ldquo;ask the device for credential 0xA3F2…\u0026rdquo;. If you delete the file, you can recreate it on any machine by running ssh-keygen -K, which queries the YubiKey for all its resident credentials and writes them to disk.\nNon-resident (the default): the credential is split. The YubiKey has a master secret used to derive credentials on demand. The file on disk holds an encrypted handle. To sign, the YubiKey needs the handle from the file plus its own master secret. Without the file, the YubiKey doesn\u0026rsquo;t know which credential to derive. Without the YubiKey, the file is gibberish.\nThe practical consequences:\nQuestion Resident Non-resident Can the file be reconstructed from the YubiKey? Yes (ssh-keygen -K) No Does losing the file matter? No Yes (re-enroll) Does a passphrase on the file add real security? No — file holds an identifier, not a secret Yes — file holds the encrypted credential handle Is ssh-agent needed? No, the YubiKey is the agent Usually yes, to avoid re-typing the passphrase The headline rule:\nResident keys don\u0026rsquo;t need a file passphrase, because the file holds nothing secret. Non-resident keys do, because the file holds the part of the credential that isn\u0026rsquo;t on the YubiKey.\nA non-resident key with a passphrase is conceptually identical to a classic passphrase-protected file SSH key — except the actual signing material never leaves the YubiKey. Same mental model, with the YubiKey as a hard-bound second factor.\nKnob 2: touch When you sign with a key, the YubiKey can require you to physically touch the gold disc. This is \u0026ldquo;user presence\u0026rdquo; — proof that a human is at the device.\nTouch required (default): every signing produces a touch prompt. The YubiKey\u0026rsquo;s LED blinks, you tap it, the signing completes. Failure to touch within ~15 seconds aborts the signing. No touch: signings happen automatically as long as the YubiKey is plugged in. Set with -O no-touch-required at generation. The server\u0026rsquo;s authorized_keys must also have no-touch-required for OpenSSH to accept the signature. You turn touch off when an operation produces many signings — Ansible across hundreds of hosts, an rsync of 100k files, a deploy that opens 50 sessions. None of these can realistically prompt for a touch each time.\nDisable touch only if you plan to use short-lived ssh-agent with password protected non-resident keyfile!\nTouch is a defense against silent malicious signing on a host you\u0026rsquo;ve connected to (with agent forwarding) or on a compromised laptop you happen to be at. It is not a defense against device theft — someone holding the device can touch it.\nKnob 3: PIN The YubiKey has a FIDO2 PIN, set once with ykman fido access change-pin. It\u0026rsquo;s separate from touch.\nPIN required (-O verify-required at generation): every signing prompts for the PIN. PIN not required: signing happens without a PIN (subject to touch policy). PIN is a defense against device theft. Touch alone doesn\u0026rsquo;t help you here — the thief can touch. PIN does, because the thief doesn\u0026rsquo;t know it.\nThe same device PIN gates ssh-keygen -K and FIDO2 credential management generally. Even for credentials that don\u0026rsquo;t require PIN to sign, the device PIN is required to extract them. This becomes important in the four-mode model below.\nKnob 4: ssh-agent ssh-agent is a small process that holds keys in memory and signs on behalf of SSH clients that ask. It exists for two reasons:\nYou don\u0026rsquo;t want to re-enter a file passphrase on every connection. Load the key once, use it many times. You want agent forwarding (ssh -A). Connecting to host A and then from inside that session to host B, with B able to ask your laptop\u0026rsquo;s agent for signatures back through the forwarded socket. For YubiKey-backed keys, whether you need an agent depends on the connection pattern, not just on storage:\nConnection pattern Agent needed? Direct SSH (ssh host) No — ssh client talks to the YubiKey directly ProxyJump (ssh -J jump target) No — local ssh signs each hop directly Agent forwarding (ssh -A, in-session multi-hop) Yes — remote host needs to reach your agent Non-resident key with passphrase Yes — to avoid retyping on every connection ProxyJump is the modern multi-hop pattern: the local ssh client opens each connection in sequence, signing each against the YubiKey directly. Nothing is exposed on intermediate hosts. Agent forwarding is the older pattern, used when you\u0026rsquo;re already inside a remote shell and need to reach further (e.g., on host1, running scp host2:file ./).\nFor loading resident keys into the agent (when forwarding needed), no passphrase is required:\nssh-add ~/.ssh/id_sudo # No passphrase prompt; the file holds a reference, not encrypted material. Never do this with no-touch no-pin keys! They must be password protected and added to agent like ssh-add -t 10m ~/.ssh/id_wheel\nTouch-required keys make agent forwarding safe again. With file-based keys, an unlocked agent signs anything it\u0026rsquo;s asked to, silently, for the agent\u0026rsquo;s lifetime — agent forwarding became dangerous because a compromised forwarded host could sign as you on every other host you have access to. With FIDO2 touch-required keys, every signing request from a forwarded host produces a touch prompt on your laptop. If you didn\u0026rsquo;t initiate the action, you don\u0026rsquo;t touch, and the signing fails. The classic \u0026ldquo;never use -A\u0026rdquo; advice no longer applies once credentials are hardware-backed and touch-gated.\nThis refines the rule:\nResident is the default. Non-resident is reserved for keys that must live in ssh-agent for the wheel-style mass-automation use case — explained next.\nThe four-mode model A single key configuration cannot serve both rare root login and Ansible across a fleet. Different operations have different blast radius and different frequency, and they want different policies. The pragmatic answer is four keys, each a deliberate combination of the four knobs.\nFour keys, four policies, one device root resident PIN + touch no ssh-agent direct root SSH login break-glass, ceremonial sudo resident touch only no ssh-agent daily admin PAM TOTP at host wheel non-resident no touch, no PIN passphrase + ssh-agent NOPASSWD automation Ansible, fleet rollouts robo resident no touch, no PIN no ssh-agent backups, sftp stage deploys three resident, one non-resident — non-resident only when the key needs ssh-agent The same model as a table:\nKey Touch PIN Storage File pass ssh-agent Use root yes yes resident no no Direct root SSH login sudo yes no resident no no Daily admin (TOTP at host) wheel no no non-resident yes yes NOPASSWD mass automation robo no no resident no no Backups, sftp, stage deploys Why three are resident and one isn\u0026rsquo;t wheel is the deliberate exception, for three reasons that compound:\n1. Mass automation must use ssh-agent. Ansible across 300 hosts produces thousands of signing operations per run. A touch on each is unworkable. So wheel is generated no-touch-required AND no verify-required. Once it\u0026rsquo;s loaded into ssh-agent (so it can be reused across the run), the agent holds the key in memory.\n2. The file on disk needs a passphrase. It\u0026rsquo;s to prevent accidental loading, and to force the operator to deliberately type something before the agent gets the key.\n3. The passphrase needs a forcing function. ssh-keygen -K on a new machine writes resident credentials into ~/.ssh — id_root, id_sudo, id_robo — none needing passphrases, because they\u0026rsquo;re just references to material on the device. The flow trains you that \u0026ldquo;resident-export-without-passphrase is safe.\u0026rdquo;\nIf wheel were resident, the same command would write id_wheel, and you\u0026rsquo;d have to remember the one exception: passphrase this file, the others are fine. Humans don\u0026rsquo;t reliably catch that exception. Non-resident wheel is structurally outside that flow: ssh-keygen -K can\u0026rsquo;t produce it, and the file you copy from your existing setup already has a passphrase. A physical equivalent: keep wheel on a separate YubiKey with a \u0026ldquo;passphrase required\u0026rdquo; sticker.\nGeneration commands # root: resident, touch + PIN ssh-keygen -t ed25519-sk -O resident -O verify-required \\ -N \u0026#34;\u0026#34; -f ~/.ssh/id_root -C \u0026#34;laptop-root\u0026#34; # sudo: resident, touch only ssh-keygen -t ed25519-sk -O resident \\ -N \u0026#34;\u0026#34; -f ~/.ssh/id_sudo -C \u0026#34;laptop-sudo\u0026#34; # wheel: non-resident, no touch, no PIN, passphrase, used via ssh-agent -t 10m ssh-keygen -t ed25519-sk -O no-touch-required \\ -f ~/.ssh/id_wheel -C \u0026#34;laptop-wheel\u0026#34; # Set a real passphrase when prompted. # robo: resident, no touch, no PIN ssh-keygen -t ed25519-sk -O resident -O no-touch-required \\ -N \u0026#34;\u0026#34; -f ~/.ssh/id_robo -C \u0026#34;laptop-robo\u0026#34; -N \u0026quot;\u0026quot; skips the file passphrase prompt. Used for the three resident keys. wheel is the only one without -N \u0026quot;\u0026quot; — you\u0026rsquo;ll be prompted, and you set a real passphrase.\nServer-side authorized_keys Keys generated with -O no-touch-required need a matching no-touch-required option in authorized_keys, otherwise OpenSSH rejects the signature.\nroot /root/.ssh/authorized_keys:\nsk-ssh-ed25519@openssh.com AAAA... laptop-root wheel ~wheel/.ssh/authorized_keys:\nno-touch-required sk-ssh-ed25519@openssh.com AAAA... laptop-wheel sudo ~admin/.ssh/authorized_keys (the daily-admin user with sudo privileges):\nsk-ssh-ed25519@openssh.com AAAA... laptop-sudo Pair the sudo key with pam_google_authenticator.so at the host\u0026rsquo;s sudo PAM stack:\n# /etc/pam.d/sudo auth required pam_google_authenticator.so Per-user TOTP secrets in /etc/google_authenticator (readable only by root) can protect from stolen YubiKey (touch is not enough for sudo). Also protects you from accidental sudo rm -rf / .\nrobo ~robo/.ssh/authorized_keys — the most-restricted, non-prod-fleet entry, constrained at source IP and forced command:\nno-touch-required,from=\u0026#34;10.0.0.0/8\u0026#34;,command=\u0026#34;/usr/local/bin/backup-shell\u0026#34; sk-ssh-ed25519@openssh.com AAAA... laptop-robo PIV — the older alternative protocol YubiKey supports a second SSH path: PIV (Personal Identity Verification), a US-government smartcard standard that predates FIDO2 by about a decade.\nPIV-on-YubiKey gives you:\nMultiple \u0026ldquo;slots\u0026rdquo; (9a, 9c, 9d, 9e, plus retired 82–95) — each holds a separate certificate and key pair. Three touch policies per slot: never, cached (15-second window), always. PIN policies: default, once, always, never. Standard X.509 certificates, which integrate nicely if your environment already uses smartcards for things like email signing, S/MIME, or government identity. A typical setup:\n# Generate ECCP256 key in slot 9a, with cached touch and PIN-once ykman piv keys generate \\ --algorithm ECCP256 \\ --touch-policy CACHED \\ --pin-policy ONCE \\ 9a /tmp/pubkey.pem # Self-signed certificate (or sign with a corporate CA) ykman piv certificates generate \\ --subject \u0026#34;CN=admin\u0026#34; \\ 9a /tmp/pubkey.pem # Use it directly via PKCS#11 ssh -I /usr/lib/x86_64-linux-gnu/libykcs11.so user@host # Or load into ssh-agent ssh-add -s /usr/lib/x86_64-linux-gnu/libykcs11.so On paper, the cached touch policy is exactly what you want. One touch unlocks signing for 15 seconds, then it locks again — ideal for rsync or scp of many files where one logical operation triggers many SSH transactions.\nIn practice, the cache behavior depends on how your SSH client handles the PKCS#11 session. Different clients open and close PKCS#11 sessions differently:\nSome open the session once per ssh invocation and keep it open, so the cache works as advertised. Some open and close per cryptographic operation, which resets the cache and produces a touch prompt every signing. Behavior varies between OpenSSH versions, between using ssh-agent vs. direct PKCS#11, between Linux distributions and OS package builds. For a single user on one machine, PIV with cached can be made to work once you\u0026rsquo;ve found the right combination. For a fleet with mixed client versions across Linux, macOS, and Windows, the behavior isn\u0026rsquo;t predictable. You\u0026rsquo;ll get bug reports for years and your runbooks will accumulate if your client is X, do Y branches.\nFIDO2 sidesteps this entirely. Per-credential policy is set at generation time, OpenSSH speaks the protocol natively without PKCS#11 in the middle, and behavior is consistent across clients and platforms.\nUse PIV if you already have smartcard tooling, X.509 workflows, or a strong organizational reason to use the existing standard.\nUse FIDO2 if you\u0026rsquo;re starting fresh and want predictable behavior across a heterogeneous fleet.\nSoftware-only alternatives Hardware tokens cost money and procurement takes time. For distributed contractors, BYOD policies, or organizations without an IT budget for keys, you\u0026rsquo;re sometimes deploying software-only solutions. The options below all keep your private key better-protected than a plain file in ~/.ssh/, but with different trade-offs.\nThe dimension that matters: can the private key be extracted from where it lives?\nSolution Key storage Extractable? Notes Secretive (macOS) Apple Secure Enclave No Touch ID per signing. Open source. Windows Hello SSH Windows TPM No TPM-bound; biometric/PIN per signing. Caveats below. KeePassXC SSH agent Encrypted KDBX database Yes (when DB unlocked) Keys are read from disk; the DB is just an extra layer. 1Password SSH agent 1Password vault (cloud-synced) Yes (extractable when vault is unlocked locally) Convenient. You\u0026rsquo;re trusting their infrastructure. LastPass SSH agent LastPass vault (cloud-synced) Yes (2022 breach; weak master passwords brute-forced offline) LastPass had a major vault-data breach in 2022. The categories sort cleanly:\nHardware-backed (Secretive, Windows Hello). The private key is generated inside a secure element and never leaves it. Same security model as a YubiKey, but tied to one device. Strong for \u0026ldquo;I always work from this laptop\u0026rdquo;; weaker for \u0026ldquo;I work from three machines.\u0026rdquo;\nNote on Windows Hello SSH. \u0026ldquo;Windows Hello SSH\u0026rdquo; gets used to describe three different things, only one of which is genuinely the macOS-Secretive equivalent:\nTPM-backed via Virtual Smart Card — the actual TPM-bound SSH path. Requires tpmvscmgr.exe to create a virtual smart card, a self-signed cert via the Microsoft Smart Card Key Storage Provider, and PuTTY/Pageant rather than the default OpenSSH client. tpmvscmgr.exe is Pro/Enterprise/Education only — not available on Windows 11 Home. Windows Hello for Business — the corporate path, requires Entra ID or AD join. Out of scope for a personal laptop. ssh-keygen -t ed25519-sk with Windows Hello as the UV layer — the most-documented \u0026ldquo;Windows Hello SSH\u0026rdquo; path, but Windows Hello is just the UI layer asking for your PIN. The actual FIDO2 authenticator is still a USB device (typically a YubiKey). On Windows 11 Home, this is effectively the only available option, which means you need external hardware anyway. The takeaway: on macOS, software-only hardware-backed SSH is one click in Secretive. On Windows it\u0026rsquo;s an enterprise feature with awkward retrofitting, and Home users are pushed toward an external YubiKey regardless. This is one of the practical reasons a YubiKey wins on cross-platform — the same device works the same way on every OS, no per-OS puzzle to solve.\nSoftware-encrypted (KeePassXC). The key is a normal SSH private key, encrypted in a database. Strictly better than a naked file because there\u0026rsquo;s a master password gating access, but the key is still extractable any time the DB is open. Reasonable when you already use KeePassXC for password management.\nCloud-synced (1Password, LastPass). The key is stored in the provider\u0026rsquo;s vault. Whoever can read the vault can read the key. You\u0026rsquo;re trusting the provider\u0026rsquo;s infrastructure and operational security. 1Password\u0026rsquo;s design (Secret Key + master password) makes server-side decryption genuinely difficult; LastPass\u0026rsquo;s 2022 breach demonstrated that vault contents can leak in practice. The convenience is real; the trust assumption is non-trivial.\nPick the strongest option you can ship to your team, and back it with a multi-mode model along the same lines as the YubiKey one — different keys for different operation classes, with the most automated keys getting the strongest restrictions at the server side.\nSSH CAs — Teleport, step-ca, HashiCorp Boundary Everything above is about credential custody: where the private key lives and what\u0026rsquo;s required to use it.\nTeleport, step-ca (Smallstep\u0026rsquo;s open-source CA), and HashiCorp Boundary solve a related but distinct problem: credential lifecycle and access control. Instead of long-lived keys, they issue short-lived SSH certificates that expire automatically. They integrate with identity providers (Okta, Google Workspace, Entra ID), log session activity, and can grant just-in-time access that revokes itself.\nWhether you need this depends on scale.\nTeam size Typical reality Recommendation Solo or up to ~15 people You know who has access. authorized_keys is auditable by reading. Offboarding is manual but tractable. YubiKey + four-mode model is enough. A CA adds operational overhead without proportional security gain. 15–100 people, growing New hires need access; departures need offboarding; \u0026ldquo;who can SSH to production?\u0026rdquo; stops being answerable from authorized_keys alone. Onboarding takes a day per person. Adopt a CA system. Pain is real and pays back the investment. Hundreds of devs, regulated industry Manual key management is impossible. You can\u0026rsquo;t audit it, you can\u0026rsquo;t rotate it, you can\u0026rsquo;t prove who logged into what after the fact. CA system is mandatory. Plan around it from day one. The operational pain shows up in roughly this order as you grow:\nAdding a key to N hosts requires Ansible discipline. Doable. Removing a key from N hosts requires the same discipline. Often skipped on departures. Rotating keys regularly across the whole fleet is a project. Answering \u0026ldquo;is this person\u0026rsquo;s access still active?\u0026rdquo; requires querying every host. Expensive. Proving to an auditor what happened in a session three months ago requires session logging that authorized_keys doesn\u0026rsquo;t provide. Each of these gets harder in a known order, and each has a CA-shaped solution.\nThe common confusion: SSH CAs don\u0026rsquo;t replace hardware keys. They complement them.\nWhen you use a CA, the long-term identity authenticates to the CA\u0026rsquo;s enrollment endpoint and gets a short-lived SSH certificate in return. That long-term identity needs to be protected — if it\u0026rsquo;s a file-based key, an attacker who steals it can request fresh certificates indefinitely. The CA system has moved the problem rather than solved it.\nThe right shape:\nLong-term identity: YubiKey + the four-mode model (or just sudo/root keys, depending on what the CA expects). Short-term access: SSH certificates issued by the CA, valid for hours, scoped to specific hosts. Audit: CA logs the issuance; session recording captures what happened during use. The hardware-backed identity is the foundation. The CA is the access plane on top of it.\nTL;DR The four knobs:\nResident vs non-resident — where the credential lives. Resident is the default; the file is a label, no passphrase needed. Non-resident is for keys that must be in ssh-agent; the file holds encrypted material and must have a passphrase. Touch — physical proof of presence. Defends against silent signing on a forwarded or compromised host. Not a defense against device theft. PIN — defense against device theft. Also gates ssh-keygen -K extraction of resident credentials. ssh-agent — not needed for direct SSH or ProxyJump. Needed for agent forwarding (-A, including in-session multi-hop) and for non-resident keys with passphrases. With FIDO2 + touch-required keys, agent forwarding is safe again because every signing requires a touch on your laptop — silent signing isn\u0026rsquo;t possible. The four-mode model:\nroot — resident, PIN + touch. Direct root login, rare. sudo — resident, touch only. Daily admin. Pair with PAM TOTP at the host. wheel — non-resident, no touch, passphrase + ssh-agent. NOPASSWD mass automation. Non-resident specifically so device + PIN cannot extract it. robo — resident, no touch, no PIN. Convenience tier, restricted at the server with from= and command=. Other paths and where they fit:\nPIV is theoretically cleaner (slots, certificates, cached touch policy) but its caching depends on PKCS#11 session handling that drifts between SSH client versions. Avoid for heterogeneous fleets. Software alternatives sort by extractability. Secretive and Windows Hello are hardware-backed (non-extractable). KeePassXC, 1Password, and LastPass are extractable to varying degrees of \u0026ldquo;the provider can see your key.\u0026rdquo; SSH CAs (Teleport, step-ca, HashiCorp Boundary) solve access management at scale. They don\u0026rsquo;t replace hardware keys — they sit on top of them. Adopt when manual authorized_keys management starts hurting, typically around 15–100 engineers. The shortest possible version: hardware key first, multi-mode policy second, CA system if and when scale demands it.\n","permalink":"https://rivik.dev/hardware-backed-ssh-keys-end-to-end-yubikey-piv-software-alternatives-and-where-ssh-cas-fit-in/","summary":"A working guide to using a YubiKey for SSH on a real Linux fleet — the four knobs (resident, touch, PIN, agent), a four-mode policy for root and Ansible, software-only alternatives, and where SSH CAs fit in.","title":"Hardware-backed SSH keys end to end: YubiKey, PIV, software alternatives, and where SSH CAs fit in"},{"content":"Even in the GPT era, I regularly meet engineers who know ssh user@host and stop there. Yet hiding behind three flags — -D, -R, -L — is a full replacement for a VPN client, a mesh VPN, and a proxy stack. There\u0026rsquo;s also a story below about how engineers at one very big Korean corp used a single reverse tunnel to keep working past locked doors — for years, long before COVID.\n3 flags · 3 bonuses · 1 man page. I hope it\u0026rsquo;s intriguing enough to give it a try :)\n-D — Dynamic Forward: The VPN Hiding Inside Your SSH Not one forwarded port. Every port. Every host. Every DNS name. Everything that server can reach — your laptop can reach.\nssh -D 127.0.0.1:9876 user@corp-bastion.com -D opens a SOCKS5 proxy locally. Any app that speaks SOCKS5 (Firefox, curl, psql, ssh itself, basically everything) routes through the remote server. Tick Proxy DNS when using SOCKSv5 → even DNS resolves on the remote side.\nResult: your browser lives inside the remote network.\nMost people know -L forwards one port. -D forwards the whole internet that server can see. Very different tool.\nOne flag replaces your corporate VPN:\nInternal corp apps — Grafana, Kibana, Jira, wikis. Open in Firefox, no VPN client, no Tailscale, no admin tickets. IPMI / iDRAC / BMC networks — reach the management LAN from your laptop via the one jump host that sees it. No per-port -L gymnastics. Firewall / geo bypass — your browser profile exits through the remote country. Debug from the server\u0026rsquo;s POV — \u0026ldquo;why does my laptop see this and theirs doesn\u0026rsquo;t\u0026rdquo; becomes answerable. Note: the diagram runs -D on a router — turns your whole LAN into a shared SOCKS5 exit. Usually you just run it on your own laptop.\n-R — Reverse Tunnel: You Don\u0026rsquo;t Need Tailscale 10 NAT\u0026rsquo;d boxes — at home, at customers, in random clouds — and you want to reach all of them. You need one VPS and this on every box:\n# on box1: ssh -fNT -R 0.0.0.0:2201:127.0.0.1:22 tunnel@my-vps.com # on box2: ssh -fNT -R 0.0.0.0:2202:127.0.0.1:22 tunnel@my-vps.com # on box3: ssh -fNT -R 0.0.0.0:2203:127.0.0.1:22 tunnel@my-vps.com From anywhere:\nssh -p 2201 user@my-vps.com # → box1 ssh -p 2202 user@my-vps.com # → box2 ssh -p 2203 user@my-vps.com # → box3 One public VPS. N tunnels. N NAT\u0026rsquo;d boxes reachable. Wrap in a systemd unit or @reboot cron for persistence :))\nWant to expose a web server? ssh -R 443:localhost:443 vps — done.\nUnder the hood: one outbound SSH session from a NAT\u0026rsquo;d box makes a public port on the VPS. Anyone hitting the public port (vps:2222 in the diagram below) lands on 127.0.0.1:22 of the original box. All through a single TCP connection the firewall already allows.\nDisclaimer: no, it\u0026rsquo;s not actually Tailscale — Tailscale solves different problems and does them far more conveniently. But for \u0026ldquo;I just need to reach my boxes,\u0026rdquo; SSH punches through the same holes :))\n-R — Cautionary Tale: How Engineers Escape Corp Same flag, other direction. Big Korean corp. Office-only desktops, badges, cameras, NAT\u0026rsquo;d grey IPs, firewall cutting everything inbound. And the best part — the office doors lock after 8 hours. Work-life balance, problem solved :))\nExcept not everyone in the world is Korean. On their work desktop, engineers just run:\nssh -fNT -R 0.0.0.0:2222:127.0.0.1:22 root@my-vps.com From anywhere: ssh -p 2222 corp-user@my-vps.com → lands on their locked-down corporate desktop. Through the firewall. Through the grey NAT. For years. Long before COVID.\nThe real lesson: outbound ≈ inbound. Any allowed outbound protocol does the same — HTTPS with a custom client, DNS, whatever. If you can reach out, something can reach in.\n-L — Local Forward: mysql on a Server, mysql on Your Laptop Local login only, no network exposure. But you want to connect from your code, on your laptop.\nssh -L 127.0.0.1:3306:127.0.0.1:3306 user@db-server Now localhost:3306 on your laptop is mysql on the server.\nBonus: -L forwards Unix sockets too.\n# socket → socket ssh -L /tmp/mysql.sock:/var/run/mysqld/mysqld.sock root@db mysql --socket /tmp/mysql.sock --user root # TCP → socket (when mysql only listens on a socket) ssh -L 127.0.0.1:5555:/var/run/mysqld/mysqld.sock root@db mysql --host 127.0.0.1 --port 5555 --user root -R does sockets too. Read the man page :)\n-L — Firewall Bypass: google.com Doesn\u0026rsquo;t Open? Your VPS Says Hi ssh -L 0.0.0.0:443:google.com:443 ubuntu@vps Add 127.0.0.1 google.com to your hosts file (or the tunnel box\u0026rsquo;s LAN IP — 192.168.1.1 in the diagram, where it runs on the router). Open Chrome → google.com → works. No VPN, no proxy config, no client software.\nBonus Flags Worth Knowing -J — jump host, no VPN:\nssh -J bastion.corp prod-db.internal -X — remote GUI on a headless server:\nssh -X user@server firefox # window opens locally -w — full L3 VPN via tun devices:\nssh -w 0:0 root@server # real VPN in one command (+ root) Three more flags most engineers have never typed. Read the man page.\nTakeaway: SSH Is Absurdly Powerful. Most Engineers Use 5% of It. Everyone knows -L. Few know it forwards Unix sockets too. The two flags most engineers have never typed — and the two most powerful: -D → full network access through one SSH session. Replaces corp VPN for most read-only needs. -R → one public VPS replaces a mesh VPN for reaching N NAT\u0026rsquo;d boxes. Outbound connections are never \u0026ldquo;safe\u0026rdquo;. Whatever you can reach, can reach you. ChatGPT is decent at \u0026ldquo;is this possible?\u0026rdquo; Bad at syntax. Verify. Read the man page. Seriously. ","permalink":"https://rivik.dev/ssh-tunnel-magic-your-ssh-already-is-tailscale/","summary":"SSH punching for everyone who only knows \u003ccode\u003essh user@host\u003c/code\u003e — how -D replaces a corporate VPN, -R replaces a mesh VPN for NAT\u0026rsquo;d boxes, and -L forwards Unix sockets. 3 flags, 3 bonuses, 1 man page.","title":"SSH Tunnel Magic: Your SSH Already Is Tailscale"},{"content":"Why the era of \u0026ldquo;all data is public\u0026rdquo; demands the same radical fix that HTTPS once brought to open networks\nIn July 2024, a single missing array bounds check in CrowdStrike\u0026rsquo;s Falcon Sensor crashed 8.5 million Windows machines worldwide. Airlines grounded flights. Hospitals cancelled surgeries. Emergency services went dark. The total economic damage: at least $10 billion. The root cause was not a sophisticated nation-state attack, nor a novel zero-day exploit. It was the failure to verify that a configuration file had the expected number of fields — a mistake that would earn a failing grade in an undergraduate computer science course.\nIn March 2026, Iran-linked hackers published over 300 personal emails and photographs belonging to FBI Director Kash Patel, stolen from his personal Gmail account. The method was not sophisticated either: the director\u0026rsquo;s credentials had been circulating on dark-web markets for years, harvested from earlier data breaches. He had not enabled phishing-resistant authentication. The head of America\u0026rsquo;s premier law enforcement agency was undone by password reuse.\nThese are not isolated failures. They are symptoms of a structural crisis in how the world builds, secures, and governs digital systems — a crisis with hard data behind it and, perhaps, a surprisingly elegant resolution ahead.\nI. The Numbers Behind the Collapse The intuition that \u0026ldquo;everything is breaking more often\u0026rdquo; is not a feeling. It is a measurable trend.\nIn 2024, the average minute of IT downtime cost organisations $14,056, and for large enterprises the figure exceeded $23,750. Across the Global 2000, outages drained approximately $400 billion annually — and the per-incident cost kept rising even as the number of major incidents saw a slight decline. The cost escalation reflects a deeper truth: modern businesses are so thoroughly fused with their technology stacks that even brief interruptions cascade through supply chains, customer relationships, and balance sheets. IT and networking-related outages increased in 2024, reaching 23% of all impactful outages, driven by growing complexity and the shift to cloud and colocation, according to the Uptime Institute\u0026rsquo;s 2025 Annual Outage Analysis.\nIn November 2025, Cloudflare suffered a major global outage that knocked thousands of websites offline — X, ChatGPT, Spotify, Uber, and even the tools that monitor outages, since they too depend on Cloudflare. The cause: a software bug triggered by a configuration change.\nOn the security side, the picture is worse. Compromised accounts surged from approximately 730 million in 2023 to over 5.5 billion in 2024 — roughly 180 accounts breached every second, according to Surfshark\u0026rsquo;s annual analysis.\nThe Identity Theft Resource Center recorded 3,332 data compromises in the United States alone in 2025, a new record and a 79% increase over five years. Breach notification letters reached 1.35 billion in 2024, a 211% increase year-over-year, driven by five \u0026ldquo;mega-breaches\u0026rdquo; each affecting more than 100 million individuals.\nAnd the cost of each breach is substantial: IBM\u0026rsquo;s 2025 Cost of a Data Breach report pegged the global average at $4.44 million per incident, with the United States topping the charts at $10.22 million.\nWhat makes these numbers particularly damning is the banality of the attack vectors. Four of the biggest breaches of 2024 — Ticketmaster, Advanced Auto Parts, Change Healthcare, and AT\u0026amp;T — collectively exposed over 1.24 billion records. All four could have been prevented by enabling multi-factor authentication. The Verizon Data Breach Investigations Report 2025 found that only about 3% of compromised passwords met even baseline complexity requirements. Attackers do not need zero-day exploits. As IBM\u0026rsquo;s own X-Force team put it: \u0026ldquo;Attackers simply do not need zero-days — they just need valid credentials and a little bit of patience.\u0026rdquo;\nII. The Root Cause: Incentives, Not Ignorance The tempting explanation is that engineers have forgotten how to build reliable systems. The reality is more uncomfortable: the economic incentives have shifted decisively against quality.\nThe cost of poor software quality in the United States has grown to at least $2.41 trillion, according to the Consortium for Information \u0026amp; Software Quality. Technical debt reached $1.52 trillion. Seventy-five percent of business and IT executives now expect their software projects to fail. Sixty-nine percent of developers report losing eight or more hours per week to inefficiencies — a full day of every working week consumed by the consequences of prior shortcuts. Sixty-six percent of global organisations admit they are at risk of a software outage within the next year.\nMeanwhile, executive attention has drifted elsewhere. Beginning around 2023–2024, AI vaulted to the top of C-suite priorities, displacing quality and security concerns that had been mounting for years. In 2025, Google, Amazon, Microsoft, and Meta collectively spent $380 billion on building AI tools.\nBruce Schneier, the Harvard security researcher and one of the most respected voices in the field, has been characteristically blunt. \u0026ldquo;We\u0026rsquo;re moving into a world of untrusted systems,\u0026rdquo; he told IBM in a 2026 cybersecurity trends briefing, adding that even the commonly suggested solution of greater transparency \u0026ldquo;helps, assuming you have a customer base that is sophisticated enough to understand what they\u0026rsquo;re seeing.\u0026rdquo; In an interview reflecting on a decade of data privacy work, he was blunter still: \u0026ldquo;Nothing has changed since 2015. On the corporate side, companies are spying on us even more extensively. More of our data is in the cloud. And every one of us carries an incredibly sophisticated surveillance device wherever we go.\u0026rdquo;\nThe problem is compounded by a demographic time bomb.\nCompanies are eliminating junior developer positions in favour of AI coding tools, creating what some analysts call the \u0026ldquo;Junior Gap.\u0026rdquo; Entry-level developer postings dropped 60% between 2022 and 2024. Salesforce\u0026rsquo;s CEO announced the company would hire \u0026ldquo;no new engineers\u0026rdquo; in 2025. A 67% hiring cliff in 2024–2026 means 67% fewer potential engineering leaders in 2031–2036. The Stack Overflow 2025 Developer Survey found that 66% of developers\u0026rsquo; biggest frustration was AI solutions that are \u0026ldquo;almost right, but not quite,\u0026rdquo; and 45% found debugging AI-generated code more time-consuming than writing code from scratch.\nAnd the institutional response to breaches has been opacity, not accountability. In 2020, nearly 100% of breach notifications explained the root cause. By 2025, only 30% did. Companies are not just failing to prevent breaches — they are actively concealing how the breaches occurred.\nIII. A History of Paradigm Shifts To understand where digital security is heading, it helps to understand where it has been. The history of internet security can be read as a series of reluctant recognitions that something previously assumed to be safe was, in fact, fundamentally public.\nThe First Paradigm: \u0026ldquo;All networks are public\u0026rdquo; (circa 2010–2015)\nFor the first two decades of the consumer internet, the implicit assumption was that networks could be trusted — or at least that network security was someone else\u0026rsquo;s problem. WiFi was trivially interceptable. WEP encryption could be cracked from a laptop in a coffee shop. LTE networks had their own man-in-the-middle vulnerabilities. Anyone sitting on the same cafe WiFi could capture login credentials, session cookies, and email contents in plaintext.\nThe response, when it finally came, was HTTPS — the encryption of web traffic using TLS. Let\u0026rsquo;s Encrypt launched in 2015, making SSL certificates free and automated. Google began penalising unencrypted sites in search rankings. Within a few years, encrypted web traffic went from roughly 30% to over 95%.\nThe genius of HTTPS was that it solved the problem at the infrastructure level, not the human level. Users did not need to understand public-key cryptography. They did not need to check certificates manually. Browsers handled it automatically and started displaying warnings when encryption was absent. The fix was, in effect, invisible — and that is precisely why it worked.\nThe Second Paradigm: \u0026ldquo;All data is public\u0026rdquo; (circa 2020–present)\nThe current era is defined by a new, equally uncomfortable recognition: it is no longer just the network that is untrustworthy. The data itself — wherever it is stored, however it is protected — should be assumed to be accessible to adversaries.\nThis is not nihilism. It is the formal security doctrine known as \u0026ldquo;Assume Breach,\u0026rdquo; the foundational principle of Zero Trust architecture. As Microsoft\u0026rsquo;s own Zero Trust framework states: \u0026ldquo;Instead of assuming everything behind the corporate firewall is safe, the Zero Trust model assumes breach and verifies each request as though it originates from an open network.\u0026rdquo;\nBut here is where the analogy with HTTPS breaks down — and where the current crisis becomes acute. HTTPS was elegant: one protocol, deployed everywhere, transparent to users. Zero Trust is expensive, complex, and organisationally demanding. Even in organisations that follow Zero Trust principles, sensitive data often remains exposed once an attacker gains application-level access. The \u0026ldquo;HTTPS of data security\u0026rdquo; has not yet been found.\nOr has it?\nIV. Passkeys: The HTTPS of Authentication The most promising candidate for the first half of that fix is already being deployed at scale. Passkeys — based on the FIDO2/WebAuthn standard — replace passwords with asymmetric cryptography tied to a user\u0026rsquo;s device and biometrics. There is no shared secret to steal, no password to phish, no credential to stuff. The private key never leaves the device. Authentication happens through Face ID, Touch ID, or a device PIN.\nThe adoption numbers are accelerating rapidly. Google reports over 800 million accounts using passkeys as of early 2026. Amazon saw 175 million users create passkeys in its first year of support. Microsoft made passkeys the default for all new accounts in May 2025, driving a 120% increase in passwordless authentication. Nearly 70% of consumers now hold at least one passkey, up from 39% two years prior.\nThe parallels with the HTTPS rollout are almost exact. Like HTTPS, passkeys are more secure and more convenient — login times drop by up to 17x, and success rates reach 98% on platforms like TikTok. Like HTTPS, adoption is being driven by platform defaults: Apple, Google, and Microsoft now support passkeys at the operating-system level, and NIST\u0026rsquo;s updated Digital Identity Guidelines now cite synced passkeys as phishing-resistant authentication. Like HTTPS, regulatory pressure is accelerating the transition: the UAE mandated elimination of SMS OTPs by March 2026, India follows in April 2026, the Philippines by June 2026, and the EU Digital Identity Wallet rollout happens by end of 2026.\nAnd like HTTPS, passkeys work precisely because they remove the human from the equation.\nThe FBI director\u0026rsquo;s Gmail hack would have been impossible with passkeys enabled — there would have been no password to steal from old breaches, no credential to replay. The four mega-breaches of 2024 that lacked MFA would have been blocked entirely. Passkeys do not require users to choose strong passwords, remember them, or avoid reusing them. The device handles everything. Apple even solved the migration problem by allowing passkeys to be created automatically in the background when users sign in with their password — zero friction, no extra steps.\nThis is the HTTPS moment for authentication. Within three to five years, passwords will be a legacy fallback — still present, but increasingly irrelevant, like HTTP sites in 2026. The credential-stuffing attack vector, which has powered the majority of account compromises for a decade, will be effectively eliminated.\nV. The Missing Half: On-Device AI as Data Loss Prevention Passkeys solve the authentication problem. They do not solve the data leakage problem. The Pentagon\u0026rsquo;s \u0026ldquo;Signalgate\u0026rdquo; scandal — in which Defence Secretary Pete Hegseth shared information from a SECRET/NOFORN CENTCOM document into Signal group chats, including one containing his wife, brother, and personal attorney — was not a credential failure. It was a judgment failure. The Pentagon Inspector General concluded that Hegseth violated military regulations and that the information, had it been intercepted, could have endangered American troops. No passkey prevents a human from copying classified text into the wrong app.\nThis is where the next paradigm shift may be emerging, and it follows the same architectural logic as HTTPS and passkeys: solve the problem at the infrastructure level, invisibly, without depending on human discipline.\nApple has already deployed the template. Its Communication Safety feature uses on-device machine learning to detect sensitive visual content in Messages, Photos, AirDrop, and FaceTime. It runs entirely on the device\u0026rsquo;s neural engine. No data is sent to Apple. It is enabled by default for child accounts. The user receives a warning, not a block. It shipped with minimal controversy precisely because it was framed as protection, not surveillance.\nThe same architecture could be extended to text. Apple Intelligence already runs local large language models on every text field in iOS — for autocorrect, summarisation, and writing assistance. Adding a classification layer that recognises patterns associated with sensitive data (classification markings, coordinate formats, operational timing language, Social Security numbers, API keys, medical record identifiers) would be technically straightforward. The user experience would mirror Communication Safety: a gentle iOS-style sheet saying \u0026ldquo;This content may contain sensitive information\u0026rdquo; with options to proceed or review.\nThe behavioural economics are well-understood. Apple\u0026rsquo;s analytics opt-in — presented during device setup with a default of acceptance — achieves near-universal consent because nobody unchecks it. The same default bias would drive adoption of on-device data classification. It would be opt-out, buried in Settings, and virtually nobody would disable it.\nThe enterprise DLP market already recognises this direction. A 2025 Cyberhaven study tracking 1.1 million workers found that 11% of data pasted into ChatGPT was classified as sensitive corporate data. Gartner predicts that by 2027, 60% of enterprises will integrate DLP controls directly into their Zero Trust architecture. But traditional enterprise DLP requires Mobile Device Management — the cumbersome infrastructure that officials like Clinton, Patel, and Hegseth have consistently refused to use. On-device AI sidesteps this entirely: personal devices become secure enough without corporate management profiles.\nThe business incentive for Apple is substantial. If iPhones with Apple Intelligence can credibly satisfy government and enterprise security requirements without MDM, Apple eliminates the last argument for issuing locked-down government devices. Every Pentagon official, every banker, every healthcare worker becomes an iPhone customer by default.\nThe product would follow the HTTPS pattern: the platform provider (Apple, Google) supplies the infrastructure — the on-device AI classification engine. Organisations supply the policy — the specific patterns and classification rules. And regulators supply the forcing function — mandating phishing-resistant authentication and data-aware devices for handling sensitive information.\nVI. The Agent Paradox: When the User Is the Threat Model Passkeys neutralise stolen credentials. On-device DLP catches careless humans. But a third category of risk is now emerging that neither fix addresses — and it may be the hardest of all.\nAn AI agent needs broad access to be useful. If Microsoft Copilot cannot read your emails, calendar, Slack messages, and OneDrive files, it is useless. But the moment it can read all of that, it can also be tricked into sending all of that somewhere else. The agent is not a tool the user wields; it is the user — with the user\u0026rsquo;s identity, the user\u0026rsquo;s permissions, and none of the user\u0026rsquo;s judgment.\nThis is not a theoretical risk. It is already being exploited.\nIn 2025, researchers disclosed the EchoLeak attack (CVE-2025-32711) against Microsoft Copilot. An attacker sent an email containing hidden prompt-injection instructions to a victim\u0026rsquo;s mailbox. When Copilot ingested the email through retrieval-augmented generation, the hidden instructions caused it to embed sensitive data — chat logs, OneDrive files, SharePoint content, Teams messages — into outbound links. Zero clicks required. The victim\u0026rsquo;s own AI assistant became the exfiltration channel.\nA ServiceNow vulnerability demonstrated something worse: agents can be tricked into recruiting other agents. Low-privileged users embedded malicious instructions in data fields. When higher-privileged AI agents later processed those fields, they recruited even more powerful agents to perform unauthorised actions — including assigning administrator roles. The attack chain was entirely autonomous.\nIn a controlled exercise, McKinsey\u0026rsquo;s internal AI platform \u0026ldquo;Lilli\u0026rdquo; was compromised by an autonomous agent that gained broad system access in under two hours. And a Dark Reading poll found that 48% of cybersecurity professionals now identify agentic AI as the single most dangerous attack vector.\nThe readiness numbers are alarming. Over 80% of technical teams have moved past planning into active testing or production with AI agents — but only 14.4% report all agents going live with full security approval. Nearly half still rely on shared API keys for agent-to-agent authentication. And 82% of executives feel confident their existing policies provide adequate protection — which may be the most terrifying statistic of all, because it means the people who control budgets do not yet understand the problem.\nWhat is being tried — and why it is not enough\nThe industry is responding with five overlapping approaches, none yet mature.\nFirst, treating agents as identities, not tools. Microsoft launched Entra Agent ID, giving each agent its own identity with Conditional Access policies — so a risky agent can be blocked the same way a risky user is blocked. The idea is sound: an agent is not a feature of Slack; it is a user of Slack, with its own permissions, audit trail, and revocation path.\nSecond, network egress controls. NVIDIA\u0026rsquo;s AI Red Team published sandboxing guidelines with mandatory restrictions: agents should not have unrestricted outbound network access, file writes should be limited to a designated workspace, and configuration files must be protected. The agent can read your data, but it physically cannot phone home.\nThird, output filtering at the gateway. Agent-generated outputs are treated as potential exfiltration channels — data is classified by sensitivity, and agents cannot include more than a threshold amount of confidential content in any external-facing output without human approval.\nFourth, OWASP\u0026rsquo;s \u0026ldquo;autonomy is earned\u0026rdquo; principle. In 2025, OWASP released an Agentic AI Security framework with input from over 100 experts. Its core principle: \u0026ldquo;Autonomy is a feature that should be earned, not a default setting.\u0026rdquo; Agents start with zero permissions and are promoted — like a new employee.\nFifth, human-in-the-loop checkpoints. An agent should never be allowed to transfer funds, delete data, or change access-control policies without explicit human approval. The problem is obvious: this kills the entire point of autonomous agents.\nThe honest assessment is that none of these approaches fully works. The fundamental issue is categorical. When a human copies text from Signal to Telegram, you can intercept at the clipboard. When an agent reads an entire inbox, synthesises it, and sends a summary to an API endpoint, the exfiltration risk is compounded by context-window size. Modern LLMs process hundreds of thousands of tokens in a single pass. An agent can ingest an entire document repository, compress it, and include the synthesis in an outbound API call — all within one inference step.\nThe HTTPS analogy breaks down here because there is no single choke point. The agent is the user and the application and the network layer. It is like trying to apply data loss prevention to someone\u0026rsquo;s brain.\nThe filesystem evolution — and the missing final step\nThe deeper architectural pattern, however, suggests where a solution might eventually emerge. The history of computing can be read as a progressive narrowing of what any single compromised entity can access:\nShared filesystem — every process could read every file. Per-application sandbox — on iOS, each app receives a unique home directory randomly assigned at install, and is prevented from accessing data stored by other apps. On macOS, Apple uses \u0026ldquo;Data Vaults\u0026rdquo; to restrict access to an app\u0026rsquo;s data from all other requesting processes. Per-object encryption — on iOS, each file has its own per-file key, wrapped with a class key, with all key handling in the Secure Enclave, never exposed to the application processor. Files carry individual protection classes — some accessible only when the device is unlocked, some after first authentication. Identity-bound decryption — the logical extension being pushed by Microsoft with Purview and Entra ID: documents carry their encryption and access policy with them, so even if the file leaks, only authorised identities can decrypt it.\nEach step narrows the blast radius. iOS is at step three. The industry is struggling with step five: capability-scoped agent access.\nThe conceptual model has a name in computer science theory: capability-based security, dating to Dennis and Van Horn\u0026rsquo;s 1966 paper and implemented in CMU\u0026rsquo;s Hydra operating system. The idea: instead of asking \u0026ldquo;who are you?\u0026rdquo; and then granting access to everything that identity is authorised for, you grant specific capabilities — unforgeable tokens that give access to specific objects for specific durations. An agent would not receive \u0026ldquo;Ilya\u0026rsquo;s full access.\u0026rdquo; It would receive a capability token saying \u0026ldquo;read these three files, write to this one API, for the next thirty minutes.\u0026rdquo;\niOS already implements a version of this: apps must declare specific entitlements to access system resources, and the kernel enforces sandbox restrictions through mandatory access control policies loaded at launch that cannot be modified at runtime. The missing piece for the agent era is extending this model to AI: when you give an MCP-connected agent access to your system, it currently inherits your identity. What it should inherit is a scoped capability — with monitored outbound channels and enforced rate limits.\nThe next wave of Data Security Posture Management tools is beginning to close this loop, moving beyond discovery to automatically applying encryption, adjusting access controls, and triggering DLP workflows when data risk rises — compressing the response time from days to seconds.\nThe full trajectory, then: shared filesystem → per-app sandbox → per-object encryption → identity-bound decryption → capability-scoped agent access.\nThe trust trilemma — and why the OS must be God\nBut there is a deeper question that this trajectory leaves unanswered: who enforces the capabilities? And the answer exposes a fundamental architectural contradiction that nobody in the industry wants to say out loud.\nThe model requires the operating system to be God.\nThis is not a bug. It is the only design that works. The reason is a trilemma. You can have any two of three properties, but not all three:\nPer-object encryption — files protected from external attackers who gain physical access to storage. DLP and agent monitoring — the OS can inspect what flows between applications and agents. End-to-end encryption between agents — agents can communicate privately, and the OS cannot see the content. If you have the first and third, DLP is blind — data flows between agents in ciphertext the OS cannot inspect. If you have the first and second, agents cannot hide from the OS — which is exactly the model that works. If you try to combine the second and third, you have a contradiction: the OS cannot simultaneously inspect content and be unable to see it.\nApple already made the choice. On iOS, all third-party applications are sandboxed, and they can only communicate through services explicitly provided by the operating system. iOS offers very few inter-process communication options compared to Android, deliberately minimising the attack surface. There is no mechanism for App A to talk to App B without passing through Apple\u0026rsquo;s mediated channels. The OS sees everything inside its walls.\nAll file key handling occurs in the Secure Enclave; the file key is never directly exposed to the application processor. Even the CPU does not see the raw keys — only Apple\u0026rsquo;s custom silicon does. The architecture trusts nothing except the hardware it controls.\nThis is precisely the traditional corporate security model, moved down one level of abstraction. In a corporate network, the security team terminates TLS at the proxy, reads everything in plaintext, re-encrypts, and forwards. Employees have no truly private channel that bypasses corporate visibility. The company is God inside the perimeter. In the OS-as-God model, the operating system mediates all inter-process communication, inspects all inter-agent data flow, enforces DLP rules, and agents have no private channel that bypasses the OS. The Secure Enclave is God inside the device.\nFor agents, this creates a strict hierarchy. The Secure Enclave sits at the root, holding all cryptographic keys. The OS kernel enforces sandboxing and mediates inter-process communication. The DLP layer — Apple Intelligence or its equivalent — inspects data flows. The agent runtime operates within its sandbox, with capability-scoped permissions. And agent-to-agent communication passes through OS-mediated channels, inspectable and filterable at every step. An agent cannot establish a channel that bypasses this stack.\nThe per-object encryption protects against external attackers — someone who steals the device or reads the flash storage directly. But it does not protect against the OS itself. That is by design.\nThe uncomfortable implication is that you are not trusting \u0026ldquo;encryption\u0026rdquo; in the abstract. You are trusting Apple. Specifically, you are trusting that Apple will not abuse its position as the root of trust; that it will not be compelled by governments to insert backdoors; that the Secure Enclave will not be compromised; and that the DLP rules enforced on your device will be your rules, not Apple\u0026rsquo;s.\nThis is exactly why Apple fights so hard on privacy branding. Their entire business model for this architecture depends on users trusting them as the benevolent God of the device. The moment that trust fractures — as it nearly did in 2021, when Apple proposed on-device scanning of photos for child sexual abuse material before iCloud upload, then reversed course after massive backlash — the whole architecture loses its legitimacy. Not its technical validity, but its social licence.\nIs there an alternative? In theory, yes. Homomorphic encryption and secure multi-party computation could allow agents to process data without the OS ever seeing the plaintext. But this remains largely academic. The performance overhead is enormous, and no production-grade system operates at OS scale. For the foreseeable future, the choice is between a centralised root of trust and no DLP at all.\nThe honest comparison, then, is not between \u0026ldquo;privacy\u0026rdquo; and \u0026ldquo;surveillance.\u0026rdquo; It is between two centralised trust models. Corporate IT departments are staffed by humans who get phished, reuse passwords, leave the company, and operate under variable institutional discipline. Apple\u0026rsquo;s Secure Enclave is hardware. It does not get phished. It does not reuse credentials. And it is identical across two billion devices. The attack surface is radically smaller, even if the trust model is structurally identical.\nIt is not zero trust all the way down. It is \u0026ldquo;trust Apple\u0026rsquo;s silicon, trust nothing else.\u0026rdquo; And for most threat models — including the Signalgate scenario, the Patel credential hack, and the agent exfiltration problem — that is actually good enough.\nThe philosophical question remains: should a single company be the root of trust for all human digital communication? That is a political question, not a technical one. And it is the one nobody wants to answer.\nVII. The Human Residual No technical solution eliminates human agency. HTTPS did not prevent users from entering credentials on phishing sites (until browsers added warnings for that too). Passkeys will not prevent users from sharing secrets verbally. On-device DLP will not prevent a determined insider from memorising classified information and writing it on a napkin.\nBut the history of security engineering suggests that the goal is not perfection — it is raising the cost of failure high enough that casual negligence is caught, while accepting that deliberate malice requires a different class of countermeasures. HTTPS eliminated passive network eavesdropping. Passkeys will eliminate credential reuse and phishing. On-device AI classification will eliminate casual data spillage into unclassified channels.\nWhat remains is the irreducible human element: judgment, accountability, and institutional culture. And here, paradoxically, the very visible failures of the past few years may be producing their own corrective. When soldiers see that their commanders leak classified strike plans into family group chats, when employees see that their CEO\u0026rsquo;s email gets hacked because of password reuse, the implicit trust in \u0026ldquo;the system works\u0026rdquo; evaporates. A distributed scepticism emerges — the human equivalent of \u0026ldquo;assume breach.\u0026rdquo;\nThe military has a doctrine for this. It is called Auftragstaktik — mission-type tactics, originally Prussian — the principle that subordinates should understand the intent behind an order, not just the order itself, so they can exercise independent judgment when communications fail or leadership is compromised. The American version is \u0026ldquo;Commander\u0026rsquo;s Intent\u0026rdquo;: every order includes the why, so that when the plan falls apart (or when the order seems insane), people down the chain can think for themselves.\nSystem fragility, it turns out, creates human vigilance. Not the ideal security posture — but perhaps a realistic one.\nVIII. What Comes Next The technical trajectory is clear. Within three years, passkeys will be the dominant authentication method for consumer services, and passwords will begin their long retreat to legacy-fallback status. Within five years, on-device AI classification will be a standard feature of mobile operating systems, catching casual data spillage the way spam filters catch phishing emails today — imperfectly, but well enough to make the most common failures rare.\nThe agent security problem will take longer. Unlike passkeys or on-device DLP, there is no single protocol that makes the problem disappear — the agent\u0026rsquo;s need for broad access is fundamental to its utility. The most likely path is a layered defence: capability-scoped identity, network egress controls, output classification, and behavioural monitoring — the spam-filter model rather than the HTTPS model. Within ten years, granting an AI agent your full identity will seem as reckless as browsing the web over HTTP seems today.\nThe institutional trajectory is less certain. The incentive structures that reward shipping speed over code quality, that replace junior engineers with AI tools while destroying the talent pipeline, that allow executives to claim \u0026ldquo;total exoneration\u0026rdquo; after inspectors general find they violated regulations — these are not technical problems, and they will not be solved by technical means.\nBut if the history of HTTPS is any guide, the technical fixes do not need institutional virtue to succeed. They need only to be cheaper, easier, and more convenient than the alternatives — and to be shipped as defaults by platform providers with the market power to make them ubiquitous. That is how HTTPS won. That is how passkeys are winning. And that, perhaps, is how on-device AI classification will win too.\nThe era of \u0026ldquo;all data is public\u0026rdquo; is not a prophecy of doom. It is a design constraint — the same way \u0026ldquo;all networks are public\u0026rdquo; was a design constraint in 2012. We did not fix networks by making WiFi secure. We fixed them by making encryption universal. We will not fix data security by making humans disciplined. We will fix it by making the infrastructure smart enough that human discipline becomes optional.\nNot great, not terrible. Just engineering.\nSources Outages \u0026amp; Software Quality\nThe Great Software Quality Collapse: How We Normalized Catastrophe — TechTrenches, Sep 2025 Biggest IT Outages of 2023–2025 — Zenduty Uptime Institute Annual Outage Analysis 2025 — Uptime Institute, May 2025 9 Biggest Software Bugs, Fails, Glitches and Outages of 2025 — TestDevLab, Dec 2025 The Hidden $2.4 Trillion Crisis: Why Software Quality Can\u0026rsquo;t Wait — DEV Community, Aug 2025 The Cost of Poor Software Quality Is Higher Than Ever — CDInsights, Jun 2025 The Software Quality and Productivity Crisis Executives Won\u0026rsquo;t Address — FlowChain Sensei, Feb 2026 Data Breaches \u0026amp; Security\nGlobal Data Breach Statistics: A 2024 Recap — Surfshark, Feb 2025 Data Breach Statistics 2025–2026: Global Trends \u0026amp; Costs — DeepStrike U.S. Data Compromises Hit Record in 2025 — HIPAA Journal, Feb 2026 ITRC 2024 Annual Data Breach Report — Identity Theft Resource Center, Jan 2025 More Than 1.7 Billion Individuals Had Data Compromised in 2024 — HIPAA Journal, Apr 2025 Cybersecurity Trends 2026 — IBM Think (includes Bruce Schneier quotes) Expert Opinions\nNearly 10 Years After Data and Goliath, Bruce Schneier Says: Privacy\u0026rsquo;s Still Screwed — ThreatsHub, Feb 2025 Bruce Schneier on Security, Society and Public AI Models — ThreatsHub, Oct 2024 Schneier on Security — AI Finding Zero-Days in OpenSSL — Schneier.com, Mar 2026 Government Breaches\nIranian Hackers Publish Emails Stolen from FBI Director Patel — NBC News, Mar 2026 FBI Director Patel\u0026rsquo;s Email Hacked — Technical Analysis — SafeState, Mar 2026 United States Government Group Chat Leaks (Signalgate) — Wikipedia Pentagon IG Finds Hegseth Violated Military Regulations — NBC News, Dec 2025 Pentagon Warned Against Signal Days After Leak — NPR, Mar 2025 Zero Trust \u0026amp; Assume Breach\nWhat Is Zero Trust Architecture? — Palo Alto Networks Zero Trust Security and Strategy — Microsoft Why \u0026ldquo;Assume Breach\u0026rdquo; Must Extend to the Data Layer — PriviCore, Mar 2026 Passkeys \u0026amp; Passwordless Authentication\nPasskeys Are Finally Taking Over in 2025 — Aujas / NUSummit, Dec 2025 Ditching the Password: Passkeys in 2026 — Techpression, Dec 2025 Passwordless Authentication in 2025: The Year Passkeys Went Mainstream — Authsignal, Dec 2025 Passkeys and Passwordless Authentication 2026: FIDO2 Across Apple, Google, and Microsoft — Programming Helper, Jan 2026 Passwordless Adoption Moves from Hype to Habit — Help Net Security, Oct 2025 Passkeys vs Passwords: Why 2026 Could Be the Tipping Point — Pepelac News, Jan 2026 Passkeys Are Taking Over Email Logins — Mailbird, Nov 2025 AI Agents \u0026amp; Capability Security\nEchoLeak Attack (CVE-2025-32711) — Microsoft Copilot prompt injection via email leading to data exfiltration ServiceNow Agent Vulnerability — low-privilege prompt injection causing agent-to-agent privilege escalation McKinsey \u0026ldquo;Lilli\u0026rdquo; AI Platform Compromise — autonomous agent gained broad system access in under two hours during controlled exercise Dark Reading Poll: 48% of Cybersecurity Professionals Identify Agentic AI as Top Attack Vector — Dark Reading, 2025 Agent Security Readiness Survey — 80.9% in production, 14.4% with full security approval, 45.6% on shared API keys Microsoft Entra Agent ID — agent identity with Conditional Access policies NVIDIA AI Red Team Sandboxing Guide — mandatory egress, file write, and configuration protections for AI agents OWASP Agentic AI Security Framework — \u0026ldquo;Autonomy is earned, not a default setting,\u0026rdquo; 100+ expert contributors Dennis, J.B. \u0026amp; Van Horn, E.C. (1966) \u0026ldquo;Programming Semantics for Multiprogrammed Computations\u0026rdquo; — capability-based security model Apple Platform Security: File Data Protection — per-file encryption, Secure Enclave key management, protection classes Apple Platform Security: App Sandboxing and IPC — all third-party apps sandboxed, IPC only through OS-mediated services Apple CSAM Scanning Controversy (2021) — proposed on-device photo scanning, reversed after backlash; demonstrates fragility of OS-as-root-of-trust social contract AI, DLP \u0026amp; the Junior Developer Crisis\nWhat Is AI DLP? AI-Powered Data Loss Prevention Explained — ArticlEdge, 2026 The Junior Developer Crisis of 2026: AI Is Creating Developers Who Can\u0026rsquo;t Debug — DEV Community, Mar 2026 The AI Mentorship Crisis: Hollowing Out the Engineering Pipeline — AlgeriaTech, Mar 2026 Junior Developers in the Age of AI — CodeConductor, Jan 2026 ","permalink":"https://rivik.dev/180-breaches-a-second/","summary":"180 accounts are breached every second — and most of it comes down to reused passwords and missing MFA. A look at the software quality collapse behind the headlines, and why the fix is the same infrastructure-level move HTTPS once made: passkeys, on-device DLP, and capability-scoped AI agents.","title":"180 Breaches a Second: How Software Broke Its Promise, and the Radical Fix Hiding in Plain Sight"},{"content":"We run iProxy.online, a mobile proxy infrastructure. Our Android app turns phones into proxy servers across 100+ countries. Last year we shipped an advanced network health checker that runs a lot of probes through these proxies to a controlled server. That’s when things got weird.\nA small but noticeable percentage of devices started failing HTTPS checks. HTTP worked fine. The failure was always at the TLS handshake stage. And the behavior was completely non-deterministic: broken for two hours, then fine, then broken for five minutes, then fine again.\nWhat we saw The correlations were weak and noisy. Android v8-v9 devices showed up more often. Cheaper, lower-spec phones were overrepresented. The strongest signal was memory pressure on the device. When we could catch the failure in real time, the device was almost always low on available RAM. But metrics from memory-starved phones are unreliable by definition, so we couldn’t be sure this wasn’t survivorship bias.\nTwo tracks of investigation: find correlations (inconclusive, as described above) and understand what actually breaks.\nThe second track was harder than it sounds. These are not our devices. They sit in remote locations. Physical access happens maybe once a month. We can’t deploy debug builds on demand. The bug is intermittent with no predictable trigger. Direct on-device debugging was effectively impossible.\nNarrowing it down Our metrics pointed at TLS handshake failure, so we tried to reproduce manually. curl through the same proxy, same server (Caddy, default Ubuntu 24.04 repos):\ncurl -x socks5://proxy:port https://our-server.example.com Works perfectly. TLS 1.3, clean handshake, 200 OK. Every time.\nOur network checker is written in Go (1.24 at the time). I built a minimal Go client to isolate the behavior. Here’s where it got interesting:\nClient TLS version Result curl 1.3 works Go 1.3 hangs Go 1.2 (forced) works Forcing TLS 1.2 in Go:\ntlsConfig := \u0026amp;tls.Config{ MaxVersion: tls.VersionTLS12, } This consistently fixed the issue on affected devices.\nA tcpdump on the client side showed the Go client sending ClientHello and then… nothing. No ServerHello coming back to client. The proxy app just sat there.\nIt gets weirder The problem was server-dependent. Go + TLS 1.3 failed against our Caddy server, against Cloudflare, against Google. But it worked against some other sites. So the failure depended on the specific TLS implementation on the remote end, not just the client.\nAnd this wasn’t limited to our checker. When the bug was active on a device, Chrome via same mobile proxy couldn’t open Google over TLS 1.3 either. TLS 1.2 sites loaded fine. This was a device/app-level issue, not something specific to our Go code.\nWhy Go and curl behave differently This is the part that took the longest to reason about.\nGo’s crypto/tls and curl’s underlying OpenSSL/BoringSSL produce different ClientHello messages. Go’s ClientHello has always been somewhat larger due to different extension sets and key share choices. But there’s a much bigger factor here that we didn’t initially consider.\nStarting with Go 1.23, the post-quantum key exchange X25519Kyber768Draft00 is enabled by default when Config.CurvePreferences is nil (which is the standard case). In Go 1.24, this became X25519MLKEM768. The ML-KEM public key alone is 1184 bytes. This makes the ClientHello big enough to exceed a single TCP packet at typical 1500-byte MTU.\ncurl (as of the versions we tested) does not send post-quantum key shares by default. Its ClientHello is much smaller and fits comfortably in a single packet.\nThis size difference matters because our Android app is acting as a TCP proxy. It reads data from one socket and writes it to another. The proxy doesn’t terminate TLS; it just forwards bytes. But it still needs to buffer them.\nThe (probable) mechanism Here’s our working theory. There are two chokepoints, not one.\nChokepoint 1: the ClientHello. Go 1.24 sends a very big ClientHello due to the ML-KEM post-quantum key share (1184 bytes for the public key alone). curl’s OpenSSL sends ~500-700 bytes with a 32-byte X25519 key share. Chrome 124+ also sends post-quantum key shares by default, producing similarly large ClientHello messages. Our tcpdump on the client side showed the ClientHello leaving the client, then silence. This means the proxy either failed to forward the ClientHello to the server, or forwarded it and failed to relay the server’s response back. We can’t distinguish these two cases from a client-side capture alone, but both point to the same thing: the proxy is the bottleneck.\nChokepoint 2: the server flight. This explains why TLS 1.3 worked with some servers but not others, even when the ClientHello is identical.\nIn TLS 1.3 the server responds with a single flight: ServerHello + EncryptedExtensions + Certificate + CertificateVerify + Finished, all at once. The size of this response depends on the server’s certificate chain. A small site with a single cert and short chain might send 2-3 KB. Google or Cloudflare with full certificate chains send 4-6+ KB. Unfortunately, I don\u0026rsquo;t remember what kind of sites worked with TLS 1.3 and what kind of certificate chain they had, so this is just a guess.\nSo even when the ClientHello makes it through the proxy, the server response might not. A memory-starved proxy can relay a 2 KB response but chokes on a 5 KB one. That’s why some sites worked and others didn’t, with the exact same client? It sounds dubious, the numbers are too small, but I didn\u0026rsquo;t have any other hypotheses.\nTLS 1.2 avoids both problems. Its ClientHello is smaller (no PQ key shares), and the handshake is split across multiple round trips with smaller messages in each direction. No single message is large enough to stress the proxy’s buffers.\nWhy restart fixes it: killing the app clears leaked memory, resets all socket state, and gives the proxy fresh buffer capacity. The fact that this consistently works is the strongest evidence that the root cause is resource exhaustion, not a protocol bug.​​​​​​​​​​​​​​​​\nThe fix We needed a production fix, not a research paper. We did two things:\nImmediate mitigation: capped TLS to 1.2 for the checker and for proxy traffic where possible.\ntlsConfig := \u0026amp;tls.Config{ MaxVersion: tls.VersionTLS12, } Observability-driven restart: the checker now detects when TLS 1.3 fails but TLS 1.2 succeeds on the same device. When this pattern appears, we send a remote command to fully restart the app (kill the process, clear memory, relaunch). This consistently fixes the problem, which further supports the memory pressure hypothesis.\nWe also found and fixed memory leaks in our app that were contributing to the pressure. The correlation between leak fixes and reduced TLS 1.3 failures was visible in our dashboards.\nWhy we didn’t dig deeper We don’t have a verified root cause. We have a strong hypothesis, consistent correlations, and effective mitigations that confirm the hypothesis indirectly.\nThe devices are remote, not ours, and physically accessible about once a month. The bug is intermittent with no reliable trigger. Production users were affected. We needed fast fix, not a deep research.\nWe’re now investing in better telemetry and the ability to capture targeted diagnostics remotely. Next time something like this happens, we’ll have the data to pin it down.​​​​​​​​​​​​​​​​\nKey takeaways For proxy developers on constrained devices: TLS 1.3 messages are larger than TLS 1.2, and post-quantum key exchange makes them much larger. If your proxy buffers TCP data, make sure your buffers can handle multi-packet TLS records, especially under memory pressure.\nFor Go developers proxying TLS: be aware that Go 1.23+ sends post-quantum key shares by default. If you’re running through proxies or middleboxes, this can break things. Set CurvePreferences explicitly or use GODEBUG=tlskyber=0 (Go 1.23) / GODEBUG=tlsmlkem=0 (Go 1.24+) to disable it.\nFor anyone debugging intermittent TLS failures: if TLS 1.2 works and TLS 1.3 doesn’t, the problem is almost certainly not in TLS itself. It’s in something between client and server that can’t handle the larger messages. Check your middleboxes, proxies, and buffer sizes.\n","permalink":"https://rivik.dev/when-tls-1.3-silently-dies-inside-your-android-proxy/","summary":"A post-mortem of intermittent HTTPS failures across a mobile proxy fleet: TLS 1.3 handshakes silently dying on memory-starved Android devices — large multi-packet handshake messages, inflated by post-quantum key shares, stressing proxy buffers under memory pressure.","title":"When TLS 1.3 Silently Dies Inside Your Android Proxy"},{"content":"Back in 2022-2023, I encountered an interesting bug that caused unreclaimable kernel memory growth on Ubuntu servers running containerized workloads. The key word here is unreclaimable — this memory cannot be freed without rebooting the kernel. No echo 3 \u0026gt; /proc/sys/vm/drop_caches, no service restart, no container reboot. Only a full host node reboot releases it.\nThe Configuration A systemd unit that restarts on failure:\n[Unit] After=network-online.target Wants=network-online.target StartLimitIntervalSec=0 [Service] Type=exec DynamicUser=true ExecStart=bash -c \u0026#39;echo bug; exit 1\u0026#39; ProtectSystem=strict ProtectHome=yes ProtectKernelModules=yes ProtectKernelLogs=yes ProtectControlGroups=yes ProtectClock=yes PrivateDevices=yes PrivateTmp=yes RestartSec=1ms Restart=always [Install] WantedBy=multi-user.target The daemon would exit if a required network interface or config file wasn\u0026rsquo;t present, then restart. During network issues or misconfigurations, this created restart loops — accumulating over days and weeks.\nObserved Behavior SUnreclaim in /proc/meminfo growing slowly but steadily Eventually OOM kills across unrelated processes Container restarts did nothing — the memory is in host kernel space Only host node reboot reclaimed the memory This last point is critical for containerized environments (LXC/LXD/Proxmox). The kernel is shared between all containers. When unreclaimable slab grows, you cannot fix it by restarting the affected container. You must reboot the entire host — taking down all workloads on that node.\nSUnreclaim stable growing ~60mb/day — seems to slow, but multiplies fast on multi-container envs\nRoot Cause: Cgroup Memory Accounting Bug The Linux kernel has a bug in cgroup memory controller. When systemd creates and destroys cgroups:\nmem_cgroup_css_alloc() allocates kernel structures for each new cgroup mem_cgroup_css_offline() is called when cgroup is removed But mem_cgroup_css_free() is NOT always called The percpu_ref reference counting mechanism fails to fully release the structures. These allocations go into slab_unreclaimable — kernel memory that cannot be reclaimed by any means except reboot.\nByteDance engineers documented this with kprobe analysis: alloc/offline counts matched, but alloc/free counts diverged over time. Each restart cycle leaked a small amount. Over days of normal operation with occasional restart loops, this accumulated into gigabytes.\nI confirmed the bug by temporarily setting RestartSec=1ms — the rapid restarts made SUnreclaim grow visibly in real-time, which proved the correlation.\nWhy Type=exec Amplifies the Problem Type=simple:\nDirect fork of the target process Single cgroup creation per service start Type=exec:\nSpawns systemd-executor first Executor configures namespaces, sandboxing, cgroup hierarchy Then exec\u0026rsquo;s the target binary Multiple cgroup operations per restart cycle Switching from Type=exec to Type=simple reduced cgroup churn significantly, which slowed the leak enough to be manageable.\nThe Fix Just set Type=simple is mostly enough. Also set RestartSec around ~10-15s.\nAnd, obviously, fix your service not to fail so often 😉\nAlternative Mitigations Upgrade to Ubuntu 24.04+ with latest kernel updates Boot parameter cgroup.memory=nokmem disables kernel memory accounting Disable memory accounting for specific units: MemoryAccounting=no Affected Versions I observed this bug on:\nUbuntu 18.04 (kernel 4.15, cgroups v1) Ubuntu 20.04 (kernel 5.4, cgroups v2) The bug affects both cgroups v1 and v2. Full fixes arrived with later kernel updates in 22.04 (5.15+) or through backported patches.\nMonitoring # Unreclaimable slab — this number should be stable over time watch -n 60 \u0026#39;grep SUnreclaim /proc/meminfo\u0026#39; # If SUnreclaim keeps growing, check which slab caches are responsible slabtop -o | head -20 # Memory cgroup count (growing = potential leak) cat /proc/cgroups | grep memory # Find services with frequent restarts journalctl --since \u0026#34;24 hours ago\u0026#34; | grep -E \u0026#34;Started.*\\.service\u0026#34; | sort | uniq -c | sort -rn | head -10 Related Bug Reports and Articles Kernel/Systemd Issues:\nhttps://bugs.debian.org/cgi-bin/bugreport.cgi?bug=912411 — ByteDance\u0026rsquo;s original report with kprobe analysis https://github.com/systemd/systemd/issues/8015 — systemd-logind memory leak on SSH connections https://github.com/systemd/systemd/issues/14639 — cred_jar slab growth with socket activation https://bugs.launchpad.net/bugs/1750013 — systemd-logind session leak on Ubuntu Kubernetes/Container Ecosystem:\nhttps://github.com/kubernetes/kubernetes/issues/61937 — cgroup leak with kernel memory accounting enabled https://github.com/kubernetes-sigs/kind/issues/421 — memory cgroup leaks in kind clusters https://github.com/canonical/lxd/issues/5405 — unreclaimable slab growth in LXD https://github.com/deckhouse/deckhouse/issues/2152 — kmalloc-4k leak with systemd cgroupDriver Technical Analysis:\nhttps://www.sobyte.net/post/2022-01/memory-cgroup/ — NetEase kernel team\u0026rsquo;s deep dive with kernel-level fix Summary The combination of:\nKernel bug in cgroup memory accounting Type=exec overhead (multiple cgroup operations per start) Services that restart frequently over time Containerized environment (shared kernel) Creates a scenario where unreclaimable kernel memory grows until OOM. The debugging path is non-obvious — the leak is in kernel space, and container restarts provide no relief. You only discover it when the host goes down.\nIf you\u0026rsquo;re running Ubuntu 18.04/20.04 LTS with services that restart frequently — monitor your SUnreclaim and consider switching to Type=simple.\n","permalink":"https://rivik.dev/systemd-unreclaimable-kernel-memory-leak/","summary":"How a frequently-restarting systemd unit with Type=exec leaked unreclaimable kernel slab memory (~60 MB/day) on Ubuntu hosts via a cgroup memory-accounting bug — and why only a full host reboot could free it.","title":"Systemd Unreclaimable Kernel Memory Leak"},{"content":"Another bug in Ubuntu with a stopgap fix. Not a single year without them.\nLatest: apt-get update randomly hangs for hours on Ubuntu 24.04 LTS. Known since 22.04, still not fixed.\nThe worst part? It silently blocks unattended-upgrades. Your servers stop receiving security updates and you don\u0026rsquo;t even know (if you have no alerts 🥲).\nThe \u0026ldquo;fix\u0026rdquo;? A cron job that kills stuck apt processes 🙄.\nThat\u0026rsquo;s why Google, Amazon, and others roll their own Linux distributions.\n","permalink":"https://rivik.dev/another-ubuntu-bug-with-a-stopgap-fix-apt-get-update-hangs-for-hours/","summary":"apt-get update randomly hangs for hours on Ubuntu 24.04 LTS — known since 22.04, still not fixed. Worst part: it silently blocks unattended-upgrades, so your servers stop receiving security updates. The \u0026lsquo;fix\u0026rsquo; is a cron job that kills stuck apt processes.","title":"Another Ubuntu Bug With a Stopgap Fix: apt-get update Hangs for Hours"},{"content":"Recently, my home internet provider started upselling a 10gbit (XGS-PON) connection. No more words, just take my money =)\nBut I have no suitable devices for this — how to test the difference? 10G USB-C for notebooks is too expensive, buying a new PC just for a test is madness (yeah, say XGS-PON is not =)\nLuckily, I have an old HP G2 mini as my home server for backups. Aaand, a $15 noname M.2 → PCIe expander + a 10G NIC did the stuff! Seriously — I didn\u0026rsquo;t expect this to work, wow =)\nP.S. Don\u0026rsquo;t forget the pcie_aspm=off kernel option — my extender cable is too long to work without it.\nVery tricky setup: M.2 expanders provide four PCIe lanes, but most cheap 10G NICs require eight. Also notice the PCIe version.\nThe exact equipment list:\nExpander: ADT R43UF (PCIe 3.0 x4) NIC: AQC107 (Aquantia/Marvell — happy on x4) Originally posted on LinkedIn.\n","permalink":"https://rivik.dev/10g-at-home-15-m.2-expander-turns-an-old-hp-mini-into-a-10gbe-test-rig/","summary":"My ISP started upselling 10gbit XGS-PON — but how to test it without buying a 10G-capable machine? A $15 noname M.2 → PCIe expander + AQC107 NIC on an old HP G2 mini did the job. Plus the pcie_aspm=off gotcha.","title":"10G at Home: $15 M.2 Expander Turns an Old HP Mini Into a 10GbE Test Rig"}]