How AI development has changed in the last two years — a software company's perspective

Two years ago, most of our engineering team treated AI the way people treat new power tools at a hardware store: interesting, promising, worth a weekend to try out. Clients asked about it occasionally. We ran a few internal experiments. Nobody scoped a production feature around it.

That picture looks nothing like today. AI is now part of how we estimate projects, review pull requests, onboard developers, and talk to clients about what their apps should do. The change happened in fits and starts, but looking back across 2024, 2025, and into 2026, a few shifts stand out as genuinely different.

Abstract digital background with AI icons representing recent advancements in software technology

From demos to production

The biggest change is boring to describe and enormous in practice: AI features actually work in production now. Two years ago, if a client asked for “AI-powered search” or a chatbot that could answer product questions, we built it with a mix of fragile prompt chains, custom retrieval code, and fallback logic for when the model hallucinated. Demos passed. Staging worked. Real users asked a question we hadn’t anticipated, and the whole thing fell over.

Today we ship features that stay up. Part of that is better models — longer context windows, lower error rates, better tool use. Part of it is that the surrounding infrastructure has matured. Vector databases, evaluation frameworks, and observability tools designed specifically for LLM calls existed in rough form two years ago. Now they’re as well-documented as any other part of the stack. We can trace why a model gave a particular answer, replay it, write a regression test, and fix it.

Daily engineering work looks different

If you visited our offices in 2024 and again in 2026, the desks would look the same but the rhythm of work would not. Engineers spend less time typing boilerplate and more time reviewing code they didn’t write. A feature that used to take a week of implementation plus a week of review now often takes two days of implementation and three days of review, testing, and cleanup. The total time didn’t drop as much as some predicted. What changed is where the effort goes.

This has consequences we’re still working through. Junior engineers who would have learned by writing code line by line are now learning by reading, editing, and redirecting code the AI produces. That’s a different skill, and the ones who do it well are the ones who still read documentation, still understand memory models, still argue with senior engineers about architecture. The ones who just accept suggestions end up shipping code they can’t debug.

Code review has become the bottleneck. Two years ago, the slow step was writing the code. Now the slow step is understanding what twenty minutes of agentic coding produced, checking whether it matches the intent, and deciding whether the tests it wrote are the right tests.

What clients ask for has shifted

In early 2024, client conversations about AI were mostly about possibility. “Could we add a chatbot?” “What would it take to summarize these documents?” The questions were hypothetical, and our answers were hedged.

Now the conversations are about trade-offs. Clients know, roughly, what AI can do. They’ve used ChatGPT and Claude themselves. They understand hallucinations the way they understand crashes — something to budget for, not a deal-breaker. The questions have become sharper: what’s the cost per user per month, what happens when the model is wrong, how do we stop it from saying something that gets us sued, who owns the data going through the API.

We spend more time on governance and evaluation than on the AI itself. Writing a prompt that produces good output on a good day is easy. Writing one that fails gracefully on a bad day, with a user who speaks a third language, on a mobile connection that dropped mid-request — that’s the work.

Mobile and AI finally met

Mobile is where the gap between 2024 and now shows up most clearly. Two years ago, AI in a mobile app meant calling a cloud API from the phone and displaying the result. The app was a thin client. The AI lived somewhere else.

That’s still common, but the picture is wider now. On-device models have become small enough and fast enough to run useful tasks — transcription, basic summarization, image classification — without a round trip. Apple Intelligence, Gemini Nano, and a growing set of open models that fit in a few hundred megabytes have pushed work back onto the device. For apps that handle sensitive data or operate offline, this matters more than any cloud benchmark. When we build custom mobile apps at Empat Tech, the first architectural question for an AI feature is no longer “which API do we call” — it’s “what runs locally, what runs in the cloud, and how do we fall back when the connection drops.”

The tooling has caught up too. Core ML, TensorFlow Lite, and MLC have become predictable targets. Two years ago, getting a quantized model to run on a mid-range Android phone felt like a research project. Today it’s a line item in the sprint plan.

Team composition has changed

Hiring looks different. We still need senior engineers, arguably more than before — someone has to review everything the AI produces and know when it’s wrong. Our demand for junior engineers who write CRUD endpoints has dropped. What’s grown is roles that didn’t exist two years ago in any serious way: people who write and maintain evaluation suites, people who own the prompt library the way someone else might own a CSS framework, people who handle data pipelines feeding retrieval systems.

We’ve also hired our first full-time AI product manager — someone whose job is to decide which features should use a model and which shouldn’t. That distinction mattered less when AI was a novelty. Now, with an AI option available for almost any feature, choosing when not to use one is a real skill.

New failure modes

Every new capability brings new ways to break. The failures we spent 2024 firefighting — runaway token costs, prompt injections in customer inputs, models returning confidently wrong answers — are now problems we design around from day one. We rate-limit aggressively. We treat user input to an LLM the same way we treat user input to a SQL query. We write evals before we write prompts.

Data leakage is the concern that keeps clients up at night. A model trained on customer data, or one that logs prompts to a third-party provider with unclear retention policies, can become a compliance problem years after the project ships. Contracts, data residency, and model selection are part of the initial scoping conversation now, not an afterthought.

What comes next

Predicting two years out is a losing game, but a few directions look likely. Agents that run longer — hours instead of seconds — are already reshaping what “build me a feature” means. Voice interfaces are starting to feel less novelty and more default. Models that handle images, audio, and video in one call are changing what a mobile app’s camera is for.

The shift from 2024 to 2026 wasn’t one sudden leap. It was a thousand small improvements and a few big ones. Working through it from inside a software company has felt less like riding a wave and more like learning a new language while still being expected to hold conversations. Two years from now, what feels awkward today will feel ordinary, and something else will feel awkward. That’s the part we’ve gotten used to.