Most of the decisions a small business owner makes about AI are presented as choices between models — which vendor, which tier, which monthly subscription. That is the visible decision. The quieter one, and usually the more consequential, is where the model runs. On a server in Ohio, a data centre in Dublin, or a small machine in the corner of your office.
Over the last eighteen months the gap between what you can run locally and what you can run on a public cloud has narrowed significantly. Not closed — the frontier models from Anthropic, OpenAI, and Google still clearly lead — but narrowed to the point where, for a large fraction of business use cases, a model you own is now a viable option. This piece is for the owner who is trying to work out whether their business is one of them.
The landscape, briefly
There are three practical places to run a language model today, each with different commercial and operational shapes.
Public cloud APIs — OpenAI, Anthropic, Google, plus a handful of hosted-open-source providers (Together, Fireworks, Groq). You send a request; you pay by the token; you get a response. Set-up time is measured in hours. You never touch infrastructure. You also never see the hardware, the logs, or the training data, and your prompts and outputs transit their systems.
Managed-private cloud — the same API shape, but inside your cloud tenant with a contractual data residency and privacy posture: AWS Bedrock, Azure OpenAI, GCP Vertex. More plumbing, more paperwork, slightly more expensive, but with signed commitments about where your data goes and how it is retained.
Local inference — you run the model on your own hardware, or a machine you rent that nobody else uses. Nothing leaves your network unless you choose to send it. Historically the preserve of research labs; now increasingly plausible for production use, because the open-weight models (Llama 4, Qwen 3, Mistral’s medium-weight releases) have become genuinely capable, and because consumer-grade GPUs and Apple Silicon have become powerful enough to run them usefully.
The reasonable middle path most businesses end up on is a hybrid — local for sensitive or high-volume work, public cloud for bursts and frontier tasks. The interesting part of the decision is where the line between the two gets drawn.
The quieter decision — and usually the more consequential — is where the model runs.
When local inference is the right answer
There are four situations where running models on your own hardware is, for a UK SMB, not just acceptable but actively preferable.
You hold data that is unambiguously sensitive. If your business handles client medical records, legal correspondence, accountants’ books, confidential HR files, or anything under special-category UK GDPR, then the question of where processing happens is not a performance question — it is a compliance question and a trust question. Sending that data to a third-party API, even one with excellent data-handling commitments, asks your clients to extend trust to a third party they did not vet. Running the model on your own machine does not.
Most UK solicitors’ firms, private medical clinics, and independent accountancy practices we meet are surprised by how little it costs to host a capable model themselves. A second-hand workstation with a modern consumer GPU — call it £2,500 to £4,000 — will comfortably serve a small firm’s document summarisation, correspondence drafting, and contract review needs.
Your volume is high and predictable. Per-token pricing looks cheap until you run a process that calls a model a hundred thousand times. We have seen small businesses quietly accrue five-figure annual bills with cloud providers by doing relatively mundane batch work (invoice reconciliation, email triage, enrichment of a CRM) that could have been done locally at essentially zero marginal cost.
The break-even arrives earlier than people expect. A mid-tier GPU server (£4,000-£6,000 capital, perhaps £40 a month in electricity) typically beats per-token cloud costs within nine to fifteen months for any process that runs continuously. If your use case is high-volume and steady, local is a straightforward finance decision.
Latency matters in an unusual way. Cloud APIs generally feel fast, but they are round-trip-limited. When a human is waiting behind the response — a live chat, a voice interface, a real-time summarisation — every hundred milliseconds of network transit is visible. Locally-hosted models sitting on the same network as the application can respond with a noticeably different feel. Whether that matters depends on whether anything about your product or operations is real-time in a way that a human will notice.
You need a guarantee of availability. Public cloud APIs have outages. They are infrequent, mostly minor, and the providers handle them well — but if your business depends on a process that cannot fail when OpenAI has a bad afternoon, local inference removes an entire category of risk. Whether that matters, again, depends on the business. For a marketing agency drafting client deliverables, no. For a medical triage tool used in a walk-in clinic, yes.
When cloud is the right answer
The reverse cases are at least as common.
You need the frontier. The best models from Anthropic, OpenAI, and Google are genuinely ahead of anything you can run at home — not by a small margin, and not on every task, but clearly on tasks that require sophisticated reasoning, coding, or knowledge breadth. If your use case is “help me think through this difficult thing” rather than “process these ten thousand invoices”, you want the best model available, and the best model available is hosted in someone else’s cloud.
Your volume is low or unpredictable. At low volume, the maintenance cost of self-hosting — the hour-a-month of someone keeping the server patched, the occasional driver update, the moment the GPU fan dies — exceeds any plausible saving. Below about fifteen thousand model calls a month, cloud is cheaper even if you value your own time at zero.
You move fast and the model itself is changing. Cloud providers ship new models roughly every quarter. Running a new model locally means downloading weights, updating the serving stack, and re-benchmarking against your workload. That maintenance work is real, and for a small business that mostly cares about outputs rather than models, it is rarely a good use of engineering time.
Your team is distributed, or remote-first. A local server in an office is a single point of access, usefully so for sensitive work, limitingly so for a team that works from kitchens in three time zones. Cloud endpoints meet people where they are.
The hybrid, which is what most businesses end up running
In practice, most of the businesses we help end up with a hybrid: a locally-hosted small model for the high-volume, lower-judgement work (classify this email, enrich this lead, summarise this document) and a public-cloud frontier model for the work where reasoning quality matters more than unit cost (draft this proposal, answer this unusual client question, triage this edge case). The workflow code routes between the two based on simple rules that the business owner understands.
We tend to resist clients who want to be purist about it. The correct answer almost always involves using both.
What we recommend, tentatively
If you are a UK SMB reading this and trying to decide, the honest answer is that most businesses should start with cloud, and migrate specific workloads to local inference once those workloads reveal themselves — either by becoming high-volume and predictable, or by touching data that shouldn’t leave your walls.
Starting with local inference for a business whose workloads are still being discovered is almost always the wrong choice. The setup cost is real, the model choice is hard to make before you know what you’re doing, and you end up paying in engineering time for a flexibility you may never use.
The migration, when it comes, is usually painless. Most production AI code we write uses a model-agnostic interface; swapping a cloud endpoint for a local one is a half-day of work.
A note on the regulatory picture
UK data protection law — the UK GDPR and the Data Protection Act 2018 — treats AI processing as a form of automated processing with some specific implications, particularly around Article 22 (automated decisions with legal or similarly significant effects) and around data transfers to third countries. If your business processes personal data, you will need a lawful basis for any AI processing, a documented DPIA if the risk is meaningful, and — if you use a cloud provider whose processing happens outside the UK — appropriate transfer safeguards.
These are not reasons to avoid AI; they are reasons to pick your deployment model with the regulatory posture in mind. The cloud providers have thought hard about this and have defensible answers. The local-inference posture is attractive here because it largely sidesteps the third-country transfer question entirely.
We have written about the regulatory picture in more detail in UK data protection for small businesses adopting AI.
What this is not
This piece has been studiously avoidant of technical specifics — which model to pick, which framework to use, how to serve it. This is deliberate. The choice of where to run the model is a commercial and operational decision. The choice of what to run is an engineering decision that follows from it, and that we’d rather discuss with you in the context of a specific workload than in the abstract.
If you are thinking about this for your own business and would like a straight answer to which of the three postures fits, write to us. We will give you a view within the working day, free, without trying to sell you anything you don’t need.