AI Agents

Ollama in production: how to deploy local AI in your SME

2026-06-30 7 min read
Ollama in production: how to deploy local AI in your SME

Many SMEs have already tried Ollama on a laptop: install it, download a model, run a few prompts, and the AI runs locally without sending anything to the cloud. The real challenge begins when you want to move that into production: making it respond to the whole team, keeping it up, making it fast, secure, and integrated. That leap is what we will break down.

The goal is for a real SME to have local AI running 24/7 at a reasonable cost and without relying on third parties.

What is Ollama and why does it matter to an SME?

Ollama is an open-source tool that simplifies running large language models (LLM) on your own hardware. You download compatible models — Llama, Mistral, Qwen, Gemma, and many more — and serve them through a local API. You do not need to know PyTorch, CUDA, or complex containers. With a couple of commands you already have a model running.

For an SME, this changes the rules:

  • Data sovereignty. Your conversations, documents, and queries do not leave your network.
  • Predictable cost. You pay for the hardware once; afterwards you do not pay per token or per user.
  • No vendor dependence. You do not depend on OpenAI, Anthropic, or Google changing prices, policies, or availability.
  • Easy integration. Ollama's REST API connects easily to n8n, your applications, or internal agents.

Ollama is not the model itself: it is the engine that makes it work in your environment.

From testing to production deployment

The difference between "I have Ollama on my laptop" and "Ollama in production" is the same as between a test car and a fleet vehicle. In production you need:

  • Availability. The service must start on its own, survive reboots, and recover from errors.
  • Concurrency. Several employees or processes may query the model at the same time.
  • Stable performance. Response time must be predictable, not sometimes two seconds and other times thirty.
  • Monitoring. Knowing whether the service is down, whether the GPU is at 100%, or whether there are errors.
  • Model management. Being able to update, switch, or add models without stopping everything.
  • Security. Making sure not just anyone can access your local API from outside.

All this can be built with modest hardware and open-source tools. The important thing is to design it before users start depending on it.

Realistic hardware requirements

You do not need a supercomputer. You need to know which model you will run and how many users will use it.

Profile Use Recommended hardware Viable models
Entry Testing, 1-2 users, light tasks Modern CPU, 16 GB RAM, SSD Gemma 2B, Qwen 2.5 3B, Llama 3.2 1B
Standard 3-10 person team, RAG, automations GPU RTX 3060/4060 (12 GB VRAM), 32 GB RAM Llama 3.1 8B, Mistral 7B, Qwen 2.5 7B
Advanced Heavy use, multiple agents, large contexts RTX 4090 (24 GB VRAM) or A6000, 64 GB RAM Llama 3.1 70B (quantised), Mixtral 8x7B, 13B-32B models

VRAM is the bottleneck. A 7B model at 4 bits usually takes about 4-5 GB of VRAM, but long context can double that consumption. If you are going to process long documents or keep long conversations, aim for at least 12 GB of VRAM. Without a GPU, everything runs on CPU, but speed drops by 5 to 20 times: usable for tests, frustrating for production.

Storage also matters. Models weigh between 2 GB and 50 GB. A 500 GB SSD is enough to start; 1 TB gives you room for several models, vector databases, and logs.

Recommended models by use case

Ollama lets you download dozens of models, but not all are suitable for the same purpose. These are the ones we usually recommend at Neurosint:

  • Llama 3.1 / 3.2 (8B). Balance between quality and speed. Ideal for internal chat, email classification, and agents that follow structured instructions.
  • Mistral 7B / Mixtral 8x7B. Mistral 7B is fast and reasons well in Spanish. Mixtral offers quality close to much larger models, but needs more VRAM.
  • Qwen 2.5 (7B-14B). Excellent for multilingual tasks, including Spanish. Useful if your SME operates in several languages.
  • Gemma 2 (9B). Good option for summarising long texts and answering questions about documents if you have enough VRAM.
  • Specialised models. For code, Code Llama or DeepSeek Coder; for embeddings, nomic-embed-text or all-minilm; for vision, Llava if you need to analyse images.

The key is not to use the biggest model, but the right one. A well-configured 7B model solves most SME needs. If you need more precision, increase size; if you need speed, lower quantisation.

Typical deployment architecture

The most solid way to deploy Ollama in an SME is with Docker on a dedicated server or virtual machine. A basic architecture would be:

  1. Physical or virtual server with a GPU (if possible) inside your local network or DMZ.
  2. Docker and Docker Compose to run Ollama reproducibly.
  3. Persistent volume for models, so they are not downloaded every time.
  4. Reverse proxy (nginx, Traefik, or Caddy) that terminates HTTPS and routes to Ollama.
  5. Authentication on the proxy: API key, internal OAuth, or IP/VPN restriction.
  6. Basic monitoring with centralised logs and availability alerts.
  7. Backups of the model volume and configuration.

It is not necessary to expose Ollama to the internet. In most cases, it is enough for it to be accessible from your local network or VPN. If you want remote employees to use it, they connect to the VPN first; you do not open the port to the world.

For modest high availability, you can have a second server with a smaller model as backup, or configure a contingency plan that, in case of failure, routes critical requests to another model or to a deferred queue.

REST API and integration with n8n and agents

Ollama exposes a REST API that is the bridge to your workflows. The most useful endpoints are:

  • /api/generate: for a single response from a prompt.
  • /api/chat: for conversations with history.
  • /api/embeddings: for converting texts into vectors and feeding a vector database in a RAG system.

A basic curl call looks like this:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Summarise this text in three points: ...",
  "stream": false
}'

In n8n, you use the HTTP Request node pointing to your internal Ollama. From there you can build flows such as:

  • Classify customer service emails and draft replies.
  • Summarise attached documents and save the summary in your CRM.
  • Generate product descriptions from technical data sheets.
  • Feed an agent that queries your local knowledge base.

If you use agent frameworks such as LangChain, LlamaIndex, or CrewAI, most allow you to point to the Ollama endpoint as if it were OpenAI. You only change the base URL and the model name. That way your agents run entirely on your infrastructure.

Security: not just because it is local

Running local AI is a big step for privacy, but it is not enough. In production, apply these layers:

  • Isolate the network. Ollama does not need internet access to generate text. Cut its outgoing access except for downloading models, and do those downloads from a management machine.
  • Control access. Never expose port 11434 directly. Use a proxy with HTTPS and authentication. Limit by IP or VPN.
  • Principle of least privilege. Agents that only read should not write. Those that write should not delete. Each integration has its own key.
  • Sandbox. If an agent can execute actions (send emails, modify databases), run those actions inside a controlled environment with strict permissions.
  • Logs and auditing. Record who queries what, when, and with what result. It helps detect problems and comply with GDPR.
  • Updates. Both Ollama and the operating system and models must be updated. Models also receive security patches and improvements.

Remember: a local model with uncontrolled access to your email or ERP can be as dangerous as a badly configured external API.

Common mistakes that block deployment

At Neurosint we have seen the same pitfalls again and again. Avoid them:

  • Expecting cloud speed with CPU. Ollama on CPU works, but not for production with real users. If there is budget, invest in a GPU.
  • Choosing a model that is too big. A 70B model on a 12 GB GPU slows everything down or simply does not fit. Better a flowing 7B-8B quantised model.
  • Leaving the API public. Unprotected port 11434 is an unnecessary risk.
  • Ignoring context size. Huge prompts slow generation and consume VRAM. Trim or summarise context before sending it.
  • Not validating outputs. AI can invent data. If the result goes to an ERP, CRM, or database, add a validation layer.
  • Forgetting backup. One day the disk fails and you realise you had no copy of the models or configuration.
  • Prompt injection. A user or malicious email can redirect the model. Limit what it can do and validate instructions.

Most of these mistakes are solved with good practices, not with more hardware.

Conclusion: local AI is already within reach of your SME

Ollama is not just a tool for testing. With the right hardware and a secure architecture, it can become the AI engine of your SME: customer service, document automation, internal agents, knowledge assistants.

The advantage is also strategic: you control your data, reduce cloud dependence, and amortise an initial investment in months, instead of a subscription that grows with every user.

Start small: one use case, one 7B model on GPU, n8n, and measure the savings. Scaling afterwards is a matter of repeating the formula.


At Neurosint we help SMEs in Bilbao and the surrounding area design, deploy, and secure local AI infrastructures with Ollama, n8n, and open-source models. If you want to move from testing to production without relying on the cloud, let's talk.

Ready for the technology leap?

Don't let your SME fall behind. We implement the AI infrastructure that will give you the competitive edge.

Book Your Free Audit

Keep exploring