top of page

Building a Cost-Efficient LLM Inference API with Ollama and Gemma 3n on Cloud Run

  • Writer: Meltem Seyhan
    Meltem Seyhan
  • Jul 5
  • 6 min read

Updated: Jul 8

As developers, we often start building new features based on what’s technically possible — only to be forced to rethink based on what’s sustainable. This post is the story of one such pivot, where a promising LLM-powered feature idea led me to explore Gemma 3n models, Ollama, and ultimately deploy a cost-efficient, flexible setup using Google Cloud Run and Cloud Storage FUSE.


My Use Case


I’m developing a mobile quiz app, and I wanted to add a new feature: automatically generate quiz questions from user-uploaded content like PDFs, DOCX files, and image-based documents. My plan was to extract text and images and generate questions using an LLM. Since my app already uses OpenAI APIs for other features, I started testing this with OpenAI APIs.


But I quickly hit a wall — cost. During basic testing, I spent over 20 USD in a single day. It was clear that if I didn’t find a more cost-efficient solution, I wouldn’t be able to finish this feature at all.


Why Gemma 3n?


While searching for affordable LLM options, I came across Gemma 3n — announced by Google on May 20, 2025. It immediately caught my attention.


Gemma 3n is:

  • Lightweight enough to run on edge devices

  • Doesn’t require a GPU

  • Open-weight and free to use

  • Accepts image and text as input (While the official documentation for Gemma 3n indicates support for image inputs, the version served through Ollama does not appear to process images. Although the API accepts image data in the request, it seems the model does not actually utilize it during inference.)


These made it ideal for my use case, where I want to keep costs low but still host my own model. With a context size of 32K, it’s more than sufficient for my needs, especially since I already have chunking logic in place to manage larger documents.


Ollama offers two variants of the Gemma 3n model: gemma3n:e2b and gemma3n:e4b. In this post, I’m using the 4B version, but I may switch to the lighter 2B model if it proves sufficient for my needs after thorough testing.


Why Ollama?


The next challenge was figuring out how to serve Gemma 3n easily and portably. I discovered Ollama — a tool that makes LLMs available through a simple local API with just one command. Best of all, it provides official Docker images and supports many models.


Since my users don’t interact directly with the model and I already have an application server in place (running on Google Cloud Run), Ollama was a great fit. It gave me a clean, vendor-neutral solution I could host on GCP or move elsewhere if needed. And switching to another model later would be easy because Ollama supports many LLMs out of the box.


Running Ollama on Cloud Run


I decided to deploy the Ollama server on Google Cloud Run. Here’s why:

  • I already use Cloud Run for my app backend, so the setup was familiar

  • Cloud Run has a zero-cost idle mode when instances aren’t used

  • If I ever need more power or switch to a heavier model, I can attach GPUs

  • It’s serverless, so it scales automatically with demand


That said, I had to solve two main problems:

  1. Authentication – I didn’t want the Ollama server to be public. Only other services I control should be able to access it.

  2. Storage – Ollama downloads the model at runtime, but Cloud Run has no persistent disk. I didn’t want to redownload models every time the container starts.


The solution was using Cloud Storage FUSE. I placed the models into a GCS bucket and mounted that bucket inside the container as the models folder. This way, model files persist across cold starts, without the need for local disk.


By the way, if you’re concerned about data privacy and prefer not to send your content to hosted LLM APIs — or even avoid using the cloud altogether — you can simply deploy the same Docker image to your own Kubernetes cluster instead of Cloud Run.


Step-by-Step Setup


If you want to follow this approach and serve an LLM using Ollama and Cloud Run with persistent model storage, here’s a step-by-step guide. I assume you already have a Google Cloud project and basic familiarity with Docker and the terminal.


Step 1: Prerequisites


  • Install Docker and the gcloud CLI

  • Set up billing on your GCP project

  • Enable these services: Cloud Run, Artifact Registry, Cloud Storage

  • Download and install Ollama for your operating system.

  • Create a folder for your project and open a terminal in that directory


Step 2: GCloud Initial Setup

Authenticate with Google Cloud and set your active project:

gcloud auth login
gcloud config set project [PROJECT_NAME]

Step 3: Create a Cloud Storage Bucket

Create a bucket in the same region as your Cloud Run deployment to store the Ollama models:

gsutil mb -c standard -l [REGION] gs://[BUCKET_NAME]

Step 4: Download and Upload the Model

Pull the model locally using 'ollama pull'. Then upload the model files to the bucket that you have just created:

ollama pull gemma3n:e4b
gsutil -m cp -r ~/.ollama/models/* gs://[BUCKET_NAME]/

Step 5: Dockerfile Setup

Create your Dockerfile with the following content:

FROM ollama/ollama:0.9.5
ENV OLLAMA_HOST=0.0.0.0:8080
ENV OLLAMA_MODELS=/models
ENV OLLAMA_DEBBUG=true
ENV OLLAMA_KEEP_ALIVE=-1
ENTRYPOINT [“ollama”, “serve”]

Step 6: Create an Artifact Registry Repository

Create a repository in Google Artifact Registry, or you can skip this step and use an existing repository if you already have one:

gcloud artifacts repositories create [ARTIFACT_REPOSITORY] --repository-format=docker --location=[REGION] --project=[PROJECT_NAME]

Step 7: Build and Push the Docker Image

Build your Docker image and push it to Artifact Registry using Cloud Build, which automatically handles building for the correct platform architecture required by Cloud Run:

gcloud builds submit --tag [REGION]-docker.pkg.dev/[PROJECT_NAME]/[ARTIFACT_REPOSITORY]/ollama_server:latest .

You could build and push manually with Docker commands (e.g., docker build --platform linux/amd64, docker push …), but Cloud Build streamlines this process—you don’t need to manage architecture flags or worry about matching Cloud Run’s environment.


Step 8: Deploy to Cloud Run with GCS Fuse

Deploy the container image to Cloud Run as a service. The service is configured to be private (`--no-allow-unauthenticated`) and mounts the Cloud Storage bucket for model storage:

gcloud beta run deploy ollamaserver \
      --image [REGION]-docker.pkg.dev/[PROJECT_NAME]/[ARTIFACT_REPOSITORY]/ollama_server:latest \
      --platform managed \
      --region [REGION] \
      --no-allow-unauthenticated \
      --no-cpu-throttling \
      --timeout=3600 \
      --max-instances=1 \
      --min-instances=0 \
      --memory=16Gi \
      --cpu=4 \
      --cpu-boost \
      --execution-environment=gen2 \
      --add-volume=name=model-vol,type=cloud-storage,bucket=[BUCKET_NAME] \
      --add-volume-mount=volume=model-vol,mount-path=/models

Step 9: Configure Permissions

Create a service account:

gcloud iam service-accounts create [SERVICE_ACCOUNT] --display-name "Service Account to Invoke Ollama Server"

Add Cloud Run Invoker role to the service account that you have just created:

gcloud run services add-iam-policy-binding ollamaserver \
      --member="serviceAccount:[SERVICE_ACCOUNT]@[PROJECT_NAME].iam.gserviceaccount.com" \
      --role=roles/run.invoker \
      --platform=managed \
      --region=[REGION]

Step 10: Test the Deployed Service

To test the service from your local machine, you need to generate an identity token. This requires your GCP user account to have permission to act as the service account:

gcloud iam service-accounts add-iam-policy-binding [SERVICE_ACCOUNT]@[PROJECT_NAME].iam.gserviceaccount.com \
      --member="user:$(gcloud config get-value account)" \
      --role="roles/iam.serviceAccountTokenCreator"

Create a temporary identity token by impersonating the service account. This token proves the request is authorized. Replace CLOUD_RUN_SERVICE_URL with the URL of your Cloud Run service. It should look similar to "https://ollamaserver-594837594857.europe-west3.run.app":

export ID_TOKEN=$(gcloud auth print-identity-token --impersonate-service-account="[SERVICE_ACCOUNT]@[PROJECT_NAME].iam.gserviceaccount.com" --audiences="[CLOUD_RUN_SERVICE_URL]")

Call the Ollama server with token you have just generated, using CURL command

curl -X POST "[CLOUD_RUN_SERVICE_URL]/api/generate" \
          -H "Authorization: Bearer $ID_TOKEN" \
          -H "Content-Type: application/json" \
          -d '{
            "model": "gemma3n:e4b",
            "prompt": "Why is the sky blue?",
	      "stream": false
          }'

Step 11 (Optional): Test Locally via Cloud Run Proxy

The proxy command starts a server on localhost that forwards a local port to the Cloud Run service, handling authentication automatically. This is useful for quick, local testing. You would not need steps 9 and 10 in this scenario. But you would need step 9 for production use anyway:

gcloud run services proxy ollamaserver --port=9090

Since the above command will occupy the terminal for proxy server, in another terminal call the Ollama server over the proxy that runs on localhost:

curl http://localhost:9090/api/generate -d '{
  "model": "gemma3n:e4b",
  "prompt": "Why is the sky blue?"
}'

How to Call Ollama Server in Pyhton

pip install ollama
pip install google-auth

Download the service account key json file.

gcloud iam service-accounts keys create [KEY_FILE_NAME] --iam-account [SERVICE_ACCOUNT]

Set GOOGLE_APPLICATION_CREDENTIALS environment variable to point your key file (make sure to write full absolute path). I normally use .env files to manage environment variables. You can do it in your own way:

GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.json

Create an ollama custom client with a valid JWT token, and send your requests:

from ollama import Client
from google.auth.transport.requests import Request
from google.oauth2 import id_token

ollama_base_url = [CLOUD_RUN_SERVICE_URL]
token = id_token.fetch_id_token(Request(), ollama_base_url)

ollama_client = Client(
  host=ollama_base_url,
  headers={
    "Authorization": f"Bearer {token}",
    "Content-Type": "application/json",
  },
)

response = ollama_client.chat(
  model='gemma3n:e4b', 
  messages=[
  {
    'role': 'user',
    'content': 'Why is the sky blue?',
  },
]) 

Next Step


With this setup in place, I now need to test it across different content types. I may need to increase CPU and memory of my ollamaserver, even attach a GPU.


My first impressions:

  • Gemma 3n works well for content that’s mostly text

  • It struggles with image-heavy content where OCR is required


But this is fine for now. If the feature proves valuable, I can always:

  • Switch to a more powerful model that handles images better

  • Introduce GPU-backed inference for premium users

  • Reintroduce OpenAI for specific cases where budget allows


Once I collect more test results, I might share them in a follow-up post.


If you’ve explored cost-efficient LLM usage and have ideas or suggestions — especially for multi-modal inputs or hybrid pipelines — I’d love to hear from you. Thanks for reading!

 
 
 

Comments


bottom of page