GenAI MVP experiment

GovTechMidtjylland has conducted a GenAI pilot experiment to investigate:

1: What does it take for a munucipality to develop, run and implement a GenAI in a public context?
2: Which usecases can benefit from a GenAI solution?

The GenAI repo can be reviewed here: https://github.com/itk-dev/ai-launchpad

Performance testing

As part of the implementation, a performance test was conducted:

Testing VLLM instead of OLLAMA

Performance optimization with quantization.

Currently running on a 2-GPU server at Hetzner.

Requires 20 GB VRAM – therefore it cannot run on Azure (only has 16 GB)!

1 server (system prompt response set to min. 50-150 words)

handles up to 250 simultaneous requests.
breaks at 180.

Response time – 0-60 sec.

1 server

Max response time vs. number of requests

If we can accept up to 4 sec. response time for the first token, 1 server can handle approximately 180 requests at a time. Multiply the number with the number of servers: 2 servers = 360 simultaneous users.

With 2 servers – 380 simultaneous users

max response time of 4.5 sec. for the first token.

Comparison between AQW and GPTQ

Without quantization at 380 simultaneous users.

At 500 simultaneous users (GPTQ) – response time up to 7 sec., but we still respond!