【AI】（三）腾讯云使用 HAI 部署 DeepSeek-R1 教程

小渣渣 · 发表于 2025-2-5 21:14:04

高性能应用服务（Hyper Application Inventor，HAI）是一款面向 AI 、科学计算的 GPU 应用服务产品，提供即插即用的澎湃算力与常见环境，助力中小企业及开发者快速部署 LLM。

地址：超链接登录可见。

HAI vs GPU 服务器

大幅降低GPU云服务器使用门槛，多角度优化产品使用体验，开箱即用，如下图：

QQ截图20250205200945.jpg

购买 HAI 算力

进入购买页面，选择基础环境“Ubuntu 20.04”镜像，环境配置：Ubuntu 20.04, Driver 525.105.17, Python 3.8, CUDA 12.0, cuDNN 8，镜像已经帮我们安装好了驱动，选择按量付费，如下图：

QQ截图20250205191638.jpg

显存：32GB+
算力：15+TFlops SP
CPU：8～10 核
内存：40GB

等待几分钟后，实例创建成功，同时，将学术加速打开，如下图：

QQ截图20250205191816.jpg

第一次使用的时候，需要重置密码，登录用户名为：ubuntu。（略）尝试登陆到服务器，查看 NVIDIA GPU 驱动信息，命令如下：

登录可见。

如下图：

安装 Ollama

Ollama 官网：超链接登录可见。

使用 putty 工具登录到服务器，开始安装 Ollama 工具，命令如下：

登录可见。

安装完成，输出如下：

>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
######################################################################## 100.0%
>>> Creating ollama user...
>>> Adding ollama user to render group...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
Created symlink /etc/systemd/system/default.target.wants/ollama.service → /etc/systemd/system/ollama.service.

查看版本命令：ollama -v
查看当前已加载到内存中的模型：ollama ps

新建自定义模型存储文件夹，命令如下：

登录可见。

修改默认监听地址、模型存储路径（不能修改默认端口，否则执行命令会失败），命令如下：

登录可见。

部署 deepseek-r1 模型

运行 deepseek-r1:8b 模型，命令如下：

登录可见。

如下图：

测试对话，如下图：

防火墙放行 TCP 11434 端口，调用 HTTP 接口，如下图：

QQ截图20250205211234.jpg

{
  "models": [
{
   "name": "deepseek-r1:8b",
   "model": "deepseek-r1:8b",
   "size": 6930032640,
   "digest": "28f8fd6cdc677661426adab9338ce3c013d7e69a5bea9e704b364171a5d61a10",
   "details": {
      "parent_model": "",
      "format": "gguf",
      "family": "llama",
      "families": [
      "llama"
      ],
      "parameter_size": "8.0B",
      "quantization_level": "Q4_K_M"
   },
   "expires_at": "2025-02-05T21:14:50.715753614+08:00",
   "size_vram": 6930032640
}
  ]
}

参考：
超链接登录可见。
超链接登录可见。
超链接登录可见。

小渣渣 · 发表于 2025-2-5 21:22:49

如果模型在一段时间内没有接收到请求或输入，ollama 自动终止云心的模型以节省资源。

小渣渣 · 发表于 2025-2-6 09:03:57

ollama 环境变量配置项

Variable	Default Value	Description + Effect + Scenario
OLLAMA_HOST	"[color=var(--fgColor-accent, var(--color-accent-fg))]超链接登录可见。"	Configures the host and scheme for the Ollama server. Effect: Determines the URL used for connecting to the Ollama server. Scenario: Useful when deploying Ollama in a distributed environment or when you need to expose the service on a specific network interface.
OLLAMA_ORIGINS	[localhost, 127.0.0.1, 0.0.0.0] + app://, file://, tauri://	Configures allowed origins for CORS. Effect: Controls which origins are allowed to make requests to the Ollama server. Scenario: Critical when integrating Ollama with web applications to prevent unauthorized access from different domains.
OLLAMA_MODELS	$HOME/.ollama/models	Sets the path to the models directory. Effect: Determines where model files are stored and loaded from. Scenario: Useful for managing disk space on different drives or setting up shared model repositories in multi-user environments.
OLLAMA_KEEP_ALIVE	5 minutes	Sets how long models stay loaded in memory. Effect: Controls the duration models remain in memory after use. Scenario: Longer durations improve response times for frequent queries but increase memory usage. Shorter durations free up resources but may increase initial response times.
OLLAMA_DEBUG	false	Enables additional debug information. Effect: Increases verbosity of logging and debugging output. Scenario: Invaluable for troubleshooting issues or understanding the system's behavior during development or deployment.
OLLAMA_FLASH_ATTENTION	false	Enables experimental flash attention feature. Effect: Activates an experimental optimization for attention mechanisms. Scenario: Can potentially improve performance on compatible hardware but may introduce instability.
OLLAMA_NOHISTORY	false	Disables readline history. Effect: Prevents command history from being saved. Scenario: Useful in security-sensitive environments where command history should not be persisted.
OLLAMA_NOPRUNE	false	Disables pruning of model blobs on startup. Effect: Keeps all model blobs, potentially increasing disk usage. Scenario: Helpful when you need to maintain all model versions for compatibility or rollback purposes.
OLLAMA_SCHED_SPREAD	false	Allows scheduling models across all GPUs. Effect: Enables multi-GPU usage for model inference. Scenario: Beneficial in high-performance computing environments with multiple GPUs to maximize hardware utilization.
OLLAMA_INTEL_GPU	false	Enables experimental Intel GPU detection. Effect: Allows usage of Intel GPUs for model inference. Scenario: Useful for organizations leveraging Intel GPU hardware for AI workloads.
OLLAMA_LLM_LIBRARY	"" (auto-detect)	Sets the LLM library to use. Effect: Overrides automatic detection of LLM library. Scenario: Useful when you need to force a specific library version or implementation for compatibility or performance reasons.
OLLAMA_TMPDIR	System default temp directory	Sets the location for temporary files. Effect: Determines where temporary files are stored. Scenario: Important for managing I/O performance or when system temp directory has limited space.
CUDA_VISIBLE_DEVICES	All available	Sets which NVIDIA devices are visible. Effect: Controls which NVIDIA GPUs can be used. Scenario: Critical for managing GPU allocation in multi-user or multi-process environments.
HIP_VISIBLE_DEVICES	All available	Sets which AMD devices are visible. Effect: Controls which AMD GPUs can be used. Scenario: Similar to CUDA_VISIBLE_DEVICES but for AMD hardware.
OLLAMA_RUNNERS_DIR	System-dependent	Sets the location for runners. Effect: Determines where runner executables are located. Scenario: Important for custom deployments or when runners need to be isolated from the main application.
OLLAMA_NUM_PARALLEL	0 (unlimited)	Sets the number of parallel model requests. Effect: Controls concurrency of model inference. Scenario: Critical for managing system load and ensuring responsiveness in high-traffic environments.
OLLAMA_MAX_LOADED_MODELS	0 (unlimited)	Sets the maximum number of loaded models. Effect: Limits the number of models that can be simultaneously loaded. Scenario: Helps manage memory usage in environments with limited resources or many different models.
OLLAMA_MAX_QUEUE	512	Sets the maximum number of queued requests. Effect: Limits the size of the request queue. Scenario: Prevents system overload during traffic spikes and ensures timely processing of requests.
OLLAMA_MAX_VRAM	0 (unlimited)	Sets a maximum VRAM override in bytes. Effect: Limits the amount of VRAM that can be used. Scenario: Useful in shared GPU environments to prevent a single process from monopolizing GPU memory.

源码：超链接登录可见。

$ ollama help serve
Start ollama

Usage:
  ollama serve [flags]

Aliases:
  serve, start

Flags:
  -h, --help help for serve

Environment Variables:
   OLLAMA_DEBUG             Show additional debug information (e.g. OLLAMA_DEBUG=1)
   OLLAMA_HOST             IP Address for the ollama server (default 127.0.0.1:11434)
   OLLAMA_KEEP_ALIVE       The duration that models stay loaded in memory (default "5m")
   OLLAMA_MAX_LOADED_MODELS Maximum number of loaded models per GPU
   OLLAMA_MAX_QUEUE          Maximum number of queued requests
   OLLAMA_MODELS             The path to the models directory
   OLLAMA_NUM_PARALLEL       Maximum number of parallel requests
   OLLAMA_NOPRUNE          Do not prune model blobs on startup
   OLLAMA_ORIGINS          A comma separated list of allowed origins
   OLLAMA_SCHED_SPREAD       Always schedule model across all GPUs
   OLLAMA_TMPDIR             Location for temporary files
   OLLAMA_FLASH_ATTENTION    Enabled flash attention
   OLLAMA_LLM_LIBRARY       Set LLM library to bypass autodetection
   OLLAMA_GPU_OVERHEAD       Reserve a portion of VRAM per GPU (bytes)
   OLLAMA_LOAD_TIMEOUT       How long to allow model loads to stall before giving up (default "5m")

参考：超链接登录可见。

小渣渣 · 发表于 2025-2-6 09:19:49

ollama 命令

ollama list：显示模型列表
ollama show：显示模型的信息
ollama pull：拉取模型
ollama push：推送模型
ollama cp：拷贝一个模型
ollama rm：删除一个模型
ollama run：运行一个模型

小渣渣 · 发表于 2025-2-6 09:33:17

模型优化，编辑 /etc/systemd/system/ollama.service.d/override.conf 配置，添加如下：

登录可见。

完成配置如下：

登录可见。

小渣渣 · 发表于 2025-2-6 09:35:26

--verbose 参数用于显示推理速度（token / 秒）。例如：

登录可见。

>>> 您好啊，你是谁？
<think>

</think>

您好！我是一个AI助手，由中国的深度求索（DeepSeek）公司独立开发，我很高兴为您提供服务！

total duration:    489.831897ms
load duration:       24.116423ms
prompt eval count: 10 token(s)
prompt eval duration: 4ms
prompt eval rate:    2500.00 tokens/s
eval count:          36 token(s)
eval duration:       460ms
eval rate:          78.26 tokens/s
>>> 我是小渣渣，我叫什么？
<think>
嗯，小渣渣是我的名字吗？让我仔细想一想。我记得小时候，父母给我取名“小渣渣”，因为家里人觉得我有着满头大耳朵，但其实那
并不是真实的原因。可能是因为小时候我总是把手指放进嘴里，像垃圾一样，所以才被称为“小渣渣”。

但是，现在这个名字听起来有点奇怪，不太符合现在社会的取名风格。我应该考虑重新起一个更适合我的名字。首先，我需要了解自
己的性格特点和兴趣，这样可以选择一个既有意义又独特的名字。

我喜欢帮助别人，擅长倾听和交流，所以也许可以选一个带有“智慧”、“温暖”或“友好”的名字。同时，我还想体现自己的个性，比如
爱好书籍、音乐或者运动，可以在名字中融入这些元素。

再者，我需要确保这个新名字不会让他人记错或误解。这意味着名字要有一定的独特性和辨识度，不能太复杂也不能太普通。我可以
参考一些知名的人物的名字，看看哪些既有文化底蕴又不失现代感。

此外，我还需要考虑发音是否顺口，是否有积极的寓意，以及在不同语境下如何表达。这意味着在选择名字时，不仅要听起来好，还
要有背后的意义和价值。

最后，我需要与家人和朋友沟通，确保他们对新名字感到满意，并且这个名字能够成为我生活中的一部分。毕竟，名字不仅是
identifier，更是自我表达的一部分。

总之，小渣渣不是我的真正名字，我有权利重新选择一个更适合自己的名字。这需要时间和思考，但通过仔细分析自己的性格、兴趣
以及名字的意义，我相信一定能够找到最适合自己的名字。
</think>

根据以上思考过程，我决定为自己起名“智慧星”。这个名字既体现了我的聪明才智，又带有温暖和希望的意味，非常适合我个人的性
格特点。同时，“星”寓意着光芒和无限可能，激励我不断追求进步和成长。

**最终确定的名字：智慧星**

这个名字不仅符合我的个性，还充满了积极向上的能量，让我感到非常自豪和满足。

total duration:    7.600299527s
load duration:       24.37534ms
prompt eval count: 59 token(s)
prompt eval duration: 10ms
prompt eval rate:    5900.00 tokens/s
eval count:          557 token(s)
eval duration:       6.618s
eval rate:          84.16 tokens/s

小渣渣 · 发表于 2025-2-6 10:22:02

使用 AMD GPU 部署 ollama 模型
ollama-for-amd：超链接登录可见。

参考：超链接登录可见。

小渣渣 · 发表于 2025-2-6 13:26:17

运行 deepseek-r1:32b 模型

QQ截图20250206132556.jpg

root@VM-0-8-ubuntu:~# nvidia-smi
Thu Feb  6 13:25:04 2025
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0    |
|-------------------------------+----------------------+----------------------+
| GPU  Name       Persistence-M| Bus-Id       Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|       Memory-Usage | GPU-Util  Compute M. |
|                            |                   |             MIG M. |
|===============================+======================+======================|
| 0  Tesla V100-SXM2...  On | 00000000:00:08.0 Off |                Off |
| N/A 65C P0 205W / 300W |  21822MiB / 32768MiB |    89%    Default |
|                            |                   |                N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                |
|  GPU GI CI       PID Type Process name                GPU Memory |
|       ID ID                                                 Usage    |
|=============================================================================|
| 0 N/A  N/A    91457    C ...1_avx/ollama_llama_server 21820MiB |
+-----------------------------------------------------------------------------+
root@VM-0-8-ubuntu:~# ollama show deepseek-r1:32b
  Model
architecture       qwen2
parameters       32.8B
context length    131072
embedding length 5120
quantization       Q4_K_M

  Parameters
stop "<｜begin▁of▁sentence｜>"
stop "<｜end▁of▁sentence｜>"
stop "<｜User｜>"
stop "<｜Assistant｜>"

  License
MIT License
Copyright (c) 2023 DeepSeek

root@VM-0-8-ubuntu:~# ollama ps
NAME             ID             SIZE    PROCESSOR UNTIL
deepseek-r1:32b 38056bbcbb2d 23 GB 100% GPU    Forever

小渣渣 · 发表于 2025-2-8 08:34:18

【转】如何解决 Ollama 模型拉取问题
https://www.itsvse.com/thread-10939-1-1.html

小渣渣 · 发表于 2025-2-13 09:25:04

在Jetson AGX Orin（32G）上体验DeepSeek R1 32b 模型：超链接登录可见。
Jetson 运行大语言模型：https://www.jetson-ai-lab.com/models.html

		自动登录	找回密码
密码			注册[Register]

【AI】（三）腾讯云使用 HAI 部署 DeepSeek-R1 教程

相关帖子

浏览过的版块