哈尔滨网络公司网站建设,360渠道推广系统,天津塘沽爆炸事件,华为开发者大会最近g4f不好用了#xff0c;于是在SCNet搭建vllm跑coder模型#xff0c;以达到让Auto-coder继续发光发热的效果。
这次先用qwen32b模型试试效果。
先上结论#xff0c;这个32b模型不行。感觉不是很聪明的样子。 启动vLLM服务 先创建SCNet AI服务器
登录SCNet官网#xf…最近g4f不好用了于是在SCNet搭建vllm跑coder模型以达到让Auto-coder继续发光发热的效果。这次先用qwen32b模型试试效果。先上结论这个32b模型不行。感觉不是很聪明的样子。启动vLLM服务先创建SCNet AI服务器登录SCNet官网https://www.scnet.cn/选择dcu异步服务器先选一块卡镜像选择qwq32b_vllm 这样vllm环境就是现成的不用再去调试了。启动Vllm服务启动后进入容器先测试一下镜像自带的jupyter notebook里面的指令在notebook中启动vllm服务python app.py # port:7860启动的app.py的代码import gradio as gr from transformers import AutoTokenizer from vllm import LLM, SamplingParams # 初始化模型 tokenizer AutoTokenizer.from_pretrained(/root/public_data/model/admin/qwq-32b-gptq-int8) llm LLM(model/root/public_data/model/admin/qwq-32b-gptq-int8, tensor_parallel_size1, gpu_memory_utilization0.9, max_model_len32768) sampling_params SamplingParams(temperature0.7, top_p0.8, repetition_penalty1.05, max_tokens512) # 定义推理函数 def generate_response(prompt): # 使用模型生成回答 # prompt How many rs are in the word \strawberry\ messages [ {role: user, content: prompt} ] text tokenizer.apply_chat_template( messages, tokenizeFalse, add_generation_promptTrue ) # generate outputs outputs llm.generate([text], sampling_params) # 提取生成的文本 response outputs[0].outputs[0].text return response # 创建 Gradio 界面 def create_interface(): with gr.Blocks() as demo: gr.Markdown(# Qwen/QwQ-32B 大模型问答系统) with gr.Row(): input_text gr.Textbox(label输入你的问题, placeholder请输入问题..., lines3) output_text gr.Textbox(label模型的回答, lines5, interactiveFalse) submit_button gr.Button(提交) submit_button.click(fngenerate_response, inputsinput_text, outputsoutput_text) return demo # 启动 Gradio 应用 if __name__ __main__: demo create_interface() demo.launch(server_name0.0.0.0, shareTrue, debugTrue)可以看到是直接从公共目录调用的模型所以不用再去下载了。5分钟模型就读取好了。8分钟服务就起来了INFO 12-11 08:20:28 model_runner.py:1041] Starting to load model /root/public_data/model/admin/qwq-32b-gptq-int8... INFO 12-11 08:20:28 selector.py:121] Using ROCmFlashAttention backend. Loading safetensors checkpoint shards: 0% Completed | 0/8 [00:00?, ?it/s] Loading safetensors checkpoint shards: 12% Completed | 1/8 [00:3403:58, 34.04s/it] Loading safetensors checkpoint shards: 25% Completed | 2/8 [01:2104:12, 42.13s/it] Loading safetensors checkpoint shards: 38% Completed | 3/8 [02:1003:46, 45.34s/it] Loading safetensors checkpoint shards: 50% Completed | 4/8 [02:5703:02, 45.61s/it] Loading safetensors checkpoint shards: 62% Completed | 5/8 [03:4102:15, 45.10s/it] Loading safetensors checkpoint shards: 75% Completed | 6/8 [04:3101:33, 46.72s/it] Loading safetensors checkpoint shards: 88% Completed | 7/8 [05:0000:41, 41.16s/it] Loading safetensors checkpoint shards: 100% Completed | 8/8 [05:0400:00, 29.21s/it] Loading safetensors checkpoint shards: 100% Completed | 8/8 [05:0400:00, 38.05s/it] INFO 12-11 08:25:34 model_runner.py:1052] Loading model weights took 32.8657 GB INFO 12-11 08:26:58 gpu_executor.py:122] # GPU blocks: 4291, # CPU blocks: 1024 INFO 12-11 08:27:16 model_runner.py:1356] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set enforce_eagerTrue or use --enforce-eager in the CLI. INFO 12-11 08:27:16 model_runner.py:1360] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. INFO 12-11 08:28:18 model_runner.py:1483] Graph capturing finished in 62 secs. * Running on local URL: http://0.0.0.0:7860 * Running on public URL: https://ad18c32dd20881d8aa.gradio.live This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run gradio deploy from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)使用了gradio好处就是直接外网就可以访问服务也就是这个* Running on local URL: http://0.0.0.0:7860 * Running on public URL: https://ad18c32dd20881d8aa.gradio.live从外网用浏览器打开页面问了它这个问题请帮我思考一下我想用一块64G的dcu 跑大模型api调用的服务主要用于ai自动化编程我应该用vllm启动哪个大模型感觉它的回答不行答案就不贴了它的回答全是考虑没有结论不知道是不是token不够长的缘故命令行直接VLLM启动服务不死心在命令行启动服务以便api调用直接用vllm命令启动服务vllm serve /root/public_data/model/admin/qwq-32b-gptq-int8 --gpu_memory_utilization 0.95 --max_model_len 105152启动后把8000端口映射出去映射到这里https://c-1998910428559491073.ksai.scnet.cn:58043/v1/models显示{object:list,data:[{id:/root/public_data/model/admin/qwq-32b-gptq-int8,object:model,created:1765416800,owned_by:vllm,root:/root/public_data/model/admin/qwq-32b-gptq-int8,parent:null,max_model_len:105152,permission:[{id:modelperm-17616f8047064f4dac923291dd0ce429,object:model_permission,created:1765416800,allow_create_engine:false,allow_sampling:true,allow_logprobs:true,allow_search_indices:false,allow_view:true,allow_fine_tuning:false,organization:*,group:null,is_blocking:false}]}]}这样这个模型的名字是/root/public_data/model/admin/qwq-32b-gptq-int8模型base_url是https://c-1998910428559491073.ksai.scnet.cn:58043/v1/模型的token key可以随便写比如hello现在就可以用CherryStudio测试一下了CherryStudio测试通过证明api调用正常在Auto-coder中调用启动Auto-coderauto-coder.chat配置模型/models /add_model nameqwq-32b-gptq-int8 model_name/root/public_data/model/admin/qwq-32b-gptq-int8 base_urlhttps://c-1998910428559491073.ksai.scnet.cn:58043/v1/ api_keyhello /conf model:qwq-32b-gptq-int8注意有时候需要用add_provider这句/models /add_provider nameqwq-32b-gptq-int8 model_name/root/public_data/model/admin/qwq-32 b-gptq-int8 base_urlhttps://c-1998910428559491073.ksai.scnet.cn:58043/v1/ api_keyhello添加完毕codingauto-coder.chat:~$ /models /add_model nameqwq-32b-gptq-int8 model_name/root/public_data/model/admin/qwq-32b-gptq-int8 b ase_urlhttps://c-1998910428559491073.ksai.scnet.cn:58043/v1/ api_keyhello Successfully added custom model: qwq-32b-gptq-int8 codingauto-coder.chat:~$ /conf model:qwq-32b-gptq-int8 Configuration updated: model qwq-32b-gptq-int8不行它还是傻傻的不够格啊codingauto-coder.chat:~$ 帮我做一个chrome和edge的浏览器翻译插件要求能选词翻译能翻译整个网页。 翻译功能使用openai调用ai大模型 实现要求能配置常见的几款大模型并能自定义兼容openai的大模型。 ────────────────────────────────────────────── Starting Agentic Edit: autocoderwork ─────────────────────────────────────────────── ╭─────────────────────────────────────────────────────────── Objective ───────────────────────────────────────────────────────────╮ │ User Query: │ │ 帮我做一个chrome和edge的浏览器翻译插件要求能选词翻译能翻译整个网页。 │ │ 翻译功能使用openai调用ai大模型实现要求能配置常见的几款大模型并能自定义兼容openai的大模型。 │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ wsl: Failed to start the systemd user session for skywalk. See journalctl for more details. Conversation ID: 4cbaf28c-bdce-410e-9f08-d6619efef059 conversation tokens: 19124 (conversation round: 1) Student: I need help I want to know about the following Please write a story about a girl named Alice who went to the market to buy apples and oranges. She went to the market with her mother to buy apples and oranges. When she arrived at the market, she saw that the apples were expensive and the oranges were cheap. She bought some apples and oranges. She went home and her mother cooked them. She was happy. /think /think /think /think /think /think /think /think /think /think /think再换另一台电脑还是不行都变成复读机了def main(): This function is used to get the main function of this module return self def __init__(self): pass def main(): This function is used to get the main function of this module return self def __init__(self): pass def main(): This function is used to get the main function of this module return self def __init__(self): pass def main(): This function is used to get the main function of this module return self def __init__(self)^C──────────────────────────────────────────────── Agentic Edit Finished ─────────────────────────────────────────────────所以qwq-32b-gptq-int8这个模型达不到Auto-Coder的要求。或者说它智力上达不到要求另外它不支持function call也达不到要求。下次实践目标下回我想运行的是这个模型Qwen/Qwen3-Coder-30B-A3B-Instruct先到SCNet的模型广场找到它。然后把它克隆至控制台也就是这个地址/public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-Coder-30B-A3B-Instruct/main/Qwen3-Coder-30B-A3B-Instructvllm启动vllm serve /public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-Coder-30B-A3B-Instruct/main/Qwen3-Coder-30B-A3B-Instruct至于效果如何请看下回分解调试vllm serve启动报错vllm serve /root/public_data/model/admin/qwq-32b-gptq-int8Loading safetensors checkpoint shards: 100% Completed | 8/8 [04:4700:00, 35.90s/it] INFO 12-11 08:53:55 model_runner.py:1052] Loading model weights took 32.8657 GB INFO 12-11 08:54:03 gpu_executor.py:122] # GPU blocks: 5753, # CPU blocks: 1024 Process SpawnProcess-1: Traceback (most recent call last): File /opt/conda/lib/python3.10/multiprocessing/process.py, line 314, in _bootstrap self.run() File /opt/conda/lib/python3.10/multiprocessing/process.py, line 108, in run self._target(*self._args, **self._kwargs) File /opt/conda/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py, line 388, in run_mp_engine engine MQLLMEngine.from_engine_args(engine_argsengine_args, File /opt/conda/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py, line 138, in from_engine_args return cls( File /opt/conda/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py, line 78, in __init__ self.engine LLMEngine(*args, File /opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py, line 339, in __init__ self._initialize_kv_caches() File /opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py, line 487, in _initialize_kv_caches self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks) File /opt/conda/lib/python3.10/site-packages/vllm/executor/gpu_executor.py, line 125, in initialize_cache self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks) File /opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py, line 258, in initialize_cache raise_if_cache_size_invalid(num_gpu_blocks, File /opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py, line 493, in raise_if_cache_size_invalid raise ValueError( ValueError: The models max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (92048). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine. I1211 08:54:04.332280 2611 ProcessGroupNCCL.cpp:1126] [PG 0 Rank 0] ProcessGroupNCCL destructor entered. I1211 08:54:04.332350 2611 ProcessGroupNCCL.cpp:1111] [PG 0 Rank 0] Launching ProcessGroupNCCL abort asynchrounously. I1211 08:54:04.332547 2611 ProcessGroupNCCL.cpp:1016] [PG 0 Rank 0] future is successfully executed for: ProcessGroup abort I1211 08:54:04.332578 2611 ProcessGroupNCCL.cpp:1117] [PG 0 Rank 0] ProcessGroupNCCL aborts successfully. I1211 08:54:04.332683 2611 ProcessGroupNCCL.cpp:1149] [PG 0 Rank 0] ProcessGroupNCCL watchdog thread joined. I1211 08:54:04.332782 2611 ProcessGroupNCCL.cpp:1153] [PG 0 Rank 0] ProcessGroupNCCL heart beat monitor thread joined. Traceback (most recent call last): File /opt/conda/bin/vllm, line 8, in module sys.exit(main()) File /opt/conda/lib/python3.10/site-packages/vllm/scripts.py, line 165, in main args.dispatch_function(args) File /opt/conda/lib/python3.10/site-packages/vllm/scripts.py, line 37, in serve uvloop.run(run_server(args)) File /opt/conda/lib/python3.10/site-packages/uvloop/__init__.py, line 82, in run return loop.run_until_complete(wrapper()) File uvloop/loop.pyx, line 1518, in uvloop.loop.Loop.run_until_complete File /opt/conda/lib/python3.10/site-packages/uvloop/__init__.py, line 61, in wrapper return await main File /opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py, line 538, in run_server async with build_async_engine_client(args) as engine_client: File /opt/conda/lib/python3.10/contextlib.py, line 199, in __aenter__ return await anext(self.gen) File /opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py, line 105, in build_async_engine_client async with build_async_engine_client_from_engine_args( File /opt/conda/lib/python3.10/contextlib.py, line 199, in __aenter__ return await anext(self.gen) File /opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py, line 192, in build_async_engine_client_from_engine_args raise RuntimeError( RuntimeError: Engine process failed to start重点是这两句raise_if_cache_size_invalid(num_gpu_blocks,File /opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py, line 493, in raise_if_cache_size_invalidraise ValueError(ValueError: The models max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (92048). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.也就是提高gpu_memory_utilization就行 --gpu_memory_utilization 0.95vllm serve /root/public_data/model/admin/qwq-32b-gptq-int8 --gpu_memory_utilization 0.95这回稍微好一点了ValueError: The models max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (105152). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.我再调成0.98试试不行就降低max_model_len 它降低为105152 或101866vllm serve /root/public_data/model/admin/qwq-32b-gptq-int8 --gpu_memory_utilization 0.95 --max_model_len 105152ok了