<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <id>https://project-hami.io/blog</id>
    <title>HAMi Blog</title>
    <updated>2026-01-20T00:00:00.000Z</updated>
    <generator>https://github.com/jpmonette/feed</generator>
    <link rel="alternate" href="https://project-hami.io/blog"/>
    <subtitle>HAMi Blog</subtitle>
    <icon>https://project-hami.io/img/logo.svg</icon>
    <entry>
        <title type="html"><![CDATA[HAMi v2.8.0 Release: Full DRA Support and High Availability Scheduling—Towards Standardized GPU Resource Management]]></title>
        <id>https://project-hami.io/blog/hami-v2-8-0-release</id>
        <link href="https://project-hami.io/blog/hami-v2-8-0-release"/>
        <updated>2026-01-20T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[HAMi v2.8.0 is now officially released. This milestone release delivers architectural completeness, enhanced scheduling reliability, and ecosystem alignment, featuring DRA support, Leader election mechanism, CDI mode support, and more.]]></summary>
        <content type="html"><![CDATA[<p>Te HAMi community is proud to announce the official release of <strong>HAMi v2.8.0</strong>. This represents a milestone version in terms of <strong>architectural completeness, scheduling reliability, and ecosystem alignment</strong>.</p>
<p>v2.8.0 not only introduces multiple key feature updates but also delivers systematic enhancements in <strong>Kubernetes native standard alignment, heterogeneous device support, production readiness, and observability</strong>, making HAMi more suitable for AI production clusters that require long-term operation with high stability and clear evolution paths.</p>
<p>This article provides a detailed overview of the major updates in v2.8.0.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="core-features-and-capability-updates">Core Features and Capability Updates<a href="https://project-hami.io/blog/hami-v2-8-0-release#core-features-and-capability-updates" class="hash-link" aria-label="Direct link to Core Features and Capability Updates" title="Direct link to Core Features and Capability Updates" translate="no">​</a></h2>
<p>This section introduces the core features and capability updates in HAMi v2.8.0, covering standard interface support, high availability mechanisms, device compatibility, and more.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="official-support-for-kubernetes-device-resource-assignment-dra">Official Support for Kubernetes Device Resource Assignment (DRA)<a href="https://project-hami.io/blog/hami-v2-8-0-release#official-support-for-kubernetes-device-resource-assignment-dra" class="hash-link" aria-label="Direct link to Official Support for Kubernetes Device Resource Assignment (DRA)" title="Direct link to Official Support for Kubernetes Device Resource Assignment (DRA)" translate="no">​</a></h3>
<p>HAMi v2.8.0 adds support for <strong>Kubernetes Device Resource Assignment (DRA)</strong> and provides an independent implementation project:</p>
<ul>
<li class=""><a href="https://github.com/Project-HAMi/HAMi-DRA" target="_blank" rel="noopener noreferrer" class="">https://github.com/Project-HAMi/HAMi-DRA</a></li>
</ul>
<p>DRA is the next-generation device resource declaration and allocation mechanism being advanced by the Kubernetes community, aiming to provide a <strong>more standardized, composable, and scalable</strong> resource management model for GPUs/AI accelerators and other devices.</p>
<p>HAMi's support for DRA marks the project's transition from "custom device scheduling logic" to <strong>Kubernetes native standard interfaces</strong> in device resource management. This not only lays the foundation for more complex GPU/AI accelerator usage patterns but also opens space for HAMi's long-term evolution in the upstream ecosystem.</p>
<blockquote>
<p>A separate technical article will cover DRA's design philosophy, usage methods, and comparison with existing patterns.</p>
</blockquote>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="leader-election-mechanism-for-multiple-scheduler-instances">Leader Election Mechanism for Multiple Scheduler Instances<a href="https://project-hami.io/blog/hami-v2-8-0-release#leader-election-mechanism-for-multiple-scheduler-instances" class="hash-link" aria-label="Direct link to Leader Election Mechanism for Multiple Scheduler Instances" title="Direct link to Leader Election Mechanism for Multiple Scheduler Instances" translate="no">​</a></h3>
<p>For large-scale clusters or high-availability deployment scenarios, HAMi v2.8.0 introduces a <strong>Leader election mechanism for multiple Scheduler instances</strong> to enhance the stability and operability of the scheduling layer. This mechanism offers the following advantages:</p>
<ul>
<li class="">Avoids resource conflicts from concurrent scheduling by multiple instances</li>
<li class="">Improves the high availability of the Scheduler component</li>
<li class="">Provides a more robust operational model for long-running production clusters</li>
</ul>
<p>This mechanism makes HAMi more suitable for deployment in production environments with high requirements for stability and fault tolerance.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="nvidia-device-support-for-container-device-interface-cdi-mode">NVIDIA Device Support for Container Device Interface (CDI) Mode<a href="https://project-hami.io/blog/hami-v2-8-0-release#nvidia-device-support-for-container-device-interface-cdi-mode" class="hash-link" aria-label="Direct link to NVIDIA Device Support for Container Device Interface (CDI) Mode" title="Direct link to NVIDIA Device Support for Container Device Interface (CDI) Mode" translate="no">​</a></h3>
<p>HAMi v2.8.0 adds support for <strong>NVIDIA <a href="https://github.com/cncf-tags/container-device-interface" target="_blank" rel="noopener noreferrer" class="">CDI (Container Device Interface)</a></strong> mode, further reducing the coupling between device management and container runtime. Key features include:</p>
<ul>
<li class="">Using more standard device injection methods</li>
<li class="">Providing clearer device declaration and lifecycle management</li>
<li class="">Laying the foundation for future multi-runtime and multi-device models</li>
</ul>
<p>Users can choose between the traditional environment variable mode (<code>envvar</code>) or CDI mode (<code>cdi-annotations</code>) through the <code>deviceListStrategy</code> configuration in <code>values.yaml</code>.</p>
<p>This capability drives HAMi's continued evolution toward <strong>more cloud-native and composable device management</strong>.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="alignment-with-nvidia-k8s-device-plugin-v0180">Alignment with NVIDIA k8s-device-plugin v0.18.0<a href="https://project-hami.io/blog/hami-v2-8-0-release#alignment-with-nvidia-k8s-device-plugin-v0180" class="hash-link" aria-label="Direct link to Alignment with NVIDIA k8s-device-plugin v0.18.0" title="Direct link to Alignment with NVIDIA k8s-device-plugin v0.18.0" translate="no">​</a></h3>
<p>In v2.8.0, HAMi synchronizes upgrades and aligns with <strong>NVIDIA's official <a href="https://github.com/NVIDIA/k8s-device-plugin" target="_blank" rel="noopener noreferrer" class="">k8s-device-plugin</a> v0.18.0</strong> to achieve the following goals:</p>
<ul>
<li class="">Maintain compatibility with NVIDIA's latest device management models</li>
<li class="">Reduce user adaptation costs in hybrid deployment scenarios</li>
<li class="">Ensure HAMi serves as an "enhancement layer" for device management and scheduling, rather than a forked implementation</li>
</ul>
<p>This alignment helps users smoothly introduce HAMi into their existing NVIDIA GPU ecosystem.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="mock-device-plugin-support">Mock Device Plugin Support<a href="https://project-hami.io/blog/hami-v2-8-0-release#mock-device-plugin-support" class="hash-link" aria-label="Direct link to Mock Device Plugin Support" title="Direct link to Mock Device Plugin Support" translate="no">​</a></h3>
<p>To improve testability and development efficiency in engineering practice, v2.8.0 adds <strong><a href="https://github.com/Project-HAMi/mock-device-plugin" target="_blank" rel="noopener noreferrer" class="">Mock Device Plugin</a></strong> capabilities, suitable for the following scenarios:</p>
<ul>
<li class="">Feature validation and development debugging</li>
<li class="">Device simulation in CI/testing environments</li>
<li class="">Reducing costs for new feature validation and regression testing</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="build-information-and-metrics-system-updates">Build Information and Metrics System Updates<a href="https://project-hami.io/blog/hami-v2-8-0-release#build-information-and-metrics-system-updates" class="hash-link" aria-label="Direct link to Build Information and Metrics System Updates" title="Direct link to Build Information and Metrics System Updates" translate="no">​</a></h3>
<p>HAMi v2.8.0 includes enhancements and refinements in observability, specifically:</p>
<ul>
<li class="">New <code>hami_build_info</code> metric</li>
<li class="">More complete version and build information output at startup</li>
<li class="">Official removal of previously deprecated legacy metrics</li>
</ul>
<p>These improvements make version tracking, issue troubleshooting, and operational visibility clearer in production environments.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="heterogeneous-devices-and-vendor-ecosystem-progress">Heterogeneous Devices and Vendor Ecosystem Progress<a href="https://project-hami.io/blog/hami-v2-8-0-release#heterogeneous-devices-and-vendor-ecosystem-progress" class="hash-link" aria-label="Direct link to Heterogeneous Devices and Vendor Ecosystem Progress" title="Direct link to Heterogeneous Devices and Vendor Ecosystem Progress" translate="no">​</a></h2>
<p>HAMi continues to evolve around the <strong>unified management and scheduling capabilities of multiple GPU/AI accelerator types</strong>.</p>
<p>During the v2.8.0 cycle, the community has continued advancing in the following directions:</p>
<ul>
<li class="">Adaptation and capability enhancement for different GPU/AI accelerator device models</li>
<li class="">Continued support and feature additions for domestic GPU/AI chips</li>
<li class="">Continuous integration of related features and bug fixes (see GitHub PR records for details)</li>
</ul>
<p>These improvements further enhance HAMi's availability and expansion space in heterogeneous compute environments.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="upstream-and-downstream-ecosystem-integration-progress">Upstream and Downstream Ecosystem Integration Progress<a href="https://project-hami.io/blog/hami-v2-8-0-release#upstream-and-downstream-ecosystem-integration-progress" class="hash-link" aria-label="Direct link to Upstream and Downstream Ecosystem Integration Progress" title="Direct link to Upstream and Downstream Ecosystem Integration Progress" translate="no">​</a></h2>
<p>HAMi is not just an independent project but also continues to co-evolve with key components in the Kubernetes AI ecosystem. Current major integration directions include:</p>
<ul>
<li class=""><strong>Kueue</strong>: The HAMi community has contributed enhancement capabilities to the Kueue project, enabling native support for HAMi's device resource management and scheduling model, providing heterogeneous device scheduling support for batch AI job queue management</li>
<li class=""><strong>vLLM</strong>: Fixed compatibility issues in multi-card scenarios (see related issues <a href="https://github.com/Project-HAMi/HAMi/issues/1461" target="_blank" rel="noopener noreferrer" class="">#1461</a> and <a href="https://github.com/Project-HAMi/HAMi/issues/1381" target="_blank" rel="noopener noreferrer" class="">#1381</a>)</li>
</ul>
<p>These ecosystem integrations help users build more complete compute scheduling and resource management solutions in real AI workloads.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="community-and-project-progress">Community and Project Progress<a href="https://project-hami.io/blog/hami-v2-8-0-release#community-and-project-progress" class="hash-link" aria-label="Direct link to Community and Project Progress" title="Direct link to Community and Project Progress" translate="no">​</a></h2>
<p>HAMi is not just a code repository but also a continuously evolving open-source community and project organization.</p>
<p>During the v2.8.0 cycle, the community has remained active in the following areas:</p>
<ul>
<li class="">Real-world usage feedback from users and vendors, such as the <a href="https://www.cncf.io/case-studies/daocloud/" target="_blank" rel="noopener noreferrer" class="">DaoCloud user case</a> of using HAMi to build GPU clouds, published on the CNCF official website</li>
</ul>
<p>The HAMi community welcomes more developers, users, and ecosystem partners to participate in the project and jointly advance GPU virtualization and device scheduling capabilities.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="summary">Summary<a href="https://project-hami.io/blog/hami-v2-8-0-release#summary" class="hash-link" aria-label="Direct link to Summary" title="Direct link to Summary" translate="no">​</a></h2>
<p>HAMi v2.8.0 is a significant version update focused on <strong>standardization, production readiness, and ecosystem alignment</strong>.</p>
<p>By introducing DRA, enhancing scheduling high availability capabilities, aligning with mainstream device plugins and runtime standards, and continuously expanding heterogeneous device and ecosystem integration, HAMi is steadily moving toward a more mature and sustainable GPU resource management and scheduling platform.</p>]]></content>
        <author>
            <name>HAMi Community</name>
        </author>
        <category label="Release" term="Release"/>
        <category label="GPU" term="GPU"/>
        <category label="Kubernetes" term="Kubernetes"/>
        <category label="DRA" term="DRA"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Source Code Walkthrough of the GPU Pod Scheduling Process in HAMi]]></title>
        <id>https://project-hami.io/blog/hami-gpu-scheduling-source-code</id>
        <link href="https://project-hami.io/blog/hami-gpu-scheduling-source-code"/>
        <updated>2024-12-31T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[A detailed source code analysis of HAMi's GPU Pod scheduling process, covering MutatingWebhook, scheduler extension, device registration, scoring algorithms, and binding implementation.]]></summary>
        <content type="html"><![CDATA[<p>During the use of HAMi, it is common for Pods to be created and remain in a Pending state, particularly due to the following two issues:</p>
<ul>
<li class="">Pod UnexpectedAdmissionError</li>
<li class="">Pod Pending</li>
</ul>
<p>This section provides a rough walkthrough of the related code to explain the interactions between components during scheduling and how resources are calculated. Other details may be omitted.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="scheduling-process">Scheduling Process<a href="https://project-hami.io/blog/hami-gpu-scheduling-source-code#scheduling-process" class="hash-link" aria-label="Direct link to Scheduling Process" title="Direct link to Scheduling Process" translate="no">​</a></h2>
<p>Before diving into the code, it's helpful to first check the official documentation, which provides a clear overview:</p>
<p><img decoding="async" loading="lazy" src="https://github.com/Project-HAMi/HAMi/blob/master/docs/develop/imgs/flowchart.jpeg?raw=true" alt="flowchart" class="img_ev3q"></p>
<p>The process can be broken down into three phases:</p>
<ul>
<li class="">
<p><strong>Preparation Phase</strong>: From the diagram, we can see some prerequisites, such as the need for a Mutating Webhook, device-plugin, etc.<br>
<!-- -->This phase primarily analyzes the preparation of dependencies, which are only needed during the initial service startup.</p>
<p><img decoding="async" loading="lazy" src="https://github.com/elrondwong/elrond.wang/raw/master/img/posts/Hami-GPU-Pod-Scheduler/%E5%87%86%E5%A4%87%E5%B7%A5%E4%BD%9C.png" alt="Preparation before Pod creation" class="img_ev3q"></p>
</li>
<li class="">
<p><strong>Pod Scheduling Phase</strong>: After preparation, the Pod enters the scheduling process.</p>
</li>
<li class="">
<p><strong>Pod Startup Phase</strong>: How the Pod interacts with the GPU on the Node.</p>
</li>
</ul>
<p>This article focuses on analyzing the preparation phase, mainly around the scheduling logic.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="pod-scheduling-process">Pod Scheduling Process<a href="https://project-hami.io/blog/hami-gpu-scheduling-source-code#pod-scheduling-process" class="hash-link" aria-label="Direct link to Pod Scheduling Process" title="Direct link to Pod Scheduling Process" translate="no">​</a></h2>
<ul>
<li class="">The user sends a request to create a Pod to the kube-apiserver.</li>
<li class="">The Admission Webhook is triggered, updating the <code>schedulerName</code> in the Pod.</li>
<li class="">The kube-apiserver sends the request to the scheduler based on the <code>schedulerName</code>.</li>
<li class="">The scheduler processes the request:<!-- -->
<ul>
<li class="">Collects node device information — collected via node annotations, with data periodically written by the <code>hami-device-plugin</code> DaemonSet.</li>
<li class="">Scores nodes based on device information and the Pod’s resource limits, selecting the highest-scoring node.</li>
<li class="">Binds the Pod to the node and completes the Pod creation.</li>
</ul>
</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="common-issue-troubleshooting">Common Issue Troubleshooting<a href="https://project-hami.io/blog/hami-gpu-scheduling-source-code#common-issue-troubleshooting" class="hash-link" aria-label="Direct link to Common Issue Troubleshooting" title="Direct link to Common Issue Troubleshooting" translate="no">​</a></h3>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="pod-unexpectedadmissionerror">Pod UnexpectedAdmissionError<a href="https://project-hami.io/blog/hami-gpu-scheduling-source-code#pod-unexpectedadmissionerror" class="hash-link" aria-label="Direct link to Pod UnexpectedAdmissionError" title="Direct link to Pod UnexpectedAdmissionError" translate="no">​</a></h4>
<p>The Pod creation status shows <code>UnexpectedAdmissionError</code>.</p>
<p>From the process, this error indicates the kube-apiserver failed to call the extended scheduler. There are two common causes; other cases require checking the kube-apiserver logs.</p>
<ul>
<li class=""><strong>Communication Failure</strong>: The kube-apiserver cannot reach the HTTPS port of the extended scheduler. Possible reasons:<!-- -->
<ul>
<li class="">DNS resolution failure.</li>
<li class="">Cross-node communication issues.</li>
<li class="">Extended scheduler service failure.</li>
</ul>
</li>
<li class=""><strong>TLS Verification Error</strong>: Typically shows <code>webhook x509: certificate signed by unknown authority</code>.<br>
<!-- -->During Helm chart deployment, there's a <code>jobs.batch</code> job called <code>hami-vgpu.admission-patch</code>. If it hasn't completed, this issue may occur.</li>
</ul>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="scheduling-issues">Scheduling Issues<a href="https://project-hami.io/blog/hami-gpu-scheduling-source-code#scheduling-issues" class="hash-link" aria-label="Direct link to Scheduling Issues" title="Direct link to Scheduling Issues" translate="no">​</a></h4>
<p>The container remains in the <code>Pending</code> state. Use the <code>kubectl describe</code> command to see specific reasons, commonly:</p>
<ul>
<li class=""><code>card Insufficient remaining memory</code></li>
<li class=""><code>calcScore: node not fit pod</code></li>
</ul>
<p>The main causes are usually either actual resource shortage or misconfiguration.<br>
<!-- -->Misconfiguration often refers to an incorrect <code>devicememoryscaling</code> setting. This can be configured in two places, with node-level config taking precedence over global config. A common pitfall is that the <code>name</code> must exactly match the nodename shown by <code>kubectl get node</code>.</p>
<ul>
<li class="">
<p><strong>Global Configuration</strong>: <code>kubectl get cm hami-scheduler-device</code></p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token key atrule">deviceMemoryScaling</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token number" style="color:rgb(247, 140, 108)">3</span><br></span></code></pre></div></div>
</li>
<li class="">
<p><strong>Node Configuration</strong>: <code>kubectl get cm hami-device-plugin</code></p>
<div class="language-json codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-json codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token punctuation" style="color:rgb(199, 146, 234)">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token property">"nodeconfig"</span><span class="token operator" style="color:rgb(137, 221, 255)">:</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(199, 146, 234)">[</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token punctuation" style="color:rgb(199, 146, 234)">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token property">"name"</span><span class="token operator" style="color:rgb(137, 221, 255)">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(195, 232, 141)">"node1"</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token property">"devicememoryscaling"</span><span class="token operator" style="color:rgb(137, 221, 255)">:</span><span class="token plain"> </span><span class="token number" style="color:rgb(247, 140, 108)">3</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token property">"devicesplitcount"</span><span class="token operator" style="color:rgb(137, 221, 255)">:</span><span class="token plain"> </span><span class="token number" style="color:rgb(247, 140, 108)">10</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token property">"migstrategy"</span><span class="token operator" style="color:rgb(137, 221, 255)">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(195, 232, 141)">"none"</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token property">"filterdevices"</span><span class="token operator" style="color:rgb(137, 221, 255)">:</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(199, 146, 234)">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">        </span><span class="token property">"uuid"</span><span class="token operator" style="color:rgb(137, 221, 255)">:</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(199, 146, 234)">[</span><span class="token punctuation" style="color:rgb(199, 146, 234)">]</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">        </span><span class="token property">"index"</span><span class="token operator" style="color:rgb(137, 221, 255)">:</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(199, 146, 234)">[</span><span class="token punctuation" style="color:rgb(199, 146, 234)">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token punctuation" style="color:rgb(199, 146, 234)">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token punctuation" style="color:rgb(199, 146, 234)">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token punctuation" style="color:rgb(199, 146, 234)">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token punctuation" style="color:rgb(199, 146, 234)">}</span><br></span></code></pre></div></div>
</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="mutatingwebhook">MutatingWebhook<a href="https://project-hami.io/blog/hami-gpu-scheduling-source-code#mutatingwebhook" class="hash-link" aria-label="Direct link to MutatingWebhook" title="Direct link to MutatingWebhook" translate="no">​</a></h3>
<p>Kubernetes provides the <code>admissionWebhook</code> resource, which is triggered by resource operations in Kubernetes.<br>
<!-- -->Its most common use is intercepting Pod creation and injecting YAML content into the Pod — for example, adding an init container to inject files.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="webhook-configuration">Webhook Configuration<a href="https://project-hami.io/blog/hami-gpu-scheduling-source-code#webhook-configuration" class="hash-link" aria-label="Direct link to Webhook Configuration" title="Direct link to Webhook Configuration" translate="no">​</a></h4>
<p>hami-webhook:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">kubectl get mutatingwebhookconfigurations.admissionregistration.k8s.io hami-webhook </span><span class="token parameter variable" style="color:rgb(191, 199, 213)">-o</span><span class="token plain"> yaml</span><br></span></code></pre></div></div>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token key atrule">apiVersion</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> admissionregistration.k8s.io/v1</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">kind</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> MutatingWebhookConfiguration</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">metadata</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">annotations</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">meta.helm.sh/release-name</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> hami</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">meta.helm.sh/release-namespace</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> kube</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">system</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">creationTimestamp</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(195, 232, 141)">"2024-12-10T03:50:37Z"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">generation</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token number" style="color:rgb(247, 140, 108)">5</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">labels</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">app.kubernetes.io/managed-by</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> Helm</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">name</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> hami</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">webhook</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">resourceVersion</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(195, 232, 141)">"2307810"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">uid</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> 2cdcebe4</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">f561</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">429f</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">9480</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token number" style="color:rgb(247, 140, 108)">701e65980687</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">webhooks</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> </span><span class="token key atrule">admissionReviewVersions</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> v1beta1</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">clientConfig</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">caBundle</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJkakNDQVJ5Z0F3SUJBZ0lSQUxjd2FQMjUrMlphdGhTTlFMcG1qT0V3Q2dZSUtvWkl6ajBFQXdJd0R6RU4KTUFzR0ExVUVDaE1FYm1sc01UQWdGdzB5TkRFeU1EWXdOekV4TVRWYUdBOHlNVEkwTVRFeE1qQTNNVEV4TlZvdwpEekVOTUFzR0ExVUVDaE1FYm1sc01UQlpNQk1HQnlxR1NNNDlBZ0VHQ0NxR1NNNDlBd0VIQTBJQUJDUnlXUDdYCkRmT2N4NEVTMVRYaUs0dnFFU2wrcUFHYjI2YzNrOEdMWlZTL1lHaFpLZVVxaEgydVRhTFdWTW1hZVJFbkxqM0cKSStMVFRVTTR6SVhEUld5alZ6QlZNQTRHQTFVZER3RUIvd1FFQXdJQ0JEQVRCZ05WSFNVRUREQUtCZ2dyQmdFRgpCUWNEQVRBUEJnTlZIUk1CQWY4RUJUQURBUUgvTUIwR0ExVWREZ1FXQkJTcVV4bWpGa29YUlpRK0xXVzBNM1pJCnMzck1wakFLQmdncWhrak9QUVFEQWdOSUFEQkZBaUJSY2VRL2tJVkR2VTV3Vjl0K3NRWm93TmFhTWhIMTV5K2sKT3VrR0FlRGVtQUloQUxDZzFrM0JQZUJBNG8reWY5emxvVjM2VEk2RHUzaGdMT1B3MXhaZkFvcDMKLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">service</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token key atrule">name</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> hami</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">scheduler</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token key atrule">namespace</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> kube</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">system</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token key atrule">path</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> /webhook</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token key atrule">port</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token number" style="color:rgb(247, 140, 108)">443</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">failurePolicy</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> Ignore</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">matchPolicy</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> Equivalent</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">name</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> vgpu.hami.io</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">namespaceSelector</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">matchExpressions</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> </span><span class="token key atrule">key</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> hami.io/webhook</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token key atrule">operator</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> NotIn</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token key atrule">values</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> ignore</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">objectSelector</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">matchExpressions</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> </span><span class="token key atrule">key</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> hami.io/webhook</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token key atrule">operator</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> NotIn</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token key atrule">values</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> ignore</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">reinvocationPolicy</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> Never</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">rules</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> </span><span class="token key atrule">apiGroups</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> </span><span class="token string" style="color:rgb(195, 232, 141)">""</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">apiVersions</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> v1</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">operations</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> CREATE</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">resources</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain"> pods</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">scope</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(195, 232, 141)">'*'</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">sideEffects</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> None</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">timeoutSeconds</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token number" style="color:rgb(247, 140, 108)">10</span><br></span></code></pre></div></div>
<p>When a Pod is created, <code>https://hami-scheduler.kube-system:443/webhook</code> is called for TLS verification, with the CA certificate configured via <code>caBundle</code>.<br>
<!-- -->If the namespace has the label <code>hami.io/webhook: ignore</code>, the webhook is not triggered.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="webhook-server-implementation">Webhook Server Implementation<a href="https://project-hami.io/blog/hami-gpu-scheduling-source-code#webhook-server-implementation" class="hash-link" aria-label="Direct link to Webhook Server Implementation" title="Direct link to Webhook Server Implementation" translate="no">​</a></h4>
<p>A TLS-enabled HTTP server must be implemented and expose the <code>/webhook</code> endpoint.</p>
<p>cmd/scheduler/main.go:84</p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func start() {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> ...</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> router.POST("/webhook", routes.WebHookRoute())</span><br></span></code></pre></div></div>
<p><code>WebHookRoute</code> needs to implement <code>sigs.k8s.io/controller-runtime@v0.16.3/pkg/webhook/admission/webhook.go:98</code></p>
<p>pkg/scheduler/webhook.go:52</p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain"> pod := &amp;corev1.Pod{}</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> err := h.decoder.Decode(req, pod)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.Errorf("Failed to decode request: %v", err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return admission.Errored(http.StatusBadRequest, err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if len(pod.Spec.Containers) == 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.Warningf(template+" - Denying admission as pod has no containers", req.Namespace, req.Name, req.UID)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return admission.Denied("pod has no containers")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.Infof(template, req.Namespace, req.Name, req.UID)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> hasResource := false</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for idx, ctr := range pod.Spec.Containers {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  c := &amp;pod.Spec.Containers[idx]</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if ctr.SecurityContext != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   if ctr.SecurityContext.Privileged != nil &amp;&amp; *ctr.SecurityContext.Privileged {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    klog.Warningf(template+" - Denying admission as container %s is privileged", req.Namespace, req.Name, req.UID, c.Name)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  for _, val := range device.GetDevices() {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   found, err := val.MutateAdmission(c, pod)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    klog.Errorf("validating pod failed:%s", err.Error())</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    return admission.Errored(http.StatusInternalServerError, err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   hasResource = hasResource || found</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if !hasResource {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.Infof(template+" - Allowing admission for pod: no resource found", req.Namespace, req.Name, req.UID)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  //return admission.Allowed("no resource found")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> } else if len(config.SchedulerName) &gt; 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  pod.Spec.SchedulerName = config.SchedulerName</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if pod.Spec.NodeName != "" {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Infof(template+" - Pod already has node assigned", req.Namespace, req.Name, req.UID)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   return admission.Denied("pod has node assigned")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> marshaledPod, err := json.Marshal(pod)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.Errorf(template+" - Failed to marshal pod, error: %v", req.Namespace, req.Name, req.UID, err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return admission.Errored(http.StatusInternalServerError, err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return admission.PatchResponseFromRaw(req.Object.Raw, marshaledPod)</span><br></span></code></pre></div></div>
<p>The decision to use the extended scheduler is mainly based on the container resource specifications in the Pod.</p>
<p>pkg/device/nvidia/device.go:246</p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func (dev *NvidiaGPUDevices) MutateAdmission(ctr *corev1.Container, p *corev1.Pod) (bool, error) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> /*gpu related */</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> priority, ok := ctr.Resources.Limits[corev1.ResourceName(dev.config.ResourcePriority)]</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if ok {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  ctr.Env = append(ctr.Env, corev1.EnvVar{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   Name:  util.TaskPriority,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   Value: fmt.Sprint(priority.Value()),</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  })</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> _, resourceNameOK := ctr.Resources.Limits[corev1.ResourceName(dev.config.ResourceCountName)]</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if resourceNameOK {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return resourceNameOK, nil</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> _, resourceCoresOK := ctr.Resources.Limits[corev1.ResourceName(dev.config.ResourceCoreName)]</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> _, resourceMemOK := ctr.Resources.Limits[corev1.ResourceName(dev.config.ResourceMemoryName)]</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> _, resourceMemPercentageOK := ctr.Resources.Limits[corev1.ResourceName(dev.config.ResourceMemoryPercentageName)]</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if resourceCoresOK || resourceMemOK || resourceMemPercentageOK {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if dev.config.DefaultGPUNum &gt; 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   ctr.Resources.Limits[corev1.ResourceName(dev.config.ResourceCountName)] = *resource.NewQuantity(int64(dev.config.DefaultGPUNum), resource.BinarySI)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   resourceNameOK = true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if !resourceNameOK &amp;&amp; dev.config.OverwriteEnv {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  ctr.Env = append(ctr.Env, corev1.EnvVar{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   Name:  "NVIDIA_VISIBLE_DEVICES",</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   Value: "none",</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  })</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return resourceNameOK, nil</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>The scheduler mainly checks whether the <code>Resources Limit</code> section of the Pod includes configurations defined in <code>device-config.yaml</code>.<br>
<!-- -->If such configurations are present, the HAMI scheduling process is used.</p>
<p>An example of <code>device-config</code> for NVIDIA GPUs:</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token key atrule">nvidia</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">resourceCountName</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> nvidia.com/gpu</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">resourceMemoryName</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> nvidia.com/gpumem</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">resourceMemoryPercentageName</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> nvidia.com/gpumem</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">percentage</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">resourceCoreName</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> nvidia.com/gpucores</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">resourcePriorityName</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> nvidia.com/priority</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">overwriteEnv</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token boolean important" style="color:rgb(255, 88, 116)">false</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">defaultMemory</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token number" style="color:rgb(247, 140, 108)">0</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">defaultCores</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token number" style="color:rgb(247, 140, 108)">0</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">defaultGPUNum</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token number" style="color:rgb(247, 140, 108)">1</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">deviceSplitCount</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token number" style="color:rgb(247, 140, 108)">10</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">deviceMemoryScaling</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token number" style="color:rgb(247, 140, 108)">3</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">deviceCoreScaling</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token number" style="color:rgb(247, 140, 108)">3</span><br></span></code></pre></div></div>
<p>Once it is determined that the Pod should follow the HAMi scheduling process, the Pod's <code>schedulerName</code> is modified to the HAMi scheduler name via a Patch.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="extending-the-kubernetes-scheduler">Extending the Kubernetes Scheduler<a href="https://project-hami.io/blog/hami-gpu-scheduling-source-code#extending-the-kubernetes-scheduler" class="hash-link" aria-label="Direct link to Extending the Kubernetes Scheduler" title="Direct link to Extending the Kubernetes Scheduler" translate="no">​</a></h3>
<p>The <a href="https://kubernetes.io/docs/reference/config-api/kube-scheduler-config.v1/" target="_blank" rel="noopener noreferrer" class="">KubeSchedulerConfiguration</a> allows the Kubernetes scheduler to be extended by implementing extension points.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="kubeschedulerconfiguration">KubeSchedulerConfiguration<a href="https://project-hami.io/blog/hami-gpu-scheduling-source-code#kubeschedulerconfiguration" class="hash-link" aria-label="Direct link to KubeSchedulerConfiguration" title="Direct link to KubeSchedulerConfiguration" translate="no">​</a></h4>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">kubectl get cm hami</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">scheduler</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">newversion </span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">o yaml</span><br></span></code></pre></div></div>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token key atrule">apiVersion</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> v1</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">data</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">config.yaml</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(199, 146, 234)">|</span><span class="token scalar string" style="color:rgb(195, 232, 141)"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">    apiVersion: kubescheduler.config.k8s.io/v1beta2</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">    kind: KubeSchedulerConfiguration</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">    leaderElection:</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      leaderElect: false</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">    profiles:</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">    - schedulerName: hami-scheduler</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">    extenders:</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">    - urlPrefix: "https://127.0.0.1:443"</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      filterVerb: filter</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      bindVerb: bind</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      nodeCacheCapable: true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      weight: 1</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      httpTimeout: 30s</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      enableHTTPS: true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      tlsConfig:</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">        insecure: true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      managedResources:</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      - name: nvidia.com/gpu</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">        ignoredByScheduler: true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      - name: nvidia.com/gpumem</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">        ignoredByScheduler: true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      - name: nvidia.com/gpucores</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">        ignoredByScheduler: true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      - name: nvidia.com/gpumem-percentage</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">        ignoredByScheduler: true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      - name: nvidia.com/priority</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">        ignoredByScheduler: true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      - name: cambricon.com/vmlu</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">        ignoredByScheduler: true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      - name: hygon.com/dcunum</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">        ignoredByScheduler: true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      - name: hygon.com/dcumem</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">        ignoredByScheduler: true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      - name: hygon.com/dcucores</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">        ignoredByScheduler: true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">      - name: iluvatar.ai/vgpu</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token scalar string" style="color:rgb(195, 232, 141)">        ignoredByScheduler: true</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">kind</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> ConfigMap</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">metadata</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">annotations</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">meta.helm.sh/release-name</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> hami</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">meta.helm.sh/release-namespace</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> kube</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">system</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">creationTimestamp</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(195, 232, 141)">"2024-12-10T03:50:36Z"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">labels</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">app.kubernetes.io/component</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> hami</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">scheduler</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">app.kubernetes.io/instance</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> hami</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">app.kubernetes.io/managed-by</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> Helm</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">app.kubernetes.io/name</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> hami</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">app.kubernetes.io/version</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> 2.4.1</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">helm.sh/chart</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> hami</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">2.4.1</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">name</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> hami</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">scheduler</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">newversion</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">namespace</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> kube</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">system</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">resourceVersion</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(195, 232, 141)">"2316275"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">uid</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> 3a61a72c</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">0bab</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">432f</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">b4d7</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">5c1ae46ee14d</span><br></span></code></pre></div></div>
<p>The extended scheduler is customized through <a href="https://kubernetes.io/docs/reference/scheduling/config/#extension-points" target="_blank" rel="noopener noreferrer" class="">extension points</a>.<br>
<!-- -->In this case, the <code>filter</code> and <code>bind</code> extension points are implemented:</p>
<ul>
<li class=""><strong>filter</strong>: Identifies the most suitable node.</li>
<li class=""><strong>bind</strong>: Creates a <code>binding</code> resource for the Pod.</li>
</ul>
<p>During scheduling, the extended scheduler's implementations are invoked in the order of the extension points.<br>
<!-- -->Here, it first calls <code>https://127.0.0.1:443/filter</code>, followed by <code>https://127.0.0.1:443/bind</code>.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="starting-the-extended-scheduler-http-server">Starting the Extended Scheduler HTTP Server<a href="https://project-hami.io/blog/hami-gpu-scheduling-source-code#starting-the-extended-scheduler-http-server" class="hash-link" aria-label="Direct link to Starting the Extended Scheduler HTTP Server" title="Direct link to Starting the Extended Scheduler HTTP Server" translate="no">​</a></h4>
<p><code>cmd/scheduler/main.go:70</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func start() {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> device.InitDevices()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> sher = scheduler.NewScheduler()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> sher.Start()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> defer sher.Stop()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // start monitor metrics</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> go sher.RegisterFromNodeAnnotations()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> go initMetrics(config.MetricsBindAddress)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // start http server</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> router := httprouter.New()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> router.POST("/filter", routes.PredicateRoute(sher))</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> router.POST("/bind", routes.Bind(sher))</span><br></span></code></pre></div></div>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="filter-implementation">filter implementation<a href="https://project-hami.io/blog/hami-gpu-scheduling-source-code#filter-implementation" class="hash-link" aria-label="Direct link to filter implementation" title="Direct link to filter implementation" translate="no">​</a></h4>
<p><code>pkg/scheduler/routes/route.go:41</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func PredicateRoute(s *scheduler.Scheduler) httprouter.Handle {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.Infoln("Into Predicate Route outer func")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return func(w http.ResponseWriter, r *http.Request, _ httprouter.Params) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.Infoln("Into Predicate Route inner func")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  checkBody(w, r)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  var buf bytes.Buffer</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  body := io.TeeReader(r.Body, &amp;buf)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  var extenderArgs extenderv1.ExtenderArgs</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  var extenderFilterResult *extenderv1.ExtenderFilterResult</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if err := json.NewDecoder(body).Decode(&amp;extenderArgs); err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Errorln("decode error", err.Error())</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   extenderFilterResult = &amp;extenderv1.ExtenderFilterResult{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    Error: err.Error(),</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  } else {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   extenderFilterResult, err = s.Filter(extenderArgs)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    klog.Errorf("pod %v filter error, %v", extenderArgs.Pod.Name, err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    extenderFilterResult = &amp;extenderv1.ExtenderFilterResult{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     Error: err.Error(),</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if resultBody, err := json.Marshal(extenderFilterResult); err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Errorf("Failed to marshal extenderFilterResult: %+v, %+v",</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    err, extenderFilterResult)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   w.Header().Set("Content-Type", "application/json")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   w.WriteHeader(http.StatusInternalServerError)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   w.Write([]byte(err.Error()))</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  } else {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   w.Header().Set("Content-Type", "application/json")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   w.WriteHeader(http.StatusOK)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   w.Write(resultBody)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p><code>pkg/scheduler/scheduler.go:430</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func (s *Scheduler) Filter(args extenderv1.ExtenderArgs) (*extenderv1.ExtenderFilterResult, error) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.InfoS("begin schedule filter", "pod", args.Pod.Name, "uuid", args.Pod.UID, "namespaces", args.Pod.Namespace)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> nums := k8sutil.Resourcereqs(args.Pod)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> total := 0</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for _, n := range nums {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  for _, k := range n {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   total += int(k.Nums)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if total == 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.V(1).Infof("pod %v not find resource", args.Pod.Name)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  s.recordScheduleFilterResultEvent(args.Pod, EventReasonFilteringFailed, []string{}, fmt.Errorf("does not request any resource"))</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return &amp;extenderv1.ExtenderFilterResult{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   NodeNames:   args.NodeNames,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   FailedNodes: nil,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   Error:       "",</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }, nil</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> annos := args.Pod.Annotations</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> s.delPod(args.Pod)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> nodeUsage, failedNodes, err := s.getNodesUsage(args.NodeNames, args.Pod)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  s.recordScheduleFilterResultEvent(args.Pod, EventReasonFilteringFailed, []string{}, err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return nil, err</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if len(failedNodes) != 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.V(5).InfoS("getNodesUsage failed nodes", "nodes", failedNodes)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> nodeScores, err := s.calcScore(nodeUsage, nums, annos, args.Pod)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  err := fmt.Errorf("calcScore failed %v for pod %v", err, args.Pod.Name)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  s.recordScheduleFilterResultEvent(args.Pod, EventReasonFilteringFailed, []string{}, err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return nil, err</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if len((*nodeScores).NodeList) == 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.V(4).Infof("All node scores do not meet for pod %v", args.Pod.Name)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  s.recordScheduleFilterResultEvent(args.Pod, EventReasonFilteringFailed, []string{}, fmt.Errorf("no available node, all node scores do not meet"))</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return &amp;extenderv1.ExtenderFilterResult{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   FailedNodes: failedNodes,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }, nil</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.V(4).Infoln("nodeScores_len=", len((*nodeScores).NodeList))</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> sort.Sort(nodeScores)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> m := (*nodeScores).NodeList[len((*nodeScores).NodeList)-1]</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.Infof("schedule %v/%v to %v %v", args.Pod.Namespace, args.Pod.Name, m.NodeID, m.Devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> annotations := make(map[string]string)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> annotations[util.AssignedNodeAnnotations] = m.NodeID</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> annotations[util.AssignedTimeAnnotations] = strconv.FormatInt(time.Now().Unix(), 10)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for _, val := range device.GetDevices() {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  val.PatchAnnotations(&amp;annotations, m.Devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> //InRequestDevices := util.EncodePodDevices(util.InRequestDevices, m.devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> //supportDevices := util.EncodePodDevices(util.SupportDevices, m.devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> //maps.Copy(annotations, InRequestDevices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> //maps.Copy(annotations, supportDevices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> s.addPod(args.Pod, m.NodeID, m.Devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> err = util.PatchPodAnnotations(args.Pod, annotations)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  s.recordScheduleFilterResultEvent(args.Pod, EventReasonFilteringFailed, []string{}, err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  s.delPod(args.Pod)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return nil, err</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> s.recordScheduleFilterResultEvent(args.Pod, EventReasonFilteringSucceed, []string{m.NodeID}, nil)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> res := extenderv1.ExtenderFilterResult{NodeNames: &amp;[]string{m.NodeID}}</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return &amp;res, nil</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>The core logic here consists of two main steps: retrieving node resources and calculating the score based on allocated resources and total resources to select the highest-scoring node.</p>
<h5 class="anchor anchorTargetStickyNavbar_Vzrq" id="retrieving-node-resource-information">Retrieving Node Resource Information<a href="https://project-hami.io/blog/hami-gpu-scheduling-source-code#retrieving-node-resource-information" class="hash-link" aria-label="Direct link to Retrieving Node Resource Information" title="Direct link to Retrieving Node Resource Information" translate="no">​</a></h5>
<p><code>pkg/scheduler/scheduler.go:241</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func (s *Scheduler) getNodesUsage(nodes *[]string, task *corev1.Pod) (*map[string]*NodeUsage, map[string]string, error) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> overallnodeMap := make(map[string]*NodeUsage)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> cachenodeMap := make(map[string]*NodeUsage)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> failedNodes := make(map[string]string)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> //for _, nodeID := range *nodes {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> allNodes, err := s.ListNodes()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return &amp;overallnodeMap, failedNodes, err</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for _, node := range allNodes {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  nodeInfo := &amp;NodeUsage{}</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  userGPUPolicy := config.GPUSchedulerPolicy</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if task != nil &amp;&amp; task.Annotations != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   if value, ok := task.Annotations[policy.GPUSchedulerPolicyAnnotationKey]; ok {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    userGPUPolicy = value</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  nodeInfo.Node = node.Node</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  nodeInfo.Devices = policy.DeviceUsageList{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   Policy:      userGPUPolicy,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   DeviceLists: make([]*policy.DeviceListsScore, 0),</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  for _, d := range node.Devices {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   nodeInfo.Devices.DeviceLists = append(nodeInfo.Devices.DeviceLists, &amp;policy.DeviceListsScore{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    Score: 0,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    Device: &amp;util.DeviceUsage{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     ID:        d.ID,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     Index:     d.Index,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     Used:      0,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     Count:     d.Count,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     Usedmem:   0,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     Totalmem:  d.Devmem,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     Totalcore: d.Devcore,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     Usedcores: 0,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     MigUsage: util.MigInUse{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      Index:     0,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      UsageList: make(util.MIGS, 0),</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     },</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     MigTemplate: d.MIGTemplate,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     Mode:        d.Mode,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     Type:        d.Type,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     Numa:        d.Numa,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     Health:      d.Health,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    },</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   })</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  overallnodeMap[node.ID] = nodeInfo</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> podsInfo := s.ListPodsInfo()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for _, p := range podsInfo {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  node, ok := overallnodeMap[p.NodeID]</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if !ok {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  for _, podsingleds := range p.Devices {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   for _, ctrdevs := range podsingleds {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    for _, udevice := range ctrdevs {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     for _, d := range node.Devices.DeviceLists {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      deviceID := udevice.UUID</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      if strings.Contains(deviceID, "[") {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">       deviceID = strings.Split(deviceID, "[")[0]</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      if d.Device.ID == deviceID {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">       d.Device.Used++</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">       d.Device.Usedmem += udevice.Usedmem</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">       d.Device.Usedcores += udevice.Usedcores</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">       if strings.Contains(udevice.UUID, "[") {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">        tmpIdx, Instance := util.ExtractMigTemplatesFromUUID(udevice.UUID)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">        if len(d.Device.MigUsage.UsageList) == 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">         util.PlatternMIG(&amp;d.Device.MigUsage, d.Device.MigTemplate, tmpIdx)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">        }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">        d.Device.MigUsage.UsageList[Instance].InUse = true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">        klog.V(3).Infoln("add mig usage", d.Device.MigUsage, "template=", d.Device.MigTemplate, "uuid=", d.Device.ID)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">       }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.V(5).Infof("usage: pod %v assigned %v %v", p.Name, p.NodeID, p.Devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> s.overviewstatus = overallnodeMap</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for _, nodeID := range *nodes {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  node, err := s.GetNode(nodeID)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   // The identified node does not have a gpu device, so the log here has no practical meaning,increase log priority.</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.V(5).InfoS("node unregistered", "node", nodeID, "error", err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   failedNodes[nodeID] = "node unregistered"</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  cachenodeMap[node.ID] = overallnodeMap[node.ID]</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> s.cachedstatus = cachenodeMap</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return &amp;cachenodeMap, failedNodes, nil</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>To retrieve the total resources and allocated resources of a Node, the first step is to gather the Node information.</p>
<p><code>pkg/scheduler/nodes.go:120</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func (m *nodeManager) ListNodes() (map[string]*util.NodeInfo, error) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> m.mutex.RLock()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> defer m.mutex.RUnlock()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return m.nodes, nil</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>Caching is used here to store Node information, which is added to the cache by the <code>addNode</code> function.</p>
<h6 class="anchor anchorTargetStickyNavbar_Vzrq" id="node-cache">Node Cache<a href="https://project-hami.io/blog/hami-gpu-scheduling-source-code#node-cache" class="hash-link" aria-label="Direct link to Node Cache" title="Direct link to Node Cache" translate="no">​</a></h6>
<p><code>pkg/scheduler/nodes.go:46</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func (m *nodeManager) addNode(nodeID string, nodeInfo *util.NodeInfo) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if nodeInfo == nil || len(nodeInfo.Devices) == 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> m.mutex.Lock()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> defer m.mutex.Unlock()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> _, ok := m.nodes[nodeID]</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if ok {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if len(nodeInfo.Devices) &gt; 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   tmp := make([]util.DeviceInfo, 0, len(nodeInfo.Devices))</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   devices := device.GetDevices()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   deviceType := ""</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   for _, val := range devices {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    if strings.Contains(nodeInfo.Devices[0].Type, val.CommonWord()) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     deviceType = val.CommonWord()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   for _, val := range m.nodes[nodeID].Devices {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    if !strings.Contains(val.Type, deviceType) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     tmp = append(tmp, val)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   m.nodes[nodeID].Devices = tmp</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   m.nodes[nodeID].Devices = append(m.nodes[nodeID].Devices, nodeInfo.Devices...)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> } else {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  m.nodes[nodeID] = nodeInfo</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>The main logic here involves using <code>device.GetDevices()</code> to retrieve device information.</p>
<p><code>pkg/device/devices.go:81</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func GetDevices() map[string]Devices {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return devices</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>The <code>device</code> is also cached, which will be analyzed later. First, let's look at when the Node cache is called.</p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func (s *Scheduler) RegisterFromNodeAnnotations() {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.V(5).Infoln("Scheduler into RegisterFromNodeAnnotations")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> ticker := time.NewTicker(time.Second * 15)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  select {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  case &lt;-s.nodeNotify:</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  case &lt;-ticker.C:</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  case &lt;-s.stopCh:</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   return</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  labelSelector := labels.Everything()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if len(config.NodeLabelSelector) &gt; 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   labelSelector = (labels.Set)(config.NodeLabelSelector).AsSelector()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  rawNodes, err := s.nodeLister.List(labelSelector)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Errorln("nodes list failed", err.Error())</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  var nodeNames []string</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  for _, val := range rawNodes {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   nodeNames = append(nodeNames, val.Name)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   for devhandsk, devInstance := range device.GetDevices() {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    health, needUpdate := devInstance.CheckHealth(devhandsk, val)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    klog.V(5).InfoS("device check health", "node", val.Name, "deviceVendor", devhandsk, "health", health, "needUpdate", needUpdate)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    if !health {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     err := devInstance.NodeCleanUp(val.Name)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     // If the device is not healthy, the device is removed from the node.</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     // At the same time, this node needs to be removed from the cache.</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      klog.Errorln("node cleanup failed", err.Error())</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     info, ok := s.nodes[val.Name]</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     if ok {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      klog.Infof("node %v device %s:%v leave, %v remaining devices:%v", val.Name, devhandsk, info.ID, err, s.nodes[val.Name].Devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      s.rmNodeDevice(val.Name, info, devhandsk)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    if !needUpdate {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    _, ok := util.HandshakeAnnos[devhandsk]</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    if ok {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     tmppat := make(map[string]string)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     tmppat[util.HandshakeAnnos[devhandsk]] = "Requesting_" + time.Now().Format("2006.01.02 15:04:05")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     klog.V(4).InfoS("New timestamp", util.HandshakeAnnos[devhandsk], tmppat[util.HandshakeAnnos[devhandsk]], "nodeName", val.Name)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     n, err := util.GetNode(val.Name)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      klog.Errorln("get node failed", err.Error())</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     util.PatchNodeAnnotations(n, tmppat)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    nodeInfo := &amp;util.NodeInfo{}</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    nodeInfo.ID = val.Name</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    nodeInfo.Node = val</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    nodedevices, err := devInstance.GetNodeDevices(*val)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    nodeInfo.Devices = make([]util.DeviceInfo, 0)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    for _, deviceinfo := range nodedevices {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     nodeInfo.Devices = append(nodeInfo.Devices, *deviceinfo)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    s.addNode(val.Name, nodeInfo)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    if s.nodes[val.Name] != nil &amp;&amp; len(nodeInfo.Devices) &gt; 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     klog.Infof("node %v device %s come node info=%s,%v total=%v", val.Name, devhandsk, nodeInfo.ID, nodeInfo.Devices, s.nodes[val.Name].Devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  _, _, err = s.getNodesUsage(&amp;nodeNames, nil)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Errorln("get node usage failed", err.Error())</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>A 15-second periodic task is started to retrieve Node information and maintain the Node cache.</p>
<p>The core logic here is in <code>for devhandsk, devInstance := range device.GetDevices()</code>, which retrieves all devices.<br>
<!-- -->Different handlers are registered for each device type, and the corresponding device is used to get GPU resource information through <code>devInstance.GetNodeDevices</code>.</p>
<p>In this case, the registered device is NVIDIA, and the <code>GetNodeDevices</code> implementation for each GPU is called. The specifics of the <code>device</code> will be explained later.</p>
<p><code>pkg/device/nvidia/device.go:209</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">ffunc (dev *NvidiaGPUDevices) GetNodeDevices(n corev1.Node) ([]*util.DeviceInfo, error) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> devEncoded, ok := n.Annotations[RegisterAnnos]</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if !ok {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return []*util.DeviceInfo{}, errors.New("annos not found " + RegisterAnnos)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> nodedevices, err := util.DecodeNodeDevices(devEncoded)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.ErrorS(err, "failed to decode node devices", "node", n.Name, "device annotation", devEncoded)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return []*util.DeviceInfo{}, err</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if len(nodedevices) == 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.InfoS("no nvidia gpu device found", "node", n.Name, "device annotation", devEncoded)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return []*util.DeviceInfo{}, errors.New("no gpu found on node")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for _, val := range nodedevices {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if val.Mode == "mig" {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   val.MIGTemplate = make([]util.Geometry, 0)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   for _, migTemplates := range dev.config.MigGeometriesList {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    found := false</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    for _, migDevices := range migTemplates.Models {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     if strings.Contains(val.Type, migDevices) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      found = true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      break</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    if found {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     val.MIGTemplate = append(val.MIGTemplate, migTemplates.Geometries...)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     break</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> devDecoded := util.EncodeNodeDevices(nodedevices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.V(5).InfoS("nodes device information", "node", n.Name, "nodedevices", devDecoded)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return nodedevices, nil</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>At this point, the basic logic is that the scheduler uses a timer to read the node's annotation information and maintains it in the node cache for use during scheduling.</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token key atrule">apiVersion</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> v1</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">kind</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> Node</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">metadata</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">annotations</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token punctuation" style="color:rgb(199, 146, 234)">...</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">hami.io/node-nvidia-register</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> 'GPU</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">7aebc545</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">cbd3</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">18a0</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">afce</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">76cae449702a</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token number" style="color:rgb(247, 140, 108)">10</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token number" style="color:rgb(247, 140, 108)">24576</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token number" style="color:rgb(247, 140, 108)">300</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token plain">NVIDIA</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">NVIDIA</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      GeForce RTX 3090</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token number" style="color:rgb(247, 140, 108)">0</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token plain">true</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><br></span></code></pre></div></div>
<p>The <code>device</code> is called again here, which we will look into later. For now, let's continue to examine who calls <code>RegisterFromNodeAnnotations</code>.</p>
<p><code>cmd/scheduler/main.go:70</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func start() {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> device.InitDevices()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> sher = scheduler.NewScheduler()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> sher.Start()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> defer sher.Stop()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // start monitor metrics</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> go sher.RegisterFromNodeAnnotations()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> go initMetrics(config.MetricsBindAddress)</span><br></span></code></pre></div></div>
<p>The scheduler calls this during startup, which clarifies the logic. Let's now continue by looking at the <code>device</code> from earlier.</p>
<h6 class="anchor anchorTargetStickyNavbar_Vzrq" id="device">device<a href="https://project-hami.io/blog/hami-gpu-scheduling-source-code#device" class="hash-link" aria-label="Direct link to device" title="Direct link to device" translate="no">​</a></h6>
<p>The <code>device</code> is initialized through <code>pkg/device/devices.go:85</code>.</p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func InitDevicesWithConfig(config *Config) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> devices = make(map[string]Devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> DevicesToHandle = []string{}</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> devices[nvidia.NvidiaGPUDevice] = nvidia.InitNvidiaDevice(config.NvidiaConfig)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> devices[cambricon.CambriconMLUDevice] = cambricon.InitMLUDevice(config.CambriconConfig)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> devices[hygon.HygonDCUDevice] = hygon.InitDCUDevice(config.HygonConfig)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> devices[iluvatar.IluvatarGPUDevice] = iluvatar.InitIluvatarDevice(config.IluvatarConfig)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> devices[mthreads.MthreadsGPUDevice] = mthreads.InitMthreadsDevice(config.MthreadsConfig)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> devices[metax.MetaxGPUDevice] = metax.InitMetaxDevice(config.MetaxConfig)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> DevicesToHandle = append(DevicesToHandle, nvidia.NvidiaGPUCommonWord)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> DevicesToHandle = append(DevicesToHandle, cambricon.CambriconMLUCommonWord)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> DevicesToHandle = append(DevicesToHandle, hygon.HygonDCUCommonWord)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> DevicesToHandle = append(DevicesToHandle, iluvatar.IluvatarGPUCommonWord)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> DevicesToHandle = append(DevicesToHandle, mthreads.MthreadsGPUCommonWord)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> DevicesToHandle = append(DevicesToHandle, metax.MetaxGPUCommonWord)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for _, dev := range ascend.InitDevices(config.VNPUs) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  devices[dev.CommonWord()] = dev</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  DevicesToHandle = append(DevicesToHandle, dev.CommonWord())</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>Since NVIDIA is used here, we mainly need to focus on <code>InitNvidiaDevice</code>.</p>
<p><code>pkg/device/devices.go:42</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">type Devices interface {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> CommonWord() string</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> MutateAdmission(ctr *corev1.Container, pod *corev1.Pod) (bool, error)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> CheckHealth(devType string, n *corev1.Node) (bool, bool)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> NodeCleanUp(nn string) error</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> GetNodeDevices(n corev1.Node) ([]*util.DeviceInfo, error)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> CheckType(annos map[string]string, d util.DeviceUsage, n util.ContainerDeviceRequest) (bool, bool, bool)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // CheckUUID is check current device id whether in GPUUseUUID or GPUNoUseUUID set, return true is check success.</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> CheckUUID(annos map[string]string, d util.DeviceUsage) bool</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> LockNode(n *corev1.Node, p *corev1.Pod) error</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> ReleaseNodeLock(n *corev1.Node, p *corev1.Pod) error</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> GenerateResourceRequests(ctr *corev1.Container) util.ContainerDeviceRequest</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> PatchAnnotations(annoinput *map[string]string, pd util.PodDevices) map[string]string</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> CustomFilterRule(allocated *util.PodDevices, request util.ContainerDeviceRequest, toAllicate util.ContainerDevices, device *util.DeviceUsage) bool</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> ScoreNode(node *corev1.Node, podDevices util.PodSingleDevice, policy string) float32</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> AddResourceUsage(n *util.DeviceUsage, ctr *util.ContainerDevice) error</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // This should not be associated with a specific device object</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> //ParseConfig(fs *flag.FlagSet)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>Here, some interfaces are defined, and different devices implement them. During the scheduler startup, these are initialized for use during runtime.</p>
<p>Once the resource information for each device on each node is retrieved, scoring begins.</p>
<h5 class="anchor anchorTargetStickyNavbar_Vzrq" id="scoring-based-on-node-resource-information">Scoring Based on Node Resource Information<a href="https://project-hami.io/blog/hami-gpu-scheduling-source-code#scoring-based-on-node-resource-information" class="hash-link" aria-label="Direct link to Scoring Based on Node Resource Information" title="Direct link to Scoring Based on Node Resource Information" translate="no">​</a></h5>
<p><code>pkg/scheduler/scheduler.go:458</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain"> nodeScores, err := s.calcScore(nodeUsage, nums, annos, args.Pod)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  err := fmt.Errorf("calcScore failed %v for pod %v", err, args.Pod.Name)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  s.recordScheduleFilterResultEvent(args.Pod, EventReasonFilteringFailed, []string{}, err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return nil, err</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span></code></pre></div></div>
<p><code>pkg/scheduler/score.go:198</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func (s *Scheduler) calcScore(nodes *map[string]*NodeUsage, nums util.PodDeviceRequests, annos map[string]string, task *corev1.Pod) (*policy.NodeScoreList, error) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> userNodePolicy := config.NodeSchedulerPolicy</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if annos != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if value, ok := annos[policy.NodeSchedulerPolicyAnnotationKey]; ok {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   userNodePolicy = value</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> res := policy.NodeScoreList{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  Policy:   userNodePolicy,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  NodeList: make([]*policy.NodeScore, 0),</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> //func calcScore(nodes *map[string]*NodeUsage, errMap *map[string]string, nums util.PodDeviceRequests, annos map[string]string, task *corev1.Pod) (*NodeScoreList, error) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // res := make(NodeScoreList, 0, len(*nodes))</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for nodeID, node := range *nodes {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  viewStatus(*node)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  score := policy.NodeScore{NodeID: nodeID, Node: node.Node, Devices: make(util.PodDevices), Score: 0}</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  score.ComputeDefaultScore(node.Devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  //This loop is for different container request</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  ctrfit := false</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  for ctrid, n := range nums {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   sums := 0</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   for _, k := range n {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    sums += int(k.Nums)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   if sums == 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    for idx := range score.Devices {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     for len(score.Devices[idx]) &lt;= ctrid {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      score.Devices[idx] = append(score.Devices[idx], util.ContainerDevices{})</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     score.Devices[idx][ctrid] = append(score.Devices[idx][ctrid], util.ContainerDevice{})</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.V(5).InfoS("fitInDevices", "pod", klog.KObj(task), "node", nodeID)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   fit, _ := fitInDevices(node, n, annos, task, &amp;score.Devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   ctrfit = fit</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   if !fit {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    klog.InfoS("calcScore:node not fit pod", "pod", klog.KObj(task), "node", nodeID)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    break</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if ctrfit {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   res.NodeList = append(res.NodeList, &amp;score)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   score.OverrideScore(node.Devices, userNodePolicy)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return &amp;res, nil</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>This logic is mainly divided into two parts: iterating through the nodes to score them, and iterating through the Pod's containers to calculate the score for each container's corresponding device. Finally, all nodes that can accommodate the resource limits required by the Pod are returned.</p>
<h5 class="anchor anchorTargetStickyNavbar_Vzrq" id="calculating-node-scores">Calculating Node Scores<a href="https://project-hami.io/blog/hami-gpu-scheduling-source-code#calculating-node-scores" class="hash-link" aria-label="Direct link to Calculating Node Scores" title="Direct link to Calculating Node Scores" translate="no">​</a></h5>
<p><code>pkg/scheduler/policy/node_policy.go:68</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func (ns *NodeScore) ComputeDefaultScore(devices DeviceUsageList) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> used, usedCore, usedMem := int32(0), int32(0), int32(0)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for _, device := range devices.DeviceLists {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  used += device.Device.Used</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  usedCore += device.Device.Usedcores</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  usedMem += device.Device.Usedmem</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.V(2).Infof("node %s used %d, usedCore %d, usedMem %d,", ns.NodeID, used, usedCore, usedMem)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> total, totalCore, totalMem := int32(0), int32(0), int32(0)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for _, deviceLists := range devices.DeviceLists {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  total += deviceLists.Device.Count</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  totalCore += deviceLists.Device.Totalcore</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  totalMem += deviceLists.Device.Totalmem</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> useScore := float32(used) / float32(total)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> coreScore := float32(usedCore) / float32(totalCore)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> memScore := float32(usedMem) / float32(totalMem)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> ns.Score = float32(Weight) * (useScore + coreScore + memScore)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.V(2).Infof("node %s computer default score is %f", ns.NodeID, ns.Score)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>The node scoring rule is relatively simple.</p>
<h5 class="anchor anchorTargetStickyNavbar_Vzrq" id="calculating-device-scores-for-each-container">Calculating Device Scores for Each Container<a href="https://project-hami.io/blog/hami-gpu-scheduling-source-code#calculating-device-scores-for-each-container" class="hash-link" aria-label="Direct link to Calculating Device Scores for Each Container" title="Direct link to Calculating Device Scores for Each Container" translate="no">​</a></h5>
<p><code>pkg/scheduler/score.go:149</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func fitInDevices(node *NodeUsage, requests util.ContainerDeviceRequests, annos map[string]string, pod *corev1.Pod, devinput *util.PodDevices) (bool, float32) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> //devmap := make(map[string]util.ContainerDevices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> devs := util.ContainerDevices{}</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> total, totalCore, totalMem := int32(0), int32(0), int32(0)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> free, freeCore, freeMem := int32(0), int32(0), int32(0)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> sums := 0</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // computer all device score for one node</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for index := range node.Devices.DeviceLists {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  node.Devices.DeviceLists[index].ComputeScore(requests)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> //This loop is for requests for different devices</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for _, k := range requests {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  sums += int(k.Nums)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if int(k.Nums) &gt; len(node.Devices.DeviceLists) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.InfoS("request devices nums cannot exceed the total number of devices on the node.", "pod", klog.KObj(pod), "request devices nums", k.Nums, "node device nums", len(node.Devices.DeviceLists))</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   return false, 0</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  sort.Sort(node.Devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  fit, tmpDevs := fitInCertainDevice(node, k, annos, pod, devinput)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if fit {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   for idx, val := range tmpDevs[k.Type] {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    for nidx, v := range node.Devices.DeviceLists {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     //bc node.Devices has been sorted, so we should find out the correct device</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     if v.Device.ID != val.UUID {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     total += v.Device.Count</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     totalCore += v.Device.Totalcore</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     totalMem += v.Device.Totalmem</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     free += v.Device.Count - v.Device.Used</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     freeCore += v.Device.Totalcore - v.Device.Usedcores</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     freeMem += v.Device.Totalmem - v.Device.Usedmem</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     err := device.GetDevices()[k.Type].AddResourceUsage(node.Devices.DeviceLists[nidx].Device, &amp;tmpDevs[k.Type][idx])</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      klog.Errorf("AddResource failed:%s", err.Error())</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      return false, 0</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     klog.Infoln("After AddResourceUsage:", node.Devices.DeviceLists[nidx].Device)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   devs = append(devs, tmpDevs[k.Type]...)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  } else {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   return false, 0</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  (*devinput)[k.Type] = append((*devinput)[k.Type], devs)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return true, 0</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>The main logic is as follows:</p>
<ul>
<li class="">Score each device corresponding to a container. Iterate through the limits of the different containers to find the devices that can accommodate the container's resource limits.</li>
</ul>
<p><code>pkg/scheduler/policy/gpu_policy.go:58</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func (ds *DeviceListsScore) ComputeScore(requests util.ContainerDeviceRequests) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> request, core, mem := int32(0), int32(0), int32(0)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // Here we are required to use the same type device</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for _, container := range requests {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  request += container.Nums</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  core += container.Coresreq</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if container.MemPercentagereq != 0 &amp;&amp; container.MemPercentagereq != 101 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   mem += ds.Device.Totalmem * (container.MemPercentagereq / 100.0)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  mem += container.Memreq</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.V(2).Infof("device %s user %d, userCore %d, userMem %d,", ds.Device.ID, ds.Device.Used, ds.Device.Usedcores, ds.Device.Usedmem)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> usedScore := float32(request+ds.Device.Used) / float32(ds.Device.Count)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> coreScore := float32(core+ds.Device.Usedcores) / float32(ds.Device.Totalcore)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> memScore := float32(mem+ds.Device.Usedmem) / float32(ds.Device.Totalmem)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> ds.Score = float32(Weight) * (usedScore + coreScore + memScore)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.V(2).Infof("device %s computer score is %f", ds.Device.ID, ds.Score)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>The scoring rule is similar to the one used for nodes.</p>
<p><code>pkg/scheduler/score.go:65</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func fitInCertainDevice(node *NodeUsage, request util.ContainerDeviceRequest, annos map[string]string, pod *corev1.Pod, allocated *util.PodDevices) (bool, map[string]util.ContainerDevices) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> k := request</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> originReq := k.Nums</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> prevnuma := -1</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.InfoS("Allocating device for container request", "pod", klog.KObj(pod), "card request", k)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> var tmpDevs map[string]util.ContainerDevices</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> tmpDevs = make(map[string]util.ContainerDevices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for i := len(node.Devices.DeviceLists) - 1; i &gt;= 0; i-- {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.InfoS("scoring pod", "pod", klog.KObj(pod), "Memreq", k.Memreq, "MemPercentagereq", k.MemPercentagereq, "Coresreq", k.Coresreq, "Nums", k.Nums, "device index", i, "device", node.Devices.DeviceLists[i].Device.ID)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  found, numa := checkType(annos, *node.Devices.DeviceLists[i].Device, k)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if !found {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.InfoS("card type mismatch,continuing...", "pod", klog.KObj(pod), (node.Devices.DeviceLists[i].Device).Type, k.Type)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if numa &amp;&amp; prevnuma != node.Devices.DeviceLists[i].Device.Numa {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.InfoS("Numa not fit, resotoreing", "pod", klog.KObj(pod), "k.nums", k.Nums, "numa", numa, "prevnuma", prevnuma, "device numa", node.Devices.DeviceLists[i].Device.Numa)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   k.Nums = originReq</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   prevnuma = node.Devices.DeviceLists[i].Device.Numa</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   tmpDevs = make(map[string]util.ContainerDevices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if !checkUUID(annos, *node.Devices.DeviceLists[i].Device, k) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.InfoS("card uuid mismatch,", "pod", klog.KObj(pod), "current device info is:", *node.Devices.DeviceLists[i].Device)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  memreq := int32(0)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if node.Devices.DeviceLists[i].Device.Count &lt;= node.Devices.DeviceLists[i].Device.Used {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if k.Coresreq &gt; 100 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.ErrorS(nil, "core limit can't exceed 100", "pod", klog.KObj(pod))</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   k.Coresreq = 100</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   //return false, tmpDevs</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if k.Memreq &gt; 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   memreq = k.Memreq</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if k.MemPercentagereq != 101 &amp;&amp; k.Memreq == 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   //This incurs an issue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   memreq = node.Devices.DeviceLists[i].Device.Totalmem * k.MemPercentagereq / 100</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if node.Devices.DeviceLists[i].Device.Totalmem-node.Devices.DeviceLists[i].Device.Usedmem &lt; memreq {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.V(5).InfoS("card Insufficient remaining memory", "pod", klog.KObj(pod), "device index", i, "device", node.Devices.DeviceLists[i].Device.ID, "device total memory", node.Devices.DeviceLists[i].Device.Totalmem, "device used memory", node.Devices.DeviceLists[i].Device.Usedmem, "request memory", memreq)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if node.Devices.DeviceLists[i].Device.Totalcore-node.Devices.DeviceLists[i].Device.Usedcores &lt; k.Coresreq {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.V(5).InfoS("card Insufficient remaining cores", "pod", klog.KObj(pod), "device index", i, "device", node.Devices.DeviceLists[i].Device.ID, "device total core", node.Devices.DeviceLists[i].Device.Totalcore, "device used core", node.Devices.DeviceLists[i].Device.Usedcores, "request cores", k.Coresreq)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  // Coresreq=100 indicates it want this card exclusively</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if node.Devices.DeviceLists[i].Device.Totalcore == 100 &amp;&amp; k.Coresreq == 100 &amp;&amp; node.Devices.DeviceLists[i].Device.Used &gt; 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.V(5).InfoS("the container wants exclusive access to an entire card, but the card is already in use", "pod", klog.KObj(pod), "device index", i, "device", node.Devices.DeviceLists[i].Device.ID, "used", node.Devices.DeviceLists[i].Device.Used)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  // You can't allocate core=0 job to an already full GPU</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if node.Devices.DeviceLists[i].Device.Totalcore != 0 &amp;&amp; node.Devices.DeviceLists[i].Device.Usedcores == node.Devices.DeviceLists[i].Device.Totalcore &amp;&amp; k.Coresreq == 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.V(5).InfoS("can't allocate core=0 job to an already full GPU", "pod", klog.KObj(pod), "device index", i, "device", node.Devices.DeviceLists[i].Device.ID)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if !device.GetDevices()[k.Type].CustomFilterRule(allocated, request, tmpDevs[k.Type], node.Devices.DeviceLists[i].Device) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if k.Nums &gt; 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.InfoS("first fitted", "pod", klog.KObj(pod), "device", node.Devices.DeviceLists[i].Device.ID)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   k.Nums--</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   tmpDevs[k.Type] = append(tmpDevs[k.Type], util.ContainerDevice{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    Idx:       int(node.Devices.DeviceLists[i].Device.Index),</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    UUID:      node.Devices.DeviceLists[i].Device.ID,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    Type:      k.Type,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    Usedmem:   memreq,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    Usedcores: k.Coresreq,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   })</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if k.Nums == 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.InfoS("device allocate success", "pod", klog.KObj(pod), "allocate device", tmpDevs)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   return true, tmpDevs</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if node.Devices.DeviceLists[i].Device.Mode == "mig" {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   i++</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return false, tmpDevs</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>Devices are iterated through, primarily checking the remaining device resources to determine if they can accommodate the container's resource allocation. All devices that can accommodate the allocation are returned.</p>
<p><code>pkg/scheduler/scheduler.go:458</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain"> nodeScores, err := s.calcScore(nodeUsage, nums, annos, args.Pod)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  err := fmt.Errorf("calcScore failed %v for pod %v", err, args.Pod.Name)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  s.recordScheduleFilterResultEvent(args.Pod, EventReasonFilteringFailed, []string{}, err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return nil, err</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if len((*nodeScores).NodeList) == 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.V(4).Infof("All node scores do not meet for pod %v", args.Pod.Name)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  s.recordScheduleFilterResultEvent(args.Pod, EventReasonFilteringFailed, []string{}, fmt.Errorf("no available node, all node scores do not meet"))</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return &amp;extenderv1.ExtenderFilterResult{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   FailedNodes: failedNodes,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }, nil</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.V(4).Infoln("nodeScores_len=", len((*nodeScores).NodeList))</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> sort.Sort(nodeScores)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> m := (*nodeScores).NodeList[len((*nodeScores).NodeList)-1]</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.Infof("schedule %v/%v to %v %v", args.Pod.Namespace, args.Pod.Name, m.NodeID, m.Devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> annotations := make(map[string]string)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> annotations[util.AssignedNodeAnnotations] = m.NodeID</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> annotations[util.AssignedTimeAnnotations] = strconv.FormatInt(time.Now().Unix(), 10)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  for _, val := range device.GetDevices() {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  val.PatchAnnotations(&amp;annotations, m.Devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> //InRequestDevices := util.EncodePodDevices(util.InRequestDevices, m.devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> //supportDevices := util.EncodePodDevices(util.SupportDevices, m.devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> //maps.Copy(annotations, InRequestDevices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> //maps.Copy(annotations, supportDevices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> s.addPod(args.Pod, m.NodeID, m.Devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> err = util.PatchPodAnnotations(args.Pod, annotations)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  s.recordScheduleFilterResultEvent(args.Pod, EventReasonFilteringFailed, []string{}, err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  s.delPod(args.Pod)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return nil, err</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> s.recordScheduleFilterResultEvent(args.Pod, EventReasonFilteringSucceed, []string{m.NodeID}, nil)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> res := extenderv1.ExtenderFilterResult{NodeNames: &amp;[]string{m.NodeID}}</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return &amp;res, nil</span><br></span></code></pre></div></div>
<p>After iterating through the devices, the one with the highest score is selected, and the Pod is labeled accordingly.</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token key atrule">apiVersion</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> v1</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">kind</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> Pod</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">metadata</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">annotations</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">hami.io/vgpu-node</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> node1</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">hami.io/vgpu-time</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> </span><span class="token string" style="color:rgb(195, 232, 141)">"1733988480"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">hami.io/vgpu-devices-allocated</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> GPU</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">7aebc545</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">cbd3</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">18a0</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">afce</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">76cae449702a</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token plain">NVIDIA</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token number" style="color:rgb(247, 140, 108)">20000</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token plain">80</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain">;</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">hami.io/vgpu-devices-to-allocate</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> ;</span><br></span></code></pre></div></div>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="binding-implementation">Binding Implementation<a href="https://project-hami.io/blog/hami-gpu-scheduling-source-code#binding-implementation" class="hash-link" aria-label="Direct link to Binding Implementation" title="Direct link to Binding Implementation" translate="no">​</a></h4>
<p>The bind logic is straightforward: it simply binds the Pod to the Node.</p>
<p><code>pkg/scheduler/routes/route.go:82</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func Bind(s *scheduler.Scheduler) httprouter.Handle {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return func(w http.ResponseWriter, r *http.Request, ps httprouter.Params) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  var buf bytes.Buffer</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  body := io.TeeReader(r.Body, &amp;buf)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  var extenderBindingArgs extenderv1.ExtenderBindingArgs</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  var extenderBindingResult *extenderv1.ExtenderBindingResult</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if err := json.NewDecoder(body).Decode(&amp;extenderBindingArgs); err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.ErrorS(err, "Decode extender binding args")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   extenderBindingResult = &amp;extenderv1.ExtenderBindingResult{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    Error: err.Error(),</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  } else {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   extenderBindingResult, err = s.Bind(extenderBindingArgs)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if response, err := json.Marshal(extenderBindingResult); err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.ErrorS(err, "Marshal binding result", "result", extenderBindingResult)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   w.Header().Set("Content-Type", "application/json")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   w.WriteHeader(http.StatusInternalServerError)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   errMsg := fmt.Sprintf("{'error':'%s'}", err.Error())</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   w.Write([]byte(errMsg))</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  } else {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.V(5).InfoS("Return bind response", "result", extenderBindingResult)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   w.Header().Set("Content-Type", "application/json")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   w.WriteHeader(http.StatusOK)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   w.Write(response)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>Handle the routes:</p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func (s *Scheduler) Bind(args extenderv1.ExtenderBindingArgs) (*extenderv1.ExtenderBindingResult, error) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.InfoS("Bind", "pod", args.PodName, "namespace", args.PodNamespace, "podUID", args.PodUID, "node", args.Node)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> var err error</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> var res *extenderv1.ExtenderBindingResult</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> binding := &amp;corev1.Binding{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  ObjectMeta: metav1.ObjectMeta{Name: args.PodName, UID: args.PodUID},</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  Target:     corev1.ObjectReference{Kind: "Node", Name: args.Node},</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> current, err := s.kubeClient.CoreV1().Pods(args.PodNamespace).Get(context.Background(), args.PodName, metav1.GetOptions{})</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.ErrorS(err, "Get pod failed")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> node, err := s.kubeClient.CoreV1().Nodes().Get(context.Background(), args.Node, metav1.GetOptions{})</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.ErrorS(err, "Failed to get node", "node", args.Node)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  s.recordScheduleBindingResultEvent(current, EventReasonBindingFailed, []string{}, fmt.Errorf("failed to get node %v", args.Node))</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  res = &amp;extenderv1.ExtenderBindingResult{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   Error: err.Error(),</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return res, nil</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> tmppatch := make(map[string]string)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for _, val := range device.GetDevices() {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  err = val.LockNode(node, current)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   goto ReleaseNodeLocks</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> tmppatch[util.DeviceBindPhase] = "allocating"</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> tmppatch[util.BindTimeAnnotations] = strconv.FormatInt(time.Now().Unix(), 10)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> err = util.PatchPodAnnotations(current, tmppatch)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.ErrorS(err, "patch pod annotation failed")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err = s.kubeClient.CoreV1().Pods(args.PodNamespace).Bind(context.Background(), binding, metav1.CreateOptions{}); err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.ErrorS(err, "Failed to bind pod", "pod", args.PodName, "namespace", args.PodNamespace, "podUID", args.PodUID, "node", args.Node)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err == nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  s.recordScheduleBindingResultEvent(current, EventReasonBindingSucceed, []string{args.Node}, nil)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  res = &amp;extenderv1.ExtenderBindingResult{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   Error: "",</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.Infoln("After Binding Process")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return res, nil</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">ReleaseNodeLocks:</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.InfoS("bind failed", "err", err.Error())</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for _, val := range device.GetDevices() {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  val.ReleaseNodeLock(node, current)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> s.recordScheduleBindingResultEvent(current, EventReasonBindingFailed, []string{}, err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return &amp;extenderv1.ExtenderBindingResult{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  Error: err.Error(),</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }, nil</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="node-writes-device-information-to-node-annotation">Node Writes Device Information to Node Annotation<a href="https://project-hami.io/blog/hami-gpu-scheduling-source-code#node-writes-device-information-to-node-annotation" class="hash-link" aria-label="Direct link to Node Writes Device Information to Node Annotation" title="Direct link to Node Writes Device Information to Node Annotation" translate="no">​</a></h3>
<p>The scheduler retrieves node device information primarily by reading the node's annotation, which involves the following steps:</p>
<ul>
<li class="">Start the plugin</li>
</ul>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token key atrule">apiVersion</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> v1</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">kind</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> Node</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"></span><span class="token key atrule">metadata</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  </span><span class="token key atrule">annotations</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">hami.io/node-handshake</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> Requesting_2024.12.24 03</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token datetime number" style="color:rgb(247, 140, 108)">31:30</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">hami.io/node-handshake-dcu</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"> Deleted_2024.12.06 07</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token datetime number" style="color:rgb(247, 140, 108)">43:49</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    </span><span class="token key atrule">hami.io/node-nvidia-register</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      "GPU</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">7aebc545</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">cbd3</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">18a0</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">afce</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">76cae449702a</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token number" style="color:rgb(247, 140, 108)">10</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token number" style="color:rgb(247, 140, 108)">73728</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token number" style="color:rgb(247, 140, 108)">300</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token plain">NVIDIA</span><span class="token punctuation" style="color:rgb(199, 146, 234)">-</span><span class="token plain">NVIDIA</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">      GeForce RTX 3090</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token number" style="color:rgb(247, 140, 108)">0</span><span class="token punctuation" style="color:rgb(199, 146, 234)">,</span><span class="token plain">true</span><span class="token punctuation" style="color:rgb(199, 146, 234)">:</span><span class="token plain">"</span><br></span></code></pre></div></div>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="starting-the-device-plugin-service">Starting the Device-Plugin Service<a href="https://project-hami.io/blog/hami-gpu-scheduling-source-code#starting-the-device-plugin-service" class="hash-link" aria-label="Direct link to Starting the Device-Plugin Service" title="Direct link to Starting the Device-Plugin Service" translate="no">​</a></h4>
<p>The <code>github.com/urfave/cli/v2</code> package is used to start the service via a command. It's important to note that the <code>-v</code> flag is not for log level but rather for displaying the version.</p>
<p><code>cmd/device-plugin/nvidia/main.go:40</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func main() {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> var configFile string</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> c := cli.NewApp()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> c.Name = "NVIDIA Device Plugin"</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> c.Usage = "NVIDIA device plugin for Kubernetes"</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> c.Version = info.GetVersionString()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> c.Action = func(ctx *cli.Context) error {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return start(ctx, c.Flags)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span></code></pre></div></div>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="starting-the-plugin">Starting the Plugin<a href="https://project-hami.io/blog/hami-gpu-scheduling-source-code#starting-the-plugin" class="hash-link" aria-label="Direct link to Starting the Plugin" title="Direct link to Starting the Plugin" translate="no">​</a></h4>
<p>The plugin here is designed to implement different methods for devices from different vendors. The plugin controller defines operations such as start, restart, exit, etc.<br>
<!-- -->Our main focus here is on <code>plugins, restartPlugins, err := startPlugins(c, flags, restarting)</code>.</p>
<p><code>cmd/device-plugin/nvidia/main.go:156</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func start(c *cli.Context, flags []cli.Flag) error {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.Info("Starting FS watcher.")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> util.NodeName = os.Getenv(util.NodeNameEnvName)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> watcher, err := newFSWatcher(kubeletdevicepluginv1beta1.DevicePluginPath)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return fmt.Errorf("failed to create FS watcher: %v", err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> defer watcher.Close()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> //device.InitDevices()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> /*Loading config files*/</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.Infof("Start working on node %s", util.NodeName)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.Info("Starting OS watcher.")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> sigs := newOSWatcher(syscall.SIGHUP, syscall.SIGINT, syscall.SIGTERM, syscall.SIGQUIT)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> var restarting bool</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> var restartTimeout &lt;-chan time.Time</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> var plugins []plugin.Interface</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">restart:</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // If we are restarting, stop plugins from previous run.</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if restarting {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  err := stopPlugins(plugins)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   return fmt.Errorf("error stopping plugins from previous run: %v", err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.Info("Starting Plugins.")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> plugins, restartPlugins, err := startPlugins(c, flags, restarting)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return fmt.Errorf("error starting plugins: %v", err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if restartPlugins {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.Info("Failed to start one or more plugins. Retrying in 30s...")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  restartTimeout = time.After(30 * time.Second)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> restarting = true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // Start an infinite loop, waiting for several indicators to either log</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // some messages, trigger a restart of the plugins, or exit the program.</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  select {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  // If the restart timeout has expired, then restart the plugins</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  case &lt;-restartTimeout:</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   goto restart</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  // Detect a kubelet restart by watching for a newly created</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  // 'kubeletdevicepluginv1beta1.KubeletSocket' file. When this occurs, restart this loop,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  // restarting all of the plugins in the process.</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  case event := &lt;-watcher.Events:</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   if event.Name == kubeletdevicepluginv1beta1.KubeletSocket &amp;&amp; event.Op&amp;fsnotify.Create == fsnotify.Create {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    klog.Infof("inotify: %s created, restarting.", kubeletdevicepluginv1beta1.KubeletSocket)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    goto restart</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  // Watch for any other fs errors and log them.</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  case err := &lt;-watcher.Errors:</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Errorf("inotify: %s", err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  // Watch for any signals from the OS. On SIGHUP, restart this loop,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  // restarting all of the plugins in the process. On all other</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  // signals, exit the loop and exit the program.</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  case s := &lt;-sigs:</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   switch s {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   case syscall.SIGHUP:</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    klog.Info("Received SIGHUP, restarting.")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    goto restart</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   default:</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    klog.Infof("Received signal \"%v\", shutting down.", s)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    goto exit</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">exit:</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> err = stopPlugins(plugins)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return fmt.Errorf("error stopping plugins: %v", err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return nil</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p><code>cmd/device-plugin/nvidia/main.go:239</code></p>
<p>Start the plugin with <code>p.Start()</code>:</p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func startPlugins(c *cli.Context, flags []cli.Flag, restarting bool) ([]plugin.Interface, bool, error) {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // Load the configuration file</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.Info("Loading configuration.")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> config, err := loadConfig(c, flags)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return nil, false, fmt.Errorf("unable to load config: %v", err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> disableResourceRenamingInConfig(config)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> /*Loading config files*/</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> //fmt.Println("NodeName=", config.NodeName)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> devConfig, err := generateDeviceConfigFromNvidia(config, c, flags)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.Errorf("failed to load config file %s", err.Error())</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return nil, false, err</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // Update the configuration file with default resources.</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.Info("Updating config with default resource matching patterns.")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> err = rm.AddDefaultResourcesToConfig(&amp;devConfig)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return nil, false, fmt.Errorf("unable to add default resources to config: %v", err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // Print the config to the output.</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> configJSON, err := json.MarshalIndent(devConfig, "", "  ")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return nil, false, fmt.Errorf("failed to marshal config to JSON: %v", err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.Infof("\nRunning with config:\n%v", string(configJSON))</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // Get the set of plugins.</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.Info("Retrieving plugins.")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> pluginManager, err := NewPluginManager(&amp;devConfig)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return nil, false, fmt.Errorf("error creating plugin manager: %v", err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> plugins, err := pluginManager.GetPlugins()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return nil, false, fmt.Errorf("error getting plugins: %v", err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // Loop through all plugins, starting them if they have any devices</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // to serve. If even one plugin fails to start properly, try</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // starting them all again.</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> started := 0</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for _, p := range plugins {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  // Just continue if there are no devices to serve for plugin p.</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if len(p.Devices()) == 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   continue</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  // Start the gRPC server for plugin p and connect it with the kubelet.</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if err := p.Start(); err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Error("Could not contact Kubelet. Did you enable the device plugin feature gate?")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Error("You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Error("You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   return plugins, true, nil</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  started++</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if started == 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.Info("No devices found. Waiting indefinitely.")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return plugins, false, nil</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>Where, the <code>p</code> (plugin) needs to implement several methods to manage the plugin.</p>
<p><code>pkg/device-plugin/nvidiadevice/nvinternal/plugin/api.go:37</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">type Interface interface {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> Devices() rm.Devices</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> Start() error</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> Stop() error</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>Additionally, to allow kubelet to recognize extended fields like <code>nvidia.com/gpu: 1</code> in the resource, a GRPC service needs to be started and mounted to <code>/var/lib/kubelet/device-plugins/</code>, implementing the necessary methods.<br>
<!-- -->This is not closely related to scheduling, so it will not be expanded upon here.<br>
<!-- -->For more details, refer to <a href="https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/" target="_blank" rel="noopener noreferrer" class="">device-plugins</a>.</p>
<p><code>k8s.io/kubelet@v0.28.3/pkg/apis/deviceplugin/v1beta1/api.pb.go:1419</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">type DevicePluginServer interface {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // GetDevicePluginOptions returns options to be communicated with Device</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // Manager</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> GetDevicePluginOptions(context.Context, *Empty) (*DevicePluginOptions, error)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // ListAndWatch returns a stream of List of Devices</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // Whenever a Device state change or a Device disappears, ListAndWatch</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // returns the new list</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> ListAndWatch(*Empty, DevicePlugin_ListAndWatchServer) error</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // GetPreferredAllocation returns a preferred set of devices to allocate</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // from a list of available ones. The resulting preferred allocation is not</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // guaranteed to be the allocation ultimately performed by the</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // devicemanager. It is only designed to help the devicemanager make a more</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // informed allocation decision when possible.</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> GetPreferredAllocation(context.Context, *PreferredAllocationRequest) (*PreferredAllocationResponse, error)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // Allocate is called during container creation so that the Device</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // Plugin can run device specific operations and instruct Kubelet</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // of the steps to make the Device available in the container</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> Allocate(context.Context, *AllocateRequest) (*AllocateResponse, error)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // PreStartContainer is called, if indicated by Device Plugin during registration phase,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // before each container start. Device plugin can run device specific operations</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> // such as resetting the device before making devices available to the container</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> PreStartContainer(context.Context, *PreStartContainerRequest) (*PreStartContainerResponse, error)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="implement-nvidia-plugins">Implement nvidia plugins<a href="https://project-hami.io/blog/hami-gpu-scheduling-source-code#implement-nvidia-plugins" class="hash-link" aria-label="Direct link to Implement nvidia plugins" title="Direct link to Implement nvidia plugins" translate="no">​</a></h4>
<p>Mainly consider <code>plugin.WatchAndRegister()</code></p>
<p><code>pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go:196</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func (plugin *NvidiaDevicePlugin) Start() error {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> plugin.initialize()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> err := plugin.Serve()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.Infof("Could not start device plugin for '%s': %s", plugin.rm.Resource(), err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  plugin.cleanup()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return err</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.Infof("Starting to serve '%s' on %s", plugin.rm.Resource(), plugin.socket)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> err = plugin.Register()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.Infof("Could not register device plugin: %s", err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  plugin.Stop()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return err</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.Infof("Registered device plugin for '%s' with Kubelet", plugin.rm.Resource())</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if plugin.operatingMode == "mig" {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  cmd := exec.Command("nvidia-mig-parted", "export")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  var stdout, stderr bytes.Buffer</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  cmd.Stdout = &amp;stdout</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  cmd.Stderr = &amp;stderr</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  err := cmd.Run()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Fatalf("nvidia-mig-parted failed with %s\n", err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  outStr := stdout.Bytes()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  yaml.Unmarshal(outStr, &amp;plugin.migCurrent)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  os.WriteFile("/tmp/migconfig.yaml", outStr, os.ModePerm)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if len(plugin.migCurrent.MigConfigs["current"]) == 1 &amp;&amp; len(plugin.migCurrent.MigConfigs["current"][0].Devices) == 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   idx := 0</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   plugin.migCurrent.MigConfigs["current"][0].Devices = make([]int32, 0)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   for idx &lt; GetDeviceNums() {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    plugin.migCurrent.MigConfigs["current"][0].Devices = append(plugin.migCurrent.MigConfigs["current"][0].Devices, int32(idx))</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    idx++</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.Infoln("Mig export", plugin.migCurrent)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> go func() {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  err := plugin.rm.CheckHealth(plugin.stop, plugin.health)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Infof("Failed to start health check: %v; continuing with health checks disabled", err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> go func() {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  plugin.WatchAndRegister()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return nil</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>This is a timer that collects the device information of the node every 30 seconds and writes it to the node's annotation.</p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func (plugin *NvidiaDevicePlugin) WatchAndRegister() {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.Info("Starting WatchAndRegister")</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> errorSleepInterval := time.Second * 5</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> successSleepInterval := time.Second * 30</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  err := plugin.RegisterInAnnotation()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Errorf("Failed to register annotation: %v", err)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Infof("Retrying in %v seconds...", errorSleepInterval)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   time.Sleep(errorSleepInterval)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  } else {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Infof("Successfully registered annotation. Next check in %v seconds...", successSleepInterval)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   time.Sleep(successSleepInterval)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func (plugin *NvidiaDevicePlugin) RegisterInAnnotation() error {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> devices := plugin.getAPIDevices()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.InfoS("start working on the devices", "devices", devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> annos := make(map[string]string)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> node, err := util.GetNode(util.NodeName)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.Errorln("get node error", err.Error())</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  return err</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> encodeddevices := util.EncodeNodeDevices(*devices)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> annos[nvidia.HandshakeAnnos] = "Reported " + time.Now().String()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> annos[nvidia.RegisterAnnos] = encodeddevices</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.Infof("patch node with the following annos %v", fmt.Sprintf("%v", annos))</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> err = util.PatchNodeAnnotations(node, annos)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.Errorln("patch node error", err.Error())</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return err</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>The specific data collection logic.</p>
<p><code>pkg/device-plugin/nvidiadevice/nvinternal/plugin/register.go:110</code></p>
<div class="language-golang codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#bfc7d5;--prism-background-color:#292d3e"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-golang codeBlock_bY9V thin-scrollbar" style="color:#bfc7d5;background-color:#292d3e"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#bfc7d5"><span class="token plain">func (plugin *NvidiaDevicePlugin) getAPIDevices() *[]*util.DeviceInfo {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> devs := plugin.Devices()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> klog.V(5).InfoS("getAPIDevices", "devices", devs)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> nvml.Init()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> res := make([]*util.DeviceInfo, 0, len(devs))</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> for UUID := range devs {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  ndev, ret := nvml.DeviceGetHandleByUUID(UUID)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if ret != nvml.SUCCESS {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Errorln("nvml new device by index error uuid=", UUID, "err=", ret)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   panic(0)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  idx, ret := ndev.GetIndex()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if ret != nvml.SUCCESS {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Errorln("nvml get index error ret=", ret)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   panic(0)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  memoryTotal := 0</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  memory, ret := ndev.GetMemoryInfo()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if ret == nvml.SUCCESS {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   memoryTotal = int(memory.Total)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  } else {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Error("nvml get memory error ret=", ret)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   panic(0)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  Model, ret := ndev.GetName()</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if ret != nvml.SUCCESS {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.Error("nvml get name error ret=", ret)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   panic(0)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  registeredmem := int32(memoryTotal / 1024 / 1024)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if plugin.schedulerConfig.DeviceMemoryScaling != 1 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   registeredmem = int32(float64(registeredmem) * plugin.schedulerConfig.DeviceMemoryScaling)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.Infoln("MemoryScaling=", plugin.schedulerConfig.DeviceMemoryScaling, "registeredmem=", registeredmem)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  health := true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  for _, val := range devs {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   if strings.Compare(val.ID, UUID) == 0 {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    // when NVIDIA-Tesla P4, the device info is : ID:GPU-e290caca-2f0c-9582-acab-67a142b61ffa,Health:Healthy,Topology:nil,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    // it is more reasonable to think of healthy as case-insensitive</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    if strings.EqualFold(val.Health, "healthy") {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     health = true</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    } else {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">     health = false</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">    break</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  numa, err := plugin.getNumaInformation(idx)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  if err != nil {</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   klog.ErrorS(err, "failed to get numa information", "idx", idx)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  res = append(res, &amp;util.DeviceInfo{</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   ID:      UUID,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   Index:   uint(idx),</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   Count:   int32(plugin.schedulerConfig.DeviceSplitCount),</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   Devmem:  registeredmem,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   Devcore: int32(plugin.schedulerConfig.DeviceCoreScaling * 100),</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   Type:    fmt.Sprintf("%v-%v", "NVIDIA", Model),</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   Numa:    numa,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   Mode:    plugin.operatingMode,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">   Health:  health,</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  })</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">  klog.Infof("nvml registered device id=%v, memory=%v, type=%v, numa=%v", idx, registeredmem, Model, numa)</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain"> return &amp;res</span><br></span><span class="token-line" style="color:#bfc7d5"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>The device information is obtained through the NVIDIA driver. It's important to note that there is a configuration for DeviceMemoryScaling, which is an overcommit configuration for memory.<br>
<!-- -->The values for this configuration are taken from the scheduler configuration specified via the <code>--config-file</code> parameter when the service is started, and the <code>config/config.json</code> file in the code. The <code>config.json</code> file takes precedence over the <code>--config-file</code> parameter.</p>
<p>At this point, everything required for scheduling is prepared, and the Pod can be successfully assigned to the appropriate node.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="references">References<a href="https://project-hami.io/blog/hami-gpu-scheduling-source-code#references" class="hash-link" aria-label="Direct link to References" title="Direct link to References" translate="no">​</a></h2>
<ul>
<li class=""><a href="https://kubernetes.io/" target="_blank" rel="noopener noreferrer" class="">Kubernetes Official Website</a></li>
<li class=""><a href="https://www.qikqiak.com/post/custom-kube-scheduler/" target="_blank" rel="noopener noreferrer" class="">Custom Kubernetes Scheduler</a></li>
<li class=""><a href="https://www.lixueduan.com/posts/kubernetes/21-device-plugin/" target="_blank" rel="noopener noreferrer" class="">Custom Resource Support: From Principle to Implementation of K8s Device Plugin</a></li>
</ul>]]></content>
        <author>
            <name>Elrond Wang</name>
            <uri>https://github.com/elrondwong</uri>
        </author>
        <category label="Kubernetes" term="Kubernetes"/>
        <category label="GPU" term="GPU"/>
        <category label="AI" term="AI"/>
        <category label="Source Code" term="Source Code"/>
        <category label="Scheduling" term="Scheduling"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Introducing HAMi]]></title>
        <id>https://project-hami.io/blog/introducing-hami</id>
        <link href="https://project-hami.io/blog/introducing-hami"/>
        <updated>2024-12-18T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[An introduction to HAMi (Heterogeneous AI Computing Virtualization Middleware), a Kubernetes-native solution for managing heterogeneous AI computing devices with resource isolation and unified management.]]></summary>
        <content type="html"><![CDATA[<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-is-hami">What is HAMi?<a href="https://project-hami.io/blog/introducing-hami#what-is-hami" class="hash-link" aria-label="Direct link to What is HAMi?" title="Direct link to What is HAMi?" translate="no">​</a></h2>
<p>HAMi (Heterogeneous AI Computing Virtualization Middleware), formerly known as k8s-vGPU-scheduler, is an innovative solution designed to manage heterogeneous AI computing devices within Kubernetes clusters. This all-in-one middleware enables the sharing of various AI devices while ensuring resource isolation among different tasks. By improving the utilization rates of heterogeneous computing devices, HAMi provides a unified multiplexing interface that caters to diverse device types.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-choose-hami">Why Choose HAMi?<a href="https://project-hami.io/blog/introducing-hami#why-choose-hami" class="hash-link" aria-label="Direct link to Why Choose HAMi?" title="Direct link to Why Choose HAMi?" translate="no">​</a></h2>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="kubernetes-native-api-compatibility">Kubernetes Native API Compatibility<a href="https://project-hami.io/blog/introducing-hami#kubernetes-native-api-compatibility" class="hash-link" aria-label="Direct link to Kubernetes Native API Compatibility" title="Direct link to Kubernetes Native API Compatibility" translate="no">​</a></h3>
<p>One of the standout features of HAMi is its compatibility with Kubernetes' native API. This means that users can upgrade to HAMi without making any changes to their existing configurations, allowing for a seamless transition while maintaining the default behavior of Kubernetes.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="open-and-neutral">Open and Neutral<a href="https://project-hami.io/blog/introducing-hami#open-and-neutral" class="hash-link" aria-label="Direct link to Open and Neutral" title="Direct link to Open and Neutral" translate="no">​</a></h3>
<p>HAMi is a collaborative initiative involving stakeholders from various sectors, including internet services, finance, manufacturing, and cloud providers. The goal is to establish open governance under the Cloud Native Computing Foundation (CNCF), ensuring that HAMi remains neutral and accessible to all users.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="avoid-vendor-lock-in">Avoid Vendor Lock-in<a href="https://project-hami.io/blog/introducing-hami#avoid-vendor-lock-in" class="hash-link" aria-label="Direct link to Avoid Vendor Lock-in" title="Direct link to Avoid Vendor Lock-in" translate="no">​</a></h3>
<p>With HAMi, users can integrate with mainstream cloud providers without being tied to proprietary vendor orchestration. This flexibility allows organizations to choose their preferred cloud solutions while leveraging the capabilities of HAMi.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="resource-isolation">Resource Isolation<a href="https://project-hami.io/blog/introducing-hami#resource-isolation" class="hash-link" aria-label="Direct link to Resource Isolation" title="Direct link to Resource Isolation" translate="no">​</a></h3>
<p>HAMi provides robust resource isolation within containers. Each task running in a container is restricted to its allocated resources, preventing any task from exceeding its quota. This hard isolation enhances security and stability within the computing environment.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="support-for-a-variety-of-heterogeneous-computing-devices">Support for a Variety of Heterogeneous Computing Devices<a href="https://project-hami.io/blog/introducing-hami#support-for-a-variety-of-heterogeneous-computing-devices" class="hash-link" aria-label="Direct link to Support for a Variety of Heterogeneous Computing Devices" title="Direct link to Support for a Variety of Heterogeneous Computing Devices" translate="no">​</a></h3>
<p>HAMi excels in supporting a wide range of heterogeneous computing devices. Whether it's GPUs, MLUs, or NPUs from various manufacturers, HAMi facilitates device sharing and maximizes resource efficiency across different hardware platforms.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="unified-management">Unified Management<a href="https://project-hami.io/blog/introducing-hami#unified-management" class="hash-link" aria-label="Direct link to Unified Management" title="Direct link to Unified Management" translate="no">​</a></h3>
<p>To streamline operations, HAMi offers a unified monitoring system along with configurable scheduling policies such as bin packing and spreading. This comprehensive management approach simplifies the oversight of resources and enhances overall system performance.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="conclusion">Conclusion<a href="https://project-hami.io/blog/introducing-hami#conclusion" class="hash-link" aria-label="Direct link to Conclusion" title="Direct link to Conclusion" translate="no">​</a></h2>
<p>In conclusion, HAMi represents a significant advancement in the management of heterogeneous AI computing resources within Kubernetes environments. Its compatibility with existing systems, commitment to open governance, and robust resource management capabilities make it an essential tool for organizations looking to optimize their AI computing infrastructure. Join us on this journey towards more efficient and flexible AI computing with HAMi!</p>]]></content>
        <author>
            <name>HAMi Community</name>
        </author>
        <category label="Introduction" term="Introduction"/>
        <category label="GPU Sharing" term="GPU Sharing"/>
        <category label="Kubernetes" term="Kubernetes"/>
    </entry>
</feed>