A2A
Agent-to-Agent, the collaboration and communication pattern between agents.
Site Glossary
A single-page index of cloud native and AI terminology with fast lookup and grouping.
Terms are the doorway into complex systems, not the end of learning.
This page consolidates the site’s key concepts and unique terminology for fast browsing and cross-reference.
Agent-to-Agent, the collaboration and communication pattern between agents.
Agent-to-LLM, the interaction between agents and language models.
Agent-to-Tool, the ability of an agent to call external tools.
Acquisition-Activation-Retention-Revenue-Referral funnel model.
An entity or software component that perceives its environment and acts to achieve goals.
AI that can plan, act, and use tools autonomously.
The execution environment that supports agents.
A web of interacting agents.
Artificial General Intelligence, a type of artificial intelligence (AI) that matches or exceeds
An autonomous entity that perceives and acts to achieve goals.
A pruning method that removes unnecessary search paths.
A sidecar-less service mesh mode in Istio that implements traffic management through node-level
Permissive license with patent grant and notice requirements.
A server that acts as an API front-end, receiving API requests, enforcing throttling and security
Delivery semantics that may duplicate, often with idempotency.
A mechanism that allows models to focus on important parts of input data, improving model
A technique for visualizing model attention distributions to understand model focus.
The process of verifying the identity of a user or system, typically via credentials and
The function of specifying access rights/privileges to resources.
A widely used algorithm for training feedforward neural networks.
Inference that works backward from goals to conditions.
The number of samples used in one training iteration, affecting training speed and model
A classifier that outputs one of two classes.
Post-incident review focused on system fixes, not blame.
A system in which a record of transactions made in bitcoin or another cryptocurrency is maintained
A zero-downtime deployment strategy using two environments, enabling rapid traffic switching.
A classic ranking function used to evaluate document relevance to queries.
Rate of error budget spend for alerting and release.
Average cost to acquire a customer, used for channel efficiency.
A deployment strategy that gradually directs traffic to the new version, reducing risk and quickly
Huawei Ascend's heterogeneous computing architecture, providing neural network computing engines
Canonical URL to avoid duplicate content and split ranking signals.
Tradeoff: consistency, availability, partition tolerance cannot all be met.
Captures DB changes for downstream sync and streaming pipelines.
Content Delivery Network.
An organization that issues and manages digital certificates, responsible for verifying identities
A mechanism for limiting, accounting for, and isolating the resource usage of a process group.
A prompting technique that enables large language models to solve complex tasks by generating a
An engineering method that improves system resilience by actively injecting failures, helping to
A snapshot of the model training state, used for recovery after training interruption or model
A design approach that decomposes large chips into multiple smaller chips, achieving high-speed
Continuous Integration and Continuous Deployment/Delivery.
A design pattern used in software development to detect failures and encapsulate the logic of
Agreement clarifying contributor copyright permissions.
A command-line interface is a text-based user interface (UI) used to interact with computers.
Contrastive Language-Image Pre-training, a model that connects text and images.
Cumulative Layout Shift; measures visual stability.
An unsupervised method that groups similar data.
Cloud Native Computing Foundation, a sub-foundation of the Linux Foundation.
Container Network Interface, a project to write specifications and libraries for configuring
A class of deep neural networks, most commonly applied to analyzing visual imagery.
Cohort-based analysis of retention and behavior changes.
A token-level vector retrieval method that retains fine-grained matching information.
A Kubernetes resource for storing non-sensitive configuration data, separating configuration from
The set of applicable rules in a rule-based system.
Sharding/cache strategy that minimizes remapping on node changes.
A standard unit of software that packages up code and all its dependencies so the application runs
The information that surrounds a piece of text and helps to determine its meaning.
Optimizing the use of the context window.
The maximum number of tokens a model can process, determining the model's context understanding
A batching technique that dynamically merges requests to improve GPU utilization, also known as
License model requiring derivatives to remain open source.
The task of identifying reference relationships in text, such as pointing 'he' to a specific person.
A prompting technique that enables large language models to solve complex tasks by breaking them
Crawl quota that affects indexing and refresh rates.
Custom Resource Definition, a mechanism to extend the Kubernetes API.
Conflict-free replicated data types with convergent merges.
Container Runtime Interface, a plugin interface which enables kubelet to use a wide variety of
Checkpoint/Restore in Userspace, an open-source checkpoint/restore tool for Linux that can freeze running applications and save their state to disk, allowing later restoration from the checkpoint.
Container Storage Interface, a standard for exposing file and block storage systems to
NVIDIA's parallel computing platform and programming model that allows developers to use GPUs for
A field of artificial intelligence that trains computers to interpret and understand the visual
Core Web Vitals set measuring page experience.
Application development and deployment methods that fully leverage the advantages of cloud
The technology of packaging applications and their dependencies into containers.
Technology for automating the deployment, scaling, and connection of containers.
A development practice that keeps code in a state ready to be deployed to production at any time.
The practice of automatically deploying tested code changes to production.
A development practice of frequently integrating code changes into the main branch.
Gradually releasing a new version to a subset of users to verify its stability and performance.
NVIDIA's parallel computing platform and programming model that allows developers to use GPUs for
The process of managing system configurations, including creating, updating, and maintaining
A deployment strategy that gradually directs traffic to the new version, reducing risk and quickly
A programming pattern that chains multiple operations or function calls together.
A Kubernetes resource that ensures a Pod copy runs on each node, commonly used for system-level
A text-to-image generation system.
Architecture combining data lake and warehouse capabilities.
Sign-off asserting lawful origin of contributions.
Deep Computing Unit, a coprocessor product launched by Hygon, designed for high-performance computing scenarios such as scientific computing, AI inference, and training, compatible with the CUDA ecosystem.
Part of a broader family of machine learning methods based on artificial neural networks with
A neural network with many layers, like chained microservices.
A training approach combining neural networks and reinforcement learning.
The task of analyzing dependency relationships between words in sentences.
Configuration in Istio that defines policies for service destinations, implementing load balancing,
A plugin mechanism in Kubernetes for hardware device resource extension, supporting specialized
A set of practices that combines software development (Dev) and IT operations (Ops).
Deep Learning GPU Exchange Systems, integrated high-performance AI computing platforms launched by NVIDIA that connect multiple GPUs through NVLink, providing powerful computing power for deep learning.
A statistical method that protects individual privacy by adding noise.
A generative model that produces data by gradually denoising.
A technique that transfers knowledge from large models to small models, maintaining performance
A smaller model trained via distillation for faster, lighter inference.
A technology that tracks the propagation path of requests between microservices, used for
Optimizes models directly from preference data.
Dynamic Resource Allocation, a mechanism that assigns compute resources on demand to workloads.
A graph whose nodes or edges change over time to represent dynamic relationships.
A regularization technique that randomly drops some neurons during training, preventing overfitting.
Processes that run in the background and perform system-level tasks, usually starting automatically
The part of a neural network responsible for converting internal representations into output.
Search quality framework for experience, expertise, authority, trust.
A revolutionary technology with the Linux kernel that can run sandboxed programs in a privileged
Pipeline that extracts, loads, then transforms in target system.
A model that converts text into numerical vectors.
The task of identifying named entities from text, such as person names, place names, etc.
One complete pass through the training dataset, a basic unit of model training.
Allowed SLO failure budget used for release decisions.
A strongly consistent, distributed key-value store that provides a reliable way to store data that
Pipeline that extracts, transforms, then loads data.
End User License Agreement, a legal contract between software vendor and end user that specifies the terms and conditions of software use.
Semantics where messages are processed exactly once.
A rule-based AI system.
A representation method that maps discrete data (such as words) to a continuous vector space.
The ability to automatically adjust resources based on load, including horizontal and vertical
The part of a neural network responsible for converting input into an internal representation.
A computing paradigm that performs computation at the network edge close to data sources, reducing
The property that enables a system to continue operating properly in the event of the failure of
Key properties of data used to train models.
A module that automatically selects key features.
A metric evaluating the contribution of each feature to model predictions.
A privacy-preserving technique that trains models on distributed devices without sharing raw data.
The ability to learn new tasks with only a few samples.
Additional training on top of a pre-trained model to adapt the model to specific tasks or domains.
A chunking strategy that splits documents by fixed size, simple but may break semantics.
An efficient attention algorithm optimizing memory and speed.
Inference that derives conclusions from known facts.
Free and open source software emphasizing user freedom.
Half Precision Floating Point, a 16-bit floating-point format providing approximately 3-4 decimal digits of precision, reduces memory usage and accelerates computation, widely used in AI inference and some training scenarios.
Single Precision Floating Point, a 32-bit floating-point format providing approximately 6-9 decimal digits of precision, the standard numerical format for deep learning training.
Double Precision Floating Point, a 64-bit floating-point format providing approximately 15-17 decimal digits of precision, commonly used in scientific computing and high-precision numerical simulations.
A structured-object representation for knowledge.
The mechanism for LLMs to call external functions, enabling integration with external systems.
A cloud computing service model that allows running code without managing servers.
A class of machine learning frameworks designed by Ian Goodfellow and his colleagues in June 2014.
Next-gen Kubernetes gateway spec for L4/L7 ingress control.
An optimization method inspired by evolution using selection, crossover, and mutation.
A weight file format optimized for large language models.
An operational model that takes DevOps best practices used for application development, such as
Actual usable throughput under the premise of meeting SLO, a metric that better reflects the true
Randomized state dissemination for membership and config propagation.
Strong copyleft license requiring derivatives to be open source.
A Transformer-based pretrained language model that generates text via prompts.
A parallel computing workhorse often used for deep learning training.
NVIDIA technology that allows GPUs to directly access network or storage device data, bypassing the
Combining GPUDirect and RDMA technologies to achieve direct high-speed data transfer between GPUs.
The direction that guides parameter updates in optimization algorithms, representing the direction
Limits large gradients to prevent unstable training.
An optimization algorithm used to minimize some function by iteratively moving in the direction of
An open-source visualization monitoring platform supporting multiple data sources and rich panel
A RAG technique combined with knowledge graphs, providing more structured contextual information.
Loop where output feeds the next growth cycle.
Constraint mechanisms that limit the output range of AI models, ensuring output meets expectations
A graphical user interface that allows users to interact with computers using graphical elements like windows, buttons, and menus.
Constraint mechanisms that limit the output range of AI models, ensuring output meets expectations
A phenomenon where a large language model perceives patterns or objects that are nonexistent or
High Bandwidth Memory, high-speed memory used in GPUs, providing higher bandwidth than traditional
High Bandwidth Memory 2e, an enhanced version of second-generation high-bandwidth memory, providing higher bandwidth and capacity than HBM2, commonly used in high-performance GPUs.
The package manager for Kubernetes.
A package of templates and configuration used to define, install, and upgrade Kubernetes
Hyper GPU Exchange, a modular GPU platform launched by NVIDIA that provides standardized GPU integration solutions for server manufacturers, supporting large-scale AI computing cluster deployment.
Hierarchical Navigable Small World, an efficient vector indexing algorithm.
An encryption method that allows computation directly on encrypted data.
A mechanism that automatically adjusts the number of Pods based on load, achieving elastic scaling
Automatically scales pod replicas based on metrics.
Language/region tags to avoid incorrect search targeting.
A hub for sharing pretrained AI models and tools.
A retrieval strategy combining keyword search and semantic search.
Tuning hyperparameters like learning rate and batch size to improve training.
A method for periodically checking whether an application or service is running normally.
High-speed memory used in GPUs, providing higher bandwidth than traditional GDDR.
Property where repeated requests yield the same result.
The task of generating text descriptions based on images.
A large labeled image dataset commonly used for vision models.
The process of using a trained machine learning model to make predictions.
Software or hardware accelerators specifically designed for model inference, optimizing inference
A high-performance computer network communication standard, providing high-bandwidth, low-latency
A Kubernetes API object that manages external access, providing HTTP and HTTPS routing rules.
A helper container that runs before the main container starts, used for initializing configuration
Parameter values before training starts, affecting speed and outcomes.
Interaction to Next Paint; measures responsiveness.
Labels indicating model training style or intended use.
8-bit Integer, an 8-bit integer format used to quantize models to reduce computational load and memory usage, commonly used for AI inference acceleration.
A classification task that identifies user query intent.
Inter-Process Communication, a set of mechanisms that allow different processes to exchange data and synchronize information.
A method of managing and configuring infrastructure using code.
Cloud computing services that provide virtualized computing resources.
A high-performance computer network communication standard, providing high-bandwidth, low-latency
JSON Web Token is an open standard (RFC 7519) that defines a compact and self-contained way for
Common abbreviation for Kubernetes, derived from the 8 letters between 'K' and 's'.
A machine learning competition platform for practice and sharing.
The task of extracting key terms from text.
A technique that transfers knowledge from large models to small models.
A knowledge representation method that uses graph structures to represent entities and their
A technique that maps entities and relations in knowledge graphs to vector space.
The command line tool for communicating with a Kubernetes cluster's control plane.
Node agent managing pod lifecycle and container runtime.
A data structure used to store and retrieve key-value pairs.
Logical clocks for ordering events with causal consistency.
A retrieval method that interacts between all vector embeddings of queries and documents.
Largest Contentful Paint; measures main content load time.
A hyperparameter that controls the step size of model parameter updates, affecting training
Local Interpretable Model-agnostic Explanations, a local interpretable model explanation method.
A health check that detects whether a container is alive, restarting the container if it fails.
A lightweight LLM inference framework for CPU or consumer GPUs.
Large Language Model.
A device that acts as a reverse proxy and distributes network or application traffic across a
The process of distributing a set of tasks over a set of resources (computing units), with the aim
Low-Rank Adaptation, a parameter-efficient fine-tuning technique for large language models.
A compression technique that decomposes weight matrices into the product of two smaller matrices.
Protocol defining interaction between editors and language intelligence features.
Customer lifetime value used to assess payback potential.
Deep learning models with huge parameter scales, typically referring to language models with
A field of inquiry devoted to understanding and building methods that 'learn'.
The task of automatically translating text from one language to another.
A mathematical framework in reinforcement learning describing agent-environment interaction.
A reranking strategy that balances relevance and diversity.
A protocol that standardizes how models exchange context with external tools and data sources,
A technique for filtering results through metadata in vector retrieval.
Numerical measurable data points used for monitoring and alerting.
An architectural style that structures an application as a collection of services.
Multi-Instance GPU, a technique that divides a single GPU into multiple instances.
A model architecture that processes inputs by activating a subset of expert networks, improving
Machine Learning Unit, an AI accelerator product series launched by Cambricon, optimized for deep learning inference and training tasks, supporting mainstream deep learning frameworks.
Apple's machine learning framework optimized for macOS and Apple Silicon.
A collection of techniques for reducing model size and computational overhead.
A model size of roughly 4 billion parameters.
A machine learning technique where a model is composed of multiple 'expert' networks, each
An optimization technique that accelerates training and reduces oscillation.
Multi-Process Service, a technique that allows multiple processes to share GPU resources.
Mean time between failures; measures stability.
Mutual Transport Layer Security, ensuring service authentication and secure data transmission
Mean time to recovery; measures repair speed.
A deployment architecture that involves multiple clusters.
A mechanism that executes multiple attention operations in parallel, capturing different feature
An architecture involving multiple service meshes.
Generating and retrieving vectors for different parts of a document (such as title, body)
Models or systems that process multiple data types (text, images, audio, etc.).
Moore Threads' Unified System Architecture, supporting general-purpose computing on their GPUs.
A security mechanism where both parties verify each other's identity, enhancing security.
A technique that removes unimportant connections or neurons in neural networks, reducing model size
The task of identifying and classifying named entities from text.
A virtual cluster in Kubernetes for resource isolation, enabling multi-tenancy and resource quota
A state in game theory where no player wants to unilaterally change strategy.
A multi-agent simulation platform for complex systems.
A technique for automatically searching for optimal neural network architectures.
A network or circuit of neurons, or in a modern sense, an artificial neural network, composed of
The smallest computation unit in a neural network, like a function call in code.
Cambricon's AI software stack, including development tools, runtime, and drivers.
A subfield of linguistics, computer science, and artificial intelligence concerned with the
Scales inputs to a common range to speed up convergence.
Core metric guiding long-term growth and alignment.
Neural Processing Unit, a specialized processor designed to accelerate neural network computations with optimized matrix operations and parallel computing capabilities, commonly used for AI inference and training tasks.
Non-Uniform Memory Access, a computer architecture where memory access speed depends on the memory
NVIDIA's data center GPU based on Ampere architecture, providing high-performance computing capabilities and large memory capacity, widely used for AI training and inference tasks.
NVIDIA's GPU architecture used for data center GPUs such as A100, A30, A40, and A6000, providing significant performance improvements and energy efficiency gains.
NVLink, a high-speed serial communication interface used to connect GPUs.
NVIDIA Management Library, a system library for monitoring and managing NVIDIA GPUs.
An open standard for access delegation, commonly used as a way for Internet users to grant websites
Open Container Initiative, an open governance structure for the express purpose of creating open
Original Equipment Manufacturer, a company that produces products for another company, typically sold under the purchasing company's brand.
The original implementation of the BM25 algorithm, widely used in information retrieval systems.
Analytical processing for reporting and data warehousing.
Transactional processing focused on low latency and consistency.
On-call rotation to ensure rapid incident response.
A neural architecture search method that trains once to adapt to multiple deployment scenarios.
A standardized dictionary of domain concepts and relations.
Out of Memory, an error that occurs when a program runs out of memory.
General policy engine using Rego for access/compliance rules.
An open standard for observability data collection, unifying the collection of traces, metrics, and
A controller for encapsulating and managing application operational knowledge in Kubernetes,
Orca, an optimizer for large-scale distributed training.
The automated configuration, coordination, and management of computer systems and software.
Organization maintaining the Open Source Definition and licenses.
A situation where the total allocated resources exceed the physically available resources, commonly
A phenomenon where a model performs well on the training set but has poor generalization ability,
Allocating more resources than currently needed in advance to meet burst demands or ensure high
The ability to understand the internal state of a system through its external outputs, including
Two-phase commit protocol for distributed transactions.
When partitioned choose C/A; otherwise choose consistency/latency.
A technique that improves the efficiency of attention mechanisms by using a paging mechanism.
The task of tagging the part of speech for each word in text.
The container responsible for sharing the network namespace in a Pod, also known as the sandbox
Classic consensus algorithm for agreement over unreliable networks.
Time required to recover acquisition cost.
A high-speed serial computer expansion bus standard.
Limits allowable pod disruptions to protect availability.
Parameter-efficient fine-tuning with fewer trainable weights.
An early neural network model, like a single-layer if-else classifier.
Open source license allowing proprietary redistribution.
Process ID, a numerical value used by operating systems to uniquely identify a process.
A high-level interface that wraps model workflows.
Product-led growth driven by self-serve product experience.
Degree of product-market fit and its signals.
A mechanism that controls the number of simultaneous Pod interruptions, guaranteeing minimum
A series of audio programs published online and typically consumed via subscription.
A model trained on large-scale data and reused for transfer learning.
The foundational phase of training a model on a large-scale dataset, learning general knowledge.
An open-source monitoring and alerting system that uses a pull model to collect time-series data.
The input provided to a model to generate a response.
The process of structuring text that can be interpreted and understood by a generative AI model.
Optimizes prompt vectors while keeping model weights frozen.
Prompt Operations.
A technique that removes unimportant parameters or neurons from the model.
A mainstream deep learning framework with flexible, user-friendly APIs.
A technique that adds positional information to each position in a sequence, enabling the model to
A method that only fine-tunes a small number of model parameters, dramatically reducing training
Cloud computing services that provide environments for application development and deployment.
Quality of Service, a metric used to describe the performance and reliability of a system.
A technique that reduces model precision (such as FP32 to INT8) to decrease computational load and
The step of analyzing query intent and semantics, improving retrieval accuracy.
The task of answering questions based on given context.
Read/write succeeds with a majority to ensure consistency.
Consensus algorithm for log replication and state machine consistency.
A method that retrieves external knowledge and combines it with generation to improve accuracy and
A strategy for limiting network traffic.
Role-Based Access Control, a permission management system that defines user permissions through
Remote Direct Memory Access, a direct memory access technique that bypasses the operating system
Reasoning + Acting, an agent framework that combines reasoning and action.
A health check that detects whether a container is ready to serve requests, removing it from the
A document chunking method that recursively splits at the paragraph, sentence, and word levels.
A self-reflection mechanism that enables agents to learn from failures.
A modeling method for predicting continuous values.
Techniques that prevent over-complex models and improve generalization.
A machine learning method that trains agents through trial and error to maximize rewards.
A Kubernetes controller that maintains a set of running Pod replicas, ensuring a specified number
A technique that performs secondary sorting on initial retrieval results to improve relevance.
A machine learning method based on rewarding desired behaviors and/or punishing undesired ones.
Alignment via reinforcement learning from AI feedback.
A technique that trains a reward model directly from human feedback and uses the model to optimize
A class of artificial neural networks where connections between nodes can create a cycle, allowing
Rules file controlling crawler access.
Strong and healthy; vigorous.
The ability of a system to maintain function and performance under disturbances or input variation.
AMD's open GPU computing platform, providing a CUDA-like development experience and supporting AMD
An update strategy that gradually replaces old version Pods, achieving zero-downtime deployment.
A knowledge graph embedding method that models relations as rotations in complex space.
A syndication format for subscribing to and aggregating website updates.
A technique combining information retrieval and generative models to improve the accuracy and
A connection method that skips certain layers, helping gradients propagate better through deep
Policies that limit resource usage in namespaces, including quotas for CPU, memory, storage, and
Policies or mechanisms that set upper bounds on resource usage.
A direct memory access technique that bypasses the operating system kernel, reducing network
A safe and efficient model weight file format.
Long transactions split into local steps with compensations.
A heatmap showing the importance of each part of the input image to the model output.
Bill of materials describing components for security/compliance.
Structured data markup enabling rich search results.
Specification-Driven Development.
Protocols where multiple parties jointly compute a function without revealing their inputs.
The core mechanism in Transformer that computes relationships between elements within a sequence.
A chunking strategy that splits documents based on semantic boundaries, maintaining semantic
A technique that labels image pixels by semantic region.
A web of data that machines can understand.
The task of identifying text sentiment polarity, such as positive, negative, neutral.
Search engine results page; optimized for visibility and clicks.
A cloud computing execution model in which the cloud provider runs the server, and dynamically
A mechanism for identifying microservice identities, used for authentication and authorization
A dedicated infrastructure layer for handling service-to-service communication.
An inference framework optimized for LLMs with programmatic prompting.
SHapley Additive exPlanations, a model interpretation method.
A design pattern where a helper container runs alongside the main application container in the same
Index file helping crawlers discover and update pages.
Stock Keeping Unit, a unique identifier used to track inventory, widely used in product pricing and management.
Service Level Agreement, a formal agreement between service providers and customers.
Service level indicator that quantifies performance.
A document chunking strategy that maintains overlap between adjacent chunks.
Service Level Objective, defining specific targets for service performance.
Streaming Multiprocessor, a type of GPU core.
A self-executing contract with the terms of the agreement between buyer and seller being directly
Short for State-of-the-Art, referring to the best-performing model or method in a specific task or field.
Standard for software license and component identifiers.
Small model proposes tokens; large model verifies.
Service identity standard for workload authentication.
Standards for providing identities in dynamic environments, with SPIRE being the implementation of
SPIFFE runtime for issuing and rotating identities.
Cryptographic protocols designed to provide communications security over a computer network.
A text-to-image generation model based on diffusion models.
A Kubernetes workload resource used to manage stateful applications, providing stable identities
Server GX Module, a GPU form factor developed by NVIDIA for high-performance servers and computing platforms, providing higher bandwidth and power support than PCIe.
Applications that do not save any session state and can scale instances up or down at any time.
A service mesh architecture that does not require deploying proxies next to each application, such
Applications that need to maintain state data, such as databases, where each instance has a unique
The mechanism for automatically detecting and locating available service instances on a network.
A formal agreement between service providers and customers, defining service quality and
A search method based on semantic understanding rather than keyword matching.
A cloud computing service model that provides software applications over the internet.
A design pattern that deploys auxiliary functions alongside the main application, commonly used in
Thermal Design Power, the maximum amount of heat generated by a processor under normal operation, used to guide cooling system design.
Adoption model from innovators to laggards.
The technology for remotely collecting and transmitting data, used for system monitoring and
A graph structure where nodes and edges change over time.
A multidimensional array, the basic data structure for AI computing, used to represent data and
Specialized computing cores in NVIDIA GPUs used to accelerate matrix multiplication operations, significantly improving deep learning training and inference performance.
Data precision types for model weights, such as BF16.
A deep learning framework from Google suited for large-scale training.
NVIDIA's inference optimization engine for accelerating model deployment.
The task of assigning text to predefined categories.
The task of grouping similar texts.
The task of automatically generating text content.
The task of generating a brief summary from a long text.
The task of generating images based on text descriptions.
A model task type where both input and output are text.
Term Frequency-Inverse Document Frequency, a metric measuring the importance of terms in documents.
TensorFloat-32, a numerical format introduced by NVIDIA in Ampere architecture that combines the dynamic range of FP32 with the precision of FP16, used to accelerate AI training.
Trillion Floating Point Operations Per Second, a metric used to measure computational power.
A technique for GPU sharing through time-slicing, where different processes use the GPU in
Manual repetitive ops work SREs aim to reduce.
The basic unit of text processed by large language models, which can be words, subwords, or
Splits text into tokens, affecting context length and cost.
A tool that splits text into tokens.
The ability of agents to perform external operations, expanding the functional boundaries of AI.
Trillion Operations Per Second, a metric for measuring AI accelerator performance, indicating the
Time Per Output Token, the time interval per token during generation, a metric measuring generation
Tensor Processing Unit, a specialized hardware accelerator developed by Google for machine learning.
A simple knowledge graph embedding method that treats relations as translation vectors.
A deep learning model that adopts the mechanism of self-attention, differentially weighting the
A reasoning method that extends chain of thought into a tree structure, exploring multiple possible
A knowledge representation of subject-predicate-object.
Time To First Token, a metric measuring inference response speed, representing the time from
The technology of converting text to speech.
A text-based user interface that uses text and control characters to create graphical interface elements like windows and buttons.
A large-scale language model from Microsoft.
A time period allocated by the operating system to a process or task for taking turns using compute resources (such as CPU or GPU).
The process of adjusting model parameters using a dataset, enabling the model to learn patterns in
A phenomenon where a model fails to fully learn the features of the training data, usually caused
Unified Virtual Memory, a NVIDIA CUDA technology that allows host and device to share a unified virtual address space, simplifying memory management.
Vector timestamps for causality tracking and conflict detection.
A database that indexes and stores vector embeddings for fast retrieval and similarity search.
A mechanism that automatically adjusts Pod resource requests, optimizing resource utilization.
GPU virtualization technology that divides a physical GPU into multiple virtual GPUs for use by
A coding style emphasizing atmosphere and flow.
Ratio of new users generated per existing user.
An Istio resource that defines traffic routing rules, implementing traffic management features such
A model that applies Transformer architecture to computer vision tasks.
Adjusts pod resource requests/limits to optimize usage.
The process of converting data into vector representations, used in machine learning and
The technology of creating virtual versions of computer system resources.
A binary instruction format that can run in browsers, providing near-native performance.
A proxy component in Istio Ambient mode that handles L7 traffic management and policy enforcement.
A method of augmenting or altering the behavior of a web page or web application with custom
Trainable parameters that determine model predictions.
A technique that shares the same parameters across different parts of the model, reducing
A regularization technique that adds weight norms to the loss function, preventing overfitting.
A digital certificate standard for service authentication, defining the format and distribution of
Baidu's AI chip architecture, specifically designed for deep learning training and inference.
A network security model that does not trust any user or device by default, requiring verification
The ability to complete new tasks without any samples.
A game scenario where one player's gain equals another player's loss.
The tunnel proxy in Istio Ambient mode, responsible for L4 traffic forwarding and mTLS encryption.
All terms on this page for consistent writing and translation.