Details the Q-Former architecture: a 12-layer BERT-based model using 32 learnable query embeddings. These queries use cross-attention to extract visual information for MLLM input.Details the Q-Former architecture: a 12-layer BERT-based model using 32 learnable query embeddings. These queries use cross-attention to extract visual information for MLLM input.

Visual Prompt Generation: Cross-Attention in Q-Former

2025/11/20 00:00
2 min read
For feedback or concerns regarding this content, please contact us at crypto.news@mexc.com

Abstract and 1 Introduction

  1. Related Work

    2.1. Multimodal Learning

    2.2. Multiple Instance Learning

  2. Methodology

    3.1. Preliminaries and Notations

    3.2. Relations between Attention-based VPG and MIL

    3.3. MIVPG for Multiple Visual Inputs

    3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance Scenarios

  3. Experiments and 4.1. General Setup

    4.2. Scenario 1: Samples with Single Image

    4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding

    4.4. Scenario 3: Samples with Multiple Images, with Each Image Having Multiple Patches to be Considered and 4.5. Case Study

  4. Conclusion and References

\ Supplementary Material

A. Detailed Architecture of QFormer

B. Proof of Proposition

C. More Experiments

\ Figure 7. Overview of QFormer

A. Detailed Architecture of QFormer

The architecture overview is depicted in Figure 7. Specifically, QFormer is initialized as a BERT-based model[8] comprising a total of L = 12 layers. In contrast to typical BERT models that process textual inputs, QFormer takes R = 32 learnable query embeddings as inputs. These embeddings are utilized to extract visual information from the input visual data during Stage-1 pretraining in BLIP2[22]. Subsequently, they serve as visual prompt embeddings for the LLM inputs after projection.

\ Inside the QFormer, each layer includes a self-attention module composed of a Multi-Head Attention component and a Forward module (consisting of Linear, LayerNorm, and Residual Connection). The cross-attention module, initialized with random values, is inserted every G layers, where learnable query embeddings interact with visual embeddings. In the main paper, for the sake of conciseness, we condensed the representation of the multi-head attention and forward modules into self(cross) attention modules. Furthermore, we exclusively illustrated the modifications made to the cross-attention module in MIVPG, as the self-attention modules remain unchanged. The final QFormer output is represented by the last layer’s query embeddings.

\ For a more comprehensive understanding, readers are encouraged to refer to [22].

\

:::info Authors:

(1) Wenliang Zhong, The University of Texas at Arlington (wxz9204@mavs.uta.edu);

(2) Wenyi Wu, Amazon (wenyiwu@amazon.com);

(3) Qi Li, Amazon (qlimz@amazon.com);

(4) Rob Barton, Amazon (rab@amazon.com);

(5) Boxin Du, Amazon (boxin@amazon.com);

(6) Shioulin Sam, Amazon (shioulin@amazon.com);

(7) Karim Bouyarmane, Amazon (bouykari@amazon.com);

(8) Ismail Tutar, Amazon (ismailt@amazon.com);

(9) Junzhou Huang, The University of Texas at Arlington (jzhuang@uta.edu).

:::


:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

\

Market Opportunity
Prompt Logo
Prompt Price(PROMPT)
$0.04142
$0.04142$0.04142
-3.44%
USD
Prompt (PROMPT) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact crypto.news@mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.
Tags:

You May Also Like

ECB sees progress in digital euro development

ECB sees progress in digital euro development

The post ECB sees progress in digital euro development appeared on BitcoinEthereumNews.com. Key Takeaways The ECB reports continued progress in developing the digital euro, a central bank digital currency (CBDC) for the eurozone. Testing for the digital euro is expected to end by October 2025, with a possible launch after that date. The European Central Bank sees progress in digital euro development as the central bank digital currency project advances through its preparation phase. The ECB, the central banking institution for the 20 eurozone countries, entered the digital euro preparation phase in 2023. Testing phases are expected to conclude by October 2025. The proposed CBDC would serve as a digital form of cash issued and backed by the ECB to complement physical euros. If introduced, the digital euro could handle up to €1 trillion in annual retail payments across the eurozone’s 500 million+ population. The ECB has called for EU governments to accelerate legislation establishing legal frameworks for CBDCs, aiming for a potential rollout by late 2025. The push reflects efforts to ensure regulatory control over digital payments and compete with private stablecoins. The digital euro project aligns with global trends as over 100 countries explore or pilot CBDCs. China’s digital yuan already serves millions of users, demonstrating how central banks are advancing digital currencies to modernize financial systems. Source: https://cryptobriefing.com/ecb-sees-progress-in-digital-euro-development/
Share
BitcoinEthereumNews2025/09/19 21:21
XRP Ledger Tops $1B in Tokenized Commodities, Ranks 2nd Globally

XRP Ledger Tops $1B in Tokenized Commodities, Ranks 2nd Globally

The post XRP Ledger Tops $1B in Tokenized Commodities, Ranks 2nd Globally appeared on BitcoinEthereumNews.com. XRP Ledger Surpasses $1B in Tokenized Commodities
Share
BitcoinEthereumNews2026/03/14 17:59
Crypto Market Records Gradual Upswing as Prices Turn Green

Crypto Market Records Gradual Upswing as Prices Turn Green

Today crypto market cap has climbed to $4.1T with Bitcoin ($BTC), Ethereum ($ETH), and Solana ($SOL) gains, while DeFi TVL rises and NFT sales dip.
Share
Blockchainreporter2025/09/18 18:20