LLT-7053: Raw DNS forwarder by tomasz-grz · Pull Request #1709 · NordSecurity/libtelio

tomasz-grz · 2026-03-12T10:34:55Z

Problem

Non .nord DNS queries are forwarded through hickory-server's ForwardAuthority zone. We want to remove hickory dependencies as it adds unnecessary overhead, requires additional maintenance and was a source of bugs.

Solution

Add a RawForwarder, that sends DNS queries directly to upstream resolvers as UDP packets.
The forwarder will be integrated in a subsequent PR

DNS message use internal ID rewriting to support multiple concurrent queries
each upstream is tried in order, with per query timeouts
single socket bound to the tunnel interface is reused
raw bytes are forwarded and returned unchanged

☑️ Definition of Done checklist

Commit history is clean (requirements)
README.md is updated
Functionality is covered by unit or integration tests

Add component that forwards raw DNS queries to upstream resolvers over UDP socket

sfraczek · 2026-05-05T11:31:33Z

+/// Errors returned when forwarding a DNS query
+#[derive(Error, Debug)]
+pub enum ForwardError {
+    /// Failed upstream socket bind operation
+    #[error("Failed socket bind operation: {0}")]
+    SocketBind(#[from] io::Error),
+    /// Failed to send a DNS query to the upstream resolver
+    #[error("Failed to send DNS query: {0}")]
+    Send(io::Error),
+    /// No upstreams configured
+    #[error("No upstream resolvers configured")]
+    NoUpstreams,
+    /// The upstream resolvers did not respond within the configured timeout
+    #[error("DNS query timed out")]
+    Timeout,
+    /// The forwarder channel was closed
+    #[error("Forwarder channel closed")]
+    ChannelClosed,
+    /// Too many concurrent requests
+    #[error("Too many concurrent requests in flight")]
+    TooManyRequests,
+    /// The DNS packet is too short
+    #[error("DNS packet too short")]
+    PacketTooShort,
+}


nit: SocketBind and Send read like operation names, while the rest of the variants (NoUpstreams, Timeout, ChannelClosed, TooManyRequests, PacketTooShort) describe an error state. Consider renaming them for consistency.

sfraczek · 2026-05-05T12:00:00Z

+        let first_upstream = {
+            let locked = upstreams.lock().await;
+            locked.first().cloned()
+        };
+
+        let upstream_addr = match first_upstream {
+            Some(addr) => addr,
+            None => {
+                send_channel_response!(msg.respond_to, Err(ForwardError::NoUpstreams));
+                return;
+            }
+        };


nit: Consider this:

Suggested change

let first_upstream = {

let locked = upstreams.lock().await;

locked.first().cloned()

};

let upstream_addr = match first_upstream {

Some(addr) => addr,

None => {

send_channel_response!(msg.respond_to, Err(ForwardError::NoUpstreams));

return;

}

};

let Some(upstream_addr) = upstreams.lock().await.first().cloned() else {

send_channel_response!(msg.respond_to, Err(ForwardError::NoUpstreams));

return;

};

Would this still hold the lock on upstreams while doing send_channel_response?

sfraczek · 2026-05-05T13:16:16Z

+    }
+
+    /// Handle new query
+    async fn handle_new_query(


nit: Consider extracting the body into a Result-returning helper (e.g. prepare_query) and dispatching the result once at the end. The current shape repeats the match { Ok(v) => v, Err(e) => { send_channel_response!(...); return; } } pattern four times — collapsing them with ? would significantly shorten the function. But it is readable as it is so just a nitpick. Not sure if it would be readable after doing this.

sfraczek · 2026-05-05T13:16:34Z

+        let expired_ids: Vec<(u16, bool)> = pending
+            .iter()
+            .filter(|(_, entry)| entry.deadline <= now || entry.respond_to.is_closed())
+            .map(|(&id, entry)| (id, entry.respond_to.is_closed()))
+            .collect();


nit: is_closed() is called twice per entry — once in filter, once in map. A single-pass filter_map is possible:

let expired_ids: Vec<(u16, bool)> = pending .iter() .filter_map(|(&id, entry)| { let closed = entry.respond_to.is_closed(); (entry.deadline <= now || closed).then_some((id, closed)) }) .collect();

sfraczek · 2026-05-05T13:54:14Z

+                            let is_known_upstream = {
+                                let locked = upstreams.lock().await;
+                                locked.iter().any(|u| u.ip() == src.ip())
+                            };
+                            if !is_known_upstream {
+                                telio_log_warn!("Received DNS response from unknown source: {src}, ignoring");
+                                continue;
+                            }


nit: this branch has no test

It's not so straight forward, since it only checks for the IP address, and in unit tests all the injected responses would be from localhost 🤔 It's probably better tested in nat-lab

Unless we change the check to include source port as well, it would be more strict. But I don't know how often could servers use ephemeral ports for their responses (for example for load balancing).. probably need to check this more

sfraczek · 2026-05-05T13:54:26Z

+fn allocate_id(pending: &HashMap<u16, PendingQuery>, next_id: &mut u16) -> Option<u16> {
+    for _ in 0..DNS_ID_SPACE {
+        let candidate = *next_id;
+        *next_id = next_id.wrapping_add(1);
+        if !pending.contains_key(&candidate) {
+            return Some(candidate);
+        }
+    }
+    None
+}


allocate_id walks pending linearly and bails out after a full sweep — correctness depends on IDs being released back into the pool once a query is delivered or times out. That contract is not directly tested anywhere; a regression here (e.g. an entry forgotten in pending on some error path) would show up as a slow leak, not a unit test failure.

Easy smoke test: drive the forwarder through more queries than DNS_ID_SPACE and assert no TooManyRequests. The looping echo stub used by spawn_multi_stub is a good base — spawn_stub exits after one packet so it cannot serve enough queries here.

#[tokio::test] async fn ids_are_released_after_response() { let (addr, _h) = spawn_multi_stub(DNS_ID_SPACE as usize + 1000).await; let forwarder = RawForwarder::new().await.unwrap(); forwarder.set_upstreams(vec![addr]).await; for i in 0..(DNS_ID_SPACE as u32 + 1000) { let req = make_dns_packet(i as u16, b"x"); forwarder.query(&req).await.expect("id pool exhausted — leak in `pending`"); } }

sfraczek · 2026-05-05T14:50:27Z

+    /// Handle expired queries
+    async fn handle_timeouts(
+        socket: &UdpSocket,
+        upstreams: &Arc<Mutex<Vec<SocketAddr>>>,
+        timeout: &Arc<Mutex<Duration>>,
+        pending: &mut HashMap<u16, PendingQuery>,
+    ) {
+        let now = Instant::now();
+        let expired_ids: Vec<(u16, bool)> = pending
+            .iter()
+            .filter(|(_, entry)| entry.deadline <= now || entry.respond_to.is_closed())
+            .map(|(&id, entry)| (id, entry.respond_to.is_closed()))
+            .collect();
+
+        if expired_ids.is_empty() {
+            return;
+        }
+
+        let current_upstreams = {
+            let locked = upstreams.lock().await;
+            locked.clone()
+        };
+
+        for (internal_id, is_closed) in expired_ids {
+            let mut entry = match pending.remove(&internal_id) {
+                Some(e) => e,
+                None => continue,
+            };
+
+            if is_closed {
+                telio_log_warn!("Caller dropped for: {internal_id}");
+                continue;
+            }
+
+            let next_index = entry.upstream_index + 1;
+            match current_upstreams.get(next_index) {
+                Some(&next_upstream) => {
+                    telio_log_debug!(
+                        "Upstream timed out for request: {internal_id}, trying next: {next_upstream}"
+                    );
+                    entry.upstream_index = next_index;
+                    entry.deadline = Instant::now() + *timeout.lock().await;
+
+                    if let Err(e) = socket.send_to(&entry.query_bytes, next_upstream).await {
+                        send_channel_response!(entry.respond_to, Err(ForwardError::Send(e)));
+                        continue;
+                    }
+
+                    pending.insert(internal_id, entry);
+                }
+                None => {
+                    telio_log_warn!("All upstreams exhausted for request: {internal_id}");
+                    send_channel_response!(entry.respond_to, Err(ForwardError::Timeout));
+                }
+            }
+        }
+    }


Bug: upstream_index is captured at submit time but resolved here against current_upstreams, which may have been replaced via set_upstreams while this query was pending. After a swap, entry.upstream_index + 1 can point at an unrelated resolver — possibly the same one we just timed out on.

Suggestion: store a snapshot of the upstream list (or the next SocketAddr directly) inside PendingQuery so retry behavior is independent of mutations to the shared list.

Regression test that fails on the current implementation with count_a == 2 (the retry hits the same blackhole that just timed out):

// Demonstrates that `PendingQuery::upstream_index` is unstable across // `set_upstreams`: the index is captured at submit time but resolved later // against the *current* upstream list, so a reorder/replace mid-flight makes // the retry land on the wrong resolver — possibly the same one we just // timed out on. // // Sequence: // 1. upstreams = [addr_a]; submit one query → goes to addr_a (index 0). // 2. mid-flight, swap upstreams to [echo_addr, addr_a]. // 3. timeout fires; retry uses next_index = upstream_index + 1 = 1. // 4. current_upstreams[1] is addr_a → blackhole gets a SECOND packet. // // The unambiguous bug signal is `count_a == 2`. Whether the query should // ultimately succeed via echo or fail with Timeout depends on the eventual // fix design (snapshot at submit time vs. "use latest list, skip already-tried"). #[tokio::test] async fn retry_targets_wrong_upstream_after_list_reorder() { use std::sync::atomic::{AtomicUsize, Ordering}; // Counting blackhole: records every datagram but never sends a reply. let socket_a = UdpSocket::bind(\"127.0.0.1:0\").await.unwrap(); let addr_a = socket_a.local_addr().unwrap(); let count_a = Arc::new(AtomicUsize::new(0)); let count_a_task = count_a.clone(); let _stub_a = tokio::spawn(async move { let mut buf = vec![0u8; 4096]; loop { if socket_a.recv_from(&mut buf).await.is_ok() { count_a_task.fetch_add(1, Ordering::SeqCst); } } }); let (echo_addr, _echo) = spawn_stub(StubBehavior::Echo).await; let forwarder = RawForwarder::new().await.unwrap(); forwarder.set_upstreams(vec![addr_a]).await; forwarder.set_timeout(Duration::from_millis(100)).await; let f = forwarder.clone(); let query_handle = tokio::spawn(async move { let request = make_dns_packet(TEST_PACKET_ID, TEST_DNS_PAYLOAD); f.query(&request).await }); tokio::time::sleep(Duration::from_millis(20)).await; assert_eq!(count_a.load(Ordering::SeqCst), 1, "initial query did not reach addr_a yet"); // Reorder upstreams BEFORE timeout fires: // index 0 -> echo_addr // index 1 -> addr_a <-- this is what handle_timeouts will pick forwarder.set_upstreams(vec![echo_addr, addr_a]).await; let _result = query_handle.await.unwrap(); assert_eq!( count_a.load(Ordering::SeqCst), 1, "addr_a was retried after upstream list reorder — got {} packets, expected 1", count_a.load(Ordering::SeqCst) ); }

sfraczek · 2026-05-05T15:03:18Z

+/// Extract the 16-bit transaction ID of a DNS packet
+fn get_dns_id(packet: &[u8]) -> Result<u16, ForwardError> {
+    if packet.len() < DNS_HEADER_OFFSET {
+        return Err(ForwardError::PacketTooShort);
+    }
+
+    // This is ok because the size is checked above
+    #[allow(clippy::indexing_slicing)]
+    Ok(u16::from_be_bytes([packet[0], packet[1]]))
+}
+
+/// Overwrite the 16-bit transaction ID of a DNS packet
+fn set_dns_id(packet: &mut [u8], id: u16) -> Result<(), ForwardError> {
+    if packet.len() < DNS_HEADER_OFFSET {
+        return Err(ForwardError::PacketTooShort);
+    }
+    let bytes = id.to_be_bytes();
+
+    // This is ok because the size is checked above
+    #[allow(clippy::indexing_slicing)]
+    {
+        packet[0] = bytes[0];
+        packet[1] = bytes[1];
+    }
+    Ok(())
+}


Larger refactor option (only worth it if more header inspection is on the horizon — get_flags, is_response, rcode, qdcount, etc.): split validation from access. A validate_dns_header gate validates once and returns a typed reference; getters/setters take the already-validated &[u8; 12], do not check length, do not return Result, do not need #[allow].

fn validate_dns_header(packet: &[u8]) -> Result<&[u8; DNS_HEADER_OFFSET], ForwardError> { packet.first_chunk().ok_or(ForwardError::PacketTooShort) } fn validate_dns_header_mut( packet: &mut [u8], ) -> Result<&mut [u8; DNS_HEADER_OFFSET], ForwardError> { packet.first_chunk_mut().ok_or(ForwardError::PacketTooShort) } fn get_dns_id(header: &[u8; DNS_HEADER_OFFSET]) -> u16 { u16::from_be_bytes([header[0], header[1]]) } fn set_dns_id(header: &mut [u8; DNS_HEADER_OFFSET], id: u16) { let bytes = id.to_be_bytes(); header[0] = bytes[0]; header[1] = bytes[1]; }

Indexing [0]/[1] on &[u8; 12] is compile-time safe (the type guarantees the length), so clippy stays quiet without #[allow]. Call sites validate once on entry to handle_new_query / handle_response and then pass the typed header around.

Trade-off: two layers instead of one for the current ID-only use case. If ID is all you will ever read from the header, this is over-engineered. If more header fields show up later, this scales without duplicating length checks across every getter.

tomasz-grz self-assigned this Mar 12, 2026

tomasz-grz temporarily deployed to Internal March 12, 2026 13:51 — with GitHub Actions Inactive

tomasz-grz force-pushed the LLT-7053_raw_forwarder branch from 0ef4b6a to 214691c Compare March 12, 2026 15:34

tomasz-grz temporarily deployed to Internal March 12, 2026 15:34 — with GitHub Actions Inactive

tomasz-grz changed the title ~~LLT-7053: Raw forwarder~~ LLT-7053: Raw DNS forwarder Mar 19, 2026

tomasz-grz force-pushed the LLT-7053_raw_forwarder branch from 214691c to 699976a Compare March 24, 2026 16:43

tomasz-grz temporarily deployed to Internal March 24, 2026 16:43 — with GitHub Actions Inactive

tomasz-grz changed the base branch from main to dns_nord_zone March 24, 2026 17:05

tomasz-grz temporarily deployed to Internal March 24, 2026 17:17 — with GitHub Actions Inactive

tomasz-grz force-pushed the dns_nord_zone branch from 7d71c6e to 1b4d02e Compare March 24, 2026 17:24

tomasz-grz force-pushed the LLT-7053_raw_forwarder branch from 7bd6827 to 377c2bc Compare March 24, 2026 17:27

tomasz-grz temporarily deployed to Internal March 24, 2026 17:27 — with GitHub Actions Inactive

tomasz-grz force-pushed the LLT-7053_raw_forwarder branch from 377c2bc to 955a15f Compare March 25, 2026 09:55

tomasz-grz temporarily deployed to Internal March 25, 2026 09:55 — with GitHub Actions Inactive

tomasz-grz force-pushed the dns_nord_zone branch from 1fd7087 to a8bc6c2 Compare March 25, 2026 14:55

tomasz-grz force-pushed the LLT-7053_raw_forwarder branch from 955a15f to 170b52a Compare March 25, 2026 14:55

tomasz-grz temporarily deployed to Internal March 25, 2026 14:56 — with GitHub Actions Inactive

tomasz-grz force-pushed the dns_nord_zone branch from a8bc6c2 to 3676dbd Compare March 26, 2026 10:55

tomasz-grz temporarily deployed to Internal March 26, 2026 14:54 — with GitHub Actions Inactive

tomasz-grz force-pushed the LLT-7053_raw_forwarder branch from 0beecb0 to 87e30d5 Compare March 26, 2026 14:56

tomasz-grz temporarily deployed to Internal March 26, 2026 14:56 — with GitHub Actions Inactive

tomasz-grz force-pushed the LLT-7053_raw_forwarder branch from 87e30d5 to 7a69b3b Compare March 26, 2026 15:42

tomasz-grz temporarily deployed to Internal March 26, 2026 15:42 — with GitHub Actions Inactive

tomasz-grz force-pushed the LLT-7053_raw_forwarder branch from 7a69b3b to 9aa76d9 Compare March 31, 2026 14:57

tomasz-grz temporarily deployed to Internal March 31, 2026 14:57 — with GitHub Actions Inactive

tomasz-grz temporarily deployed to Internal April 1, 2026 12:00 — with GitHub Actions Inactive

tomasz-grz force-pushed the dns_nord_zone branch from 6fd70ed to e14c0f8 Compare April 1, 2026 12:02

tomasz-grz force-pushed the LLT-7053_raw_forwarder branch from ed2af6b to 44cb4d5 Compare April 1, 2026 12:21

tomasz-grz temporarily deployed to Internal April 1, 2026 12:21 — with GitHub Actions Inactive

Base automatically changed from dns_nord_zone to main April 2, 2026 08:54

tomasz-grz temporarily deployed to Internal April 8, 2026 13:32 — with GitHub Actions Inactive

tomasz-grz temporarily deployed to Internal April 9, 2026 13:09 — with GitHub Actions Inactive

tomasz-grz force-pushed the LLT-7053_raw_forwarder branch from fa508e3 to 5cd3d05 Compare April 29, 2026 14:25

tomasz-grz temporarily deployed to Internal April 29, 2026 14:25 — with GitHub Actions Inactive

tomasz-grz force-pushed the LLT-7053_raw_forwarder branch from 5cd3d05 to 3bc7a17 Compare April 29, 2026 15:56

tomasz-grz temporarily deployed to Internal April 29, 2026 15:56 — with GitHub Actions Inactive

tomasz-grz temporarily deployed to Internal April 29, 2026 15:58 — with GitHub Actions Inactive

tomasz-grz force-pushed the LLT-7053_raw_forwarder branch from 0ddbe6e to dbbfec7 Compare April 30, 2026 09:56

tomasz-grz temporarily deployed to Internal April 30, 2026 09:56 — with GitHub Actions Inactive

tomasz-grz force-pushed the LLT-7053_raw_forwarder branch from dbbfec7 to c37a86f Compare April 30, 2026 15:35

tomasz-grz temporarily deployed to Internal April 30, 2026 15:35 — with GitHub Actions Inactive

tomasz-grz force-pushed the LLT-7053_raw_forwarder branch from c37a86f to 0621e15 Compare May 4, 2026 10:44

tomasz-grz temporarily deployed to Internal May 4, 2026 10:44 — with GitHub Actions Inactive

Implement DNS forwarder

e4660a6

Add component that forwards raw DNS queries to upstream resolvers over UDP socket

tomasz-grz force-pushed the LLT-7053_raw_forwarder branch from 0621e15 to e4660a6 Compare May 4, 2026 13:34

tomasz-grz marked this pull request as ready for review May 4, 2026 13:35

tomasz-grz requested a review from a team as a code owner May 4, 2026 13:35

tomasz-grz temporarily deployed to Internal May 4, 2026 13:35 — with GitHub Actions Inactive

$sfraczek$

sfraczek reviewed May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLT-7053: Raw DNS forwarder#1709

LLT-7053: Raw DNS forwarder#1709
tomasz-grz wants to merge 1 commit intomainfrom
LLT-7053_raw_forwarder

tomasz-grz commented Mar 12, 2026 •

edited

Loading

Uh oh!

$@sfraczek$ sfraczek May 5, 2026

Uh oh!

tomasz-grz May 6, 2026

Uh oh!

$@sfraczek$ sfraczek May 5, 2026

Uh oh!

tomasz-grz May 6, 2026

Uh oh!

$@sfraczek$ sfraczek May 5, 2026

Uh oh!

$@sfraczek$ sfraczek May 5, 2026

Uh oh!

$@sfraczek$ sfraczek May 5, 2026

Uh oh!

tomasz-grz May 6, 2026 •

edited

Loading

Uh oh!

$@sfraczek$ sfraczek May 5, 2026

Uh oh!

$@sfraczek$ sfraczek May 5, 2026

Uh oh!

$@sfraczek$ sfraczek May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tomasz-grz commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

☑️ Definition of Done checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomasz-grz May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tomasz-grz commented Mar 12, 2026 •

edited

Loading

tomasz-grz May 6, 2026 •

edited

Loading