Rust async/await 삽질 정리 — tokio 런타임에서 겪은 문제들

게시일: 2025년 5월 13일 · 16분 읽기

async fn 안에서 std::thread::sleep 호출해서 런타임 멈춘 거 디버깅에 반나절 걸렸다.

문제 1: Blocking in Async Context

나쁜 코드:

async fn process_audio(path: &str) -> Result<()> {
    let file = std::fs::read(path)?;  // ← Blocking I/O!

    // 이 시점에서 runtime 스레드가 blocked됨
    // 다른 async 작업들이 진행되지 않음

    let processed = decode(&file).await?;
    Ok(())
}

문제: std::fs::read는 blocking이다. async fn 안에서 blocking 호출이 있으면, 전체 runtime 스레드가 멈춘다.

좋은 코드:

async fn process_audio(path: &str) -> Result<()> {
    let file = tokio::fs::read(path).await?;  // ← Non-blocking!

    let processed = decode(&file).await?;
    Ok(())
}

// 또는 blocking 작업을 별도 스레드에서
async fn process_audio_alt(path: &str) -> Result<()> {
    let path = path.to_string();

    let file = tokio::task::block_in_place(|| {
        std::fs::read(&path)  // ← 전용 스레드에서 실행
    })?;

    let processed = decode(&file).await?;
    Ok(())
}

문제 2: std::thread::sleep in async

실제 사례:

async fn connect_with_retry(server: &str) -> Result<()> {
    for attempt in 1..=3 {
        match connect(server).await {
            Ok(_) => return Ok(()),
            Err(_) if attempt < 3 => {
                std::thread::sleep(Duration::from_secs(1));  // ❌ 재앙!
            }
            Err(e) => return Err(e),
        }
    }
    Ok(())
}

이 코드 실행 시:

- 첫 번째 연결 실패
- std::thread::sleep(1초) → runtime 전체 blocked
- 다른 async 작업들도 모두 1초 대기
- 사용자: "앱이 멈춤"

해결:

async fn connect_with_retry(server: &str) -> Result<()> {
    for attempt in 1..=3 {
        match connect(server).await {
            Ok(_) => return Ok(()),
            Err(_) if attempt < 3 => {
                tokio::time::sleep(Duration::from_secs(1)).await;  // ✅ OK
            }
            Err(e) => return Err(e),
        }
    }
    Ok(())
}

tokio::time::sleep은 non-blocking이다. 다른 작업은 계속 진행된다.

문제 3: unbounded channel 남용으로 메모리 폭증

이벤트를 빠르게 밀어 넣는 producer와 느린 consumer를 같이 두면, unbounded_channel은 결국 메모리를 먹고 장애를 만든다.

권장 패턴:

use tokio::sync::mpsc;
use tokio::time::{timeout, Duration};

let (tx, mut rx) = mpsc::channel::<Job>(256); // bounded channel

// producer
if let Err(_e) = timeout(Duration::from_millis(200), tx.send(job)).await {
    // 큐가 가득 찬 상태가 오래 지속되면 drop / retry 정책 선택
}

bounded channel을 쓰면 시스템이 감당 가능한 처리량을 넘길 때 바로 신호가 오고, 백프레셔 정책을 코드로 명시할 수 있다.

문제 4: timeout 없이 외부 I/O await

네트워크, DB, 외부 API는 반드시 timeout을 걸어야 한다. timeout이 없으면 특정 요청이 무기한 대기하고, 트래픽이 몰릴 때 워커가 잠식된다.

use tokio::time::{timeout, Duration};

let resp = timeout(Duration::from_secs(3), client.get(url).send()).await
    .map_err(|_| anyhow!("request timeout"))??;

실무에서는 "평균 지연"보다 "최악 지연"이 장애를 만든다. timeout은 성능 최적화가 아니라 안정성 장치다.

운영 체크리스트

async 함수 내부에서 std::fs, std::thread::sleep 사용 금지
CPU-heavy 작업은 spawn_blocking 또는 전용 워커로 분리
채널은 기본적으로 bounded, 크기는 근거(처리량/지연) 기반으로 설정
외부 I/O는 timeout + retry + circuit-breaker 정책 같이 설계
종료 시그널 처리(ctrl_c)와 graceful shutdown 경로 검증

결론

async/await는 생산성을 크게 올려주지만, "block 금지 / backpressure / timeout" 세 가지를 지키지 않으면 장애를 만든다. 위 규칙을 기본값으로 잡으면 tokio 기반 서비스의 안정성이 눈에 띄게 올라간다.

ian.lab

실무 개발자입니다. 현장에서 겪은 문제와 해결 과정을 기록합니다. 오류 제보는 연락처로 보내주세요.