Skip to content

[BE] 첨부형 전처리기 배포 시 리소스 파일 자동 다운로드 (GenOS #9078)#176

Open
yspaik wants to merge 2 commits intodevelopfrom
fix/175-preprocessor-resource-download
Open

[BE] 첨부형 전처리기 배포 시 리소스 파일 자동 다운로드 (GenOS #9078)#176
yspaik wants to merge 2 commits intodevelopfrom
fix/175-preprocessor-resource-download

Conversation

@yspaik
Copy link
Copy Markdown

@yspaik yspaik commented Apr 23, 2026

개요

전처리기 Pod 기동 시 MinIO에서 업로드된 리소스 파일을 /app/resource로 자동 다운로드하는 기능 추가. GenOS QA 이슈 genonai/GenOS#9078 수정.

관련 이슈


배경

현상 (As-Is)

doc_parser 도커 이미지로 배포된 전처리기 Pod(예: preprocessor-226-598d59564b-4r44g)의 파일시스템에 /app/resource 폴더가 생성되지 않아, 관리자 페이지(/admin/resource/preprocessor/detail/{id})에서 업로드한 리소스 파일(jpg·json·png·xlsx·py·html 등)이 Pod 내부로 동기화되지 않음. 결과적으로 전처리기 코드가 첨부 리소스를 참조할 수 없는 상태.

재현

  1. GenPortal/Admin에서 전처리기 생성 (기본 전처리기 코드 사용)
  2. 전처리기 상세 > 리소스 파일 목록에 파일 업로드 (MinIO preprocessor 버킷에 저장됨)
  3. doc_parser 이미지로 Pod 배포
  4. Pod 내부 ls -al ~resource/ 부재 확인

기대 동작 (To-Be)

Pod 기동 시 MinIO의 preprocessor/{PREPROCESSOR_ID}/resource/ 프리픽스를 /app/resource 하위에 자동 다운로드.


해결 방식

GenOS 동일 문제를 이미 해결한 container-services/workflow·mcp-server의 패턴을 doc_parser 전처리기에도 이식.

핵심 메커니즘

  1. MinIO 프리패치: Minio.list_objects(bucket=\"preprocessor\", prefix=\"{ID}/resource\", recursive=True) → 각 오브젝트를 fget_object()로 다운로드, 원본 디렉터리 구조 유지.
  2. 멀티 워커 경쟁 차단: gunicorn workers=5에서 동시 다운로드 충돌 방지용 fcntl.flock(LOCK_EX | LOCK_NB) 기반 FileLock. 첫 워커만 실제 다운로드, 나머지는 락 대기 후 os.path.exists() 로 SKIP.
  3. 멱등성: 기존 파일 존재 시 [SKIP i/N] already exists 로그 후 넘김 → 재기동·재스케줄 시 중복 다운로드 없음.
  4. 실패 전파: MinIO 연결/인증/다운로드 실패는 re-raise — Pod 기동 실패로 이어져 k8s가 재시도하도록.

변경 파일

파일 종류 설명
src/common/settings.py 수정 Optional import, Settings.PREPROCESSOR_ID: Optional[str] = _ID, MinioConfig 클래스 + minio_config 싱글턴
src/main.py 수정 os·sys·settings·download_resource_files import, DocumentProcessor() 초기화 직전 프리패치 훅 호출
src/util/__init__.py 신규 패키지 선언(빈 파일)
src/util/minio_resource.py 신규 FileLock + download_resource_files(bucket_name, resource_id, path) (104줄, workflow 구현과 동일)
tests/unit/test_minio_resource_unit.py 신규 단위 테스트 9종

주요 변경 포인트

# main.py — @app.get('/healthcheck') 직후, DocumentProcessor 초기화 이전
download_resource_files(
    bucket_name='preprocessor',
    resource_id=settings.PREPROCESSOR_ID,
    path='/app/resource',
)
# common/settings.py
class Settings(BaseSettings):
    PREPROCESSOR_ID: Optional[str] = _ID  # env var PREPROCESSOR_ID 매핑
    ...

class MinioConfig(BaseSettings):
    MINIO_ENDPOINT: str
    MINIO_ACCESS_KEY: str
    MINIO_SECRET_KEY: str

minio_config = MinioConfig()

사전 조건 (GenOS 측에서 이미 충족)

이 PR 단독으로 동작하는 것이 아니라 GenOS 오케스트레이션·인프라가 아래를 이미 주입하고 있음 — 그래서 doc_parser 측 변경만으로 해결됨.

  • PREPROCESSOR_ID env 주입: admin-api가 Pod 생성 시 주입 (admin-api/src/service/system/preprocessor_service.py:652)
  • MinIO 접속 정보 주입: 전처리기 k8s deployment template에 envFrom: llmops-minio-client-configmap·llmops-minio-client-secret 연결됨 (orchestrator/src/k8s-manifest-templates/preprocessor/preprocessor-deployment.yaml)
  • MinIO 버킷·업로드 경로: preprocessor 버킷의 {id}/resource/** — admin-api 업로드 경로와 일치
  • 의존성: minio==7.2.20 이미 pyproject.toml에 선언됨

테스트

단위 테스트 결과

$ pytest tests/unit/test_minio_resource_unit.py -v
9 passed in 1.47s
# 테스트 커버
1 test_file_lock_acquires_and_writes_pid FileLock 정상 획득·pid 기록
2 test_file_lock_times_out_when_held_elsewhere 외부 락 선점 시 TimeoutError
3 test_file_lock_blocks_concurrent_acquire 스레드 간 순차 획득 보장
4 test_download_resource_files_downloads_all_objects 다중 오브젝트·하위 디렉터리 경로 정확성
5 test_download_resource_files_skips_existing 기존 파일 SKIP
6 test_download_resource_files_skips_directory_entries is_dir=True 스킵
7 test_download_resource_files_empty_list 빈 리스트 정상 완료
8 test_download_resource_files_propagates_exception MinIO 예외 re-raise
9 test_download_resource_files_ignores_empty_relative_path rel_path == \"\" 스킵

통합/수동 테스트 체크리스트

  • 이미지 빌드 및 레지스트리 푸시
  • QA 환경(192.168.76.180:40908/admin/resource/preprocessor/detail/226) 전처리기 재배포
  • K9s shell 접속 후 ls -al /app/resource — 업로드된 파일 전체 확인
  • /var/log/supervisor/gunicorn_stdout.logAcquired lock / Downloading [i/N] / Completed! 로그
  • gunicorn workers=5 상황에서 2개 이상 워커가 [SKIP …] already exists 출력
  • /run 엔드포인트 호출 — 기존 전처리 회귀 없음
  • 리소스 미업로드 전처리기 (0 files) 정상 기동

리스크·롤백

리스크

  • MinIO 접속 실패 시 Pod 기동 실패: 의도된 동작. k8s startupProbe 가 healthcheck 를 기다리므로 MinIO 회복 전까지 CrashLoopBackOff 로 대기. 리소스 없이 조용히 기동되는 것보다 안전.
  • 락 타임아웃(timeout_sec=3600): 대량 파일 + 느린 MinIO 환경에서 1시간 초과 시 워커 기동 실패 가능. 실제 QA 데이터(수 MB ~ 수 백 MB) 기준으로는 여유.
  • 디스크 사용: /app/resource는 컨테이너 ephemeral storage. Pod 재시작 시 재다운로드(단, os.path.exists 스킵으로 재시작 자체 비용만).

롤백

  • 이 PR revert 만으로 원복 가능(코드 변경 한정, 스키마·외부 시스템 변경 없음).

리뷰 요청 포인트

@claude 아래 관점에서 상세 리뷰 부탁드립니다:

  1. FileLock 구현 안전성

    • fcntl.flock 은 프로세스 크래시 시 자동 해제되지만, __exit__ 에서 close()flock(LOCK_UN) 순서가 올바른지
    • LOCK_NB + 폴링 루프의 busy-wait 특성과 poll_interval=0.2s 의 적절성
  2. main.py import 순서·부작용

    • module-level 에서 download_resource_files() 를 호출 — FastAPI app 이미 생성된 후이므로 healthcheck 에는 응답 가능. 하지만 gunicorn worker 부팅 중이면 startupProbe 실패 가능성. 현 startupProbe failureThreshold=120, periodSeconds=5 (10분 한도)로 커버되는지
    • sys.path.append(.../util) 은 이미 from util.minio_resource import ... 가 동작하므로 사실상 불필요. 제거 가능한지
  3. Settings PREPROCESSOR_ID 필드명

    • 이슈 가이드 원문은 ID: Optional[str] = _ID 였으나, main.py가 settings.PREPROCESSOR_ID 를 참조하도록 되어 있어 필드명을 env 변수와 동일하게 PREPROCESSOR_ID 로 통일. 이 선택이 적절한지
  4. 테스트 커버리지

    • autouse fixture 로 MinIO/MQ env 치환 — 실제 배포 환경 시뮬레이션 충분한지
    • FileLock 동시성은 threading 기반 — multiprocessing 시나리오(실제 gunicorn) 도 추가해야 하는지
  5. 로깅·운영성

    • 실패 시 로그 레벨·메시지가 QA 팀이 추적하기에 충분한지
    • len(objects) 기준 진행률 로그의 verbosity

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

Summary by CodeRabbit

  • New Features

    • Added MinIO integration for resource management, enabling automatic resource downloading on application startup.
    • Introduced configurable MinIO connection settings.
  • Tests

    • Added comprehensive test coverage for resource downloading and file locking mechanisms.

- src/common/settings.py: PREPROCESSOR_ID 필드·MinioConfig 추가
- src/util/minio_resource.py: MinIO 다운로드 + fcntl FileLock 신규
- src/main.py: DocumentProcessor 초기화 전 /app/resource 프리패치 훅 삽입
- tests/unit/test_minio_resource_unit.py: FileLock·download_resource_files 단위 테스트 9종 추가

Refs: https://github.com/genonai/GenOS/issues/9078

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

Warning

Rate limit exceeded

@yspaik has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 54 minutes and 21 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 54 minutes and 21 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 917f2aad-79dd-4d43-b857-4086a157ead4

📥 Commits

Reviewing files that changed from the base of the PR and between 692a863 and 15b543c.

📒 Files selected for processing (1)
  • genon/preprocessor/src/main.py
📝 Walkthrough

Walkthrough

This PR adds resource file management from MinIO storage. It introduces PREPROCESSOR_ID to application settings, creates a MinioConfig class for MinIO connection details, implements a FileLock context manager for concurrent file access safety, adds a download_resource_files function to fetch preprocessor resources from MinIO, and initializes resource download at application startup.

Changes

Cohort / File(s) Summary
Configuration Settings
genon/preprocessor/src/common/settings.py
Added PREPROCESSOR_ID optional setting and new MinioConfig class with required MinIO connection fields (MINIO_ENDPOINT, MINIO_ACCESS_KEY, MINIO_SECRET_KEY). Instantiated as module-level minio_config object.
Application Initialization
genon/preprocessor/src/main.py
Added imports for OS/system utilities and configuration; integrated minio_resource for resource downloading. Calls download_resource_files at module import time to populate /app/resource directory using PREPROCESSOR_ID.
MinIO Resource Utility
genon/preprocessor/src/util/minio_resource.py
Introduced FileLock context manager for exclusive filesystem locks with timeout/polling, and download_resource_files function that lists objects in MinIO bucket and downloads missing files while respecting existing files and directory structures.
Unit Tests
genon/preprocessor/tests/unit/test_minio_resource_unit.py
Comprehensive test suite validating FileLock acquisition/timeout/serialization behavior and download_resource_files MinIO interactions, directory creation, file skipping logic, and error handling.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant FileLock as FileLock<br/>(Concurrent Safety)
    participant MinIO as MinIO Client
    participant FS as Filesystem
    
    App->>FileLock: Acquire lock (timeout: 600s)
    FileLock->>FS: Create/lock file with PID
    FileLock-->>App: Lock acquired
    
    App->>MinIO: Create client from credentials
    App->>MinIO: list_objects(bucket, prefix)
    MinIO-->>App: Object list
    
    loop For each object
        App->>FS: Check if file exists
        alt File missing
            App->>MinIO: fget_object(object)
            MinIO->>FS: Download file
            FS-->>MinIO: Ack
        else File exists
            App->>FS: Skip (already exists)
        end
    end
    
    App->>FileLock: Release lock
    FileLock->>FS: Unlock and close file
    FileLock-->>App: Lock released
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Poem

🐰 A rabbit hops through MinIO's store,
With locks in place to forego contention's roar,
Resource files flow like carrots in a stream,
Settings configured, a preprocessor's dream! 🥕✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.79% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title is written in Korean and directly describes the main feature: automatic resource file download when deploying the attached preprocessor, which aligns with the core objective of the changeset (fetching preprocessor resources from MinIO on Pod startup).
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/175-preprocessor-resource-download

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a resource downloading mechanism from Minio, including new configuration settings, a file locking utility to manage concurrent access, and the core download logic. It also adds comprehensive unit tests for these components. The review feedback highlights a potential issue with null preprocessor IDs, suggests removing redundant system path updates, and recommends correcting type hint inconsistencies.

Comment thread genon/preprocessor/src/main.py Outdated
Comment thread genon/preprocessor/src/main.py
self._fd = None


def download_resource_files(bucket_name: str, resource_id: int, path: str):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

resource_id 인자의 타입 힌트가 int로 되어 있으나, settings.py 및 실제 환경 변수로부터 넘어오는 값은 문자열(str)입니다. 런타임 에러는 발생하지 않으나 정적 분석 및 코드 가독성을 위해 str로 변경하는 것이 적절합니다.

Suggested change
def download_resource_files(bucket_name: str, resource_id: int, path: str):
def download_resource_files(bucket_name: str, resource_id: str, path: str):

@yspaik yspaik added the bug Something isn't working label Apr 23, 2026
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: 백영상 <yspaik@gmail.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@genon/preprocessor/src/util/minio_resource.py`:
- Around line 20-38: The __enter__ method opens self._fd before the lock loop
but may exit on timeout or unexpected OSError without closing it; modify
__enter__ (in the Minio lock/context class containing __enter__, self._fd,
lock_path, timeout_sec, poll_interval) to ensure the file descriptor is closed
on all failure paths by wrapping the acquisition loop in a try/except/finally or
by closing self._fd in each failure branch, and when re-raising unexpected
OSErrors or on timeout use exception chaining (raise ... from e) to satisfy Ruff
B904; keep the successful-lock behavior so __exit__ still handles the normal
close.
- Around line 50-80: In download_resource_files, validate resource_id at the top
(raise/return if None or not a non-empty integer/string) and normalize it (no
slashes); build the MinIO prefix as a directory prefix (e.g.,
f"{resource_id}/resource/") so you only match that directory (avoid matching
sibling keys like "226/resource_backup"); when iterating objects compute
rel_path by removing that exact prefix, reject keys that produce empty rel_path
or contain path traversal segments (e.g., ".." or leading "/" or "\"), then
compute destination_file and after joining check the realpath of
destination_file starts with the realpath of path (fail if not) before creating
parent dirs or writing files to ensure no writes escape the intended directory.
- Around line 58-63: The Minio client is hardcoded with secure=False in the
Minio(...) call inside minio_resource.py; make TLS configurable by adding a
MINIO_SECURE boolean setting to your settings module (e.g.,
genon.preprocessor.src.common.settings) and replace the literal secure=False
with that setting (e.g., secure=MINIO_SECURE) when constructing Minio in the
function or class that instantiates it so deployments can enable TLS via
environment/configuration.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9ba3df8d-77d3-4809-b8d8-86b08761b1f6

📥 Commits

Reviewing files that changed from the base of the PR and between f37af7d and 692a863.

📒 Files selected for processing (5)
  • genon/preprocessor/src/common/settings.py
  • genon/preprocessor/src/main.py
  • genon/preprocessor/src/util/__init__.py
  • genon/preprocessor/src/util/minio_resource.py
  • genon/preprocessor/tests/unit/test_minio_resource_unit.py

Comment on lines +20 to +38
def __enter__(self):
os.makedirs(os.path.dirname(self.lock_path), exist_ok=True)
self._fd = open(self.lock_path, "a+")

start = time.time()
while True:
try:
fcntl.flock(self._fd.fileno(), fcntl.LOCK_EX | fcntl.LOCK_NB)
self._fd.seek(0)
self._fd.truncate()
self._fd.write(f"pid={os.getpid()} acquired_at={time.time()}\n")
self._fd.flush()
return self
except OSError as e:
if e.errno not in (errno.EACCES, errno.EAGAIN):
raise
if (time.time() - start) >= self.timeout_sec:
raise TimeoutError(f"Timed out acquiring lock: {self.lock_path}")
time.sleep(self.poll_interval)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Close the lock file when acquisition fails.

If flock times out or raises an unexpected OSError, __enter__ exits before __exit__ runs, leaving the opened descriptor around. Wrap the acquisition loop so failure paths close _fd; this also satisfies the Ruff B904 exception-chaining warning.

🔒 Proposed fix
     def __enter__(self):
-        os.makedirs(os.path.dirname(self.lock_path), exist_ok=True)
+        lock_dir = os.path.dirname(self.lock_path)
+        if lock_dir:
+            os.makedirs(lock_dir, exist_ok=True)
         self._fd = open(self.lock_path, "a+")
 
         start = time.time()
-        while True:
-            try:
-                fcntl.flock(self._fd.fileno(), fcntl.LOCK_EX | fcntl.LOCK_NB)
-                self._fd.seek(0)
-                self._fd.truncate()
-                self._fd.write(f"pid={os.getpid()} acquired_at={time.time()}\n")
-                self._fd.flush()
-                return self
-            except OSError as e:
-                if e.errno not in (errno.EACCES, errno.EAGAIN):
-                    raise
-                if (time.time() - start) >= self.timeout_sec:
-                    raise TimeoutError(f"Timed out acquiring lock: {self.lock_path}")
-                time.sleep(self.poll_interval)
+        try:
+            while True:
+                try:
+                    fcntl.flock(self._fd.fileno(), fcntl.LOCK_EX | fcntl.LOCK_NB)
+                    self._fd.seek(0)
+                    self._fd.truncate()
+                    self._fd.write(f"pid={os.getpid()} acquired_at={time.time()}\n")
+                    self._fd.flush()
+                    return self
+                except OSError as e:
+                    if e.errno not in (errno.EACCES, errno.EAGAIN):
+                        raise
+                    if (time.time() - start) >= self.timeout_sec:
+                        raise TimeoutError(f"Timed out acquiring lock: {self.lock_path}") from e
+                    time.sleep(self.poll_interval)
+        except Exception:
+            if self._fd:
+                self._fd.close()
+                self._fd = None
+            raise
🧰 Tools
🪛 Ruff (0.15.10)

[warning] 37-37: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/src/util/minio_resource.py` around lines 20 - 38, The
__enter__ method opens self._fd before the lock loop but may exit on timeout or
unexpected OSError without closing it; modify __enter__ (in the Minio
lock/context class containing __enter__, self._fd, lock_path, timeout_sec,
poll_interval) to ensure the file descriptor is closed on all failure paths by
wrapping the acquisition loop in a try/except/finally or by closing self._fd in
each failure branch, and when re-raising unexpected OSErrors or on timeout use
exception chaining (raise ... from e) to satisfy Ruff B904; keep the
successful-lock behavior so __exit__ still handles the normal close.

Comment on lines +50 to +80
def download_resource_files(bucket_name: str, resource_id: int, path: str):
os.makedirs(path, exist_ok=True)

lock_file = os.path.join(path, ".download_resource_files.lock")

with FileLock(lock_file, timeout_sec=3600):
logger.info(f'Acquired lock: {lock_file} (pid={os.getpid()})')

minio_client = Minio(
endpoint=minio_config.MINIO_ENDPOINT,
access_key=minio_config.MINIO_ACCESS_KEY,
secret_key=minio_config.MINIO_SECRET_KEY,
secure=False
)

prefix = f"{resource_id}/resource"
objects = list(minio_client.list_objects(bucket_name, prefix=prefix, recursive=True))

try:
logger.info(f'Downloading {len(objects)} resource files for {bucket_name} {resource_id}')

for i, obj in enumerate(objects):
if obj.is_dir:
continue

rel_path = obj.object_name[len(prefix):].lstrip("/\\")
if not rel_path:
continue

destination_file = os.path.join(path, rel_path)
os.makedirs(os.path.dirname(destination_file), exist_ok=True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Harden resource ID and object-key path handling.

resource_id=None currently becomes None/resource, the prefix also matches sibling keys like 226/resource_backup/..., and ../ inside an object key can write outside /app/resource. Fail fast on missing IDs, use a directory prefix, and enforce that resolved destinations stay under path.

🛡️ Proposed fix
-def download_resource_files(bucket_name: str, resource_id: int, path: str):
+def download_resource_files(bucket_name: str, resource_id: int | str, path: str):
+    if resource_id is None or not str(resource_id).strip():
+        raise ValueError("resource_id is required")
+
     os.makedirs(path, exist_ok=True)
+    root_path = os.path.abspath(path)
 
     lock_file = os.path.join(path, ".download_resource_files.lock")
 
     with FileLock(lock_file, timeout_sec=3600):
         logger.info(f'Acquired lock: {lock_file} (pid={os.getpid()})')
@@
-        prefix = f"{resource_id}/resource"
-        objects = list(minio_client.list_objects(bucket_name, prefix=prefix, recursive=True))
+        prefix = f"{resource_id}/resource/"
 
         try:
+            objects = list(minio_client.list_objects(bucket_name, prefix=prefix, recursive=True))
             logger.info(f'Downloading {len(objects)} resource files for {bucket_name} {resource_id}')
 
             for i, obj in enumerate(objects):
                 if obj.is_dir:
                     continue
 
-                rel_path = obj.object_name[len(prefix):].lstrip("/\\")
+                if not obj.object_name.startswith(prefix):
+                    continue
+
+                rel_path = obj.object_name.removeprefix(prefix)
                 if not rel_path:
                     continue
 
-                destination_file = os.path.join(path, rel_path)
+                destination_file = os.path.abspath(os.path.join(root_path, rel_path))
+                if os.path.commonpath([root_path, destination_file]) != root_path:
+                    raise ValueError(f"Unsafe resource object path: {obj.object_name}")
+
                 os.makedirs(os.path.dirname(destination_file), exist_ok=True)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/src/util/minio_resource.py` around lines 50 - 80, In
download_resource_files, validate resource_id at the top (raise/return if None
or not a non-empty integer/string) and normalize it (no slashes); build the
MinIO prefix as a directory prefix (e.g., f"{resource_id}/resource/") so you
only match that directory (avoid matching sibling keys like
"226/resource_backup"); when iterating objects compute rel_path by removing that
exact prefix, reject keys that produce empty rel_path or contain path traversal
segments (e.g., ".." or leading "/" or "\"), then compute destination_file and
after joining check the realpath of destination_file starts with the realpath of
path (fail if not) before creating parent dirs or writing files to ensure no
writes escape the intended directory.

Comment on lines +58 to +63
minio_client = Minio(
endpoint=minio_config.MINIO_ENDPOINT,
access_key=minio_config.MINIO_ACCESS_KEY,
secret_key=minio_config.MINIO_SECRET_KEY,
secure=False
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
rg -n -C2 'MINIO_(ENDPOINT|SECURE)|secure=False' --iglob '*.py' --iglob '*.env*' --iglob '*.yaml' --iglob '*.yml'

Repository: genonai/doc_parser

Length of output: 2345


🏁 Script executed:

#!/bin/bash
# 1. Check the full MinioConfig class definition
echo "=== MinioConfig class definition ==="
sed -n '66,85p' genon/preprocessor/src/common/settings.py

# 2. Find all Minio() instantiations
echo -e "\n=== All Minio() instantiations ==="
rg -n 'Minio\(' --type py

# 3. Look for deployment/env files
echo -e "\n=== Deployment and env files ==="
find . -type f \( -name '*.env*' -o -name '*deployment*' -o -name '*docker*' \) | head -20

# 4. Check if other config sections define defaults
echo -e "\n=== Other BaseSettings with defaults ==="
grep -A 10 'class.*Config(BaseSettings)' genon/preprocessor/src/common/settings.py | head -40

Repository: genonai/doc_parser

Length of output: 1146


🏁 Script executed:

#!/bin/bash
# 1. Check env files for MinIO configuration
echo "=== .env.dev content (MinIO-related) ==="
grep -i minio ./genon/preprocessor/env/.env.dev || echo "(no MinIO config in .env.dev)"

echo -e "\n=== .env.prod content (MinIO-related) ==="
grep -i minio ./genon/preprocessor/env/.env.prod || echo "(no MinIO config in .env.prod)"

# 2. Look for boolean fields with defaults in settings.py
echo -e "\n=== Boolean config fields with defaults ==="
rg -n ': bool\s*=' genon/preprocessor/src/common/settings.py

# 3. Get context around Minio() instantiation
echo -e "\n=== Full Minio() instantiation context ==="
sed -n '50,75p' genon/preprocessor/src/util/minio_resource.py

Repository: genonai/doc_parser

Length of output: 1254


🏁 Script executed:

#!/bin/bash
# Check all usages of minio_config (not just instantiation)
echo "=== All minio_config usages ==="
rg -n 'minio_config' --type py

# Check imports of Minio class
echo -e "\n=== Minio class imports ==="
rg -n 'from.*minio|import.*Minio' --type py

Repository: genonai/doc_parser

Length of output: 2016


Make MinIO TLS configurable instead of hardcoding plaintext.

secure=False forces HTTP for all environments, including production. Add a MINIO_SECURE boolean setting to enable TLS-based deployments. This is the only MinIO instantiation in the codebase, and environment files currently provide no way to override this hardcoded value.

Proposed fix
         minio_client = Minio(
             endpoint=minio_config.MINIO_ENDPOINT,
             access_key=minio_config.MINIO_ACCESS_KEY,
             secret_key=minio_config.MINIO_SECRET_KEY,
-            secure=False
+            secure=minio_config.MINIO_SECURE
         )

Add to genon/preprocessor/src/common/settings.py:

 class MinioConfig(BaseSettings):
     class Config(BaseConfig):
         pass
 
     MINIO_ENDPOINT: str
     MINIO_ACCESS_KEY: str
     MINIO_SECRET_KEY: str
+    MINIO_SECURE: bool = False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/src/util/minio_resource.py` around lines 58 - 63, The
Minio client is hardcoded with secure=False in the Minio(...) call inside
minio_resource.py; make TLS configurable by adding a MINIO_SECURE boolean
setting to your settings module (e.g., genon.preprocessor.src.common.settings)
and replace the literal secure=False with that setting (e.g.,
secure=MINIO_SECURE) when constructing Minio in the function or class that
instantiates it so deployments can enable TLS via environment/configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BE] 첨부형 전처리기 배포 시 리소스 파일 자동 다운로드 (GenOS #9078)

2 participants