[BE] 첨부형 전처리기 배포 시 리소스 파일 자동 다운로드 (GenOS #9078)#176
Conversation
- src/common/settings.py: PREPROCESSOR_ID 필드·MinioConfig 추가 - src/util/minio_resource.py: MinIO 다운로드 + fcntl FileLock 신규 - src/main.py: DocumentProcessor 초기화 전 /app/resource 프리패치 훅 삽입 - tests/unit/test_minio_resource_unit.py: FileLock·download_resource_files 단위 테스트 9종 추가 Refs: https://github.com/genonai/GenOS/issues/9078 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 54 minutes and 21 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR adds resource file management from MinIO storage. It introduces Changes
Sequence Diagram(s)sequenceDiagram
participant App as Application
participant FileLock as FileLock<br/>(Concurrent Safety)
participant MinIO as MinIO Client
participant FS as Filesystem
App->>FileLock: Acquire lock (timeout: 600s)
FileLock->>FS: Create/lock file with PID
FileLock-->>App: Lock acquired
App->>MinIO: Create client from credentials
App->>MinIO: list_objects(bucket, prefix)
MinIO-->>App: Object list
loop For each object
App->>FS: Check if file exists
alt File missing
App->>MinIO: fget_object(object)
MinIO->>FS: Download file
FS-->>MinIO: Ack
else File exists
App->>FS: Skip (already exists)
end
end
App->>FileLock: Release lock
FileLock->>FS: Unlock and close file
FileLock-->>App: Lock released
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a resource downloading mechanism from Minio, including new configuration settings, a file locking utility to manage concurrent access, and the core download logic. It also adds comprehensive unit tests for these components. The review feedback highlights a potential issue with null preprocessor IDs, suggests removing redundant system path updates, and recommends correcting type hint inconsistencies.
| self._fd = None | ||
|
|
||
|
|
||
| def download_resource_files(bucket_name: str, resource_id: int, path: str): |
There was a problem hiding this comment.
resource_id 인자의 타입 힌트가 int로 되어 있으나, settings.py 및 실제 환경 변수로부터 넘어오는 값은 문자열(str)입니다. 런타임 에러는 발생하지 않으나 정적 분석 및 코드 가독성을 위해 str로 변경하는 것이 적절합니다.
| def download_resource_files(bucket_name: str, resource_id: int, path: str): | |
| def download_resource_files(bucket_name: str, resource_id: str, path: str): |
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: 백영상 <yspaik@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@genon/preprocessor/src/util/minio_resource.py`:
- Around line 20-38: The __enter__ method opens self._fd before the lock loop
but may exit on timeout or unexpected OSError without closing it; modify
__enter__ (in the Minio lock/context class containing __enter__, self._fd,
lock_path, timeout_sec, poll_interval) to ensure the file descriptor is closed
on all failure paths by wrapping the acquisition loop in a try/except/finally or
by closing self._fd in each failure branch, and when re-raising unexpected
OSErrors or on timeout use exception chaining (raise ... from e) to satisfy Ruff
B904; keep the successful-lock behavior so __exit__ still handles the normal
close.
- Around line 50-80: In download_resource_files, validate resource_id at the top
(raise/return if None or not a non-empty integer/string) and normalize it (no
slashes); build the MinIO prefix as a directory prefix (e.g.,
f"{resource_id}/resource/") so you only match that directory (avoid matching
sibling keys like "226/resource_backup"); when iterating objects compute
rel_path by removing that exact prefix, reject keys that produce empty rel_path
or contain path traversal segments (e.g., ".." or leading "/" or "\"), then
compute destination_file and after joining check the realpath of
destination_file starts with the realpath of path (fail if not) before creating
parent dirs or writing files to ensure no writes escape the intended directory.
- Around line 58-63: The Minio client is hardcoded with secure=False in the
Minio(...) call inside minio_resource.py; make TLS configurable by adding a
MINIO_SECURE boolean setting to your settings module (e.g.,
genon.preprocessor.src.common.settings) and replace the literal secure=False
with that setting (e.g., secure=MINIO_SECURE) when constructing Minio in the
function or class that instantiates it so deployments can enable TLS via
environment/configuration.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9ba3df8d-77d3-4809-b8d8-86b08761b1f6
📒 Files selected for processing (5)
genon/preprocessor/src/common/settings.pygenon/preprocessor/src/main.pygenon/preprocessor/src/util/__init__.pygenon/preprocessor/src/util/minio_resource.pygenon/preprocessor/tests/unit/test_minio_resource_unit.py
| def __enter__(self): | ||
| os.makedirs(os.path.dirname(self.lock_path), exist_ok=True) | ||
| self._fd = open(self.lock_path, "a+") | ||
|
|
||
| start = time.time() | ||
| while True: | ||
| try: | ||
| fcntl.flock(self._fd.fileno(), fcntl.LOCK_EX | fcntl.LOCK_NB) | ||
| self._fd.seek(0) | ||
| self._fd.truncate() | ||
| self._fd.write(f"pid={os.getpid()} acquired_at={time.time()}\n") | ||
| self._fd.flush() | ||
| return self | ||
| except OSError as e: | ||
| if e.errno not in (errno.EACCES, errno.EAGAIN): | ||
| raise | ||
| if (time.time() - start) >= self.timeout_sec: | ||
| raise TimeoutError(f"Timed out acquiring lock: {self.lock_path}") | ||
| time.sleep(self.poll_interval) |
There was a problem hiding this comment.
Close the lock file when acquisition fails.
If flock times out or raises an unexpected OSError, __enter__ exits before __exit__ runs, leaving the opened descriptor around. Wrap the acquisition loop so failure paths close _fd; this also satisfies the Ruff B904 exception-chaining warning.
🔒 Proposed fix
def __enter__(self):
- os.makedirs(os.path.dirname(self.lock_path), exist_ok=True)
+ lock_dir = os.path.dirname(self.lock_path)
+ if lock_dir:
+ os.makedirs(lock_dir, exist_ok=True)
self._fd = open(self.lock_path, "a+")
start = time.time()
- while True:
- try:
- fcntl.flock(self._fd.fileno(), fcntl.LOCK_EX | fcntl.LOCK_NB)
- self._fd.seek(0)
- self._fd.truncate()
- self._fd.write(f"pid={os.getpid()} acquired_at={time.time()}\n")
- self._fd.flush()
- return self
- except OSError as e:
- if e.errno not in (errno.EACCES, errno.EAGAIN):
- raise
- if (time.time() - start) >= self.timeout_sec:
- raise TimeoutError(f"Timed out acquiring lock: {self.lock_path}")
- time.sleep(self.poll_interval)
+ try:
+ while True:
+ try:
+ fcntl.flock(self._fd.fileno(), fcntl.LOCK_EX | fcntl.LOCK_NB)
+ self._fd.seek(0)
+ self._fd.truncate()
+ self._fd.write(f"pid={os.getpid()} acquired_at={time.time()}\n")
+ self._fd.flush()
+ return self
+ except OSError as e:
+ if e.errno not in (errno.EACCES, errno.EAGAIN):
+ raise
+ if (time.time() - start) >= self.timeout_sec:
+ raise TimeoutError(f"Timed out acquiring lock: {self.lock_path}") from e
+ time.sleep(self.poll_interval)
+ except Exception:
+ if self._fd:
+ self._fd.close()
+ self._fd = None
+ raise🧰 Tools
🪛 Ruff (0.15.10)
[warning] 37-37: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@genon/preprocessor/src/util/minio_resource.py` around lines 20 - 38, The
__enter__ method opens self._fd before the lock loop but may exit on timeout or
unexpected OSError without closing it; modify __enter__ (in the Minio
lock/context class containing __enter__, self._fd, lock_path, timeout_sec,
poll_interval) to ensure the file descriptor is closed on all failure paths by
wrapping the acquisition loop in a try/except/finally or by closing self._fd in
each failure branch, and when re-raising unexpected OSErrors or on timeout use
exception chaining (raise ... from e) to satisfy Ruff B904; keep the
successful-lock behavior so __exit__ still handles the normal close.
| def download_resource_files(bucket_name: str, resource_id: int, path: str): | ||
| os.makedirs(path, exist_ok=True) | ||
|
|
||
| lock_file = os.path.join(path, ".download_resource_files.lock") | ||
|
|
||
| with FileLock(lock_file, timeout_sec=3600): | ||
| logger.info(f'Acquired lock: {lock_file} (pid={os.getpid()})') | ||
|
|
||
| minio_client = Minio( | ||
| endpoint=minio_config.MINIO_ENDPOINT, | ||
| access_key=minio_config.MINIO_ACCESS_KEY, | ||
| secret_key=minio_config.MINIO_SECRET_KEY, | ||
| secure=False | ||
| ) | ||
|
|
||
| prefix = f"{resource_id}/resource" | ||
| objects = list(minio_client.list_objects(bucket_name, prefix=prefix, recursive=True)) | ||
|
|
||
| try: | ||
| logger.info(f'Downloading {len(objects)} resource files for {bucket_name} {resource_id}') | ||
|
|
||
| for i, obj in enumerate(objects): | ||
| if obj.is_dir: | ||
| continue | ||
|
|
||
| rel_path = obj.object_name[len(prefix):].lstrip("/\\") | ||
| if not rel_path: | ||
| continue | ||
|
|
||
| destination_file = os.path.join(path, rel_path) | ||
| os.makedirs(os.path.dirname(destination_file), exist_ok=True) |
There was a problem hiding this comment.
Harden resource ID and object-key path handling.
resource_id=None currently becomes None/resource, the prefix also matches sibling keys like 226/resource_backup/..., and ../ inside an object key can write outside /app/resource. Fail fast on missing IDs, use a directory prefix, and enforce that resolved destinations stay under path.
🛡️ Proposed fix
-def download_resource_files(bucket_name: str, resource_id: int, path: str):
+def download_resource_files(bucket_name: str, resource_id: int | str, path: str):
+ if resource_id is None or not str(resource_id).strip():
+ raise ValueError("resource_id is required")
+
os.makedirs(path, exist_ok=True)
+ root_path = os.path.abspath(path)
lock_file = os.path.join(path, ".download_resource_files.lock")
with FileLock(lock_file, timeout_sec=3600):
logger.info(f'Acquired lock: {lock_file} (pid={os.getpid()})')
@@
- prefix = f"{resource_id}/resource"
- objects = list(minio_client.list_objects(bucket_name, prefix=prefix, recursive=True))
+ prefix = f"{resource_id}/resource/"
try:
+ objects = list(minio_client.list_objects(bucket_name, prefix=prefix, recursive=True))
logger.info(f'Downloading {len(objects)} resource files for {bucket_name} {resource_id}')
for i, obj in enumerate(objects):
if obj.is_dir:
continue
- rel_path = obj.object_name[len(prefix):].lstrip("/\\")
+ if not obj.object_name.startswith(prefix):
+ continue
+
+ rel_path = obj.object_name.removeprefix(prefix)
if not rel_path:
continue
- destination_file = os.path.join(path, rel_path)
+ destination_file = os.path.abspath(os.path.join(root_path, rel_path))
+ if os.path.commonpath([root_path, destination_file]) != root_path:
+ raise ValueError(f"Unsafe resource object path: {obj.object_name}")
+
os.makedirs(os.path.dirname(destination_file), exist_ok=True)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@genon/preprocessor/src/util/minio_resource.py` around lines 50 - 80, In
download_resource_files, validate resource_id at the top (raise/return if None
or not a non-empty integer/string) and normalize it (no slashes); build the
MinIO prefix as a directory prefix (e.g., f"{resource_id}/resource/") so you
only match that directory (avoid matching sibling keys like
"226/resource_backup"); when iterating objects compute rel_path by removing that
exact prefix, reject keys that produce empty rel_path or contain path traversal
segments (e.g., ".." or leading "/" or "\"), then compute destination_file and
after joining check the realpath of destination_file starts with the realpath of
path (fail if not) before creating parent dirs or writing files to ensure no
writes escape the intended directory.
| minio_client = Minio( | ||
| endpoint=minio_config.MINIO_ENDPOINT, | ||
| access_key=minio_config.MINIO_ACCESS_KEY, | ||
| secret_key=minio_config.MINIO_SECRET_KEY, | ||
| secure=False | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
rg -n -C2 'MINIO_(ENDPOINT|SECURE)|secure=False' --iglob '*.py' --iglob '*.env*' --iglob '*.yaml' --iglob '*.yml'Repository: genonai/doc_parser
Length of output: 2345
🏁 Script executed:
#!/bin/bash
# 1. Check the full MinioConfig class definition
echo "=== MinioConfig class definition ==="
sed -n '66,85p' genon/preprocessor/src/common/settings.py
# 2. Find all Minio() instantiations
echo -e "\n=== All Minio() instantiations ==="
rg -n 'Minio\(' --type py
# 3. Look for deployment/env files
echo -e "\n=== Deployment and env files ==="
find . -type f \( -name '*.env*' -o -name '*deployment*' -o -name '*docker*' \) | head -20
# 4. Check if other config sections define defaults
echo -e "\n=== Other BaseSettings with defaults ==="
grep -A 10 'class.*Config(BaseSettings)' genon/preprocessor/src/common/settings.py | head -40Repository: genonai/doc_parser
Length of output: 1146
🏁 Script executed:
#!/bin/bash
# 1. Check env files for MinIO configuration
echo "=== .env.dev content (MinIO-related) ==="
grep -i minio ./genon/preprocessor/env/.env.dev || echo "(no MinIO config in .env.dev)"
echo -e "\n=== .env.prod content (MinIO-related) ==="
grep -i minio ./genon/preprocessor/env/.env.prod || echo "(no MinIO config in .env.prod)"
# 2. Look for boolean fields with defaults in settings.py
echo -e "\n=== Boolean config fields with defaults ==="
rg -n ': bool\s*=' genon/preprocessor/src/common/settings.py
# 3. Get context around Minio() instantiation
echo -e "\n=== Full Minio() instantiation context ==="
sed -n '50,75p' genon/preprocessor/src/util/minio_resource.pyRepository: genonai/doc_parser
Length of output: 1254
🏁 Script executed:
#!/bin/bash
# Check all usages of minio_config (not just instantiation)
echo "=== All minio_config usages ==="
rg -n 'minio_config' --type py
# Check imports of Minio class
echo -e "\n=== Minio class imports ==="
rg -n 'from.*minio|import.*Minio' --type pyRepository: genonai/doc_parser
Length of output: 2016
Make MinIO TLS configurable instead of hardcoding plaintext.
secure=False forces HTTP for all environments, including production. Add a MINIO_SECURE boolean setting to enable TLS-based deployments. This is the only MinIO instantiation in the codebase, and environment files currently provide no way to override this hardcoded value.
Proposed fix
minio_client = Minio(
endpoint=minio_config.MINIO_ENDPOINT,
access_key=minio_config.MINIO_ACCESS_KEY,
secret_key=minio_config.MINIO_SECRET_KEY,
- secure=False
+ secure=minio_config.MINIO_SECURE
)Add to genon/preprocessor/src/common/settings.py:
class MinioConfig(BaseSettings):
class Config(BaseConfig):
pass
MINIO_ENDPOINT: str
MINIO_ACCESS_KEY: str
MINIO_SECRET_KEY: str
+ MINIO_SECURE: bool = False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@genon/preprocessor/src/util/minio_resource.py` around lines 58 - 63, The
Minio client is hardcoded with secure=False in the Minio(...) call inside
minio_resource.py; make TLS configurable by adding a MINIO_SECURE boolean
setting to your settings module (e.g., genon.preprocessor.src.common.settings)
and replace the literal secure=False with that setting (e.g.,
secure=MINIO_SECURE) when constructing Minio in the function or class that
instantiates it so deployments can enable TLS via environment/configuration.
개요
전처리기 Pod 기동 시 MinIO에서 업로드된 리소스 파일을
/app/resource로 자동 다운로드하는 기능 추가. GenOS QA 이슈 genonai/GenOS#9078 수정.관련 이슈
배경
현상 (As-Is)
doc_parser도커 이미지로 배포된 전처리기 Pod(예:preprocessor-226-598d59564b-4r44g)의 파일시스템에/app/resource폴더가 생성되지 않아, 관리자 페이지(/admin/resource/preprocessor/detail/{id})에서 업로드한 리소스 파일(jpg·json·png·xlsx·py·html 등)이 Pod 내부로 동기화되지 않음. 결과적으로 전처리기 코드가 첨부 리소스를 참조할 수 없는 상태.재현
preprocessor버킷에 저장됨)doc_parser이미지로 Pod 배포ls -al ~→resource/부재 확인기대 동작 (To-Be)
Pod 기동 시 MinIO의
preprocessor/{PREPROCESSOR_ID}/resource/프리픽스를/app/resource하위에 자동 다운로드.해결 방식
GenOS 동일 문제를 이미 해결한
container-services/workflow·mcp-server의 패턴을doc_parser전처리기에도 이식.핵심 메커니즘
Minio.list_objects(bucket=\"preprocessor\", prefix=\"{ID}/resource\", recursive=True)→ 각 오브젝트를fget_object()로 다운로드, 원본 디렉터리 구조 유지.gunicorn workers=5에서 동시 다운로드 충돌 방지용fcntl.flock(LOCK_EX | LOCK_NB)기반FileLock. 첫 워커만 실제 다운로드, 나머지는 락 대기 후os.path.exists()로 SKIP.[SKIP i/N] already exists로그 후 넘김 → 재기동·재스케줄 시 중복 다운로드 없음.변경 파일
src/common/settings.pyOptionalimport,Settings.PREPROCESSOR_ID: Optional[str] = _ID,MinioConfig클래스 +minio_config싱글턴src/main.pyos·sys·settings·download_resource_filesimport,DocumentProcessor()초기화 직전 프리패치 훅 호출src/util/__init__.pysrc/util/minio_resource.pyFileLock+download_resource_files(bucket_name, resource_id, path)(104줄, workflow 구현과 동일)tests/unit/test_minio_resource_unit.py주요 변경 포인트
사전 조건 (GenOS 측에서 이미 충족)
이 PR 단독으로 동작하는 것이 아니라 GenOS 오케스트레이션·인프라가 아래를 이미 주입하고 있음 — 그래서 doc_parser 측 변경만으로 해결됨.
PREPROCESSOR_IDenv 주입: admin-api가 Pod 생성 시 주입 (admin-api/src/service/system/preprocessor_service.py:652)envFrom: llmops-minio-client-configmap·llmops-minio-client-secret연결됨 (orchestrator/src/k8s-manifest-templates/preprocessor/preprocessor-deployment.yaml)preprocessor버킷의{id}/resource/**— admin-api 업로드 경로와 일치minio==7.2.20이미pyproject.toml에 선언됨테스트
단위 테스트 결과
test_file_lock_acquires_and_writes_pidtest_file_lock_times_out_when_held_elsewhereTimeoutErrortest_file_lock_blocks_concurrent_acquiretest_download_resource_files_downloads_all_objectstest_download_resource_files_skips_existingtest_download_resource_files_skips_directory_entriesis_dir=True스킵test_download_resource_files_empty_listtest_download_resource_files_propagates_exceptiontest_download_resource_files_ignores_empty_relative_pathrel_path == \"\"스킵통합/수동 테스트 체크리스트
192.168.76.180:40908/admin/resource/preprocessor/detail/226) 전처리기 재배포ls -al /app/resource— 업로드된 파일 전체 확인/var/log/supervisor/gunicorn_stdout.log에Acquired lock/Downloading [i/N]/Completed!로그[SKIP …] already exists출력/run엔드포인트 호출 — 기존 전처리 회귀 없음0 files) 정상 기동리스크·롤백
리스크
startupProbe가 healthcheck 를 기다리므로 MinIO 회복 전까지 CrashLoopBackOff 로 대기. 리소스 없이 조용히 기동되는 것보다 안전.timeout_sec=3600): 대량 파일 + 느린 MinIO 환경에서 1시간 초과 시 워커 기동 실패 가능. 실제 QA 데이터(수 MB ~ 수 백 MB) 기준으로는 여유./app/resource는 컨테이너 ephemeral storage. Pod 재시작 시 재다운로드(단,os.path.exists스킵으로 재시작 자체 비용만).롤백
리뷰 요청 포인트
@claude 아래 관점에서 상세 리뷰 부탁드립니다:
FileLock 구현 안전성
fcntl.flock은 프로세스 크래시 시 자동 해제되지만,__exit__에서close()전flock(LOCK_UN)순서가 올바른지LOCK_NB+ 폴링 루프의 busy-wait 특성과poll_interval=0.2s의 적절성main.py import 순서·부작용
download_resource_files()를 호출 — FastAPI app 이미 생성된 후이므로 healthcheck 에는 응답 가능. 하지만 gunicorn worker 부팅 중이면startupProbe실패 가능성. 현startupProbe failureThreshold=120, periodSeconds=5(10분 한도)로 커버되는지sys.path.append(.../util)은 이미from util.minio_resource import ...가 동작하므로 사실상 불필요. 제거 가능한지Settings
PREPROCESSOR_ID필드명ID: Optional[str] = _ID였으나, main.py가settings.PREPROCESSOR_ID를 참조하도록 되어 있어 필드명을 env 변수와 동일하게PREPROCESSOR_ID로 통일. 이 선택이 적절한지테스트 커버리지
로깅·운영성
len(objects)기준 진행률 로그의 verbosity🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com
Summary by CodeRabbit
New Features
Tests