Add Nuplan central token extraction and USDZ dataset support for Alpasim integration#62
Add Nuplan central token extraction and USDZ dataset support for Alpasim integration#62WCJ-BERT wants to merge 6 commits into
Conversation
Add USDZ dataset implementation with the following features: - Wrap alpasim_utils.Artifact for stable USDZ parsing - Convert Artifact data to trajdata's standard DataFrame format - Extract velocity and acceleration from trajectory positions - Use project's arr_utils.quaternion_to_yaw() for consistency - Calculate actual dt from timestamps (fallback to 0.1s) - Optimize derivative computation with single groupby pass - Support maps and agent metadata extraction Technical improvements: - Proper quaternion order conversion ([x,y,z,w] -> [w,x,y,z]) - Dynamic time step calculation from trajectory timestamps - Efficient pandas operations for velocity/acceleration - Clean integration with env_utils Total: 461 lines of new USDZ dataset code
- Add dataset_kwargs parameter to ParallelDatasetPreprocessor - Pass dataset_kwargs through multiprocessing to child processes - Save dataset_kwargs in UnifiedDataset for parallel preprocessing - This enables dataset-specific configurations to work in parallel mode Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…path
Major changes:
- Change dataset_kwargs format from flat to nested dict structure
Old: dataset_kwargs={'param': value} (shared by all datasets)
New: dataset_kwargs={'dataset_name': {'param': value}} (per-dataset)
- Remove yaml_config_path parameter from NuplanDataset and NuPlanObject
Use central_tokens_config directly instead
- Update env_utils.get_raw_datasets() to support nested dict format
Each dataset now receives only its own specific parameters
- Optimize parallel preprocessing to avoid redundant parameter extraction
ParallelDatasetPreprocessor now correctly handles nested dict format
- Clean up df_cache.py: remove unused resolution parameter
Benefits:
- Clearer separation of per-dataset parameters
- More flexible multi-dataset configuration
- Simpler parallel worker logic
Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
Revert the usdz dataset added in c60712d: drop the src/trajdata/dataset_specific/usdz package, the usdz dispatch branch in env_utils, and the README table row. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| kdtrees_path, | ||
| rtrees_path, | ||
| ) = DataFrameCache.get_map_paths(cache_path, env_name, map_name, resolution) | ||
| ) = DataFrameCache.get_map_paths(cache_path, env_name, map_name) |
There was a problem hiding this comment.
I'm not sure I understand if this change was actually needed. It seems like is_map_cached now has an unused argument and that existing callers may have issues.
| class NuplanDataset(RawDataset): | ||
| def __init__( | ||
| self, | ||
| name: str, |
There was a problem hiding this comment.
nit: suggest keeping this variable the same name as the super class (env_name). Especially important since this is different from self.name
| num_timesteps_before: Optional[int] = None, | ||
| num_timesteps_after: Optional[int] = None, | ||
| use_central_tokens: bool = False, | ||
| **kwargs, |
| central_tokens_config: Optional[List[Dict[str, Any]]] = None, | ||
| num_timesteps_before: Optional[int] = None, | ||
| num_timesteps_after: Optional[int] = None, | ||
| use_central_tokens: bool = False, |
There was a problem hiding this comment.
Based on my understanding, this might be redundant. If I read this correctly, use_central_tokens is equivalent to central_tokens_config is not None
This also makes me wonder if self._use_central_tokens is necessary or if this could be a property (dependent on central_tokens_config) instead
This comment also trickles down to the NuplanObject class
| data_dir: str, | ||
| parallelizable: bool = True, | ||
| has_maps: bool = True, | ||
| central_tokens_config: Optional[List[Dict[str, Any]]] = None, |
There was a problem hiding this comment.
I think it's possible that these optional arguments might currently be incompatible with the way that caching works in that the env_is_cached() method won't be aware of changes to these arguments.
| } | ||
| } | ||
| try: | ||
| import pickle |
There was a problem hiding this comment.
this introduces pickle.loads() on values read from the NuPlan SQLite database. That is only safe if we treat the raw NuPlan dataset as trusted input (the concern being malicious pickle execution). Is this required by the NuPlan DB format, and if so can we document the trust assumption clearly or use an official NuPlan deserialization path instead
| else: | ||
| translation = np.array([trans_obj.x, trans_obj.y, trans_obj.z]) if hasattr(trans_obj, 'x') else np.array([0.0, 0.0, 0.0]) | ||
| except Exception: | ||
| translation = np.array([0.0, 0.0, 0.0]) |
There was a problem hiding this comment.
I wonder what the consequences of the default value are here. For instance, if there is an issue loading this data, would it be better to raise the exception or set this to None to avoid someone accidentally using this downstream and getting corrupt results.
same question for the quaternions
| # Try to get quaternion components. | ||
| if hasattr(rot_obj, 'quaternion'): | ||
| q = rot_obj.quaternion | ||
| rotation = np.array([q.w, q.x, q.y, q.z]) if hasattr(q, 'w') else np.array([q.x, q.y, q.z, q.w]) |
There was a problem hiding this comment.
this line doesn't quite look right to me. it uses q.w even when it doesn't exist
this is another point where I wonder if None is better
| elif hasattr(intrinsic_obj, '__iter__') and not isinstance(intrinsic_obj, str): | ||
| intrinsic = np.array(intrinsic_obj) | ||
| else: | ||
| intrinsic = np.array([[1545.0, 0.0, 960.0], [0.0, 1545.0, 560.0], [0.0, 0.0, 1.0]]) |
There was a problem hiding this comment.
where do these numbers come from?
add reference source? or maybe this should be another None case instead of fallback?
| self.envs: List[RawDataset] = env_utils.get_raw_datasets(data_dirs) | ||
| # Pass dataset-specific kwargs to raw datasets | ||
| dataset_kwargs = dataset_kwargs or {} | ||
| self.dataset_kwargs = dataset_kwargs # Save for later use in parallel preprocessing |
There was a problem hiding this comment.
I wonder if we can change this to
self.dataset_kwargs = dataset_kwargs or {}
and then just use self.dataset_kwargs everywhere?
Overview
This PR enables trajdata as a unified data source for Alpasim by adding:
All changes are fully backward compatible with existing trajdata usage, and ready to support Unified Scene Data Flow changes in Alpasim.