It would be neat to have richer support for struct tags for auto-generated schema definitions. I added this feature to a branch off my forked repo and am happy to put up a PR if you guys think this is a good idea! I added documentation on what this would look like (I just copied the updates I made to the README on my branch).
Object Schema Definitions
The sub-package parquetschema/autoschema supports auto-generating schema
definitions for a provided object's type using reflection and struct tags. The
generated schema is meant to be compatible with the reflection-based
marshalling/unmarshalling in the floor sub-package.
Supported Parquet Types
| Parquet Type |
Go Types |
Note |
| BOOLEAN |
bool |
|
| INT32 |
int{8,16,32}, uint{,8,16,32} |
|
| INT64 |
int{,64}, uint64 |
|
| INT96 |
[12]byte |
Must specify type=INT96 in the parquet struct tag. |
| FLOAT |
float32 |
|
| DOUBLE |
float64 |
|
| BYTE_ARRAY |
string, []byte |
|
| FIXED_LEN_BYTE_ARRAY |
[]byte, [N]byte |
|
Supported Logical Types
| Logical Type |
Go Types |
Note |
| STRING |
string, []byte |
|
| MAP |
map[T1]T2 |
Maps with any key and value types. |
| LIST |
[]T, [N]T |
Slices and arrays of any type except for byte. |
| ENUM |
string, []byte |
|
| DECIMAL |
int32, int64, []byte, [N]byte |
|
| DATE |
int32, time.Time |
|
| TIME |
int32, int64, goparquet.Time |
int32: TIME(MILLIS, {false,true}), int64: TIME({MICROS,NANOS}, {false,true}) |
| TIMESTAMP |
int64, time.Time |
|
| INTEGER |
{,u}int{,8,16,32,64} |
|
| JSON |
string, []byte |
|
| BSON |
string, []byte |
|
| UUID |
[16]byte |
|
Pointers are automatically mapped to optional fields. Unsupported Go types
include funcs, interfaces, unsafe pointers, unsigned int pointers, and complex
numbers.
Default Type Mappings
By default, Go types are mapped to Parquet types and in some cases logical
types as well. More specific mappings can be achieved by the use of struct
tags (see below).
| Go Type |
Default Parquet Type |
Default Logical Type |
| bool |
BOOLEAN |
|
| int{,8,16,32,64} |
INT{64,32,32,32,64} |
INTEGER({64,8,16,32,64}, true) |
| uint{,8,16,32,64} |
INT{32,32,32,32,64} |
INTEGER({32,8,16,32,64}, false) |
| string |
BYTE_ARRAY |
STRING |
| []byte |
BYTE_ARRAY |
|
| [N]byte |
FIXED_LEN_BYTE_ARRAY |
|
| time.Time |
INT64 |
TIMESTAMP(NANOS, true) |
| goparquet.Time |
INT64 |
TIME(NANOS, true) |
| map |
group |
MAP |
| slice, array |
group |
LIST |
| struct |
group |
|
Struct Tags
Automatic schema definition generation supports the use of the parquet struct
tag for further schema specification beyond the default mappings. Tag fields
have the format key=value and are comma separated. The tags do not support
converted types as these are now deprecated by Parquet. Since converted types
are still required to support backward compatibility, they are automatically
set based on a field's logical type.
| Tag Field |
Type |
Values |
Notes |
| name |
string |
ANY |
Defaults to the lower-case struct field name. |
| type |
string |
INT96 |
Unless using a [12]byte field for INT96, this does not ever need to be specified. |
| logicaltype |
string |
STRING, ENUM, DECIMAL, DATE, TIME, TIMESTAMP, JSON, BSON, UUID |
Maps and non-byte slices and arrays are always mapped to MAP and LIST logical types, respectively. |
| timeunit |
string |
MILLIS, MICROS, NANOS |
Only used when the logical type is TIME or TIMESTAMP, defaults to NANOS. |
| isadjustedtoutc |
bool |
ANY |
Only used when the logical type is TIME or TIMESTAMP, defaults to true. |
| scale |
int32 |
N >= 0 |
Only used when the logical type is DECIMAL, defaults to 0. |
| precision |
int32 |
N >= 0 |
Only used when the logical type is DECIMAL, required. |
All fields must be prefixed by key. and value. when referring to key and
value types of a map, respectively, and element. when referring to the
element type of a slice or array. It is invalid to prefix name since it can
only apply to the field itself.
Object Schema Example
type example struct {
ByteSlice []byte
String string
ByteString []byte `parquet:"name=byte_string, logicaltype=STRING"`
Int64 int64 `parquet:"name=int_64"`
Uint8 uint8 `parquet:"name=u_int_8"`
Int96 [12]byte `parquet:"name=int_96, type=INT96"`
DefaultTS time.Time `parquet:"name=default_ts"`
Timestamp int64 `parquet:"name=ts, logicaltype=TIMESTAMP, timeunit=MILLIS, isadjustedtoutc=false`
Date time.Time `parquet:"name=date, logicaltype=DATE"`
OptionalDecimal *int32 `parquet:"name=decimal, logicaltype=DECIMAL, scale=5, precision=10"`
TimeList []int32 `parquet:"name=time_list, element.logicaltype=TIME, element.timeunit=MILLIS"`
DecimalTimeMap map[int64]int32 `parquet:"name=decimal_time_map, key.logicaltype=DECIMAL, key.scale=5, key.precision=15, value.logicaltype=TIME, value.timeunit=MILLIS", value.isadjustedtoutc=true`
Struct struct {
OptionalInt64 *int64 `parquet:"name=int_64"`
Time int64 `parquet:"name=time, logicaltype=TIME, isadjustedtoutc=false"`
StringList []string `parquet:"name=string_list"`
} `parquet:"name=struct"`
}
The above struct is equivalent to the following schema definition:
message autogen_schema {
required binary byteslice;
required binary string (STRING);
required binary byte_string (STRING);
required int64 int_64 (INTEGER(64,true));
required int32 int_8 (INTEGER(8,false));
required int96 int_96;
required int64 default_ts (TIMESTAMP(NANOS,true));
required int64 ts (TIMESTAMP(MILLIS,false));
required int32 date (DATE);
optional int32 decimal (DECIMAL(10,5));
required group time_list (LIST) {
repeated group list {
required int32 element (TIME(MILLIS,true));
}
}
optional group decimal_time_map (MAP) {
repeated group key_value (MAP_KEY_VALUE) {
required int64 key (DECIMAL(15,5));
required int32 value (TIME(MILLIS, true));
}
}
required group struct {
optional int64 int_64 (INTEGER(64,true));
required int64 time (TIME(NANOS, false));
required group string_list (LIST) {
repeated group list {
required binary element (STRING);
}
}
}
}
It would be neat to have richer support for struct tags for auto-generated schema definitions. I added this feature to a branch off my forked repo and am happy to put up a PR if you guys think this is a good idea! I added documentation on what this would look like (I just copied the updates I made to the README on my branch).
Object Schema Definitions
The sub-package
parquetschema/autoschemasupports auto-generating schemadefinitions for a provided object's type using reflection and struct tags. The
generated schema is meant to be compatible with the reflection-based
marshalling/unmarshalling in the
floorsub-package.Supported Parquet Types
type=INT96in theparquetstruct tag.Supported Logical Types
Pointers are automatically mapped to optional fields. Unsupported Go types
include funcs, interfaces, unsafe pointers, unsigned int pointers, and complex
numbers.
Default Type Mappings
By default, Go types are mapped to Parquet types and in some cases logical
types as well. More specific mappings can be achieved by the use of struct
tags (see below).
Struct Tags
Automatic schema definition generation supports the use of the
parquetstructtag for further schema specification beyond the default mappings. Tag fields
have the format
key=valueand are comma separated. The tags do not supportconverted types as these are now deprecated by Parquet. Since converted types
are still required to support backward compatibility, they are automatically
set based on a field's logical type.
INT96STRING,ENUM,DECIMAL,DATE,TIME,TIMESTAMP,JSON,BSON,UUIDMILLIS,MICROS,NANOSNANOS.true.All fields must be prefixed by
key.andvalue.when referring to key andvalue types of a map, respectively, and
element.when referring to theelement type of a slice or array. It is invalid to prefix
namesince it canonly apply to the field itself.
Object Schema Example
The above struct is equivalent to the following schema definition: