Evaluate_generator produces wrong accuracy scores?

Hello, I run a slightly modified version of the [keras fine tuning examples](https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html) which only fine tunes the top layers (with Keras 2.0.3/Tensorflow on Ubuntu with GPU). This looks like the following:
```python
img_width, img_height = 150, 150
train_data_dir = 'data/train_s'
validation_data_dir = 'data/val_s'
nb_train_samples = 2000
nb_validation_samples = 800
epochs = 10
batch_size = 16

base_model = applications.VGG16(weights='imagenet', include_top=False, input_shape=(img_width, img_height, 3))

top_model = Sequential()
top_model.add(Flatten(input_shape=base_model.output_shape[1:]))
top_model.add(Dense(256, activation='relu'))
top_model.add(Dense(1, activation='sigmoid'))

model = Model(inputs=base_model.input, outputs=top_model(base_model.output))
model.compile(loss='binary_crossentropy', optimizer=optimizers.SGD(lr=1e-4, momentum=0.9),
              metrics=['accuracy'])

train_datagen = ImageDataGenerator(
    rescale=1. / 255,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True)

test_datagen = ImageDataGenerator(rescale=1. / 255)

train_generator = train_datagen.flow_from_directory(
    train_data_dir,
    target_size=(img_height, img_width),
    batch_size=batch_size,
    class_mode='binary')

validation_generator = test_datagen.flow_from_directory(
    validation_data_dir,
    target_size=(img_height, img_width),
    batch_size=batch_size,
    class_mode='binary', shuffle=False)

model.fit_generator(
    train_generator,
    steps_per_epoch=nb_train_samples // batch_size,
    epochs=epochs,
    validation_data=validation_generator,
    validation_steps=nb_validation_samples // batch_size,
    verbose=2, workers=12)

score = model.evaluate_generator(validation_generator, nb_validation_samples/batch_size, workers=12)

scores = model.predict_generator(validation_generator, nb_validation_samples/batch_size, workers=12)

correct = 0
for i, n in enumerate(validation_generator.filenames):
    if n.startswith("cats") and scores[i][0] <= 0.5:
        correct += 1
    if n.startswith("dogs") and scores[i][0] > 0.5:
        correct += 1

print("Correct:", correct, " Total: ", len(validation_generator.filenames))
print("Loss: ", score[0], "Accuracy: ", score[1])
```
With this, I get unreliable validation accuracy results. For example, predict_generator predicts 640 out of 800 (80%) classes correctly whereas evaluate_generator produces an accuracy score of 95%. Someone in #3477 suggests to remove the `rescale=1. / 255` parameter from the validation generator, then I get results of 365/800=45% and 89% from evaluate_generator.

Is there something wrong with my evaluation or is this due to a bug? There are many similar issues (e.g. #3849, #6245) where the stated accuracy (during training and afterwards) doesn't match the actual predictions. Could someone experienced maybe shine some light onto this problem? Thanks


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate_generator produces wrong accuracy scores? #6499

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Evaluate_generator produces wrong accuracy scores? #6499

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions