Skip to content

Lock remains in HMS if HiveTableOperations gets killed (direct process shutdown - no signals) after lock is acquired  #2301

@pPanda-beta

Description

@pPanda-beta

Although unlock is kept inside the finally block,

cleanupMetadataAndUnlock(threw, newMetadataLocation, lockId);

YET INFINITE LOCK MAY HAPPEN

Steps to reproduce :

  1. Start a new commit
  2. Let icberg acquire the lock
  3. Kill the job (signal 9 / 19 ) or disconnect network. (See below to know more about actual scenarios that happened)
  4. Restart the job, a fresh new commit
  5. Thats it, it will never be able to acquire a new lock on same resource again

Lets make it more spicy

Consider it as the table create operation 🥰 , i.e. the actual "table" does not exist yet and before creation it will be killed.

Steps:

  1. Start creating the table
  2. Let it acquire the lock on the non-existent table resource (HMS allows that)
  3. Kill the job before the table is even created
  4. Go cry yourself, cause you can not remove this lock manually using beeline / hive-cli since table doesn’t exist.

What is this "Killed" thingy?

Well here in rapido we use GCP preemptible VMs for low cost. Now on such kind of infra, VM may be preempted at any point of time with very short notice period (15 sec).
Consider this as a network cable unplug. We will never get a chance for InterruptedException and do clean up operations.

Why hive never faced this issue?

Well hive has expiry of its locks which works as a TTL. So system recovers eventually.

Disaster Recoveries that we tried so far:

  1. Ensure no jobs are running for that table (which may not have been created yet)
  2. Use java api to connect to the hive metastore
  3. Delete all locks for that table

Suggestions

  1. Before acquiring locks, delete all locks that have not received any heartbeats since last 'x' minutes. (configurable)
  2. After acquiring lock keep sending heart beats to HMS from a different daemon thread. This will ensure concurrent writes are protected
  3. After operation finishes (success or fail), unlock and cancel the heartbeat thread.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions