- When to use batched background migrations
- Isolation
- Accessing data for multiple databases
- Idempotence
- Batched background migrations for EE-only features
- Queueing
- Cleaning up
- Job arguments
- Additional filters
- Example
- Testing
- Best practices
- Additional tips and strategies
- Legacy background migrations
Batched background migrations
Batched Background Migrations should be used to perform data migrations whenever a migration exceeds the time limits in our guidelines. For example, you can use batched background migrations to migrate data that’s stored in a single JSON column to a separate table instead.
When to use batched background migrations
Use a batched background migration when you migrate data in tables containing so many rows that the process would exceed the time limits in our guidelines if performed using a regular Rails migration.
- Batched background migrations should be used when migrating data in high-traffic tables.
- Batched background migrations may also be used when executing numerous single-row queries for every item on a large dataset. Typically, for single-record patterns, runtime is largely dependent on the size of the dataset. Split the dataset accordingly, and put it into background migrations.
- Don’t use batched background migrations to perform schema migrations.
Background migrations can help when:
- Migrating events from one table to multiple separate tables.
- Populating one column based on JSON stored in another column.
- Migrating data that depends on the output of external services. (For example, an API.)
Isolation
Batched background migrations must be isolated and cannot use application code (for example,
models defined in app/models
except the ApplicationRecord
classes).
Because these migrations can take a long time to run, it’s possible
for new versions to deploy while the migrations are still running.
Accessing data for multiple databases
Background Migration contrary to regular migrations does have access to multiple databases
and can be used to efficiently access and update data across them. To properly indicate
a database to be used it is desired to create ActiveRecord model inline the migration code.
Such model should use a correct ApplicationRecord
depending on which database the table is located. As such usage of ActiveRecord::Base
is disallowed as it does not describe a explicitly database to be used to access given table.
# good
class Gitlab::BackgroundMigration::ExtractIntegrationsUrl
class Project < ::ApplicationRecord
self.table_name = 'projects'
end
class Build < ::Ci::ApplicationRecord
self.table_name = 'ci_builds'
end
end
# bad
class Gitlab::BackgroundMigration::ExtractIntegrationsUrl
class Project < ActiveRecord::Base
self.table_name = 'projects'
end
class Build < ActiveRecord::Base
self.table_name = 'ci_builds'
end
end
Similarly the usage of ActiveRecord::Base.connection
is disallowed and needs to be
replaced preferably with the usage of model connection.
# good
Project.connection.execute("SELECT * FROM projects")
# acceptable
ApplicationRecord.connection.execute("SELECT * FROM projects")
# bad
ActiveRecord::Base.connection.execute("SELECT * FROM projects")
Idempotence
Batched background migrations are executed in a context of a Sidekiq process. The usual Sidekiq rules apply, especially the rule that jobs should be small and idempotent. Make sure that in case that your migration job is retried, data integrity is guaranteed.
See Sidekiq best practices guidelines for more details.
Batched background migrations for EE-only features
All the background migration classes for EE-only features should be present in GitLab FOSS. For this purpose, create an empty class for GitLab FOSS, and extend it for GitLab EE as explained in the guidelines for implementing Enterprise Edition features.
Batched Background migrations are simple classes that define a perform
method. A
Sidekiq worker then executes such a class, passing any arguments to it. All
migration classes must be defined in the namespace
Gitlab::BackgroundMigration
. Place the files in the directory
lib/gitlab/background_migration/
.
Queueing
Queueing a batched background migration should be done in a post-deployment
migration. Use this queue_batched_background_migration
example, queueing the
migration to be executed in batches. Replace the class name and arguments with the values
from your migration:
queue_batched_background_migration(
JOB_CLASS_NAME,
TABLE_NAME,
JOB_ARGUMENTS,
JOB_INTERVAL
)
JOB_CLASS_NAME
.Make sure the newly-created data is either migrated, or saved in both the old and new version upon creation. Removals in turn can be handled by defining foreign keys with cascading deletes.
Requeuing batched background migrations
If one of the batched background migrations contains a bug that is fixed in a patch release, you must requeue the batched background migration so the migration repeats on systems that already performed the initial migration.
When you requeue the batched background migration, turn the original
queuing into a no-op by clearing up the #up
and #down
methods of the
migration performing the requeuing. Otherwise, the batched background migration is
queued multiple times on systems that are upgrading multiple patch releases at
once.
When you start the second post-deployment migration, delete the previously batched migration with the provided code:
delete_batched_background_migration(MIGRATION_NAME, TABLE_NAME, COLUMN, JOB_ARGUMENTS)
Cleaning up
Because background migrations can take a long time, you can’t immediately clean things up after queueing them. For example, you can’t drop a column used in the migration process, as jobs would fail. You must add a separate post-deployment migration in a future release that finishes any remaining jobs before cleaning things up. (For example, removing a column.)
To migrate the data from column foo
(containing a big JSON blob) to column bar
(containing a string), you would:
- Release A:
- Create a migration class that performs the migration for a row with a given ID.
- Update new rows using one of these techniques:
- Create a new trigger for simple copy operations that don’t need application logic.
- Handle this operation in the model/service as the records are created or updated.
- Create a new custom background job that updates the records.
- Queue the batched background migration for all existing rows in a post-deployment migration.
- Release B:
- Add a post-deployment migration that checks if the batched background migration is completed.
- Deploy code so that the application starts using the new column and stops to update new records.
- Remove the old column.
Bump to the import/export version may be required, if importing a project from a prior version of GitLab requires the data to be in the new format.
Job arguments
BatchedMigrationJob
provides the job_arguments
helper method for job classes to define the job arguments they need.
Batched migrations scheduled with queue_batched_background_migration
must use the helper to define the job arguments:
queue_batched_background_migration(
'CopyColumnUsingBackgroundMigrationJob',
TABLE_NAME,
'name', 'name_convert_to_text',
job_interval: DELAY_INTERVAL
)
queue_batched_background_migration
raises an error.In this example, copy_from
returns name
, and copy_to
returns name_convert_to_text
:
class CopyColumnUsingBackgroundMigrationJob < BatchedMigrationJob
job_arguments :copy_from, :copy_to
operation_name :update_all
def perform
from_column = connection.quote_column_name(copy_from)
to_column = connection.quote_column_name(copy_to)
assignment_clause = "#{to_column} = #{from_column}"
each_sub_batch do |relation|
relation.update_all(assignment_clause)
end
end
end
Additional filters
By default, when creating background jobs to perform the migration, batched background migrations
iterate over the full specified table. This iteration is done using the
PrimaryKeyBatchingStrategy
. If the table has 1000 records
and the batch size is 100, the work is batched into 10 jobs. For illustrative purposes,
EachBatch
is used like this:
# PrimaryKeyBatchingStrategy
Namespace.each_batch(of: 100) do |relation|
relation.where(type: nil).update_all(type: 'User') # this happens in each background job
end
In some cases, only a subset of records must be examined. If only 10% of the 1000 records need examination, apply a filter to the initial relation when the jobs are created:
Namespace.where(type: nil).each_batch(of: 100) do |relation|
relation.update_all(type: 'User')
end
In the first example, we don’t know how many records will be updated in each batch. In the second (filtered) example, we know exactly 100 will be updated with each batch.
BatchedMigrationJob
provides a scope_to
helper method to apply additional filters and achieve this:
-
Create a new migration job class that inherits from
BatchedMigrationJob
and defines the additional filter:class BackfillNamespaceType < BatchedMigrationJob scope_to ->(relation) { relation.where(type: nil) } operation_name :update_all feature_category :source_code_management def perform each_sub_batch do |sub_batch| sub_batch.update_all(type: 'User') end end end
For EE migrations that definescope_to
, ensure the module extendsActiveSupport::Concern
. Otherwise, records are processed without taking the scope into consideration. -
In the post-deployment migration, enqueue the batched background migration:
class BackfillNamespaceType < Gitlab::Database::Migration[2.1] MIGRATION = 'BackfillNamespaceType' DELAY_INTERVAL = 2.minutes restrict_gitlab_migration gitlab_schema: :gitlab_main def up queue_batched_background_migration( MIGRATION, :namespaces, :id, job_interval: DELAY_INTERVAL ) end def down delete_batched_background_migration(MIGRATION, :namespaces, :id, []) end end
EachBatch
performance.
In the example above we need an index on (type, id)
to support the filters. See the EachBatch
documentation for more information.Example
The routes
table has a source_type
field that’s used for a polymorphic relationship.
As part of a database redesign, we’re removing the polymorphic relationship. One step of
the work is migrating data from the source_id
column into a new singular foreign key.
Because we intend to delete old rows later, there’s no need to update them as part of the
background migration.
-
Start by defining our migration class, which should inherit from
Gitlab::BackgroundMigration::BatchedMigrationJob
:class Gitlab::BackgroundMigration::BackfillRouteNamespaceId < BatchedMigrationJob # For illustration purposes, if we were to use a local model we could # define it like below, using an `ApplicationRecord` as the base class # class Route < ::ApplicationRecord # self.table_name = 'routes' # end operation_name :update_all feature_category :source_code_management def perform each_sub_batch( batching_scope: -> (relation) { relation.where("source_type <> 'UnusedType'") } ) do |sub_batch| sub_batch.update_all('namespace_id = source_id') end end end
Job classes must be subclasses ofBatchedMigrationJob
to be correctly handled by the batched migration framework. Any subclass ofBatchedMigrationJob
is initialized with necessary arguments to execute the batch, as well as a connection to the tracking database. -
Create a database migration that adds a new trigger to the database. Example:
class AddTriggerToRoutesToCopySourceIdToNamespaceId < Gitlab::Database::Migration[2.1] FUNCTION_NAME = 'example_function' TRIGGER_NAME = 'example_trigger' def up execute(<<~SQL) CREATE OR REPLACE FUNCTION #{FUNCTION_NAME}() RETURNS trigger LANGUAGE plpgsql AS $$ BEGIN NEW."namespace_id" = NEW."source_id" RETURN NEW; END; $$; CREATE TRIGGER #{TRIGGER_NAME}() AFTER INSERT OR UPDATE ON routes FOR EACH ROW EXECUTE FUNCTION #{FUNCTION_NAME}(); SQL end def down drop_trigger(TRIGGER_NAME, :routes) drop_function(FUNCTION_NAME) end end
-
Create a post-deployment migration that queues the migration for existing data:
class QueueBackfillRoutesNamespaceId < Gitlab::Database::Migration[2.1] MIGRATION = 'BackfillRouteNamespaceId' DELAY_INTERVAL = 2.minutes restrict_gitlab_migration gitlab_schema: :gitlab_main def up queue_batched_background_migration( MIGRATION, :routes, :id, job_interval: DELAY_INTERVAL ) end def down delete_batched_background_migration(MIGRATION, :routes, :id, []) end end
When queuing a batched background migration, you need to restrict the schema to the database where you make the actual changes. In this case, we are updatingroutes
records, so we setrestrict_gitlab_migration gitlab_schema: :gitlab_main
. If, however, you need to perform a CI data migration, you would setrestrict_gitlab_migration gitlab_schema: :gitlab_ci
.After deployment, our application: - Continues using the data as before. - Ensures that both existing and new data are migrated.
-
In the next release, add a database migration to remove the trigger.
class RemoveNamepaceIdTriggerFromRoutes < Gitlab::Database::Migration[2.1] FUNCTION_NAME = 'example_function' TRIGGER_NAME = 'example_trigger' def up drop_trigger(TRIGGER_NAME, :routes) drop_function(FUNCTION_NAME) end def down # Should reverse the trigger and the function in the up method of the migration that added it end end
-
Add a new post-deployment migration that checks that the batched background migration is completed. For example:
class FinalizeBackfillRouteNamespaceId < Gitlab::Database::Migration[2.1] MIGRATION = 'BackfillRouteNamespaceId' disable_ddl_transaction! restrict_gitlab_migration gitlab_schema: :gitlab_main def up ensure_batched_background_migration_is_finished( job_class_name: MIGRATION, table_name: :routes, column_name: :id, job_arguments: [], finalize: true ) end def down # no-op end end
If the batched background migration is not finished, the system will execute the batched background migration inline. If you don’t want to see this behavior, you need to passfinalize: false
.If the application does not depend on the data being 100% migrated (for instance, the data is advisory, and not mission-critical), then you can skip this final step. This step confirms that the migration is completed, and all of the rows were migrated.
After the batched migration is completed, you can safely depend on the
data in routes.namespace_id
being populated.
Batching over non-distinct columns
The default batching strategy provides an efficient way to iterate over primary key columns. However, if you need to iterate over columns where values are not unique, you must use a different batching strategy.
The LooseIndexScanBatchingStrategy
batching strategy uses a special version of EachBatch
to provide efficient and stable iteration over the distinct column values.
This example shows a batched background migration where the issues.project_id
column is used as
the batching column.
Database post-migration:
class ProjectsWithIssuesMigration < Gitlab::Database::Migration[2.1]
MIGRATION = 'BatchProjectsWithIssues'
INTERVAL = 2.minutes
BATCH_SIZE = 5000
SUB_BATCH_SIZE = 500
restrict_gitlab_migration gitlab_schema: :gitlab_main
disable_ddl_transaction!
def up
queue_batched_background_migration(
MIGRATION,
:issues,
:project_id,
job_interval: INTERVAL,
batch_size: BATCH_SIZE,
batch_class_name: 'LooseIndexScanBatchingStrategy', # Override the default batching strategy
sub_batch_size: SUB_BATCH_SIZE
)
end
def down
delete_batched_background_migration(MIGRATION, :issues, :project_id, [])
end
end
Implementing the background migration class:
module Gitlab
module BackgroundMigration
class BatchProjectsWithIssues < Gitlab::BackgroundMigration::BatchedMigrationJob
include Gitlab::Database::DynamicModelHelpers
operation_name :backfill_issues
def perform
distinct_each_batch do |batch|
project_ids = batch.pluck(batch_column)
# do something with the distinct project_ids
end
end
end
end
end
scope_to
are ignored by LooseIndexScanBatchingStrategy
and distinct_each_batch
.Testing
Writing tests is required for:
- The batched background migrations’ queueing migration.
- The batched background migration itself.
- A cleanup migration.
The :migration
and schema: :latest
RSpec tags are automatically set for
background migration specs. Refer to the
Testing Rails migrations
style guide.
Remember that before
and after
RSpec hooks
migrate your database down and up. These hooks can result in other batched background
migrations being called. Using spy
test doubles with
have_received
is encouraged, instead of using regular test doubles, because
your expectations defined in a it
block can conflict with what is
called in RSpec hooks. Refer to issue #35351
for more details.
Best practices
- Know how much data you’re dealing with.
- Make sure the batched background migration jobs are idempotent.
- Confirm the tests you write are not false positives.
- If the data being migrated is critical and cannot be lost, the clean-up migration must also check the final state of the data before completing.
- Discuss the numbers with a database specialist. The migration may add more pressure on DB than you expect. Measure on staging, or ask someone to measure on production.
- Know how much time is required to run the batched background migration.
Additional tips and strategies
ChatOps integration
The batched background migrations framework has ChatOps support. Using ChatOps, GitLab engineers can interact with the batched background migrations present in the system.
List batched background migrations
To list the batched background migrations in the system, run this command:
/chatops run batched_background_migrations list
This command supports the following options:
- Database selection:
-
--database DATABASE_NAME
: Connects to the given database:-
main
: Uses the main database (default). -
ci
: Uses the CI database.
-
-
- Environment selection:
-
--dev
: Uses thedev
environment. -
--staging
: Uses thestaging
environment. -
--staging_ref
: Uses thestaging_ref
environment. -
--production
: Uses theproduction
environment (default).
-
Output example:
created_at
(DESC).Monitor the progress and status of a batched background migration
To see the status and progress of a specific batched background migration, run this command:
/chatops run batched_background_migrations status MIGRATION_ID
This command supports the following options:
- Database selection:
-
--database DATABASE_NAME
: Connects to the given database:-
main
: Uses the main database (default) -
ci
: Uses the CI database
-
-
- Environment selection:
-
--dev
: Uses thedev
environment. -
--staging
: Uses thestaging
environment. -
--staging_ref
: Uses thestaging_ref
environment. -
--production
: Uses theproduction
environment (default).
-
Output example:
Progress
represents the percentage of the background migration that has been completed.
Definitions of the batched background migration states:
-
Active: Either:
- Ready to be picked by the runner.
- Running batched jobs.
- Finalizing: Running batched jobs.
- Failed: Failed batched background migration.
- Finished: Completed batched background migration.
- Paused: Not visible to the runner.
Pause a batched background migration
If you want to pause a batched background migration, you need to run the following command:
/chatops run batched_background_migrations pause MIGRATION_ID
This command supports the following options:
- Database selection:
-
--database DATABASE_NAME
: Connects to the given database:-
main
: Uses the main database (default). -
ci
: Uses the CI database.
-
-
- Environment selection:
-
--dev
: Uses thedev
environment. -
--staging
: Uses thestaging
environment. -
--staging_ref
: Uses thestaging_ref
environment. -
--production
: Uses theproduction
environment (default).
-
Output example:
active
batched background migrations.Resume a batched background migration
If you want to resume a batched background migration, you need to run the following command:
/chatops run batched_background_migrations resume MIGRATION_ID
This command supports the following options:
- Database selection:
-
--database DATABASE_NAME
: Connects to the given database:-
main
: Uses the main database (default). -
ci
: Uses the CI database.
-
-
- Environment selection:
-
--dev
: Uses thedev
environment. -
--staging
: Uses thestaging
environment. -
--staging_ref
: Uses thestaging_ref
environment. -
--production
: Uses theproduction
environment (default).
-
Output example:
active
batched background migrationsViewing failure error logs
You can view failures in two ways:
- Via GitLab logs:
-
After running a batched background migration, if any jobs fail, view the logs in Kibana. View the production Sidekiq log and filter for:
json.new_state: failed
json.job_class_name: <Batched Background Migration job class name>
json.job_arguments: <Batched Background Migration job class arguments>
-
Review the
json.exception_class
andjson.exception_message
values to help understand why the jobs failed. -
Remember the retry mechanism. Having a failure does not mean the job failed. Always check the last status of the job.
-
-
Via database:
- Get the batched background migration
CLASS_NAME
. -
Execute the following query in the PostgreSQL console:
SELECT migration.id, migration.job_class_name, transition_logs.exception_class, transition_logs.exception_message FROM batched_background_migrations as migration INNER JOIN batched_background_migration_jobs as jobs ON jobs.batched_background_migration_id = migration.id INNER JOIN batched_background_migration_job_transition_logs as transition_logs ON transition_logs.batched_background_migration_job_id = jobs.id WHERE transition_logs.next_status = '2' AND migration.job_class_name = "CLASS_NAME";
- Get the batched background migration
Adding indexes to support batched background migrations
Sometimes it is necessary to add a new or temporary index to support a batched background migration. To do this, create the index in a post-deployment migration that precedes the post-deployment migration that queues the background migration.
See the documentation for adding database indexes for additional information about some cases that require special attention to allow the index to be used directly after creation.
Legacy background migrations
Batched background migrations replaced the legacy background migrations framework. Check that documentation in reference to any changes involving that framework.