Batch Replication
The AIStor Batch Framework allows you to create, manage, monitor, and execute jobs using a YAML-formatted job definition file (a “batch file”). The batch jobs run directly on the AIStor deployment to take advantage of the server-side processing power without constraints of the local machine where you run the AIStor Client.
The replicate
batch job replicates objects from one AIStor deployment (the source
deployment) to another AIStor deployment (the target
deployment).
The deployment specified as the alias becomes the ’local’ deployment for the purposes of replication.
Batch Replication between AIStor deployments have the following advantages over using mc mirror
:
- Removes the client to cluster network as a potential bottleneck.
- A user only needs access to starting a batch job with no other permissions, as the job runs entirely server side on the cluster.
- The job provides for retry attempts in event that objects do not replicate.
- Batch jobs are one-time, curated processes allowing for fine control replication.
- (AIStor to AIStor only) The replication process copies object versions from source to target.
Run batch replication with multiple workers in parallel by specifying the MINIO_BATCH_REPLICATION_WORKERS
environment variable.
The other deployment can be either another AIStor deployment or any S3-compatible location using a realtime storage class.
Use filtering options in the replication YAML
file to exclude objects stored in locations that require rehydration or other restoration methods before serving the requested object.
Batch replication to these types of remotes uses mc mirror
behavior.
Behavior
Access Control and Requirements
Batch replication shares similar access and permission requirements as bucket replication.
The credentials for the “source” deployment must have a policy similar to the following:
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"admin:SetBucketTarget",
"admin:GetBucketTarget"
],
"Effect": "Allow",
"Sid": "EnableRemoteBucketConfiguration"
},
{
"Effect": "Allow",
"Action": [
"s3:GetReplicationConfiguration",
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:GetBucketLocation",
"s3:GetBucketVersioning",
"s3:GetObjectRetention",
"s3:GetObjectLegalHold",
"s3:PutReplicationConfiguration"
],
"Resource": [
"arn:aws:s3:::*"
],
"Sid": "EnableReplicationRuleConfiguration"
}
]
}
The credentials for the “remote” deployment must have a policy similar to the following:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetReplicationConfiguration",
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:GetBucketLocation",
"s3:GetBucketVersioning",
"s3:GetBucketObjectLockConfiguration",
"s3:GetEncryptionConfiguration"
],
"Resource": [
"arn:aws:s3:::*"
],
"Sid": "EnableReplicationOnBucket"
},
{
"Effect": "Allow",
"Action": [
"s3:GetReplicationConfiguration",
"s3:ReplicateTags",
"s3:AbortMultipartUpload",
"s3:GetObject",
"s3:GetObjectVersion",
"s3:GetObjectVersionTagging",
"s3:PutObject",
"s3:PutObjectRetention",
"s3:PutBucketObjectLockConfiguration",
"s3:PutObjectLegalHold",
"s3:DeleteObject",
"s3:ReplicateObject",
"s3:ReplicateDelete"
],
"Resource": [
"arn:aws:s3:::*"
],
"Sid": "EnableReplicatingDataIntoBucket"
}
]
}
See mc admin user
, mc admin user svcacct
, and mc admin policy
for more complete documentation on adding users, access keys, and policies to a AIStor deployment.
AIStor deployments configured for Active Directory/LDAP or OpenID Connect user management can instead create dedicated access keys for supporting batch replication.
Filter Replication Targets
The batch job definition file can limit the replication by bucket, prefix, and/or filters to only replicate certain objects. The access to objects and buckets for the replication process may be restricted by the credentials you provide in the YAML for either the source or target destinations.
You can replicate from a remote AIStor deployment to the local deployment that runs the batch job.
For example, you can use a batch job to perform a one-time replication sync to push objects from a bucket on a local deployment at minio-local/invoices/
to a bucket on a remote deployment at minio-remote/invoices
.
You can also pull objects from the remote deployment at minio-remote/invoices
to the local deployment at minio-local/invoices
.
Small File Optimization
Batch replication by default automatically batches and compresses objects smaller than 5MiB to efficiently transfer data between the source and remote. The remote AIStor deployment can check and immediately apply lifecycle management tiering rules to batched objects. The functionality resembles that offered by S3 Snowball Edge small file batching.
You can modify the compression settings in the replicate job configuration.
Replicate Batch Job Reference
The YAML must define the source and target deployments.
If the source deployment is remote, then the target deployment must be local
.
Optionally, the YAML can also define flags to filter which objects replicate, send notifications for the job, or define retry attempts for the job.
You can replicate from a remote AIStor deployment to the local deployment that runs the batch job.
Use mc batch generate
to create a basic replicate
batch job for further customization.
For the local deployment, do not specify the endpoint or credentials.
Either delete or comment out those lines for the source or the target section, depending on which is the local
.
replicate:
apiVersion: v1
# source of the objects to be replicated
source:
type: TYPE # valid values are "s3" or "minio"
bucket: BUCKET
prefix: PREFIX # 'PREFIX' is optional
# If your source is the 'local' alias specified to 'mc batch start', then the 'endpoint' and 'credentials' fields are optional and can be omitted
# Either the 'source' or 'remote' *must* be the "local" deployment
endpoint: "http[s]://HOSTNAME:PORT"
# path: "on|off|auto" # "on" enables path-style bucket lookup. "off" enables virtual host (DNS)-style bucket lookup. Defaults to "auto"
credentials:
accessKey: ACCESS-KEY # Required
secretKey: SECRET-KEY # Required
# sessionToken: SESSION-TOKEN # Optional only available when rotating credentials are used
snowball: # automatically activated if the source is local
disable: false # optionally turn-off snowball archive transfer
batch: 100 # upto this many objects per archive
inmemory: true # indicates if the archive must be staged locally or in-memory
compress: false # S2/Snappy compressed archive
smallerThan: 5MiB # create archive for all objects smaller than 5MiB
skipErrs: false # skips any source side read() errors
# target where the objects must be replicated
target:
type: TYPE # valid values are "s3" or "minio"
bucket: BUCKET
prefix: PREFIX # 'PREFIX' is optional
# If your source is the 'local' alias specified to 'mc batch start', then the 'endpoint' and 'credentials' fields are optional and can be omitted
# Either the 'source' or 'remote' *must* be the "local" deployment
endpoint: "http[s]://HOSTNAME:PORT"
# path: "on|off|auto" # "on" enables path-style bucket lookup. "off" enables virtual host (DNS)-style bucket lookup. Defaults to "auto"
credentials:
accessKey: ACCESS-KEY
secretKey: SECRET-KEY
# sessionToken: SESSION-TOKEN # Optional only available when rotating credentials are used
# NOTE: All flags are optional
# - filtering criteria only applies for all source objects match the criteria
# - configurable notification endpoints
# - configurable retries for the job (each retry skips successfully previously replaced objects)
flags:
filter:
newerThan: "7d" # match objects newer than this value (e.g. 7d10h31s)
olderThan: "7d" # match objects older than this value (e.g. 7d10h31s)
createdAfter: "date" # match objects created after "date"
createdBefore: "date" # match objects created before "date"
## NOTE: tags are not supported when "source" is remote.
# tags:
# - key: "name"
# value: "pick*" # match objects with tag 'name', with all values starting with 'pick'
# metadata:
# - key: "content-type"
# value: "image/*" # match objects with 'content-type', with all values starting with 'image/'
notify:
endpoint: "https://notify.endpoint" # notification endpoint to receive job status events
token: "Bearer xxxxx" # optional authentication token for the notification endpoint
retry:
attempts: 10 # number of retries for the job before giving up
delay: "500ms" # least amount of delay between each retry