Object Storage Usage¶
About S3¶
S3, or Simple Storage Service, is a cloud storage service developped by Amazon Web Services (AWS). It allows users to store and retrieve any amount of data at any time, from anywhere on the web.
Here are some key points:
- Object Storage: S3 uses an object storage model, where each file is treated as an object and stored in a "bucket" (a storage container).
- Scalability: S3 is designed to scale easily, allowing you to store anything from small amounts to very large quantities of data without any issues.
To authenticate, a user need to get ACCESS_ID
and SECRET_ID
. Mésocentre
administrator team will provide you those information.
Objects¶
Objects storage does not store file in a global metadata hierarchy. For example, creating a directory in object storage infrastructure has no sense as there are no data associated to it.
Instead of that, any file are its own metadata, one of them is the name where
you can have /
to map a classic hierarchy view. On datastore, it's not consider
as a file but some CLI use that to simulate a hierarchy.
Buckets¶
Buckets have a key role on object storage data management. A bucket is a container which stores data that are related. That relation can be:
- experience run name where you will put all data related to that specific run,
per example,
s3://run-20241001
for the run of October 1st 2024 - services where you will put all data related to that specific service, per example
s3://website.domain.org
where you will put all data related to the website http://website.domain.org
It is up to you to choose your bucket organisation and how many buckets you want to have.
s5cmd¶
s5cmd is a powerful S3 CLI which provide basic tools to manage your data and some advanced features to create a dataflow.
environment variable credentials¶
s5cmd
can use credential file to store data, but it's not very convenient as you
have to specify --file-credential
option and s3 endpoints can't be store in that
file. We recommend to use env variable instead.
$ cat > ceph-meso.sh <<EOF
export AWS_ACCESS_KEY_ID='<your-access-key-id>'
export AWS_SECRET_ACCESS_KEY='<your-secret-access-key>'
export S3_ENDPOINT_URL='<s3-endpoint-url>'
$ source ./ceph-meso.sh
create a bucket¶
Let's say for the tutorial that you want to have a bucket per experience. I have
two experiences on my lab foo
and bar
. I will create two buckets
$ s5cmd mb s3://foo
mb s3://foo
$ s5cmd mb s3://bar
mb s3://bar
list available buckets¶
$ s5cmd ls
2024/10/09 11:28:31 s3://bar
2024/10/09 11:28:28 s3://foo
upload a file¶
As describe here object storage is different from posix storage.
Files are just dropped into a bucket with their own metadata. Hopefully, s5cmd
manage posix-like hierarchy by interpreting /
in file name as a directory separator.
For the tutorial, imagine you want to have two directory-like in your bucket, one for input parameters, the other one for output parameters. Input could be the scientific plateforme parameters used to run the experience. Output could be the results of your experience.
Let's upload the input and output for today's run for experience bar
$ touch input.txt
$ s5cmd cp input.txt s3://bar/input/$(date +%Y%M%d).txt
cp input.txt s3://bar/input/20241009.txt
$ touch output.txt
$ s5cmd cp output.txt s3://bar/output/$(date +%Y%M%d).txt
cp output.txt s3://bar/output/20241209.txt
Note
As already said, but it's important to understand, the file name is input/20241009.txt
You do not have to create a directory.
listing the content of a bucket¶
Well, now let's see what's inside your bucket
$ s5cmd ls s3://bar/
DIR input/
DIR output/
WHAT !! Directories ??
s5cmd
will let you have a posix-like representation of your bucket. The directory
exists only because at least one file is named input/xxx
and another is named
output/xxx
. Of course, you can list what's inside the input directory with
command s5cmd ls s3://bar/input/
.
You can also list the whole content of the bucket with
$ s5cmd ls s3://bar/*
$
removing a file¶
Ok, now what happens if you remove a file
$ s5cmd rm s3://bar/input/20241009.txt
rm s3://bar/input/20241009.txt
$ s5cmd ls s3://bar/
DIR output/
As you can see, the directory input has disappeared, it's just because there is
no more object named input/xxx
.
restic¶
Restic is a simple backup tools. It has S3 backend support that can provide to user a simple way to backup local data to Ceph@Mésocentre.
env variable¶
restic
use environment variable to manage configuration, here a example of
a basic configuration that you can source every time you want to use restic
.
By default, restic
only read one file at a time. In case you have a lot of
small file, your filesystem will probably not provide enough bandwitch to minimiza
the backup duration, you can improve it with the option RESTIC_READ_CONCURRENCY
which will make restic
reading multiple file at the same time.
$ cat > ~/.config/restic << EOF
export AWS_ACCESS_KEY_ID='<your-access-key-id>'
export AWS_SECRET_ACCESS_KEY='<your-secret-access-key>'
export RESTIC_REPOSITORY=s3:<s3-endpoint>/restic
export RESTIC_READ_CONCURRENCY=20
EOF
$ source ~/.config/restic
Note
You can add the export RESTIC_PASSWORD='<password>'
line to your configuration
file :warning: if you feel secure about it :warning:.
Initialize bucket¶
To use restic
you have to create a specific bucket for each directory you want
to backup. restic
will create a secure bucket to store your backup.
Note
The bucket will NOT be human-readable, restic
use its own way to
organize your data.
$ restic init
enter password for new repository:
enter password again:
created restic repository 84acbf7a6f at s3:http://<endpoint>/restic
Please note that knowledge of your password is required to access
the repository. Losing your password means that your data is
irrecoverably lost.
Backup¶
Starting a backup is just a one line command as followed, if you have registered
RESTIC_PASSWORD
you can put it on a cron job to avoid manual starting.
$ restic backup ${directory}
enter password for repository:
repository 84acbf7a opened (version 2, compression level auto)
created new cache in ${directory}
no parent snapshot found, will read all files
[0:00] 0 index files loaded
Files: 89 new, 0 changed, 0 unmodified
Dirs: 57 new, 0 changed, 0 unmodified
Added to the repository: 190.764 KiB (109.085 KiB stored)
processed 89 files, 124.248 KiB in 0:00
snapshot 85d90fdb saved
Snapshots¶
Every time you start a backup, restic
will create a unique snapshot. A snapshot
referred all files existed at the time of the snapshot even those that were not
be modified.
listing snapshots¶
$ restic snapshots
enter password for repository:
repository 84acbf7a opened (version 2, compression level auto)
ID Time Host Tags Paths Size
----------------------------------------------------------
85d90fdb <date> <hostnme> <dir> 124.248 KiB
----------------------------------------------------------
1 snapshots
listing data in a snapshot¶
$ restic ls 85d90fdb
enter password for repository:
repository 84acbf7a opened (version 2, compression level auto)
[0:00] 100.00% 1 / 1 index files loaded
snapshot 85d90fdb of [<dir>] at [...]:
[...]
comparing snapshots¶
$ restic diff 6937cc80 9f69a047
comparing snapshot 6937cc80 to 9f69a047:
M /restic/cmd_diff.go
+ /restic/foo
M /restic/restic
Files: 0 new, 0 removed, 2 changed
Dirs: 1 new, 0 removed
Others: 0 new, 0 removed
Data Blobs: 14 new, 15 removed
Tree Blobs: 2 new, 1 removed
Added: 16.403 MiB
Removed: 16.402 MiB
You can also compare the content of a specific directory between snapshots:
restic diff 6937cc80:/AIA/2024/11/ 9f69a047:/AIA/2024/11/
comparing snapshot 6937cc80 to 9f69a047:
[...]
+ /22/94/2024_11_22__08_59_11_122__SDO_AIA_AIA_94.jp2
+ /22/94/2024_11_22__08_59_47_122__SDO_AIA_AIA_94.jp2
+ /22/94/2024_11_22__09_00_23_122__SDO_AIA_AIA_94.jp2
+ /22/94/2024_11_22__09_00_59_122__SDO_AIA_AIA_94.jp2
Files: 7172 new, 0 removed, 0 changed
Dirs: 0 new, 0 removed
Others: 0 new, 0 removed
Data Blobs: 10071 new, 0 removed
Tree Blobs: 12 new, 12 removed
Added: 7.202 GiB
Removed: 275.983 KiB
restoring data from a snapshot¶
$ restic restore <tag> --target <destination>
enter password for repository:
repository 84acbf7a opened (version 2, compression level auto)
[0:00] 100.00% 1 / 1 index files loaded
snapshot 85d90fdb of [<dir>] at [...]:
[...]
<tag>
can be the ID of the snapshot, or latest for the latest snapshot
Removing old snapshots¶
restic
do NOT delete snapshot, to avoid storage saturation, you can remove
old snapshot by asking restic
to only keep last <x>
snapshots. Only files that
were deleting before the <x>
st snapshot will be deleted.
$ restic forget --keep-last <nb> --prune
[...]
<nb>
the number of snapshot you want to keep