Codebase list s4cmd / fresh-snapshots/main
Refresh patches. Debian Janitor 1 year, 2 months ago
2 changed file(s) with 6 addition(s) and 471 deletion(s). Raw diff Collapse all Expand all
+0
-467
README.md less more
0 # s4cmd
1 ### Super S3 command line tool
2 [![Build Status](https://travis-ci.com/bloomreach/s4cmd.svg?branch=master)](https://travis-ci.com/bloomreach/s4cmd) [![Join the chat at https://gitter.im/bloomreach/ s4cmd](https://badges.gitter.im/Join%20Chat.svg)](https://gitter.im/bloomreach/s4cmd?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge) [![Packaging status](https://repology.org/badge/tiny-repos/s4cmd.svg)](https://repology.org/project/s4cmd/versions)
3
4 ----
5
6 **Author**: Chou-han Yang ([@chouhanyang](https://github.com/chouhanyang))
7
8 **Current Maintainers**: Debodirno Chandra ([@debodirno](https://github.com/debodirno)) | Naveen Vardhi ([@rozuur](https://github.com/rozuur)) | Navin Pai ([@navinpai](https://github.com/navinpai))
9
10 ----
11
12 ## What's New in s4cmd 2.x
13
14 - Fully migrated from old boto 2.x to new [boto3](http://boto3.readthedocs.io/en/latest/reference/services/s3.html) library, which provides more reliable and up-to-date S3 backend.
15 - Support S3 `--API-ServerSideEncryption` along with **36 new API pass-through options**. See API pass-through options section for complete list.
16 - Support batch delete (with delete_objects API) to delete up to 1000 files with single call. **100+ times faster** than sequential deletion.
17 - Support `S4CMD_OPTS` environment variable for commonly used options such as `--API-ServerSideEncryption` across all your s4cmd operations.
18 - Support moving files **larger than 5GB** with multipart upload. **20+ times faster** then sequential move operation when moving large files.
19 - Support timestamp filtering with `--last-modified-before` and `--last-modified-after` options for all operations. Human friendly timestamps are supported, e.g. `--last-modified-before='2 months ago'`
20 - Faster upload with lazy evaluation of md5 hash.
21 - Listing large number of files with S3 pagination, with memory is the limit.
22 - New directory to directory `dsync` command is better and standalone implementation to replace old `sync` command, which is implemented based on top of get/put/mv commands. `--delete-removed` work for all cases including local to s3, s3 to local, and s3 to s3. `sync` command preserves the old behavior in this version for compatibility.
23 - [Support for S3 compatible storage services](https://github.com/bloomreach/s4cmd/issues/52) such as DreamHost and Cloudian using `--endpoint-url` (Community Supported Beta Feature).
24 - Tested on both Python 2.7, 3.6, 3.7, 3.8, 3.9 and nightly.
25 - Special thanks to [onera.com](http://www.onera.com) for supporting s4cmd.
26
27
28 ## Motivation
29
30 S4cmd is a command-line utility for accessing
31 [Amazon S3](http://en.wikipedia.org/wiki/Amazon_S3), inspired by
32 [s3cmd](http://s3tools.org/s3cmd).
33
34 We have used s3cmd heavily for a number of scripted, data-intensive
35 applications. However as the need for a variety of small improvements arose, we
36 created our own implementation, s4cmd. It is intended as an alternative to
37 s3cmd for enhanced performance and for large files, and with a number of
38 additional features and fixes that we have found useful.
39
40 It strives to be compatible with the most common usage scenarios for s3cmd. It
41 does not offer exact drop-in compatibility, due to a number of corner cases where
42 different behavior seems preferable, or for bugfixes.
43
44
45 ## Features
46
47 S4cmd supports the regular commands you might expect for fetching and storing
48 files in S3: `ls`, `put`, `get`, `cp`, `mv`, `sync`, `del`, `du`.
49
50 The main features that distinguish s4cmd are:
51
52 - Simple (less than 1500 lines of code) and implemented in pure Python, based
53 on the widely used [Boto3](https://github.com/boto/boto3) library.
54 - Multi-threaded/multi-connection implementation for enhanced performance on all
55 commands. As with many network-intensive applications (like web browsers),
56 accessing S3 in a single-threaded way is often significantly less efficient than
57 having multiple connections actively transferring data at once. In general, we
58 get a 2X boost to upload/download speeds from this.
59 - Path handling: S3 is not a traditional filesystem with built-in support for
60 directory structure: internally, there are only objects, not directories or
61 folders. However, most people use S3 in a hierarchical structure, with paths
62 separated by slashes, to emulate traditional filesystems. S4cmd follows
63 conventions to more closely replicate the behavior of traditional filesystems
64 in certain corner cases. For example, "ls" and "cp" work much like in Unix
65 shells, to avoid odd surprises. (For examples see compatibility notes below.)
66 - Wildcard support: Wildcards, including multiple levels of wildcards, like in
67 Unix shells, are handled. For example:
68 s3://my-bucket/my-folder/20120512/*/*chunk00?1?
69 - Automatic retry: Failure tasks will be executed again after a delay.
70 - Multi-part upload support for files larger than 5GB.
71 - Handling of MD5s properly with respect to multi-part uploads (for the sordid
72 details of this, see below).
73 - Miscellaneous enhancements and bugfixes:
74 - Partial file creation: Avoid creating empty target files if source does not
75 exist. Avoid creating partial output files when commands are interrupted.
76 - General thread safety: Tool can be interrupted or killed at any time without
77 being blocked by child threads or leaving incomplete or corrupt files in
78 place.
79 - Ensure exit code is nonzero on all failure scenarios (a very important
80 feature in scripts).
81 - Expected handling of symlinks (they are followed).
82 - Support both `s3://` and `s3n://` prefixes (the latter is common with
83 Amazon Elastic Mapreduce).
84
85 Limitations:
86
87 - No CloudFront or other feature support.
88 - Currently, we simulate `sync` with `get` and `put` with `--recursive --force --sync-check`.
89
90
91 ## Installation and Setup
92 You can install `s4cmd` [PyPI](https://pypi.python.org/pypi/s4cmd).
93
94 ```
95 pip install s4cmd
96 ```
97
98 - Copy or create a symbolic link so you can run `s4cmd.py` as `s4cmd`. (It is just
99 a single file!)
100 - If you already have a `~/.s3cfg` file from configuring `s3cmd`, credentials
101 from this file will be used. Otherwise, set the `S3_ACCESS_KEY` and
102 `S3_SECRET_KEY` environment variables to contain your S3 credentials.
103 - If no keys are provided, but an IAM role is associated with the EC2 instance, it will
104 be used transparently.
105
106
107 ## s4cmd Commands
108
109 #### `s4cmd ls [path]`
110
111 List all contents of a directory.
112
113 * -r/--recursive: recursively display all contents including subdirectories under the given path.
114 * -d/--show-directory: show the directory entry instead of its content.
115
116
117 #### `s4cmd put [source] [target]`
118
119 Upload local files up to S3.
120
121 * -r/--recursive: also upload directories recursively.
122 * -s/--sync-check: check md5 hash to avoid uploading the same content.
123 * -f/--force: override existing file instead of showing error message.
124 * -n/--dry-run: emulate the operation without real upload.
125
126 #### `s4cmd get [source] [target]`
127
128 Download files from S3 to local filesystem.
129
130 * -r/--recursive: also download directories recursively.
131 * -s/--sync-check: check md5 hash to avoid downloading the same content.
132 * -f/--force: override existing file instead of showing error message.
133 * -n/--dry-run: emulate the operation without real download.
134
135
136 #### `s4cmd dsync [source dir] [target dir]`
137
138 Synchronize the contents of two directories. The directory can either be local or remote, but currently, it doesn't support two local directories.
139
140 * -r/--recursive: also sync directories recursively.
141 * -s/--sync-check: check md5 hash to avoid syncing the same content.
142 * -f/--force: override existing file instead of showing error message.
143 * -n/--dry-run: emulate the operation without real sync.
144 * --delete-removed: delete files not in source directory.
145
146 #### `s4cmd sync [source] [target]`
147
148 (Obsolete, use `dsync` instead) Synchronize the contents of two directories. The directory can either be local or remote, but currently, it doesn't support two local directories. This command simply invoke get/put/mv commands.
149
150 * -r/--recursive: also sync directories recursively.
151 * -s/--sync-check: check md5 hash to avoid syncing the same content.
152 * -f/--force: override existing file instead of showing error message.
153 * -n/--dry-run: emulate the operation without real sync.
154 * --delete-removed: delete files not in source directory. Only works when syncing local directory to s3 directory.
155
156 #### `s4cmd cp [source] [target]`
157
158 Copy a file or a directory from a S3 location to another.
159
160 * -r/--recursive: also copy directories recursively.
161 * -s/--sync-check: check md5 hash to avoid copying the same content.
162 * -f/--force: override existing file instead of showing error message.
163 * -n/--dry-run: emulate the operation without real copy.
164
165 #### `s4cmd mv [source] [target]`
166
167 Move a file or a directory from a S3 location to another.
168
169 * -r/--recursive: also move directories recursively.
170 * -s/--sync-check: check md5 hash to avoid moving the same content.
171 * -f/--force: override existing file instead of showing error message.
172 * -n/--dry-run: emulate the operation without real move.
173
174 #### `s4cmd del [path]`
175
176 Delete files or directories on S3.
177
178 * -r/--recursive: also delete directories recursively.
179 * -n/--dry-run: emulate the operation without real delete.
180
181 #### `s4cmd du [path]`
182
183 Get the size of the given directory.
184
185 Available parameters:
186
187 * -r/--recursive: also add sizes of sub-directories recursively.
188
189 ## s4cmd Control Options
190
191 ##### `-p S3CFG, --config=[filename]`
192 path to s3cfg config file
193
194 ##### `-f, --force`
195 force overwrite files when download or upload
196
197 ##### `-r, --recursive`
198 recursively checking subdirectories
199
200 ##### `-s, --sync-check`
201 check file md5 before download or upload
202
203 ##### `-n, --dry-run`
204 trial run without actual download or upload
205
206 ##### `-t RETRY, --retry=[integer]`
207 number of retries before giving up
208
209 ##### `--retry-delay=[integer]`
210 seconds to sleep between retries
211
212 ##### `-c NUM_THREADS, --num-threads=NUM_THREADS`
213 number of concurrent threads
214
215 ##### `--endpoint-url`
216 endpoint url used in boto3 client
217
218 ##### `-d, --show-directory`
219 show directory instead of its content
220
221 ##### `--ignore-empty-source`
222 ignore empty source from s3
223
224 ##### `--use-ssl`
225 (obsolete) use SSL connection to S3
226
227 ##### `--verbose`
228 verbose output
229
230 ##### `--debug`
231 debug output
232
233 ##### `--validate`
234 (obsolete) validate lookup operation
235
236 ##### `-D, --delete-removed`
237 delete remote files that do not exist in source after sync
238
239 ##### `--multipart-split-size=[integer]`
240 size in bytes to split multipart transfers
241
242 ##### `--max-singlepart-download-size=[integer]`
243 files with size (in bytes) greater than this will be
244 downloaded in multipart transfers
245
246 ##### `--max-singlepart-upload-size=[integer]`
247 files with size (in bytes) greater than this will be
248 uploaded in multipart transfers
249
250 ##### `--max-singlepart-copy-size=[integer]`
251 files with size (in bytes) greater than this will be
252 copied in multipart transfers
253
254 ##### `--batch-delete-size=[integer]`
255 Number of files (<1000) to be combined in batch delete.
256
257 ##### `--last-modified-before=[datetime]`
258 Condition on files where their last modified dates are
259 before given parameter.
260
261 ##### `--last-modified-after=[datetime]`
262 Condition on files where their last modified dates are
263 after given parameter.
264
265
266 ## S3 API Pass-through Options
267
268 Those options are directly translated to boto3 API commands. The options provided will be filtered by the APIs that are taking parameters. For example, `--API-ServerSideEncryption` is only needed for `put_object`, `create_multipart_upload` but not for `list_buckets` and `get_objects` for example. Therefore, providing `--API-ServerSideEncryption` for `s4cmd ls` has no effect.
269
270 For more information, please see boto3 s3 documentations http://boto3.readthedocs.io/en/latest/reference/services/s3.html
271
272 ##### `--API-ACL=[string]`
273 The canned ACL to apply to the object.
274
275 ##### `--API-CacheControl=[string]`
276 Specifies caching behavior along the request/reply chain.
277
278 ##### `--API-ContentDisposition=[string]`
279 Specifies presentational information for the object.
280
281 ##### `--API-ContentEncoding=[string]`
282 Specifies what content encodings have been applied to the object and thus what decoding mechanisms must be applied to obtain the media-type referenced by the Content-Type header field.
283
284 ##### `--API-ContentLanguage=[string]`
285 The language the content is in.
286
287 ##### `--API-ContentMD5=[string]`
288 The base64-encoded 128-bit MD5 digest of the part data.
289
290 ##### `--API-ContentType=[string]`
291 A standard MIME type describing the format of the object data.
292
293 ##### `--API-CopySourceIfMatch=[string]`
294 Copies the object if its entity tag (ETag) matches the specified tag.
295
296 ##### `--API-CopySourceIfModifiedSince=[datetime]`
297 Copies the object if it has been modified since the specified time.
298
299 ##### `--API-CopySourceIfNoneMatch=[string]`
300 Copies the object if its entity tag (ETag) is different than the specified ETag.
301
302 ##### `--API-CopySourceIfUnmodifiedSince=[datetime]`
303 Copies the object if it hasn't been modified since the specified time.
304
305 ##### `--API-CopySourceRange=[string]`
306 The range of bytes to copy from the source object. The range value must use the form bytes=first-last, where the first and last are the zero-based byte offsets to copy. For example, bytes=0-9 indicates that you want to copy the first ten bytes of the source. You can copy a range only if the source object is greater than 5 GB.
307
308 ##### `--API-CopySourceSSECustomerAlgorithm=[string]`
309 Specifies the algorithm to use when decrypting the source object (e.g., AES256).
310
311 ##### `--API-CopySourceSSECustomerKeyMD5=[string]`
312 Specifies the 128-bit MD5 digest of the encryption key according to RFC 1321. Amazon S3 uses this header for a message integrity check to ensure the encryption key was transmitted without error. Please note that this parameter is automatically populated if it is not provided. Including this parameter is not required
313
314 ##### `--API-CopySourceSSECustomerKey=[string]`
315 Specifies the customer-provided encryption key for Amazon S3 to use to decrypt the source object. The encryption key provided in this header must be one that was used when the source object was created.
316
317 ##### `--API-ETag=[string]`
318 Entity tag returned when the part was uploaded.
319
320 ##### `--API-Expires=[datetime]`
321 The date and time at which the object is no longer cacheable.
322
323 ##### `--API-GrantFullControl=[string]`
324 Gives the grantee READ, READ_ACP, and WRITE_ACP permissions on the object.
325
326 ##### `--API-GrantReadACP=[string]`
327 Allows grantee to read the object ACL.
328
329 ##### `--API-GrantRead=[string]`
330 Allows grantee to read the object data and its metadata.
331
332 ##### `--API-GrantWriteACP=[string]`
333 Allows grantee to write the ACL for the applicable object.
334
335 ##### `--API-IfMatch=[string]`
336 Return the object only if its entity tag (ETag) is the same as the one specified, otherwise return a 412 (precondition failed).
337
338 ##### `--API-IfModifiedSince=[datetime]`
339 Return the object only if it has been modified since the specified time, otherwise return a 304 (not modified).
340
341 ##### `--API-IfNoneMatch=[string]`
342 Return the object only if its entity tag (ETag) is different from the one specified, otherwise return a 304 (not modified).
343
344 ##### `--API-IfUnmodifiedSince=[datetime]`
345 Return the object only if it has not been modified since the specified time, otherwise return a 412 (precondition failed).
346
347 ##### `--API-Metadata=[dict]`
348 A map (in json string) of metadata to store with the object in S3
349
350 ##### `--API-MetadataDirective=[string]`
351 Specifies whether the metadata is copied from the source object or replaced with metadata provided in the request.
352
353 ##### `--API-MFA=[string]`
354 The concatenation of the authentication device's serial number, a space, and the value that is displayed on your authentication device.
355
356 ##### `--API-RequestPayer=[string]`
357 Confirms that the requester knows that she or he will be charged for the request. Bucket owners need not specify this parameter in their requests. Documentation on downloading objects from requester pays buckets can be found at http://docs.aws.amazon.com/AmazonS3/latest/dev/ObjectsinRequesterPaysBuckets.html
358
359 ##### `--API-ServerSideEncryption=[string]`
360 The Server-side encryption algorithm used when storing this object in S3 (e.g., AES256, aws:kms).
361
362 ##### `--API-SSECustomerAlgorithm=[string]`
363 Specifies the algorithm to use to when encrypting the object (e.g., AES256).
364
365 ##### `--API-SSECustomerKeyMD5=[string]`
366 Specifies the 128-bit MD5 digest of the encryption key according to RFC 1321. Amazon S3 uses this header for a message integrity check to ensure the encryption key was transmitted without error. Please note that this parameter is automatically populated if it is not provided. Including this parameter is not required
367
368 ##### `--API-SSECustomerKey=[string]`
369 Specifies the customer-provided encryption key for Amazon S3 to use in encrypting data. This value is used to store the object and then it is discarded; Amazon does not store the encryption key. The key must be appropriate for use with the algorithm specified in the x-amz-server-side-encryption-customer-algorithm header.
370
371 ##### `--API-SSEKMSKeyId=[string]`
372 Specifies the AWS KMS key ID to use for object encryption. All GET and PUT requests for an object protected by AWS KMS will fail if not made via SSL or using SigV4. Documentation on configuring any of the officially supported AWS SDKs and CLI can be found at http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingAWSSDK.html#specify-signature-version
373
374 ##### `--API-StorageClass=[string]`
375 The type of storage to use for the object. Defaults to 'STANDARD'.
376
377 ##### `--API-VersionId=[string]`
378 VersionId used to reference a specific version of the object.
379
380 ##### `--API-WebsiteRedirectLocation=[string]`
381 If the bucket is configured as a website, redirects requests for this object to another object in the same bucket or to an external URL. Amazon S3 stores the value of this header in the object metadata.
382
383
384 ## Debugging Tips
385
386 Simply enable `--debug` option to see the full log of s4cmd. If you even need to check what APIs are invoked from s4cmd to boto3, you can run:
387
388 ```
389 s4cmd --debug [op] .... 2>&1 >/dev/null | grep S3APICALL
390 ```
391
392 To see all the parameters sending to S3 API.
393
394
395 ## Compatibility between s3cmd and s4cmd
396
397 Prefix matching: In s3cmd, unlike traditional filesystems, prefix names match listings:
398
399 ```
400 >> s3cmd ls s3://my-bucket/ch
401 s3://my-bucket/charlie/
402 s3://my-bucket/chyang/
403 ```
404
405 In s4cmd, behavior is the same as with a Unix shell:
406
407 ```
408 >>s4cmd ls s3://my-bucket/ch
409 >(empty)
410 ```
411
412 To get prefix behavior, use explicit wildcards instead: s4cmd ls s3://my-bucket/ch*
413
414 Similarly, sync and cp commands emulate the Unix cp command, so directory to
415 directory sync use different syntax:
416
417 ```
418 >> s3cmd sync s3://bucket/path/dirA s3://bucket/path/dirB/
419 ```
420 will copy contents in dirA to dirB.
421 ```
422 >> s4cmd sync s3://bucket/path/dirA s3://bucket/path/dirB/
423 ```
424 will copy dirA *into* dirB.
425
426 To achieve the s3cmd behavior, use wildcards:
427 ```
428 s4cmd sync s3://bucket/path/dirA/* s3://bucket/path/dirB/
429 ```
430
431 Note s4cmd doesn't support dirA without trailing slash indicating dirA/* as
432 what rsync supported.
433
434 No automatic override for put command:
435 s3cmd put fileA s3://bucket/path/fileB will return error if fileB exists.
436 Use -f as well as get command.
437
438 Bugfixes for handling of non-existent paths: Often s3cmd creates empty files when specified paths do not exist:
439 s3cmd get s3://my-bucket/no_such_file downloads an empty file.
440 s4cmd get s3://my-bucket/no_such_file returns an error.
441 s3cmd put no_such_file s3://my-bucket/ uploads an empty file.
442 s4cmd put no_such_file s3://my-bucket/ returns an error.
443
444
445 ## Additional technical notes
446
447 Etags, MD5s and multi-part uploads: Traditionally, the etag of an object in S3
448 has been its MD5. However, this changed with the introduction of S3 multi-part
449 uploads; in this case the etag is still a unique ID, but it is not the MD5 of
450 the file. Amazon has not revealed the definition of the etag in this case, so
451 there is no way we can calculate and compare MD5s based on the etag header in
452 general. The workaround we use is to upload the MD5 as a supplemental content
453 header (called "md5", instead of "etag"). This enables s4cmd to check the MD5
454 hash before upload or download. The only limitation is that this only works for
455 files uploaded via s4cmd. Programs that do not understand this header will
456 still have to download and verify the MD5 directly.
457
458
459 ## Unimplemented features
460
461 - CloudFront or other feature support beyond basic S3 access.
462
463 ## Credits
464
465 * Bloomreach http://www.bloomreach.com
466 * Onera http://www.onera.com
00 Description: do not use vendored urllib3
11 Author: Sascha Steinbiss <satta@debian.org>
2 --- a/s4cmd.py
3 +++ b/s4cmd.py
4 @@ -23,6 +23,7 @@
2 Index: s4cmd.git/s4cmd.py
3 ===================================================================
4 --- s4cmd.git.orig/s4cmd.py
5 +++ s4cmd.git/s4cmd.py
6 @@ -23,6 +23,7 @@ Super S3 command line tool.
57
68 import sys, os, re, optparse, multiprocessing, fnmatch, time, hashlib, errno, pytz
79 import logging, traceback, types, threading, random, socket, shlex, datetime, json
911
1012 IS_PYTHON2 = sys.version_info[0] == 2
1113
12 @@ -271,7 +272,7 @@
14 @@ -271,7 +272,7 @@ class BotoClient(object):
1315 S3RetryableErrors = (
1416 socket.timeout,
1517 socket.error if IS_PYTHON2 else ConnectionError,