New Upstream Release - python-srsly
Ready changes
Summary
Merged new upstream version: 2.4.6 (was: 2.4.5).
Diff
diff --git a/README.md b/README.md
index 051ae45..70cc285 100644
--- a/README.md
+++ b/README.md
@@ -18,7 +18,7 @@ serialization utilities we need in a single binary wheel. Currently supports **J
Serialization is hard, especially across Python versions and multiple platforms.
After dealing with many subtle bugs over the years (encodings, locales, large
files) our libraries like [spaCy](https://github.com/explosion/spaCy) and
-[Prodigy](https://prodi.gy) have steadily grown a number of utility functions to
+[Prodigy](https://prodi.gy) had steadily grown a number of utility functions to
wrap the multiple serialization formats we need to support (especially `json`,
`msgpack` and `pickle`). These wrapping functions ended up duplicated across our
codebases, so we wanted to put them in one place.
@@ -45,8 +45,8 @@ wheel.
`setuptools` and `wheel` are up to date.
```bash
-pip install -U pip setuptools wheel
-pip install srsly
+python -m pip install -U pip setuptools wheel
+python -m pip install srsly
```
Or from conda via conda-forge:
@@ -56,13 +56,38 @@ conda install -c conda-forge srsly
```
Alternatively, you can also compile the library from source. You'll need to make
-sure that you have a development environment consisting of a Python distribution
+sure that you have a development environment with a Python distribution
including header files, a compiler (XCode command-line tools on macOS / OS X or
-Visual C++ build tools on Windows), pip, virtualenv and git installed.
+Visual C++ build tools on Windows), pip and git installed.
+
+Install from source:
```bash
-pip install -r requirements.txt # install development dependencies
-python setup.py build_ext --inplace # compile the library
+# clone the repo
+git clone https://github.com/explosion/srsly
+cd srsly
+
+# create a virtual environment
+python -m venv .env
+source .env/bin/activate
+
+# update pip
+python -m pip install -U pip setuptools wheel
+
+# compile and install from source
+python -m pip install .
+```
+
+For developers, install requirements separately and then install in editable
+mode without build isolation:
+
+```bash
+# install in editable mode
+python -m pip install -r requirements.txt
+python -m pip install --no-build-isolation --editable .
+
+# run test suite
+python -m pytest --pyargs srsly
```
## API
@@ -111,11 +136,11 @@ data = {"foo": "bar", "baz": 123}
srsly.write_json("/path/to/file.json", data)
```
-| Argument | Type | Description |
-| ---------- | ------------ | ------------------------------------------------------ |
-| `path` | str / `Path` | The file path or `"-"` to write to stdout. |
-| `data` | - | The JSON-serializable data to output. |
-| `indent` | int | Number of spaces used to indent JSON. Defaults to `2`. |
+| Argument | Type | Description |
+| -------- | ------------ | ------------------------------------------------------ |
+| `path` | str / `Path` | The file path or `"-"` to write to stdout. |
+| `data` | - | The JSON-serializable data to output. |
+| `indent` | int | Number of spaces used to indent JSON. Defaults to `2`. |
#### <kbd>function</kbd> `srsly.read_json`
@@ -127,7 +152,7 @@ data = srsly.read_json("/path/to/file.json")
| Argument | Type | Description |
| ----------- | ------------ | ------------------------------------------ |
-| `path` | str / `Path` | The file path or `"-"` to read from stdin. |
+| `path` | str / `Path` | The file path or `"-"` to read from stdin. |
| **RETURNS** | dict / list | The loaded JSON content. |
#### <kbd>function</kbd> `srsly.write_gzip_json`
@@ -139,11 +164,27 @@ data = {"foo": "bar", "baz": 123}
srsly.write_gzip_json("/path/to/file.json.gz", data)
```
-| Argument | Type | Description |
-| ---------- | ------------ | ------------------------------------------------------ |
-| `path` | str / `Path` | The file path. |
-| `data` | - | The JSON-serializable data to output. |
-| `indent` | int | Number of spaces used to indent JSON. Defaults to `2`. |
+| Argument | Type | Description |
+| -------- | ------------ | ------------------------------------------------------ |
+| `path` | str / `Path` | The file path. |
+| `data` | - | The JSON-serializable data to output. |
+| `indent` | int | Number of spaces used to indent JSON. Defaults to `2`. |
+
+#### <kbd>function</kbd> `srsly.write_gzip_jsonl`
+
+Create a gzipped JSONL file and dump contents.
+
+```python
+data = [{"foo": "bar"}, {"baz": 123}]
+srsly.write_gzip_json("/path/to/file.jsonl.gz", data)
+```
+
+| Argument | Type | Description |
+| ----------------- | ------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `path` | str / `Path` | The file path. |
+| `lines` | - | The JSON-serializable contents of each line. |
+| `append` | bool | Whether or not to append to the location. Appending to .gz files is generally not recommended, as it doesn't allow the algorithm to take advantage of all data when compressing - files may hence be poorly compressed. |
+| `append_new_line` | bool | Whether or not to write a new line before appending to the file. |
#### <kbd>function</kbd> `srsly.read_gzip_json`
@@ -155,9 +196,22 @@ data = srsly.read_gzip_json("/path/to/file.json.gz")
| Argument | Type | Description |
| ----------- | ------------ | ------------------------ |
-| `path` | str / `Path` | The file path. |
+| `path` | str / `Path` | The file path. |
| **RETURNS** | dict / list | The loaded JSON content. |
+#### <kbd>function</kbd> `srsly.read_gzip_jsonl`
+
+Load gzipped JSONL from a file.
+
+```python
+data = srsly.read_gzip_jsonl("/path/to/file.jsonl.gz")
+```
+
+| Argument | Type | Description |
+| ----------- | ------------ | ------------------------- |
+| `path` | str / `Path` | The file path. |
+| **RETURNS** | dict / list | The loaded JSONL content. |
+
#### <kbd>function</kbd> `srsly.write_jsonl`
Create a JSONL file (newline-delimited JSON) and dump contents line by line, or
@@ -170,7 +224,7 @@ srsly.write_jsonl("/path/to/file.jsonl", data)
| Argument | Type | Description |
| ----------------- | ------------ | ---------------------------------------------------------------------------------------------------------------------- |
-| `path` | str / `Path` | The file path or `"-"` to write to stdout. |
+| `path` | str / `Path` | The file path or `"-"` to write to stdout. |
| `lines` | iterable | The JSON-serializable lines. |
| `append` | bool | Append to an existing file. Will open it in `"a"` mode and insert a newline before writing lines. Defaults to `False`. |
| `append_new_line` | bool | Defines whether a new line should first be written when appending to an existing file. Defaults to `True`. |
@@ -186,7 +240,7 @@ data = srsly.read_jsonl("/path/to/file.jsonl")
| Argument | Type | Description |
| ---------- | ---------- | -------------------------------------------------------------------- |
-| `path` | str / Path | The file path or `"-"` to read from stdin. |
+| `path` | str / Path | The file path or `"-"` to read from stdin. |
| `skip` | bool | Skip broken lines and don't raise `ValueError`. Defaults to `False`. |
| **YIELDS** | - | The loaded JSON contents of each line. |
@@ -247,10 +301,10 @@ data = {"foo": "bar", "baz": 123}
srsly.write_msgpack("/path/to/file.msg", data)
```
-| Argument | Type | Description |
-| ---------- | ------------ | ---------------------- |
-| `path` | str / `Path` | The file path. |
-| `data` | - | The data to serialize. |
+| Argument | Type | Description |
+| -------- | ------------ | ---------------------- |
+| `path` | str / `Path` | The file path. |
+| `data` | - | The data to serialize. |
#### <kbd>function</kbd> `srsly.read_msgpack`
@@ -262,7 +316,7 @@ data = srsly.read_msgpack("/path/to/file.msg")
| Argument | Type | Description |
| ----------- | ------------ | --------------------------------------------------------------------------------------- |
-| `path` | str / `Path` | The file path. |
+| `path` | str / `Path` | The file path. |
| `use_list` | bool | Don't use tuples instead of lists. Can make deserialization slower. Defaults to `True`. |
| **RETURNS** | - | The loaded and deserialized content. |
@@ -318,7 +372,7 @@ yaml_string = srsly.yaml_dumps(data)
| ----------------- | ---- | ------------------------------------------ |
| `data` | - | The JSON-serializable data to output. |
| `indent_mapping` | int | Mapping indentation. Defaults to `2`. |
-| `indent_sequence` | int | Sequence indentation. Defaults to `4`. |
+| `indent_sequence` | int | Sequence indentation. Defaults to `4`. |
| `indent_offset` | int | Indentation offset. Defaults to `2`. |
| `sort_keys` | bool | Sort dictionary keys. Defaults to `False`. |
| **RETURNS** | str | The serialized string. |
@@ -348,10 +402,10 @@ srsly.write_yaml("/path/to/file.yml", data)
| Argument | Type | Description |
| ----------------- | ------------ | ------------------------------------------ |
-| `path` | str / `Path` | The file path or `"-"` to write to stdout. |
+| `path` | str / `Path` | The file path or `"-"` to write to stdout. |
| `data` | - | The JSON-serializable data to output. |
| `indent_mapping` | int | Mapping indentation. Defaults to `2`. |
-| `indent_sequence` | int | Sequence indentation. Defaults to `4`. |
+| `indent_sequence` | int | Sequence indentation. Defaults to `4`. |
| `indent_offset` | int | Indentation offset. Defaults to `2`. |
| `sort_keys` | bool | Sort dictionary keys. Defaults to `False`. |
@@ -365,7 +419,7 @@ data = srsly.read_yaml("/path/to/file.yml")
| Argument | Type | Description |
| ----------- | ------------ | ------------------------------------------ |
-| `path` | str / `Path` | The file path or `"-"` to read from stdin. |
+| `path` | str / `Path` | The file path or `"-"` to read from stdin. |
| **RETURNS** | dict / list | The loaded YAML content. |
#### <kbd>function</kbd> `srsly.is_yaml_serializable`
diff --git a/azure-pipelines.yml b/azure-pipelines.yml
index c7c6887..0ca7259 100644
--- a/azure-pipelines.yml
+++ b/azure-pipelines.yml
@@ -3,6 +3,10 @@ trigger:
branches:
include:
- '*'
+pr:
+ paths:
+ exclude:
+ - "*.md"
jobs:
@@ -10,7 +14,7 @@ jobs:
strategy:
matrix:
Python36Linux:
- imageName: 'ubuntu-latest'
+ imageName: 'ubuntu-20.04'
python.version: '3.6'
Python36Windows:
imageName: 'windows-2019'
@@ -44,13 +48,13 @@ jobs:
python.version: '3.10'
Python311Linux:
imageName: 'ubuntu-latest'
- python.version: '3.11.0-rc.2'
+ python.version: '3.11'
Python311Windows:
imageName: 'windows-latest'
- python.version: '3.11.0-rc.2'
+ python.version: '3.11'
Python311Mac:
imageName: 'macos-latest'
- python.version: '3.11.0-rc.2'
+ python.version: '3.11'
maxParallel: 4
pool:
vmImage: $(imageName)
diff --git a/debian/changelog b/debian/changelog
index 0ef2588..816396d 100644
--- a/debian/changelog
+++ b/debian/changelog
@@ -1,10 +1,14 @@
-python-srsly (2.4.5-2) UNRELEASED; urgency=medium
+python-srsly (2.4.6-1) UNRELEASED; urgency=medium
+ [ Andreas Tille ]
* Run build-time test properly
* Standards-Version: 4.6.2 (routine-update)
* Testsuite: autopkgtest-pkg-python (routine-update)
- -- Andreas Tille <tille@debian.org> Sat, 14 Jan 2023 08:03:33 +0100
+ [ Debian Janitor ]
+ * New upstream release.
+
+ -- Andreas Tille <tille@debian.org> Fri, 10 Mar 2023 22:35:08 -0000
python-srsly (2.4.5-1) unstable; urgency=medium
diff --git a/debian/patches/reorder_setup.py b/debian/patches/reorder_setup.py
index 78e8dc7..a39a485 100644
--- a/debian/patches/reorder_setup.py
+++ b/debian/patches/reorder_setup.py
@@ -2,8 +2,10 @@ Author: Andreas Tille <tille@debian.org>
Last-Update: Tue, 29 Nov 2022 16:09:55 +0100
Description: Fix sequence of imports
---- a/setup.py
-+++ b/setup.py
+Index: python-srsly.git/setup.py
+===================================================================
+--- python-srsly.git.orig/setup.py
++++ python-srsly.git/setup.py
@@ -1,8 +1,8 @@
#!/usr/bin/env python
import sys
diff --git a/srsly/__init__.py b/srsly/__init__.py
index 245286b..0c94d82 100644
--- a/srsly/__init__.py
+++ b/srsly/__init__.py
@@ -1,4 +1,5 @@
from ._json_api import read_json, read_gzip_json, write_json, write_gzip_json
+from ._json_api import read_gzip_jsonl, write_gzip_jsonl
from ._json_api import read_jsonl, write_jsonl
from ._json_api import json_dumps, json_loads, is_json_serializable
from ._msgpack_api import read_msgpack, write_msgpack, msgpack_dumps, msgpack_loads
diff --git a/srsly/_json_api.py b/srsly/_json_api.py
index 900e42b..24d25fd 100644
--- a/srsly/_json_api.py
+++ b/srsly/_json_api.py
@@ -1,4 +1,4 @@
-from typing import Union, Iterable, Sequence, Any, Optional
+from typing import Union, Iterable, Sequence, Any, Optional, Iterator
import sys
import json as _builtin_json
import gzip
@@ -56,14 +56,27 @@ def read_json(path: FilePath) -> JSONOutput:
def read_gzip_json(path: FilePath) -> JSONOutput:
"""Load JSON from a gzipped file.
- location (FilePath): The file path.
- RETURNS (JSONOutput): The loaded JSON content.
+ location (FilePath): The file path.
+ RETURNS (JSONOutput): The loaded JSON content.
"""
file_path = force_string(path)
with gzip.open(file_path, "r") as f:
return ujson.load(f)
+def read_gzip_jsonl(path: FilePath, skip: bool = False) -> Iterator[JSONOutput]:
+ """Read a gzipped .jsonl file and yield contents line by line.
+ Blank lines will always be skipped.
+
+ path (FilePath): The file path.
+ skip (bool): Skip broken lines and don't raise ValueError.
+ YIELDS (JSONOutput): The unpacked, deserialized Python objects.
+ """
+ with gzip.open(force_path(path), "r") as f:
+ for line in _yield_json_lines(f, skip=skip):
+ yield line
+
+
def write_json(path: FilePath, data: JSONInput, indent: int = 2) -> None:
"""Create a .json file and dump contents or write to standard
output.
@@ -94,6 +107,30 @@ def write_gzip_json(path: FilePath, data: JSONInput, indent: int = 2) -> None:
f.write(json_data.encode("utf-8"))
+def write_gzip_jsonl(
+ path: FilePath,
+ lines: Iterable[JSONInput],
+ append: bool = False,
+ append_new_line: bool = True,
+) -> None:
+ """Create a .jsonl.gz file and dump contents.
+
+ location (FilePath): The file path.
+ lines (Sequence[JSONInput]): The JSON-serializable contents of each line.
+ append (bool): Whether or not to append to the location. Appending to .gz files is generally not recommended, as it
+ doesn't allow the algorithm to take advantage of all data when compressing - files may hence be poorly
+ compressed.
+ append_new_line (bool): Whether or not to write a new line before appending
+ to the file.
+ """
+ mode = "a" if append else "w"
+ file_path = force_path(path, require_exists=False)
+ with gzip.open(file_path, mode=mode) as f:
+ if append and append_new_line:
+ f.write("\n".encode("utf-8"))
+ f.writelines([(json_dumps(line) + "\n").encode("utf-8") for line in lines])
+
+
def read_jsonl(path: FilePath, skip: bool = False) -> Iterable[JSONOutput]:
"""Read a .jsonl file or standard input and yield contents line by line.
Blank lines will always be skipped.
diff --git a/srsly/about.py b/srsly/about.py
index 71c852b..c9e914f 100644
--- a/srsly/about.py
+++ b/srsly/about.py
@@ -1 +1 @@
-__version__ = "2.4.5"
+__version__ = "2.4.6"
diff --git a/srsly/tests/cloudpickle/cloudpickle_test.py b/srsly/tests/cloudpickle/cloudpickle_test.py
index fe0cf39..b293c53 100644
--- a/srsly/tests/cloudpickle/cloudpickle_test.py
+++ b/srsly/tests/cloudpickle/cloudpickle_test.py
@@ -870,8 +870,12 @@ class CloudPickleTest(unittest.TestCase):
@pytest.mark.skipif(
- platform.machine() == "aarch64" and sys.version_info[:2] >= (3, 10),
- reason="Fails on aarch64 + python 3.10+ in cibuildwheel, currently unable to replicate failure elsewhere")
+ (platform.machine() == "aarch64" and sys.version_info[:2] >= (3, 10))
+ or platform.python_implementation() == "PyPy"
+ or (sys.version_info[:2] == (3, 10) and sys.version_info >= (3, 10, 8))
+ # Skipping tests on 3.11 due to https://github.com/cloudpipe/cloudpickle/pull/486.
+ or sys.version_info[:2] == (3, 11),
+ reason="Fails on aarch64 + python 3.10+ in cibuildwheel, currently unable to replicate failure elsewhere; fails sometimes for pypy on conda-forge; fails for python 3.10.8+ and 3.11")
def test_builtin_classmethod(self):
obj = 1.5 # float object
@@ -1470,6 +1474,7 @@ class CloudPickleTest(unittest.TestCase):
finally:
sys.modules.pop("_faulty_module", None)
+ @pytest.mark.skip(reason="fails for pytest v7.2.0")
def test_dynamic_pytest_module(self):
# Test case for pull request https://github.com/cloudpipe/cloudpickle/pull/116
import py
@@ -1567,6 +1572,8 @@ class CloudPickleTest(unittest.TestCase):
assert isinstance(depickled_t2, MyTuple)
assert depickled_t2 == t2
+ @pytest.mark.skipif(platform.python_implementation() == "PyPy",
+ reason="fails sometimes for pypy on conda-forge")
def test_interactively_defined_function(self):
# Check that callables defined in the __main__ module of a Python
# script (or jupyter kernel) can be pickled / unpickled / executed.
diff --git a/srsly/tests/test_json_api.py b/srsly/tests/test_json_api.py
index dc23952..89ce400 100644
--- a/srsly/tests/test_json_api.py
+++ b/srsly/tests/test_json_api.py
@@ -4,7 +4,14 @@ from pathlib import Path
import gzip
import numpy
-from .._json_api import read_json, write_json, read_jsonl, write_jsonl
+from .._json_api import (
+ read_json,
+ write_json,
+ read_jsonl,
+ write_jsonl,
+ read_gzip_jsonl,
+ write_gzip_jsonl,
+)
from .._json_api import write_gzip_json, json_dumps, is_json_serializable
from .._json_api import json_loads
from ..util import force_string
@@ -204,3 +211,54 @@ def test_unsupported_type_error():
f = numpy.float32()
with pytest.raises(TypeError):
s = json_dumps(f)
+
+
+def test_write_jsonl_gzip():
+ """Tests writing data to a gzipped .jsonl file."""
+ data = [{"hello": "world"}, {"test": 123}]
+ expected = ['{"hello":"world"}\n', '{"test":123}\n']
+
+ with make_tempdir() as temp_dir:
+ file_path = temp_dir / "tmp.json"
+ write_gzip_jsonl(file_path, data)
+ with gzip.open(file_path, "r") as f:
+ assert [line.decode("utf8") for line in f.readlines()] == expected
+
+
+def test_write_jsonl_gzip_append():
+ """Tests appending data to a gzipped .jsonl file."""
+ data = [{"hello": "world"}, {"test": 123}]
+ expected = [
+ '{"hello":"world"}\n',
+ '{"test":123}\n',
+ "\n",
+ '{"hello":"world"}\n',
+ '{"test":123}\n',
+ ]
+ with make_tempdir() as temp_dir:
+ file_path = temp_dir / "tmp.json"
+ write_gzip_jsonl(file_path, data)
+ write_gzip_jsonl(file_path, data, append=True)
+ with gzip.open(file_path, "r") as f:
+ assert [line.decode("utf8") for line in f.readlines()] == expected
+
+
+def test_read_jsonl_gzip():
+ """Tests reading data from a gzipped .jsonl file."""
+ file_contents = [{"hello": "world"}, {"test": 123}]
+ with make_tempdir() as temp_dir:
+ file_path = temp_dir / "tmp.json"
+ with gzip.open(file_path, "w") as f:
+ f.writelines(
+ [(json_dumps(line) + "\n").encode("utf-8") for line in file_contents]
+ )
+ assert file_path.exists()
+ data = read_gzip_jsonl(file_path)
+ # Make sure this returns a generator, not just a list
+ assert not hasattr(data, "__len__")
+ data = list(data)
+ assert len(data) == 2
+ assert len(data[0]) == 1
+ assert len(data[1]) == 1
+ assert data[0]["hello"] == "world"
+ assert data[1]["test"] == 123
diff --git a/srsly/util.py b/srsly/util.py
index 575d3b5..c43120c 100644
--- a/srsly/util.py
+++ b/srsly/util.py
@@ -10,8 +10,8 @@ FilePath = Union[str, Path]
JSONOutput = Union[str, int, float, bool, None, Dict[str, Any], List[Any]]
JSONOutputBin = Union[bytes, str, int, float, bool, None, Dict[str, Any], List[Any]]
# For input, we also accept tuples, ordered dicts etc.
-JSONInput = Union[str, int, float, bool, None, Dict[str, Any], List[Any], Tuple[Any], OrderedDict]
-JSONInputBin = Union[bytes, str, int, float, bool, None, Dict[str, Any], List[Any], Tuple[Any], OrderedDict]
+JSONInput = Union[str, int, float, bool, None, Dict[str, Any], List[Any], Tuple[Any, ...], OrderedDict]
+JSONInputBin = Union[bytes, str, int, float, bool, None, Dict[str, Any], List[Any], Tuple[Any, ...], OrderedDict]
YAMLInput = JSONInput
YAMLOutput = JSONOutput
# fmt: on