Backlog¶
Iteration +1¶
Read and write compressed files.
orjsonusesxopenalready, which can do it transparentlyAllow processing larger-than-memory files
Allow specifying chunk size
Docs: It is recommended to use JSONL / NDJSON files for processing, as they support streaming
Docs: Report about throughput. 5,000 batch size; Main memory 1 GB.
GT: 11 GB input file; 1,300-1,800 records/s; 22 minutes total runtime; two simple jqlang expressions
Add
--overwriteoptionUnlock providing variable names per –arg, –argjson, –slurpfile, –rawfile, –args, –jsonargs
Make use of https://github.com/ashb/jqrepl?
Iteration +2¶
Documentation: jqlang stdlib’s
to_objectfunction for substructure managementDocumentation: Type casting
echo '{"a": 42, "b": {}, "c": []}' | jq -c '.|= (.b |= objects | .c |= objects)'{"a":42,"b":{}}Renaming currently needs JSON Pointer support, implemented in Python. Alternatively, can
jqalso do it?Simple IFTTT: When condition, do that (i.e. add tag)
Documentation:
jqfunctionsbuiltin.jq: https://github.com/jqlang/jq/blob/master/src/builtin.jqfunction.jq
Documentation: Update “What’s Inside”
Documentation: Usage (build (API, from_yaml), apply)
Documentation: How to extend
function.{jq,py}
Documentation¶
- Omit records `and .value.bill_contact.id != ""`
# Only accept `email` elements that are objects.
#and (if (.value | index("emails")) then (.value.emails[].type | type) == "object" else true end)
# Exclude a few specific documents.
# TODO: Review documents once more to discover more edge cases.
.[] |= select(
and ._id != "55d71c8ce4b02210dc47b10f"
)
# Some early `phone` elements have been stored wrongly,
# all others are of type OBJECT.
#and (.value.phone | type) != "array"
# Some early `urls` elements have been stored wrongly,
# all others are of type ARRAY.
#and (.value.urls | type) != "object"
Iteration +3¶
CLI interface
Documentation: Add Python example to “Synopsis” section on /index.html
Documentation: Compare with Seatunnel https://github.com/apache/seatunnel/tree/dev/docs/en/transform-v2
Demonstrate more use cases, like…
math expressions
omit key (recursively)
combine keys
filter on keys and/or values
Pathological cases like “Not defined” in typed fields like
TIMESTAMPUse simpleeval, like Meltano, and provide the same built-in functions
Use JSONPath, see https://sdk.meltano.com/en/v0.39.1/code_samples.html#use-a-jsonpath-expression-to-extract-the-next-page-url-from-a-hateoas-response
Iteration +4¶
Moksha transformations on Buckets
Fluent API interface
from tikray.model.fluent import FluentTransformation transformation = FluentTransformation() .jmes("records[?starts_with(location, 'B')]") .rename_fields({"_id": "id"}) .convert_values({"/id": "int", "/value": "float"}, type="pointer-python") .jq(".[] |= (.value /= 100)")
Investigate using JSON Schema
Mappers do not support external API lookups. To add external API lookups, you can either (a) land all your data and then joins using a transformation tool like dbt, or (b) create a custom mapper plugin with inline lookup logic. => Example from Luftdatenpumpe, using a reverse geocoder
Define schema
Is
jqpybetter thanjq?Load XML via Badgerfish or KDL https://github.com/kdl-org/kdl
Done¶
Refactor module namespace to
zyp, thenloko, thentikrayDocumentation
Apply to MongoDB Table Loader in CrateDB Toolkit
Model: Toggle rule active / inactive by respecting
disabledflagDocumentation: How to delete attributes from lists using jq?
Review and test jqlang stdlib’s
to_objectfunctionSupport for JSONL