Quantcast
Channel: HighContrast » Code
Viewing all articles
Browse latest Browse all 4

JSON and JSONlines from the command line

$
0
0

At Swoop we have many terabytes of JSON-like data in MongoDB, Redis, ElasticSearch, HDFS/Hadoop and even Amazon Redshift. While the internal representations are typically not JSON but BSON, MsgPack or native encodings, when it comes time to move large amounts of data for easy ad hoc processing I often end up using JSON and its bulk cousin, JSONlines. This post is about what you can quickly do with this type of data from the command line.

The best JSON(lines) command line tools

There has been a marked increase in the number of powerful & robust tools for validating and manipulating JSON and JSONlines from the command line. My favorites are:

  • jq: a blazingly fast, C-based stream processor for JSON documents with an easy yet powerful language. Think of it as sed and awk for JSON but without the 1970s syntax. Simple tasks are trivial. Powerful tasks are possible. The syntax is intuitive. Check out the tutorial and manual. Because of its stream orientation and speed, jq is the most natural fit when processing large amounts of JSONlines data. If you want to push the boundaries of what is sane to do on the command line there are conditionals, variables and UDFs.
  • underscore-cli: this is the Swiss Army knife for manipulating JSON on the command line. Based on Node.js, it supports JavaScript and CoffeeScript expressions with built-in functional programming primitives from the underscore.js library, relatively easy JSON traversal via json:select and more. This also is the best tool for debugging JSON data because of the multitude of output formats. A special plus in my book is that underscore-cli supports MsgPack, which we use in real-time flows and inside memory-constrained caches.
  • jsonpath: Ruby-based implementation of JSONPath with a corresponding command line tool. Speedy it is not but it’s great when you want JSONPath compatibility or can reuse existing expressions. There are some neat features such as pattern-based tree replace operations.
  • json (a.k.a., jsontool): another tool based on Node.js. Not as rich as underscore-cli but has a couple of occasionally useful features having to do with merging and grouping of documents. This tool also has a simple validation-only mode, which is convenient.

Keep in mind that you can modify/extend JSON data with these tools, not just transform it. jsontool can edit documents in place from the command line, something that can be useful for, for example, quickly updating properties in JSON config files.

JSON and 64-bit (BIGINT) numbers

JSON has undefined (as in implementation-specific ) semantics when it comes to dealing with 64-bit integers. The problem stems from the fact that JavaScript does not have this data type. There are Python, Ruby and Java JSON libraries that have no problem with 8-byte integers but I’d be suspicious of any Node.js implementation. If you have this type of data, test the edge cases with your tool of choice.

JSONlines validation & cleanup

There are times when JSONlines data does not come clean. It may include error messages or a mix of STDOUT and STDERR output (something Heroku is notorious for). At those times, it’s good to know how to quickly validate and clean up a large JSONlines file.

To clean up the input, we can use a simple sed incantation that removes all lines that do not begin with [ and {, the start of a JSON array or object. It is hard to think of a bulk export command or script that outputs primitive JSON types. To validate the remaining lines, we can filter through jq and output the type of the root object.

cat data.jsonlines | sed '/^[^[{]/d' > clean_data.jsonlines
cat clean_data.jsonlines | jq 'type' > /dev/null

This will generate output on STDERR with the line & column of any bad JSON.

Pretty printing JSON

Everyone has their favorite way to pretty print JSON. Mine uses the default jq output because it comes in color and because it makes it easy to drill down into the data structure. Let's use the GitHub API as an example here.

# List of Swoop repos on GitHub
API='https://api.github.com/users/swoop-inc/repos'
alias swoop_repos="curl $API"

# Pretty print the list of Swoop repos on GitHub in color
swoop_repos | jq '.'

JSON arrays to JSONlines

GitHub gives us an array of repo objects but let's say we want JSONlines instead in order to prepare the API output for input into MongoDB via mongoimport. The --compact option of jq is perfect for JSONlines output.

# Swoop repos as JSONlines
swoop_repos | jq -c '.[]'

The .[] filter breaks up an array of inputs into individual inputs.

Filtering and selection

Say we want to pull out the full names of Swoop's own repos as a JSON array. "Own" in this case means not forked.

swoop_repos | jq '[.[] | select(.fork == false) | .full_name]'

Let's parse this one piece at a time:

  • The wrapping [...] merges any output into an array.
  • You've seen .[] already. It breaks up the single array input into many separate inputs, one per repo.
  • The select only outputs those repos that are not forked.
  • The .full_name filter plucks the value of that field from the repo data.

Here is the equivalent using underscore-cli and a json:select expression:

swoop_repos | underscore select \ 
    'object:has(.fork:expr(x=false)) > .full_name'

In both cases we are not saving that much code but not having to create files just keeps things simpler. For comparison, here is the code to output the names of Swoop's own GitHub repos in Ruby.


Tagged: command line, jq, json, jsonlines, jsonpath, shell, Swoop, underscore, underscore-cli

Viewing all articles
Browse latest Browse all 4

Latest Images

Trending Articles





Latest Images