SAFAIR AI Contest
from 1st March to 13th June

SpartaNews

pz → Python instead of Bash

7th May 2021

author: Edvard Rejthar

translation: Petra Raszková

 

I would like to present you the pz utility as pythonize, intended for a command line user with knowledge of Python. Current Linux distributions have many efficient tools for input processing at their disposal. But have you ever wished that you could use Python syntax instead?

Do you often browse through the manual, and trying to find out how do those switches for formatting behave? Do you find the Bash source code difficult to read? Then, this article is meant for you.

You will learn how to write a simple program and how the program is evaluated, which variables are available. You will find out a few words about auto-import, switches and also some examples of use.

The command line is an outstanding user interface which is characterized by stunning global (if not) enthusiasm, then range. It is possible to connect to the terminal almost on the last washing machine. Some users are afraid of using of the command line, however so far there are no perfect graphical applications which allow us to deal with the scope of tasks we are daily facing.

Maybe you are GNU coreutils guru and you know all the tr and cut program switches, maybe you are doing proofreading in sed instead of Microsoft Word and also you put together your shopping lists in awk, then you definitely don’t need the pz utility. However, in case you have spent three hours searching for a forgotten space in a Bash script condition and you have vision how to better spend the afternoon, maybe your life will be easier already this evening. There is no need to know different syntaxes if you know the syntax of Python.

But why not use Python directly with the command line? Why can’t you use the -c switch and evaluate the code?

python -c "print(1)" # 1

The thing is, yes you can. It is possible. 

But once you try to handle a more complex task, you drown in the processing of input and output. Although the calculation fills up one line, mere pipe processing libraries import would cause a headache. Especially if we take into account Python is unsuitable for single-line commands (try PERL for those).

The pz utility enables  to omit all those lines like shebang #!/usr/bin/env python3, logging, conversion between bytes and string etc. Everything you would have to write over and over again for one-line functionality. 

Within process of designing of interface, we tried to apply the intuition rule – it means – if an ideal program existed, how exactly would the user wanted to use it? The basis consists in variables, switches and clauses. 

Some variables are automatically added, anothers are pre-prepared for being used. It means that you don´t have to initialize them. Another variables control output. Clauses contain your command or their sequence and they are evaluated before, during and after input process, and switches change their behavior. 

Let’s Python do the things it is amazing in, like substring slicing [start:stop:step] or the generator notation. The variable s always contains the currently processed line. If you change it, the output will change. We will now send the word ‘hello’ to the input and let us cut out the substring between the 1st and the 3rd character, so we want to get ‘him’. (Note: the hash sign indicates the return value of the program.) 

echo "ahoj" | pz s[1:3]  # "ho"

If we used the --verbose switch to display detailed information, we would find out the utility has evaluated in the backgroundthe s variable had not changed during the command execution. So it modified the expression to assign back to the s variable in this way: s = s[1:3]. Therefore, there is no need to re-assign to s explicitly.

Other auto-completed variables include n, which contains the current line converted to a number (if possible). In the following example, we add seven to the element at index 1. At the same time you can see that the utility call can be chained arbitrarily. 

echo "ahoj,5" | pz 's.split(",")[1]' | pz n+7 # 12

Another automatically completed variables include count (current line number), text (all text in a single string), lines (all text in a list of lines), numbers (all text as a list where lines are converted to numbers). 

The utility takes into account possible memory overflow (automatic completion of variables can be turned off) and also input flow – the variable text is available by default after the processing of the entire text, so that we could safely process infinite input no matter how long. 

Let´s take a look at variable numbers, for example. At the first phase we get one by one four numbers. Our clause sets following output tuple through the variable s in the format: current line number, current line and arithmetic mean.

$ echo -e "20\n40\n25\n28" | pz 's = count, s, sum(numbers)/count'
1, 20, 20.0
2, 40, 30.0
3, 25, 28.333333333333332
4, 28, 28.25

So far, we have only seen the main clause, which is executed once for each input line. There are also other clauses: --setup , evaluated at the very beginning (in case some variables need to be initialized) and, mirroring to it the --end clause, executed once at the end. For a simple example, let’s calculate the length of a text.

echo -e "hello\nworld" | pz --end 'len(text)' # 11

Filtering can be easily done using the --filter  switch – the line is sent forward according to the boolean the main clause result is casted to. The skip variable works in a similar way, just assign to it a boolean value. For example, we only send through numbers bigger than three.

$ echo -e "1\n2\n3\n4\n5" | pz "skip = not n > 3"
4
5

There are special switches for using regular expressions. Thus, there is no need to import the function from the re module and get lost in the sphere of quotes and apostrophes. You need only use --search, --match, --findall or --sub. In case some URLs appeared in the text and there may be more than one on a line, run each line through the --findall switch. The clause main is then automatically completed to: s = re.findall(expression, input).

$ echo "Lorem http://example.com ipsum http://example.cz dolor" |  pz --findall "(https?://[^\s]+)"
http://example.com/
http://example.cz/

If we focus on the output, we can see that for one line in input (the only one) was returned several lines. Yes, that is possible too.

The output is adjusted according to what is left in the s variable after the line processing. In case of a tuple or a generator, the output is a single line, separated by commas. If list remains, we get several lines. If we find a callable expression, we call it. This is reason, why we can type s.lower without brackets and got lower case letters.

$ echo "HEllO" | pz s.lower  # 's = s.lower()' → "hello"
$ echo "HEllO" | pz len  # 's = len(s)' → "5"
$ echo "25" | pz sqrt | pz round  # 's = math.sqrt(n)' → 's = round(n)' → "5"

And how does the automatic import work? It would be impractical to use the --setup clause for even the most basic functions import!

On the contrary, loading of all available modules at once would slow down the utility to an intolerable level, not mentioning the problems arised from duplicated function names in different modules. The utility makes a compromise: Part of functions are imported from scratch e.g. all symbols from library math so that user can add up all numbers in input with just pz --end sum - internally translated as pz --end "s=sum(numbers)". Another large amount of functions is prepared to be imported the moment when input processing fails due to their absence. For instance, the requests module – in order that webpage could be loaded with simple  echo "http://example.com" | pz 'requests.get(s).content', function  sleep from the module time, function  datetime from the module  datetime, previously seen  randint from the module  random or the entire module  random to allow the user easily select random pick via  random.choice.

Finally, let’s introduce one more complicated challenge. We are facing the problem - we want to verify what random generated numbers by the function random.randint will be inclined to.

We are not interested in the entire run, but let’s say in every ten thousandth value. How to do it? Firstly, we don’t need any input here.

We use the switch --generate with a value of zero to provide infinite input.

Throw away all lines not divisible by ten thousand, then output the rest to the console in the format: the current row number and the current average of the values. The values are kept around fifty.

pz "i+=randint(1,100); s = (count,i/count) if not count % 10000 else None" --generate 0
10000, 50.4153
20000, 50.35645

The utility has no dependencies - only Python (at least) version 3.6, which has been released four years ago. It is installed with the packager pip, or eventually is enough to download and run a single file. Its code is hosted as GPLv3 at https://github.com/CZ-NIC/pz . Here you can find a complete documentation and many other examples.