Splitting Strings

This post is based on a talk I gave at LRUG in October 2017 which you can stream.

Most modern programming languages have a function somewhere in their standard library for splitting strings. Given a string and a delimiter, this function will return an array of strings:

"hi-there".split("-") #=> ["hi", "there"]

But what does split return when it’s called on the empty string?

"".split("-") #=> ???

What should it return?

Why does this matter?

Imagine a Ruby file that parses output from a UNIX command:

file_count = `ls`.split("\n").length

When this is run in a directory with the files cat and dog. ls will return "cat\ndog\n", which will split to ["cat", "dog"] which has length 2.

I had a similar piece of code that I needed to convert to Python so it could be integrated with another tool.

I knew that Python is a very similar language to Ruby. Both are modern high-level languages with rich standard libraries and methods for string manipulation. After re-writing my code in Python, I ended up with something like this:

file_count = len(check_output('ls').split("\n"))

And this behaves the same as the Ruby version. Most of the time.

They produce different outputs when there aren’t any files in the directory. Ruby returns 0 but python returns 1! This is because in Ruby:

"".split("-") #=> []

while in Python:

"".split("-") #=> [""]

My incorrect assumption that the Ruby and Python versions would behave the same ultimately caused a bug in production code. When investigating the bug I was surprised to find that there was a difference between the two languages and it wasn’t clear to me why Python and Ruby had chosen to do different things. I decided to investigate.

Why does Python do what it does?

This is easy to answer as the creator of Python, Guido van Rossum, has given a reason:

“In Python the generalization is that since:
"xx".split(",") is ["xx"], and
"x".split(",") is ["x"], it naturally follows that
"".split(",") is [""].”

Guido van Rossum on the Python Mailing list (formatting added)

This generalization reveals a pattern: when a string is split by a delimiter that it doesn’t contain, the output array will contain the input string.

This sounds like a sensible reason to me. Having patterns and rules behind the output makes it easier for programmers to think about what might happen to split for various inputs. It’s not just Python that has chosen to do this. String splitting methods in Java, JavaScript, Elixir, Go, Haskell, PHP and Scala all follow the same pattern¹.

So why does Ruby do what it does?

To understand this I started looking at the documentation for split. As I did this, I started to realise how complex split is. It’s full of special cases for different inputs- for when the pattern to split on is a space, when it’s a regex, when it’s an empty string, when the pattern is nil… and so on². This complexity is reflected in the main body of its implementation, which is about 210 lines long.

It’s the last line of the documentation that’s relevant to us:

When the input str is empty, an empty Array is returned as the string is considered to have no fields to split.

Not only does this describe the behaviour we see, it contains a clue as to as to why. The rest of the documentation talks about strings and substrings, but this line uses the word fields.

Fields?

In computing a field is a piece of data. Records in a database have fields, lines in a CSV are made up of fields, and each line of output of the ls command is a field.

The key to understanding "".split(",") in Ruby, is not to think of it as splitting the empty string, rather think of it as splitting a record with no fields.

We can make our own logical progression that shows that the length of the output of splitting in Ruby is linked to the number of fields in the input:

"x,x".split(",") has 2 fields and ["x", "x"] has length 2
"x".split(",") has 1 field and ["x"] has length 1
"".split(",") has 0 fields and [] has length 0

Ruby didn’t choose this behaviour in a vacuum, it copied this behaviour from Perl, and Perl copied it from AWK. You can see these references to AWK in the Ruby source code and the Perl documentation.

What is this AWK thing?

AWK is a programming language that was created in the 70s for text processing. AWK is commonly used to process structured text files and output from UNIX programs that contain fields.

Split in AWK takes three arguments; the input string, the variable to store the output, and the delimiter. And, just like Ruby, AWK splits the empty string into an empty array:

{split("",a,"-"); print length(a); exit} # prints '0'

So Ruby, like Perl, has decided to have AWK-like splitting behaviour for empty strings. This makes it easier to process empty files or output from command line tools just like my ls example did earlier.

A Python core developer’s lament

What surprised me most during my research was how complex the behaviour of split can be and how this complexity can lead to differences in behaviour between programming languages. I would like to end by quoting a Python core developer’s response to a proposal to change the behaviour of str.split to be more AWK-like:

“… For years, we've answered arcane questions about [str.split()] and have made multiple revisions to the docs in a never ending quest to precisely describe exactly what it does without just showing the C underlying code.

… Almost any change to str.split() would either complexify the explanation of what it does or would change the behavior in a way the would break somebody's code (perhaps in a subtle ways that are hard to detect).

...In my opinion, str.split() should never be touched again.”
Raymond Hettinger on the Python Mailing List

This is not to say that splitting strings behaves the same in these languages for other inputs. It doesn’t. ↩
At my count, the documentation for String#split in Ruby describes 12 different behaviours depending on the inputs and global variables. ↩