Engineering Language from a Linguist’s Perspective

The estimated time to read this article is 10 minutes

Linguists in the NLP world

Imagine you are an expert in the art of drawing circles. That is just your thing. You are aware of circles on so many different levels and have spent a long time learning and researching the topic. Imagine someone wants to hire you for your expertise (drawing circles!), wouldn’t that be great? Then you discover that you are only given paper, no pen/pencil/computer. That’s it. Make it work.

Now change “drawing circles” to “linguistics” and you will understand what it can be like working in a field that invites linguists to R&D teams that will unconsciously place obstacles in our path and consider true the by now self fulfilling prophecy:

“Whenever I fire a linguist, our system performance improves”.

Where the problem lies

Unfortunately, today, this attitude doesn’t reflect a problem in the science of linguistics but rather the (in)ability to implement this knowledge technologically. Companies that wish to harness linguistic knowledge in the development of their products need to consider what platform they provide the domain experts and if it is rich and robust enough. Obviously, none of us, whether programmers/NLP experts/linguists etc. would want to work in an environment where the tools we use actually limit our ability to program or where the language we use cannot express all the functions we’d like it to, right?

Pairing linguists with programmers doesn’t work

As a result, companies may prefer linguists that can also program and the sad outcome is that the linguist ends up thinking and solving problems like a programmer and not like a linguist. The other common solution is that a linguist is paired up with a programmer and between the two of them they try and come up with a solution. Well the nice thing is that we all get to make new friends (hurray!) but there’s a huge downside to this; linguists rarely get what they expected and whatever their platform, it is limited and entirely dependant on future developments which the linguists aren’t able to carry out on their own. The linguist faces the challenge of designing solutions for linguistic problems in a non linguistic language. The programmer’s implementation has no linguistic design to work with and is consequently based on objects visible to the programmer but not the linguist.

Consider a day in their lives:

  1. All of the implementation is in the hands of the programmer, who is probaby coding in a language that does not reflect the domain in the way the linguist conceives of it. If there’s a problem, the linguist cannot solve it on her own. We could be trying to solve a tiny issue with all noun phrases that our system has to analyze but actually, we may not even have a noun phrase class to begin with. The system may not have a definition for what constitutes a noun phrase. This could be true for any linguistic object in the project. The feature could be any string of words where we want to find a linguistic characteristic.

  2. The linguist uses underdeveloped Domain Specific Languages such as scripts or csv files, where design and flexibility are quite constrained by the very technology being used. Where the linguist sees patterns flailing their arms at her, just looking for how to generalize them into a formalism, if this rule cannot be easily implemented into the design, it means nothing. The programmer, set on solving the puzzle, ends up having to come up with a formal solution that has nothing to do with the linguistic problem at hand.

Give linguists the tools to apply their knowledge

Now consider a world where software engineers use their expertise to change all this for all the non programming domain experts out there. Indeed some have already done just that.

Linguistic programming language

Here, at Contextors we are writing the formalism for a rule based parser and we use a very flexible, high level language that allows any competent linguist to code. Everything, and I mean everything, is expressed in the language of the domain. As far as I’m concerned, the objects to be manipulated are noun phrases, preposition phrases, subjects, etc.1 Words (nouns, verbs. etc.) have important attributes which I can make use of. I know that these attributes are derived from the lexicon, represented in database tables, that there are ‘nodes’, ‘atoms’ or ‘leaves’ and maybe even things like ‘children’ (it’s actually not so far from linguistic theory after all), but when I write my functions all of this is in the background and is the concern of the programmer. If something doesn’t work, it is very easy to differentiate between the actual linguistic design and other aspects of the parser and/or code.

Let’s take a look at some examples. If I want a noun phrase to project a subject, for instance, I can easily write the module that does just that. Let’s look at this little bit of Ruby code:


Ok, but we know that there are limitations to this rule. Not every noun phrase is a legitimate subject. So let’s start writing the validations according to which only certain noun phrases can project the subject function. To do this I need to investigate these noun phrases and be able to determine, per given noun phrase: Is it or isn’t it a valid subject? So I should be able to outline this prediction. What if I want to be able to find the head noun of a given noun phrase (e.g., ‘dog’ in ‘the big black dog’) and check what sort of attributes it has? Is it a count noun (‘dog’) or a non count noun (‘air’)? Is it in the plural (‘dogs’) or singular (‘dog’)? No problem. I can do all this because I can look at a noun phrase node and refer to its head, and then retrieve the lexical properties of the head. The generalisation should look something like this:


As you can see, the code is all in a linguistic domain. I can refer to different nodes and to the structural relations between them. I can retrieve the lexical properties of any phrase head of any node in the parse tree. But this rule still needs some elaboration. The rule as it is currently written, does not actually reflect the linguistic generalisation given - ‘a singular count noun cannot project a subject unless it is determined’. I left out the part about whether the noun phrase is determined or not. So now my code will allow ‘the dogs are barking’ and it will not allow ‘dog is barking’ but it will also wrongly disqualify ‘the dog is barking’.

Easy enough to change with a single UNLESS line:


Ah yes, is_determined? is defined elsewhere.

If you’re a programmer you may be wondering though - what’s with the hierarchy? Combine all this to one IF rule. Maybe something like this?


Now compare (3) and (4). See how the first one is closer to natural English? I know this may come as a shock, but non programming people aren’t fluent in operator-ish. We could master it if we really had to (after all ‘formal semantics/logic 101’ is a must in most linguistics departments), but why? The code in (3) reflects the problem in the way I conceptualise it. This is crucial for when I want to go back and make changes. The language used reflects the rule as it is given in English, this is critical for when other linguists want to rethink what I’ve done. They can easily find the relevant place to make changes without touching anything else. In this case, for non programming experts, hierarchy is a good thing.

Consider a final case in point. Now we want to try and see what happens when there is a determiner. For this we need to go deeper/lower to the noun phrase level (or class).


Here I’ve looked at the properties of two sister nodes (I know they are sisters because the formation rule is defined as such ‘determiner + nominal => noun phrase’) and compared them. If they match - good, a mismatch should not be allowed. So ‘three’ agrees with ‘girls’ but not with ‘girl’ or ‘air’ and the parser will parse the former but not the latter. Why have I bothered to open a case? Because the aim was to generalise the rule for determiner and head noun agreement; and I already know, from my tests (which are given in the code as examples in comments for your convenience), that I will need to account for two possible determiner phrase types; determinative phrase and preposition phrase (in fact there are additional determiner types but we are getting ahead of ourselves). The purpose served by the case under this thinking is that I answer the question as it was conceived by the data and by the linguistic phenomena. In addition, when I come back to add more rules relevant only to the determinative phrase determiners, I’ll be easily able to integrate them in the correct place, in accordance with rules already there.

A new era ahead of us

At the end of day, thinking logically and solving puzzles isn’t a domain of engineering but rather of scientific inquiry. Once we have the language to do that in, what could stop us? And after all that’s done, we can really start to make use of the vast knowledge that the field of linguistics has to offer the NLP/Computer Science world. Why limit yourself to POS, VP, NP, nsub etc. when there is so much more out there that we can accomplish together?

  1. Using full names and not abbreviations actually helps the non linguists to learn the linguistic language and enhances communication with the domain experts. See Ubiquitous Language.

× Never miss a post! join our mailing list


Join our mailing list: