Have you ever talked to… Geppetto?

7/2/2020

Recently I asked to an AI what it thinks about the existentialism. Well, this was the answer: “L’intricato esistenzialismo diventa il motore di quella riflessione metafisico-filosofica di cui il primo pensatore si avvale”. Very deep – I thought at first. Actually, it was probably too deep to me too.

This AI (developed by Aptus.AI, in collaboration with Bruno Kessler Foundation, University of Groningen and ILC-CNR) is called Geppetto and it is the first cutting-edge text-generation system that ‘speaks’ Italian. A text generation system, as the name says, is a system that is capable of writing texts. The adopted technology is GPT-2, the state-of-the-art text-generation model released by Open AI. GPT-2 is a neural network which is very good in learning statistical relationship between words, and it does it in a way that looks very natural when it writes down a snippet of text. So natural that sometimes it can be easily mistaken for a text written by a human.

In order to learn such complex relationships between words, GPT-2 needs to be ‘trained’ with a massive amount of written text. When Open AI released it, they said they used 40GB of text, which is really much for a language model. The complexity of neural networks is measured in terms of parameters, and GPT-2 accounts for 1.5 billions of them. These parameters act like knobs and switches in a giant equalizer: they are basically numbers that represent the knowledge of the network, making it able to learn so many relationship with a huge amount of words ‘read’ during the training. This is what makes the training phase incredibly costly in terms of computational power and time. Given the high effort needed to train such a huge model, Open AI team released a small and much lighter version of GPT-2, which accounts for 117 million parameters and needs less data and less time to be trained, obviously losing a bit in terms of generation quality – which still remains very high. Geppetto shares the GPT-2 architecture – in its lighter version – and it has been trained on 13GB of full Italian text.

But how do you get Geppetto generate something? This AI acts like a ‘greedy writer’, it needs just a prompt, and it starts writing. And here things get interesting. Indeed, during the training, Geppetto doesn’t learn just to correctly write in Italian, but it acquires also the style of the read text. This means that, since it’s trained on the whole Wikipedia, it can literally draw up a Wikipedia page about a subject you just make up.

For example, you can give a prompt like this: “Wikipedia Michele Cafagna”. And it produces stuff like this:
“Wikipedia Michele Cafagna
È stato professore ordinario di Diritto internazionale all’Università di Trieste e all’Università degli Studi di Napoli “Federico II” dal 1992.
Si è occupato di Diritto internazionale e diritti civili a livello internazionale. In tale veste ha fatto parte dell’Accademia Americana delle scienze ed è stato membro del Consiglio Superiore della Giustizia dell’Unione europea.
Ha preso parte alle sessioni di Diritto internazionale dei processi penali in Europa dal 2008 al 2014“.

Or it can also write a new law. If you give, for example, “Visto l’articolo 4 del dpcm”, it generates “Visto l’articolo 4 del dpcm 29 dicembre 1999, con il quale viene disposto l’inquadramento tra il personale interessato alle operazioni e le RSU per l’applicazione del quarto comma del predetto articolo 4, paragrafo 1, lettera a) del presente accordo, nei ruoli dei giornalisti, delle altre amministrazioni dello Stato, delle agenzie ed aziende autonome; Vista la legge 29 marzo 1982, n. 93, e successive modificazioni”.

Furthermore, Geppetto is totally non-deterministic. It means that, if you feed it with the same prompt again, it will write something else. Geppetto can reproduce reasonably well also the sources’ style it’s trained on, when you directly explicit it in the prompt, as long as that style is heavily present in the sources – obviously. But, what does it happen when you feed a neutral prompt? Does Geppetto have its own style? Well, not really. It has a writing style, but it is actually a mashup of what it learned in the training. Identify a proper style is something that goes beyond our purposes, but, running a linguistic profiling on its production and comparing it with the human one, something interesting comes out. Geppetto’s production is very similar to a human production, but it’s usually simpler, as it likes to use common words instead of rare ones and its sentences are similar in complexity but usually shorter than human ones. When you compare Geppetto’s production with a human one and with an old-style probabilistic model, a human recognizes it as generated the 32% of times. When asked to judge the quality and to rank them, Geppetto is right after the human production, whereas a common probabilistic model is almost always at the bottom o the ranking. The longer is the generation the better humans spots Geppetto as generated, but it still remains way better than probabilistic models in every evaluation.

Even though Geppetto outperforms old-style methods in quality of generation, it is far to be the killer app in text generation. Its contents lose in logic as the writing gets longer and sometimes it also lacks of semantics. Anyway Geppetto is an extremely leap forward and it opens up to substantial potential possibility, especially in the Italian language processing.