woensdag 17 oktober 2018

On Echo Chambers, Statistical Principles, Vaccinations and Autism

The Centuries Old Vaccination Debate
The 'vaccination debate' is as old as the introduction of the first vaccinations in the eighteenth century. A minority of anti vaccinationists rejected vaccinations with 4 main arguments, which have stayed remarkably the same over the centuries: 1) vaccinations cause all kinds of harm (syphilis, measles, encephalitis, autism, death, et cetera, depending on the time period); 2) vaccinations are unnatural; 3) vaccinations are against God's predestination; 4) vaccinations are forced upon us by Edward Jenner/the pharmaceutical industry to earn profit.[1]

Unfortunately, in the twentyfirst century anti vaccinationists are winning ground and vaccination rates are dropping. This causes a global public health risk.  The anti-vaccination lobby would not be so successful if more people had a better grasp of vaccination history, basic causality, communication science and statistics.[2]  As a historian, a data scientist and a father I will argue why the idea that vaccinations cause all kinds of bad side effects, most notably autism, is a dangerous myth.  If you are convinced vaccinations cause harm I do not have much faith in changing your mind, even if you are still reading, unless you keep an open mind. This blog is mostly for people who are in doubt, or are not sure what to think. Or for people who are not in doubt but want to use more solid arguments against anti vaccinationists. I will demonstrate that even though it is very understandable people still think vaccinations cause harm, there is no basis in facts or logic to do so.

Echo Chambers and Confirmation Bias
In this century the vaccination debate has found its way to online communities, where it has been carried on with increased intensity.[3] Despite irrefutable scientific evidence to the contrary, an alarming number of people are convinced that vaccinations cause autism or other unwanted side effects. This at least has partly to do with the online world facilitating the easy creation of new 'echo chambers'. Echo chambers are sealed off (metaphorical) spaces in which like minded people find each other and confirm each other in their beliefs. In earlier times echo chambers were mostly formed in small communities, but in the online world they can integrate people from all over the world. The internet also facilitates the availability of all kinds of information. It is human to read and believe the information that confirms what you already know, even if it is contradicted by many more, and better substantiated, sources. When people lock themselves up in Echo Chambers where one sided information is spread among like minded individuals it even can seem as if the majority of the population think like they do. Any information that says the contrary can likewise be easily dismissed.[4]

The Outliers have more impact on public perception
It is highly unlikely, yet still possible, that a vaccination causes an unwanted side effect. Still, stories of the failures are the ones that spread rapidly. You rarely hear someone saying a vaccination went perfectly, for the simple reason that it is not really something special to report. If something, supposedly, goes wrong people are far more likely to share their story. This is why the outliers, the results that are most unlike regular results, have more impact. If 1 in 10.000 vaccinations have some kind of bad side effect this potentially has much more influence on public perception than the 9.999 vaccinations that went just fine. Already in the nineteenth century anti vaccinationists often had a few horror stories ready to scare parents into not vaccinating. Since the (online) world is a big place the number of scary experiences can go up quickly, even if relatively speaking the number is still small and insignificant. A case file of a hundred 'horror stories' can serve as a scary deterrent for young parents to vaccinate.

Sometimes we have to accept we do not know or can not influence the cause
Something else which is human, is the compulsive need to understand the world and to be able to influence what is going on. In earlier centuries when a harvest failed it was the work of the devil, or a witch, or a punishment from God. You could try to improve your fate by burning the witch or by praying. A simple 'bad luck' with the weather conditions is more difficult to accept, since you cannot do much about that as an individual. No one likes to be a helpless victim of 'dumb bad luck'. Still, sometimes we have to accept this.

Just because B follows A does not mean A caused B
The difficulty to accept being helpless also is part of the reason that the link between vaccinations and autism is so persistent. It is unknown why one child is autistic and the other is not, except that it seems to have something to do with genetics. Vaccinations are scary, because it is difficult to understand what you are really injecting into your child. If your child also becomes 'visibly autistic' around age two, shortly after it received its second MMR vaccination, it is natural to think of a causal relation between something scary and something inexplicable. Human instincts are still very 'medieval'. For inexplicable phenomena people look for unlikely causes to channel their feelings of fear and helplessness. This feeling can be stronger than solid evidence that there is no causal relation between autism and vaccinations, and the exposure of the first scientist who made this claim as a fraud. [5]

Within a mass of big data there are no regular patterns
It also is important to realise that data do not follow any regular patterns. Statistics quickly can seem false if your own perceptions show a completely different pattern. Imagine that according to statistics 1 in 1000 children gets a high fever after being vaccinated, but that in your near surroundings you already know three children who were struck with high fevers. It is easy to think that the numbers should be 1 in 10 instead of 1 in 1000. The study must be flawed or maybe the government has made it up! It would however be extremely unlikely, even if all the other circumstances were the same, that a statistical pattern is regular. Voor many professors this knowledge even is a way to quickly spot badly falsified data. A normal pattern is not A, B, C, D, E, A, B, C, D, E, but the more random A, A, A, C, E, A, A, B, A. Results clump together, which is why many gamblers can think they are in a 'winning streak'. It however is not a winning streak, but a normal result within the advertised odds of winning or losing. Every gamble has the same chance of success, regardless of the gambler having won or lost ten times before the current gamble.

Measuring more does not mean there is more than before
It does not help that autism seems to be a modern phenomenon. People therefore try to blame modern vaccinations and changes in the environment or even food patterns. It is true that an increasing number of people are diagnosed with autism. This does not (necessarily) mean more people are autistic than before. The 'elf children', 'eccentric uncles' or 'siblings in the lunatic asylum' from centuries gone by would now get a diagnosis of an Autism Spectrum Disorder (ASD).  Over the past decades, the definition of autism also has been extended to include many more variations than before. To give one extreme example: not so long ago doctors could claim that only boys were autistic. Obviously the number of diagnoses will go up when you start including the other half of the earth's population as possible candidates as well. It also is more difficult for people on the spectrum to go 'unnoticed'. Modern society likes to measure, quantify and categorise everything. Furthermore, and far more damaging, modern society subjects people to many difficult-to-channel impulses. One may wonder if brilliant minds of the past who may have been on the spectrum, like Mozart, Darwin and Einstein, could have flourished in the 21st century like they did in their own time.

Even if all of this was not true, and vaccinations indeed cause autism, why would any parent prefer to have a child die from measles or polio over having an autistic child?

[1] D. Porter and R. Porter. 1998. The politics of prevention: anti-vaccinationism and public health in nineteenth century EnglandMedical History, 32:231–252; C. E. Daniels. 1875. De kinderpokinenting in Nederland: meerendeels naar onuitgegeven bescheiden bew- erkt: eene medisch-historische studie. Amsterdam.; W. Rutten. 1997. De vreselijkste aller harpijen. Pokkenepidemieen en pokkenbestrijding in Nederland in de 18e en 19e eeuw. Universiteit Wageningen.
[2 I thank my basic understanding of statistics to Statistic Reasoning for Everyday Life by Bennett, Briggs and Triola.
[3] Ana Lucia Schmidt, Fabiana Zollo Antonio Scala, Cornelia Betsch, and Walter Quattrociocchi. 2018. Polarization of the vaccination debate on FacebookVaccine, 36:3606–3612
[4]T. Chamorro-Premuzic. 13 May 2014. How the web distorts reality and impairs our judgement skills. The Guardian; M. del Vicario, G. Vivaldo, A. Bessi, F. Zollo, A. Scala, G. Caldarelli, and W. Quattrociocchi. 2016. Echo chambers: Emotional contagion and group polarization on FacebookSCIENTIFIC REPORTS, 6, 37825

dinsdag 4 september 2018

Big Data and Autism - A Metaphorical Link

For the past six years I have been working in the field of Digital Humanities, trying to make sense of ‘big data’ for humanities research with computational methods. Exploring the possibilities and limitations, to seek out new methodologies and basically to go where no humanist has gone before.

For the past six years I also have been the father of a daughter, Anna, with autism. What I have been doing professionally she has been doing her entire life: trying to make sense of the ‘big data’ in her head with the methods she has at her disposal.

It is difficult to define what autism is. It is a spectrum with many symptoms, and individuals on the spectrum will be affected differently. For me it helps the most to think of autism as an ‘information processing disorder’. People with autism, or people with ASD (autism spectrum disorder),  see and view the world fundamentally differently than people without. Often they have trouble filtering information and sounds, which can make the outside world an overwhelming place. This filtering of information and data also can be a problem for the processes taking place in the heads of people with autism. It is difficult to find the right piece of data for every situation, turn it into useful information and knowledge and use that in a practical way. As a consequence, some people with autism barely speak, or are even completely mute. Others do the opposite and will rant endlessly to compensate. Others do both. Some do not speak, but do read and write. People with autism can therefore be very present, or completely silent. People with autism often have difficulty functioning in modern society, with its fast pace and many daily impulses. Autism should, however, be considered as a variety, not as a disability, and most certainly not as a desease. Many people with an autistic brain can function well, some would say even better, if circumstances allow it. Some of the greatest minds in history, like Mozart, Darwin and Einstein, may have had autism. Some people with autism are disabled, because they cannot function in society on their own. Most, however, just function differently. Only time will tell how ‘severe’ my daughter Anna’s autism will continue to be. She does not speak normally; sometimes she does not speak at all; at this point there is no way of knowing if she will ever function (properly) in society; if she will ever know romantic love; if she will able to cope with the loss of loved ones; if she will ever deal with her neurotypical brother on equal terms; if she can ever live on her own. At this point there is no greater fear in my life than the thought that she will die alone in some kind of nursing home, frustrated and misunderstood, thirty years after me and my wife are gone.  

Coming back to the Big Data Metaphor: when looking at Anna I often wish I could know what she is thinking. And I often wish I could make sense of the ‘big data’ in her brain. It’s like there is a barrier in Anna’s head, which makes it difficult to make sense of all the pieces of data. We do not know how ‘big’ the data in her head is, but we suspect it is big. She was extremely fast as a baby to learn words, phrases, songs, et cetera. Ironically it looked like she would be able to speak well ahead of her age. This changed when her autism stopped her from channeling all the data in her head into speech, when she was around two years old.  It is therefore likely that since that time she has amassed a huge pile of ‘big data’ that begs to be sorted, categorised, and, most importantly, translated into useful information and knowledge.

Some pieces of data are easy to find for Anna. Especially colours. She can play happily with a yellow ball while repeating ‘the yellow, the yellow’. Sometimes she even refers to me and my wife with the colour of clothes we are wearing. When she is searching for a toy she can continue repeating ‘Does Anna want the blue, does Anna want the blue’.  Making me and my wife desperate in repeatedly asking her ‘the blue what, Anna?’ Anna knows the word of the item she is looking for, but it causes her visible pain to delve deeper into her brain and find and use this piece of data.

At times Anna taps into more data to describe a situation, and very rarely this even makes her sound almost neurotypical. Anna’s biggest problem may be grasping the link between words and having a ‘normal’ conversation. She uses the big data in her head to let us know what she wants, or needs. If a child in the playground tries to strike a conversation with her, she stays mute, unable to respond well to an unexpected query in a different setting from an unknown person.

Anna’s problems with dealing with the big data in her head makes her attached to the structures she does know in her life. When we prepare her to go to ‘school’ she knows she has to take the bus to go there. When we walk to the car on Sunday she knows we will do grocery shopping. When we take the car on another day she knows we may do something fun, like going to her grandparents, or to an amusement park or petting zoo. If something does not go according to what she expects, a little drama can unfold. If, for example, we drive in grandfather’s direction but go somewhere else she will protest. If we cannot eat french fries for lunch when ‘going out’ she will protest (even in a pancake house).

As a researcher it is my duty to try to do proper and conscientious research with big humanities data. As a parent it is my duty to try to hand Anna the algorithms to make sense of the big data in her head. The metaphor can be carried on to quite some extend: Some things are easy to find with the algorithms, like colours, while others seem to be unattainable at the moment. If something unexpected happens the algorithm will fail. Sometimes we have to accept a less than optimal result. Sometimes we only are able to scrape data, without getting any information or knowledge. And we always should be aware that maybe for some things we have already reached the summit of what we can achieve.

Fortunately Anna has one advantage to help her out: the human mind is wonderful and powerful. An algorithm or methodology for humanities research can only be improved by the researchers. Anna’s mind is a processing pipeline that she will continue to improve over time. That way she may be able herself to find the proper pathways to make sense of the big data in her head and translate it into information, knowledge and eventually communication. Time will tell.

dinsdag 5 december 2017

Sinterklaas Digitaal

Sinterklaas in 2017

Terwijl ik dit schrijf is het voor velen bijna pakjesavond. Kinderen in heel Nederland kijken reikhalzend uit naar het moment dat Sinterklaas langskomt of een zak met cadeautjes voor de deur zet. Sinds de tijd dat ik zelf zo'n kind was heeft het feest voor mij haar onschuld verloren door de 'Pietendiscussie' met als dieptepunt het blokkeren van de snelweg naar Dokkum om een anti-Pieten-demonstratie te voorkomen.  Ik had een heel blog kunnen schrijven over alleen dat onderwerp, maar gelukkig zijn er velen die dat beter gedaan hebben dan ik ooit zou kunnen. Wat ik wel kan is een digitale analyse geven van de manier waarop de Sint en zijn knecht in 'algemene' tijdschriften De Gids (vanaf 1837-1909) (DG) en Vaderlandsche Letteroefeningen (1776-1876) (VLO) voorkwamen in de negentiende eeuw, om te kijken hoe die 'traditie' er toen uitzag. Niet om een geschiedenis van Sint en Piet te schrijven, want hun Wikipedia pagina's zijn al rijkelijk gevuld, maar wel als een poging de huidige overspannenheid te relativeren.

58 teksten met Sinterklaas
Een simpele word search in DG en VLO geeft mij 58 teksten waarin 'Sinterklaas', 'Sint Nicolaas' of een spellingsvariant voorkomt. Een deel van deze teksten gaan helemaal niet over Sinterklaas. Zo is er ook een stad Sint Nikolaas, gaan sommige teksten over het aanroepen van de heilige of over mythen rond hem, en was "Sint Nikolaas" blijkbaar ook een achternaam. De meeste verwijzingen slaan echter wel op de goedheiligman zoals wij die nu kennen: de kindervriend die vooral bekend is door het Sinterklaasfeest. DG en VLO bespraken alles dat van belang kon zijn voor de beschaafde Nederlander. Dat de goedheiligman in ruim 40.000 artikelen slechts sporadisch voorkomt mag als eerste relativering gelden.

In VLO noemt men de Sint pas vanaf het tweede kwart van de negentiende eeuw. DG bestond toen nog niet, maar meteen in de eerste aflevering in 1837 wordt de man wel genoemd. Van een Pietendiscussie was nog in zijn geheel geen sprake, wat niet zo verwonderlijk was omdat de knecht van Sint praktisch niet voorkomt en ook nog geen Piet heet.

Wat data over de Sint en zijn knecht
Een vijftigtal vermeldingen van de Sint, soms in artikelen die compleet ergens anders over gingen, lenen zich niet heel goed voor kwantitatieve analyses, dus ik moet het hier laten bij wat anekdotische vondsten in een onwetenschappelijke bloemlezing.

Het eerste wat opvalt is dat Sinterklaas vroeger blijkbaar door de schoorsteen kwam. Iets kon immers 'uit de lucht ploffen als Sinterklaas door een schoorsteen'. Iets waar hij ruim honderd jaar later te oud voor is geworden, in tegenstelling tot zijn dikke alter ego uit de Verenigde Staten.

Niet alleen kinderen konden hun hart ophalen rond Sinterklaas. Blijkbaar stuurden ook volwassenen elkaar wel eens wat, zonder daarbij te vermelden dat het van hen kwam ('en weet je niet van wien?'). Een soort voorloper van de huidige surprises dus.

Niet alle associaties met Sinterklaas waren even vleiend. Zo werd hij ook wel eens geassocieerd met bedriegerij, sprookjes en onechtheid. Zo is er één vermelding van 'Sinterklaas-Christenen', en wordt Sinterklaas elders in één zin genoemd met 'een boeman', een 'drooge sloot', een 'papieren muur'. In een andere tekst wordt hij samen met 'Blauwbaard' genoemd als voorbeeld van een fabelfiguur. Soms wordt Sinterklaas ook geassocieerd met verkwisting, door 'tal van geschenken rondom te strooijen,' en werd er opgeroepen ook aan de armen te denken. Het Sinterklaasfeest was in ieder geval een goede marketing techniek, want sommige producten werden specifiek aanbevolen als Sinterklaasgeschenken. Banketbakkerijen deden ook goede zaken tijdens Sinterklaas, waar de (rijke) kinderen hun 'stoutste droomen' verwezenlijkt zagen.

Tegen het einde van de negentiende eeuw lijken steeds meer inmiddels traditionele elementen ingeburgerd te raken, zoals het terugkeren naar Spanje, het 'strooien', de 'pepernoten' en 'het lekkers' in de 'zak'. Dit alles vermoedelijk voor een groot deel dankzij het boekje van Jan Schenkman uit 1850, getiteld Sint Nikolaas en zijn knecht. De knecht van Sinterklaas was gedurende de negentiende eeuw vooral onzichtbaar. Als hij er wel is, is het er bovendien altijd maar één. Tegen het einde van de negentiende eeuw was hij echter wel zwart. Als een dreumes uit een literaire tekst uit 1889 vol ontzag opkijkt naar een grote vreemdeling, vraagt hij of dat misschien de knecht is van Sinterklaas. Hem wordt dan verteld dat dat toch niet kan, omdat de vreemdeling wit is en de knecht 'pitzwart'. De enige keer dat 'den zwarten knecht' in deze bronnen bij naam wordt genoemd heet hij echter Hansje. Nergens wordt trouwens vermeld waarom de knecht zwart is.

Sinterklaas dus
Wat kunnen wij hier nu van leren? Niet zoveel, maar waarschijnlijk vooral dat men in de negentiende eeuw niet zo overspannen deed over Sinterklaas als nu. Dat terwijl het feest toen ook al minstens driehonderd jaar bestond en sinds die tijd al vele gedaanteveranderingen had doorstaan. In de negentiende eeuw zijn er zonder slag of stoot, bewust of onbewust, veel veranderingen gekomen in het Sinterklaasfeest. Toen hadden we ook nog geen internet om anderen onze mening op te dringen, televisie om het feest nationaal te vieren, of mobiele telefoons om een wegblokkade te coördineren. We hadden wel dingen als slavenhandel en kinderpokken, dus weinig reden om nostalgisch te doen. Ze hadden toen genoeg andere dingen om overspannen van te raken.

maandag 29 augustus 2016

The Next Rembrandt: why did you do that?!

Michael Crichton's book Jurassic Park  caused a true 'dinomania' when Steven Spielberg decided to make a movie out of it in the early nineties of the previous century. The brilliance of his book lies in the scary message it conveys: even if science shows that you can do it, there may still be good reasons not to do it. It may not be a good idea to create a weapon of mass destruction (too late for that). It may not be a good idea to clone a dinosaur (still quite impossible). The question every scientist should ask him or herself is the same they ask their children on a daily basis: 'Why did you do that?!" It is tempting to say that the humanities are very good at asking the 'why' question, as suggested by the picture below, but that seems to be just vanity, as if the natural scientists never consider if they should be doing what they are doing.

This year a digital humanities project was in the news that was highly impressive and intriguing, but also immediately raised the question with me: 'Why did you do that?!' A team of the University of Delft collaborated with Microsoft and museum the Mauritshuis to create the 'next Rembrandt', a painting of the seventeenth century Dutch artist Rembrandt van Rijn. Their work is generously sponsored by the Dutch ING bank. They have put a very nice video on their website in which they explain how they worked on creating the 'next Rembrandt'. 

According to ING spokesperson Tjitske Benedictus, the ING wanted to bring their 'innovative spirit' to art and culture. Ron Augustus from Microsoft says they use 'technology and data' to create something like Rembrandt did with his paintbrushes. They had the computer analyse a large number of  (portrait) paintings from Rembrandt, to extract the typical Rembrandt 'nose',  'eyes', 'ear' and 'mouth', and calculate the distance between them on a face. They planted these features on a typical Rembrandt person, which not surprisingly is a 30-40 years old Caucasian male, leading to the 'next Rembrandt' (or 'average Rembrandt' maybe), printed by a 3D printer.

Even though I am very much impressed by the science behind all of this, the question remains 'why would you want to do that?' David de Witt from the Mauritshuis mentioned that Rembrandt was famous for being able to portray human emotions better than his contemporaries.  Emotions from real people, immortalised by the grand master is what makes a Rembrandt a Rembrandt. This 'next Rembrandt' however, is nothing more than an average of all those emotions,  planted on an average face of an average person that never existed, and as such not interesting. The creators may believe that Rembrandt would be pleased if he knew his work still lives on like this, but I find it more probable he would ask 'Why did you do that? My real paintings are still displayed all over the world.'
At the end of the video ING's T. Benedictus concludes:  'The Next Rembrandt makes you think about where innovation can take us, what’s next?' Despite my reservations about the product itself, I indeed am thinking of where this innovation could take us. The technology used to analyse Rembrandt's paintings could be invaluable for art historians. To name a few possibilities:

1) Use the data on the average Rembrandt to identify unknown paintings as belonging to Rembrandt (or not).
2) Compare the data on the average Rembrandt to data on the average contemporary Whatever Other Painter to see in what ways the grand master was really unique.
3) Compare average styles over time to see what developments took place.
4) Compare average styles per location to see what developments took place.

I would, for example, really like to know, would like to see quantified, how the faces Rembrandt painted were different from those painted by other Dutch seventeenth century masters, sixteenth century predecessors, or eighteenth century followers. Or to know what styles Rembrandt may have borrowed from sources that are not so apparent at first glance. To answer such questions a huge database of paintings from a huge number of artists would have to be analysed in the same way as the work of Rembrandt.  

Fortunately the ING thinks it's important to bring innovation to culture, so it should be a matter of time before they sponsor such a project. 

dinsdag 8 september 2015

Digital Humanities: From Source Criticism to Tool Criticism *

The political history of the county of Holland of the first half of the sixteenth century is rather well documented. There was a lively correspondence between the president of the Council of Holland Holland in The Hague, Gerrit van Assendelft , and the regent and stadtholder in Brussels. It is one of those coincidences of history that just because because stadtholder Anton van Lalaing resided in Brussels frequently (and not in the Hague), we have these sources at our disposal. This ritch correspondence however, does place the historian, like the younger version of myself ten years ago, in a difficult position. By only looking through the eyes of Van Assendelft at history the image gets distorted. His correspondence is biased by his personal visions on friends, enemies, relatives, and personal interests. Other opinions and visions are hardly available and when they are, for example when Van Assendelft was accused of corruption, heresy and nepotism, it is often difficult for the historian to assess which source is 'right'. Every student of history therefore is trained in a decent source criticism and to approach the sources related to his/her subject as objectively as posible. Of course there are few people who would claim they can approach a document completely objectively. Everyone is shaped by his/her own time, location and surroundings and develops sympathy/antipathy towards his/her subject.

So far nothing new. A good historian will always look at his/her sources critically and be aware that perspectives, including those of him/herself, are subject to change. What is less obvious, and what people seem to be only aware of to a small extend, is that tools for digital historical research also are far from objective. Just like a historian a tool gathers data and uses that to provide a synthesis/answer/visualisation. Just like a historian tools are filled with preconceptions/asumptions that can heavily influence the results of research. [1] If a tool always choses for a certain probability, for example that everyone without an exact date of birth always lived before the twentieth century, this can be a useful filter for one research question, but could have large and unwanted repercussions for the other. This realisation has the necessary consequences: every tool a historian uses should be criticised like a fellow historian, or even as a (sometimes very sloppy) co-author or student-assistent. This means that all choices which were made when developing a tool should be made explicit, and that ideally the complex algorithms which form the core of a tool should be understandable for the person using it. There are very few historians however, who have the necesary technical expertise, at least that of a bachelor in computer science, at their disposal to truly understand the finer nuances of computer code.

The question then is what could be done to breach the gap between the historian and technology. The most simple answer of course would be that the historian also must become a computer scientist  [2] or the other way around. Even though in the future there hopefully will be more of such hybrid academics than now, it is unlikely that we will have thousands of such people in the near future. One of my history teachers at University once said: 'A historian needs to be an amateur in every field,' Maybe it is enough to become an amateur in computer science as well. Traditionally, historians become amateurs in the fields of law, ancient languages, geography, archival science, art history, psychology, codicology and sociology. Computer science could simply be added to this list. Just like the other fields of study, computer science is an aid d to interpret all of the available data correctly.

It still is an open question what level of amateurism in computer science is acceptable to use digital tools wisely. Since digital humanities is a still emerging field this question knows many answers. Historians have used methods from other fields to various degrees over the centuries. The historian of a hundred years ago could not have predicted that statistics is now a widely accepted skill to analyse historical material and that Latin is becoming obsolete in many curriculi. I would say that necessary and (for now) sufficient conditions to use digital tools properly, are: 1) the availability of a detailed documentation of the choices made by the computer scientist, and 2) an understanding of how a computer scientist works and why he/she had to make certain choices. Or in other words: to a certain extend we need to master the languages of a computer scientist passively, which is also the level of how much historians grasp most other fields.  I read medieval French, have a basic knowledge of the work of the sociolologist Bourdieu, and I know what the legal terms mean in medieval verdicts. I would however never be able to speak medieval French (or even decent modern French), have no knowledge to be able to criticise Bourdieus work and have no clue if a medieval verdict is in line with how justice was applied in general in that time ... and I get away with it.

To graps how a tool works, historians therefore should not necessarily be able to convert a text to linked data, but should be able to grasp to a basic extend how this process works and what RDF triples are. This would entail a cultural change, in which tools are not only used as household appliances. but as the product of another academic field, that need to be approached critically before you can use them. Often historians stop at asking themselves how a tool can help them to answer their questions, while the importance of knowledge of how a tool is built and can be approached eludes them. Without such knowledge there can be no decent tool criticism,  which will become increasingly important besides the familiar source criticism.

* This is a (bad) translation and slight adaption of my blog from 24 June 2014

[1]See the important article of B. Rieder and T. Röhle: 'Digital methods: Five
challenges' .in: D. M. Berry ed., Understanding Digital Humanities (2012) 67–84.
[2] Throughout the text computer scientist can also be read as computationl linguist.

donderdag 13 augustus 2015

Biography of the future? Digital Humanities and a hypothetical biography of John de Witt (1625-1672)

As any humanist scholar of the twenty-first century, the modern biographer should account for what to do, or not to do, with the advances of the so-called digital humanities. A wide variety of biographical sources, primary sources like correspondence and secondary sources like biographical dictionaries, have been brought online over the past decades and will increasingly be consulted by future biographers. The digital turn does more, however, than only make biographers consult sources from behind their own computer rather than in a library or archive. The digitized sources can be analyzed in new, more advanced ways, visualizations of material help to see patterns, biographies can be presented in different ways, et cetera.
            The question remains if this digital turn, this increasing availability of ‘biographical data’ and new ways to consult them, really changes the biographies of the future.  This blog provides a critical reflection on the possibilities of digital humanities technologies for biographical research. I will take the life of grand pensionary John de Witt, the highest official of the Republic of the Netherlands (1653-1672), as an example of how a biographer could use digital humanities technology for biographical research, and to illustrate its potential and shortcomings. The choice for John de Witt springs forth from personal interest, the availability of plenty of primary sources to consult and the fact that he has been the subject of several larger biographies already, both in and outside the Netherlands.

I John de Witt: a life of diplomacy and writing

John de Witt is one of the most prominent figures in Dutch history. As grand pensionary of the province of Holland, the most powerful province of the Dutch Republic, he was considered to be the leader of the Republic and treated with all honors by foreign rulers. His untimely death in 1672, murdered and ripped apart by an angry mob, has contributed to the fame of his legacy. To date there is no conclusive evidence to point the blame to any particular person(s) for this murder, though his political enemy the prince of Orange, William III (the later king of England), is often mentioned as being (partly) responsible. Several biographies on John de Witt have appeared, in English and in Dutch.[2]

Statue of John de Witt in the Hague
 De Witt's archival legacy is immense. In his role as a state official alone, eight meters of letters are preserved in the Dutch National Archive, bundled in twenty volumes, written in official circuitous officialese jargon. [3]  He was not only a politician, but also a mathematician who is still considered to be a founding father of modern life insurance mathematics. 
            To write a biography on De Witt is no small feat. Nineteenth century politician J. R. Thorbecke noted that ‘To give us a life of De Witt worthy of the man is to assure oneself a place among historians of all time.’[4] Rowen needed almost 1000 pages to capture De Witt’s life in 1978. Panhuysen used slightly more than 500 pages to describe the life of De Witt in relation to that of his brother Cornelis in 2005. Recently, Prud’Homme van Reine needed more than 200 pages to describe the murder on De Witt and his brother alone.
            Rowen especially, has consulted a tremendous variety of sources to compose his work. Necessarily however, the biographer of De Witt, and of many other noteworthy persons in history, has to be selective in the topics to address and the sources to consult. One person can only ‘close-read’ a limited number of sources, especially in the twentyfirst century where there is no patience for academic output that takes longer than projects of four or five years. Digital humanities technology allows biographers to combine close-reading with ‘distant-reading’ in which larger texts are analyzed by the computer to facilitate finding patterns, test hypotheses and find leads to further research. 

II Analyzing the texts
Computer software is very good at reading and interpreting text, as long as it is modern text and written in a software interpretable (digitized) format. The first problem we encounter when we want to use digital humanities technology for a biography on De Witt, is that the correspondence of De Witt is not digitized yet, and even if it was, the OCR (Optical Character Recognition) that would have to translate the handwriting to computer readable text might not deliver great results.  Computers can be trained to recognize handwritings to improve performance, with the help of the crowd, but hand made transcriptions definitely have to be preferred.[5]  
            Let us assume however, that we have all eight meters of De Witt’s correspondence in a computer readable format of decent quality. The first thing we would want to know is all about the texts themselves to tell us a bit more about the man we are writing about. Questions we could ask with the help of digital tools, but could not ask before with great difficulty are: How long were De Witt’s letters? Did he use many words per sentence and many sentences per letter compared to his contemporaries? (this could tell us something about his working ethos and personality)  How does this change over time and why? 
            By lack of actually having the correspondence of De Witt available in a  digitized format, we used a transcription of a famous political text by De Witt  of 1654, his deduction,  as an illustration, in which he defends himself against serious allegations after conceding to the English lord protector Oliver Cromwell not to appoint a member of the House of Orange to the highest state offices. [6] By using simple and free online tools like Voyant Tools and WordCounter, we find out that De Witt used 34.456 words in this political text and that there are 5185 unique words in it. He uses 749 sentences with an average of 46 words per sentence. The most frequently used noun is ‘provincien’ (provinces). We can also see he uses that word frequently throughout the entire corpus and not just in one particular section, showing the importance of the relation between the different provinces of the Republic that De Witt addresses. A word like ‘Brabant’, the province, is only used in a very restricted part of the text. The name Orange (Oraingne), from the political adversaries of De Witt, is used often in the beginning of the text, as well as slightly after the middle of the text and in the very end.

Voyant Tools as a means to analyze John de Witt's texts

Another word De Witt uses often is ‘God’ (in several spelling variations) which we find 37 times.  If we would want to write a paragraph in our biography about how religious De Witt was in his thinking it would be a valuable exercise to compare the relative occurrence of ‘God’ in this political text to the mentioning of God in texts from other politicians of his time. This is necessary to contextualize De Witt and see how he compares to similar individuals. Once again, this would mean having to have access to full text computer readable versions of as many political texts as possible. 
            Finally, when we are dealing with a wide variety of texts, we should definitely consider the popular exercise in digital humanities of topic modelling. With topic modelling the computer extracts topics from text, by looking at the words that are mentioned together in statistically meaningful ways. In this way we could globally assess what is discussed in which documents, without having to read them fully. If we, for example, would want to know in which letters De Witt mentions the strength of the Dutch fleet, topic modelling could point us the way.

III Patterns in Networks
One of the main techniques for a biographer to contextualize his or her individual, is by analyzing the networks he or she was part of.  The analysis of correspondence with digital methods is a key component in finding out who had contact with whom and for what reasons. John de Witt was a statesman who led the anti-Orange party, who corresponded with foreign colleagues, ambassadors, scholars and international friends and who had an extensive patronage network. It therefore is of high interest to map his correspondence.  His biographers have also used his letters extensively to define his relationship to other people.
            A relatively simple exercise would be to analyze all the recipients of letters of John de Witt and all the people who have sent him letters. When you would visualize this with maps, graphs and figures (several off-the-shelf tools exist to do this) you would get a picture that allows us to see patterns we could not see before.[7] Stanford University’s Mapping the Republicof Letters, provides good examples of such visualizations. If we take the use case of the French philosopher Voltaire we can see graphs of the nationality and social background of his correspondents.  All is visualized on a map with the most modern techniques. For one, we may deduce that Voltaire’s correspondence was not as cosmopolitan as he might have wanted  to appear.
            An initiative directly relevant for John de Witt is the Circulation of Knowledge and Learned Practices in the 17th-centuryDutch Republic. Even though De Witt was primarily a statesman, he is likely to have been in contact with the most prominent intellectuals of his time.[8] When searching in the database we find five letters from the scientist (theoretical physicist) and inventor Christiaan Huygens to De Witt, from 1658 to 1670. These letters seem to give insight into quite a formal relationship between the two, in which Huygens calls himself  De Witt’s ‘humble servant’ as was the custom in that time.
            In one of these letters (1 February 1664) we also find mentioning of ‘lord Brus, brother-in-law of the lord of Somerdijck’.  Ideally, we want to know who this lord Brus is and what his relation to De Witt might have been and the same goes for any other people who are mentioned in the letters to and from De Witt. We also want to be able to match this lord Brus to other mentions of him in the correspondence.  A computer is able to search for other instances of lord Brus, but it cannot learn (without great difficulty) if these two lords are the same people, other than giving a negative match if two different Brusses are mentioned in letters that are chronologically too far apart.    

John de Witt, by Adriaen Hanneman
Another problem is a difference in spelling of the names and a difference in the way people are called or call themselves.  Christiaan Huygens signed the same letter with ‘Chr. Huygens van Zuylichem’, after the castle and estate his father had acquired in 1630. Similarly, a computer would have great difficulty to match a mention of a ‘master Vincent’ to the right person without knowing the context. The problem of recognizing and matching individuals automatically (NERD: named entity recognition and disambiguation) is common in projects that deal with biographical data.  Statistics combined with domain expert knowledge are increasingly successfully applied however,  to match names in separate documents.[9]

IV Comparisons
When using computers you want to make use of the strength of their calculating powers. For biographers it is particularly interesting to compare his or her individual to similar people, to be able to frame their individual in the context of their time.[10] To this end we need structured biographical data on as many people as possible.
            In  the case of John de Witt there are several groups of people we want to make comparisons to. De Witt started to study Law at the University of Leiden at the age of sixteen. If we were able to draw up schematic biographies of all students in the years he studied there we would know how he compared to them in regard to age, social and geographical background and later careers. De Witt was pensionary of Dordrecht shortly before becoming grand pensionary of Holland. By analyzing the previous office holders, and holders of the same office in other towns, we would get to know how unique his appointment was in that time. The same goes for the office of grand pensionary and practically any other network or group De Witt belonged to. Such prosopopgraphical, structured analyses would allow us to make much stronger substantiated claims about the person of De Witt, than usually is done in biographies.
            The reason  that such extensive comparisons often do not find their way in biographies is because they are very time consuming. With the help of digital methods however, such comparisons can be made easier. In order to do so we would need much biographical data online in a structured format.
 Digital biographical data can either be ‘digitized’, for example from a scanned book, or ‘digital born’.[11]  We can make a distinction between resources with primarily ‘generic’ biographical data, like dates and places of birth, marriage dates and dates and places of death, and resources with more narrative data with the description of a person’s life. If we, for example, take the Wikipedia entry on John de Witt then we have a quite extensive biography of his life (narrative data), accompanied by an info box with structured data on his life.  For a computer it is relatively easy to read and analyze the info boxes, but difficult to interpret the main text.
 Wikipedia is the biggest player in the field of online biographical data. Several studies have shown that for factual knowledge Wikipedia can compete with authoritative sources from professionals, especially in the field of data on people.[12]  DBpedia publishes the data of Wikipedia’s info boxes in linked data format. This allows analyses over different datasets and therefore increases the potential to compare John de Witt to different groups of people. Structured data on no less than three quarters of a million individuals worldwide are available through DBpedia. The advantages of having biographical data online is recognized as well by editors of biographical dictionaries.[13] These dictionaries contain relatively short biographies on people who were considered worth describing at the time of publication. Especially since the nineteenth century the dictionaries were published in multiple volumes all over Europe, containing descriptions of thousands of people.[14] These dictionaries form a rich source of information, that no human could ever fully analyze with traditional methods. 

V Publishing the Findings

Biographers, as historians in general, have a tendency to amass a vast amount of information on the person that is the topic of their research. Already in 1946 the Dutch biographer Jan Romein noted that biographies often were too long. The ideal length of a biography would be 200 pages, in which the biographer made a conscious selection of his source material.[15] When looking at the most prominent biographies of John de Witt, and to the length of modern biographies in general, we must conclude that Romein’s 200 page limit is rarely adhered to. Even if this is not necessarily a problem, it does reflect an above average struggle of biographers to select the right material and to keep a book within a certain page limit. Digital publishing might be the answer to both producing a manuscript of manageable size and being as ‘complete’ as possible. The monograph is not the only way anymore, or maybe even the most evident way, to publish your work.[16] With more authoritative sources online a change of attitude towards digital sources has taken place as well.  Recently, for example, academics have started to prefer the online version of the Oxford Dictionary of National Biography rather than the printed version.[17]
            There are many advantages to digital publishing. First of all, if we published our new biography of John de Witt online, we could easily rectify mistakes  If, for example, we would identify the previously mentioned lord Brus in the correspondence incorrectly, we could simply correct that and account for the change.
            Secondly, digital biographies make it easier to let go of the traditional narrative of an individual’s life from birth to death. We could divide our biography into ‘themes’ (e.g. the murder on John de Witt), provide hyperlinks to more detailed information (e.g. the Dutch Republic and the navy) and even publish the source material we used, or left out, for a particular topic (e.g. the letter from Christiaan Huygens to John de Witt of the previous paragraph). The Ludwig Boltzmann Institut für Geschichte undTheorie der Biographie is a forerunner in publishing biographies this way. In particular, they work on alternative modes of presentation for the lives of Austrian writers Ernst Jandl and Karl Kraus. They developed a content management system called Biographeme, “which breaks down the closed linear mode of life narratives in favour of a modular form of biography, the individual components of which can be combined and recombined according to interest or the question asked.”[18]
            Thirdly, the digital era offers unprecedented possibilities for researchers to also put their raw material online. We could put transcriptions of letters of John de Witt, systematically gathered information in a database and, when copyright permits it, original material online as part of an online publication. This would be a highly efficient way to facilitate further research and to allow others to check our findings. It is also a good way to show that the tax money invested in the research was well spent. It could or should be part of any data management policies at academic institutions to facilitate storing these data and making them available for further use. This would of course, also mean that a biographer should not ‘claim’ his or her subject as sometimes is the case, but let go of the subject once a biography is published (or maybe even before that).[19]
            Finally, there is the possibility of working together on a project if you put it online before a printed publication. By bringing your material online you open up a dialogue that holds the middle between writing and speaking.[20] In our case, for example, we could simply put the question online if anyone knows who lord Brus was. We could also provide our NERD results online and ask visitors of the website to spot incorrect matches. This way we avoid too many computer generated mistakes and get more input to refine our algorithms.   


VI Problems
In comparison to previous biographies on John de Witt, this hypothetical biography would be based on more resources, include quantifications on the topics John de Witt addressed most frequently, say more about John de Witt as an individual compared to his contemporaries, would include detailed and strongly visualized network analyses and would be presented in a more dynamic way with room for all the material we have gathered.  Unfortunately, at the moment this still remains largely hypothetical. 
            Perhaps the most fundamental problem which needs to be discussed is the one of representativeness. It is self-evident, but cannot be stressed often enough, that the scope of digital research is limited by the availability of computer readable digital data. The results of any exercise with computational methods therefore should be accompanied by a critical account of the completeness and biases of the sources. It would be very nice if we had all correspondence of John de Witt in a machine readable format, but that is not going to happen any time soon. Even only digitizing his correspondence would cost a lot of time and money.[21] In general. there is a huge amount of material that is not digitized (yet) and remains out of the scope of digital humanities research.[22]
            Another problem is the relatively high ratio of mistakes and inconsistencies which one would have to deal with when using digital tools on biographical data. The OCR quality of digitized texts alone can lead to problems in the analysis, especially when looking for names.[23] In the case of De Witt his seventeenth century handwriting poses another (high) hurdle to take.
            Finally, computers may be great at making calculations, but are bad at interpreting text. They can only work with the algorithms we feed them. If we ask them to match names from different documents they are bound to make more mistakes than a human would make. It is for example difficult for a computer to separate the lord Wassenaar from the location Wassenaar. They would, however, do the work much faster. A detailed documentation on how computer tools were used for our research on John de Witt should be provided in order for other researchers to check the analyses. Unfortunately, complex tools are more likely to provide precise results, but are more difficult to comprehend on a more than basic level for the digitally lay user.[24]

I discussed the potential of digital humanities tools for a hypothetical biography on John de Witt. The hypothetical biography is based on more documents and provides detailed network analyses, comparisons to contemporaries and visualizations. Some questions which could not easily be answered before, like how religiously influenced De Witt is in his writing compared to his contemporaries could even be answered. The hypothetical biography would be presented in a dynamic and interactive manner, providing possibilities for additions, dialogue and links to more information.
            Despite the apparent opportunities for biographical research, there still is a long way to go before this hypothetical biography of John de Witt could be written. First of all, much more archival sources should be put online in a format that makes it possible to analyze with digital methods.  Right now, we are still at the beginning of digitizing our cultural heritage. Techniques to translate handwriting into computer readable text is likewise still in the early stages of development.
            Some things however, can already be done for biographical research thanks to digital humanities technologies. New forms of presenting biographies already exist. In this chapter we have also shown some very basic examples of what can be done with digitized texts relating to the life of John de Witt. Even though the analyses do not provide any conclusive evidence, they do inspire new research questions and provide leads into what may be worthwhile investigating further.


