Posts

Today I was struggling with a relatively simple operation: unnest() from the tidyr package. What it’s supposed to do is pretty simple. When you have a data.frame where one or multiple columns are lists, you can unlist these columns while duplicating the information in other columns if the length of an element is larger than 1. library(tibble) df <- tibble( a = LETTERS[1:5], b = LETTERS[6:10], list_column = list(c(LETTERS[1:5]), "F", "G", "H", "I") ) df ## # A tibble: 5 x 3 ## a b list_column ## <chr> <chr> <list> ## 1 A F <chr [5]> ## 2 B G <chr [1]> ## 3 C H <chr [1]> ## 4 D I <chr [1]> ## 5 E J <chr [1]> library(tidyr) unnest(df, list_column) ## # A tibble: 9 x 3 ## a b list_column ## <chr> <chr> <chr> ## 1 A F A ## 2 A F B ## 3 A F C ## 4 A F D ## 5 A F E ## 6 B G F ## 7 C H G ## 8 D I H ## 9 E J I I came across this a lot while working on data from Twitter since individual tweets can contain multiple hashtags, mentions, URLs and so on, which is why they are stored in lists.

CONTINUE READING

I’m happy to announce that rwhatsapp is now on CRAN. After being tested by users on GitHub for a year now, I decided it is time to make the package available to a wider audience. The goal of the package is to make working with ‘WhatsApp’ chat logs as easy as possible. ‘WhatsApp’ seems to become increasingly important not just as a messaging service but also as a social network—thanks to its group chat capabilities.

CONTINUE READING

Today is Valentine’s Day. And since both I and my sweetheart are R enthusiasts, here is how to say “I love you” using a statistical programming language:

library("dplyr")
library("gganimate")
library("ggplot2")

hrt_dat <- data.frame(t = seq(0, 2 * pi, by = 0.01)) %>%
  bind_rows(data.frame(t = rep(max(.$t), 300))) %>% 
  mutate(xhrt = 16 * sin(t) ^ 3,
         yhrt = 13 * cos(t) - 5 * cos(2 * t) - 2 * cos(3 * t) - cos(4 * t),
         frame = seq_along(t)) %>% 
  mutate(text = ifelse(frame > 300, "            J", "")) %>%
  mutate(text = ifelse(frame > 500, "A           J", text)) %>%
  mutate(text = ifelse(frame > 628, "A     +     J", text)) %>% 
  mutate(texty = 0, textx = 0)

ggplot(hrt_dat, aes(x = xhrt, y = yhrt)) +
  geom_line(colour = "#C8152B") +
  geom_polygon(fill = "#C8152B") +
  geom_text(aes(x = textx, y = texty, label = text), 
            size = 18, 
            colour = "white",
            vjust = "center") +
  theme_void() +
  transition_reveal(frame)

CONTINUE READING

Some time ago, I saw a presentation by Wouter van Atteveldt who showed that wordclouds aren’t necessarily stupid. I was amazed since wordclouds were one of the first things I ever did in R and they are still often shown in introductions to text analysis. But the way they are mostly done is, in fact, not very informative. Because the position of the individual words in the cloud do not mean anything, the only information communicated is through the font size and sometimes font colour of the words.

CONTINUE READING

For my PhD project, I want to use Supervised Machine Learning (SML) to replicate my manual coding efforts onto a larger data set. That means, however, that I need to put in some manual coding effort before the SML algorithms can do their magic! I used a number of programs already to analyse texts by hand, and they all come with their up- and downsides. A while ago I already coded articles in order to train an SML algorithm and did so having a PDF with the text opened on the left side of my screen and an Excel file with my category system on the right side.

CONTINUE READING

My PhD supervisor once told me that everyone doing newspaper analysis starts by writing code to read in files from the ‘LexisNexis’ newspaper archive. However, while I do recommend this exercise, not everyone has the time. These are the first words of the introduction to my first R package, LexisNexisTools. My PhD supervisor was also my supervisor for my master dissertation and he said these words before he gave me my very first book about R.

CONTINUE READING

A while ago I was building a database of newspaper articles retrieved from LexisNexis for a research project in which I was working as a research assistant. At some point we noticed that we seemed to have a lot of duplicates in our database. I had already removed the duplicates with R so we were really surprised that those are still in there. However, after some investigation, I found that there are indeed small differences between the articles we had identified manually as duplicates in our data.

CONTINUE READING

I have been playing with the idea of getting a personal website for a while now. The concept of personal websites seems to have lost appeal in recent years, due to the omnipresence of social media sites and universities allowing some space on their website for some key points. However, it seems that most self-respecting academics still operate personal websites. After tweaking my own profile I learned what could be one of the reasons: I tried to show it to a friend and failed miserably to find it again in the maze that is the UoG website (which is not worse than most other university’s site I’ve seen).

CONTINUE READING

Just a post to test if relative links still end up in the rss feed.

library(ggplot2)
ggplot(diamonds) +
  stat_density_2d(aes(x = x, y = depth, fill = stat(nlevel)), 
                  geom = "polygon", n = 100, bins = 10, contour = TRUE) +
  facet_wrap(clarity ~ .) +
  scale_fill_viridis_c(option = "A")

CONTINUE READING