The website Footnote 2 was used as a way to gather tweet-ids Footnote step 3 , this website will bring researchers which have metadata from a beneficial (third-party-collected) corpus off Dutch tweets (Tjong Kim Carried out and you can Van den Bosch, 2013). elizabeth., the historical restriction when asking for tweets predicated on a quest ask). The newest Roentgen-bundle ‘rtweet’ and complementary ‘lookup_status’ mode were utilized to collect tweets inside JSON format. New JSON file comprises a desk for the tweets’ recommendations, including the design date, the tweet text, and the provider (we.age., brand of Myspace client).
Studies clean and preprocessing
The JSON Footnote 4 files were converted into an R data frame object. Non-Dutch tweets, retweets, and automated tweets (e.g., forecast-, advertisement-relatea, and traffic-related tweets) were removed. In addition, we excluded tweets based on three user-related criteria: (1) we removed tweets that belonged to the top 0.5 percentile of user activity because we considered them non-representative of the normal user population, such as users who created more than 2000 tweets within four weeks. (2) Tweets from users with early access to the 280 limit were removed. (3) Tweets from users who were not represented in both pre and post-CLC datasets were removed, this procedure ensured a consistent user sample over time (within-group design, Nusers = 109,661). All cleaning procedures and corresponding exclusion numbers are presented in Table 2.
New tweet messages was indeed transformed into ASCII security. URLs, line trips, tweet headers, display brands, and sources to help you monitor labels was indeed eliminated. URLs add to the profile amount when located inside the tweet. Although not, URLs don’t enhance the character count when they’re found at the termination of good tweet. To quit an excellent misrepresentation of the actual profile restrict one to profiles had to deal with, tweets which have URLs (yet not news URLs such as additional photographs otherwise films) had been omitted.
Token and you will bigram studies
The new Roentgen bundle Footnote 5 ‘quanteda’ was utilized in order to tokenize new tweet messages for the tokens (we.e., remote terms, punctuation s. As well, token-frequency-matrices had been calculated which have: the fresh frequency pre-CLC [f(token pre)], new relative volume pre-CLC[P (token pre)], the newest volume article-CLC [f(token blog post)], the brand new relative regularity blog post-CLC and you will T-score. The T-attempt is a lot like a simple T-fact and works out the new analytical difference in means (we.e., new cousin keyword frequencies). Bad T-score suggest a fairly high occurrence out of an effective token pre-CLC, while confident T-ratings imply a relatively high thickness of a beneficial token blog post-CLC. The T-rating formula utilized in the research is actually displayed once the Eq. (1) and you will (2). Letter is the final number off tokens per dataset (i.e., before and after-CLC). So it equation will be based upon the method to possess linguistic computations by Chapel ainsi que al. (1991; Tjong Kim Done, 2011).
Part-of-address (POS) studies
Brand new Roentgen package Footnote 6 ‘openNLP’ was applied to help you classify and amount POS kinds on the tweets (we.e., adjectives, adverbs, articles, conjunctives, interjections, nouns, numeral, prepositions, pronouns, punctuation, verbs, and you can miscellaneous). The brand new POS tagger works playing with a maximum entropy (maxent) likelihood design so you’re able to expect the fresh POS category considering contextual keeps (Ratnaparkhi, 1996). Brand new Dutch maxent design used in brand new POS classification is trained on the CoNLL-X Alpino Dutch Treebank study (Buchholz and you will ). New openNLP POS model could have been said which have a precision get regarding 87.3% when used for English social media study (Horsmann ainsi que al., 2015). An enthusiastic ostensible limitation of one’s current research is the reliability out-of this new POS tagger. not, comparable analyses was indeed did for both pre-CLC and you can post-CLC datasets, meaning the precision of POS tagger would be consistent more than both datasets. Ergo, we guess there are no systematic confounds.