Widget HTML Atas

Text Mining In Practice With R Download

The 5 Packages You Should Know for Text Analysis with R

A Complete Overview of the Most Useful Packages in R Data Scientists Should Know About for Text Analysis

Céline Van den Rul

Photo by Patrick Tomasso on Unsplash

1. The All-Encompassing: Quanteda

          install.packages("quanteda")
library(quanteda)

Quanteda is the go-to package for quantitative text analysis. Developed by Kenneth Benoit and other contributors, this package is a must for any data scientist doing text analysis.

Why? Because this package allows you to do A LOT. This ranges from the basics in natural language processing — lexical diversity, text-preprocessing, constructing a corpus, token objects, document-feature matrix) — to more advanced statistical analysis such as wordscores or wordfish, document classification (e.g. Naive Bayes) and topic modelling.

A useful tutorial of the package is the one developed by Kohei Watanabe and Stefan Müller (link).

2. The Transformer: Text2vec

          install.packages("text2vec")
library(text2vec)

Text2v e c is an extremely useful package if you're building machine learning algorithms based on text data. This package allows you to construct a document-term matrix (dtm) or term co-occurence matrix (tcm) from documents. As such, you vectorize text by creating a map from words or n-grams to a vector space. Based on this, you can then fit a model to that dtm or tcm. This ranges from topic modelling (LDA, LSA), word embeddings (GloVe), collocations, similarity searches and more.

The package is inspired by Gensim, a famous python library for natural language processing. You can find a useful tutorial of the package here.

3. The Adapter: Tidytext

          install.packages("tidytext")
library(tidytext)

Tidytext is an essential package for data wrangling and visualisation. One of its benefits is that it works very well in tandem with other tidy tools in R such as dplyr or tidyr. In fact, it was built for that purpose. Recognising cleaning data always requires a big amount of effort and that many of these methods aren't easily applicable to text, Silge & Robinson (2016) developed tidytext to make text mining tasks easier, more effective and consistent with tools already in wide use.

As a result, this package provides commands that allow you to convert text to and from tidy formats. The possibilities for analysis and visualisation are numerous: from sentiment analysis to tf-idf statistics, n-grams or topic modelling. The package particularly stands out for the visualization of the output.

You can find a useful tutorial of the package here.

4. The Matcher: Stringr

          install.packages("stringr")
library(stringr)

As a data scientist, you've mostly already worked with strings. They play a big role in many data cleaning and preparation tasks. Part of the tidyverse, an ecosystem of packages (that also includes ggplot and dplyr), the stringr package provides a cohesive set of functions that allow you to easily work with strings.

When it comes to text analysis, stringr is a particularly handy package to work with regular expressions as it provides a few useful pattern matching functions. Other functions include character manipulation (manipulating individual characters within the strings in character vectors) and whitespace tools (add, remove, manipulate whitespace).

The CRAN — R project has a useful tutorial on the package (link).

5. The Show-Off: Spacyr

          install.packages("spacyr")
library(spacyr)
spacy_install()
spacy_initialize()

Most of you may know the spaCy package in Python. Well, spacyr provides a convenient wrapper of that package in R, making it easy to access the powerful functionality of spaCy in a simple format. In fact, it's a pretty incredible package if you think about it, allowing R to harness the power of Python. To access these Python functionalities, spacyr opens a connection by being initialized within your R session.

This package is essential for more advanced natural language processing models — e.g. preparing text for deep learning — and other useful functionalities such as speech tagging, tokenization, parsing etc. In addition, it also works well in combination with the quanteda and tidytext packages.

You can find a useful tutorial to the package here.

Posted by: moeatlantas.blogspot.com

Source: https://towardsdatascience.com/r-packages-for-text-analysis-ad8d86684adb