What is Tokenization?

Before processing a natural language, we want to identify the words that constitute a string of characters. That’s why tokenization is a foundational step in Natural Language Processing. This process is important because the meaning of the text can be interpreted through analysis of the words present in the text. Tokenization is the process of breaking apart original text into individual pieces (tokens) for further analysis. Tokens are pieces of the original text; they are not broken down into a base form. In this blog, we will be using the spaCy library to tokenize some created text documents to help…

In this blog, we are going to walk through the basics of what hyperparameters are, how they are connected with Grid searching, and then walk through an example notebook that uses Gridsearching to optimize our model.

What is a Hyperparameter?

A hyperparameter is a parameter whose value cannot be determined from data. The value of a hyperparameter must be set before a model undergoes its learning process. For example, in a RandomForestClassifier model, some of the hyperparameters include: n_estimators, criterion, max_depth, mn_samples_split, etc. (For a full list of the parameters, visit Sci-kit Learn’s RandomForestClassifier model page here).

For the purpose of this blog, we…

Basic Overview of Pipelines

Pipelines are common in machine learning systems and help with speeding up and simplifying some preprocessing situations. They are used to chain multiple estimators into one, which automates the machine learning process. This is extremely useful as there is often a fixed sequence of steps in processing the data. They are also useful when it comes to spitting out base models and comparing them to see which may give a better result for a particular metric/metrics, but it can also be tricky to access certain parts of a pipeline. The skeleton of a pipeline for one model is fairly simple.

Our Example Data

Before we begin this fun journey, a word of caution: the focus of this blog is NOT on cleaning the data and checking whether or not the assumptions of linear regression (briefly listed below) are met. Instead, the focus is on how to format the dataset so we can feed it into a linear regression model using PySpark!

The Assumptions of Linear Regression

Linear regression is an analysis that assesses whether one or more feature variables explain the target variable.

Linear regression has 5 key assumptions:

  • Linear relationship
  • Multivariate normality
  • No or little multicollinearity
  • No auto-correlation
  • Homoscedasticity

If you’d like to know more about the…

In this blog, we will brush over the general concepts of what Apache Spark and Databricks are, how they are related to each other, and how to use these tools to analyze and model off of Big Data.

What is Spark?

What is a Spectrogram?

Spectrograms are immensely useful tools that we can use to help dissect information from audio files and process it into images. In a spectrogram, the horizontal axis represents time, the vertical axis represents frequency, and the color intensity represents the amplitude of a frequency at a certain point in time. In case you can’t quite picture that, here is an example of what a spectrogram looks like:

Mel-Spectrogram of Johannes Brahm’s Hungarian Dance №5

The cool part about these images is that we can actually use them as a diagnostic tool with Deep Learning and Computer Vision to train convolutional neural networks for the classification of a…

What is Prophet?

In 2017, Facebook open-sourced Prophet — a forecasting library equipped with easy-to-use tools available in Python and R languages. While it is considered an alternative to ARIMA models, Prophet really shines when applied to time-series data that have strong seasonal effects and several seasons of historical data to work from.

Prophet is, by default, an additive regression model. It is also specifically designed to forecast business data. According to Taylor and Letham, there are four main components in the Prophet model:

  1. A piecewise linear or logistic growth curve trend. …

Google Colab is a great web IDE to use for any type of coding project (especially projects involving bigger datasets or requiring higher computational power), and is my preferred IDE of choice when creating projects. Think of Google Colab as a Jupyter Notebook that runs entirely in the cloud and comes with many core libraries already pre-installed. If you’re wondering about how to set up your own notebook in Google Colab, you’ve come to the right place! In this blog, I will walk through how to set up your own Google Colab, and finish with discussing the pros and cons.

How to Set Up:

Christopher Lewis

I am an aspiring Data Scientist and Data Analyst skilled in Python, SQL, Tableau, Computer Vision, Deep Learning, and Data Analytics.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store