Recently I have been practicing Functional Programming with Scala and for practice, thought I would do a small project where I would automagically download the Covid-19 reports from the World Health Organization.
A bit about Scala
Scala (scalable language) is a statically typed, functional, object-oriented language that also lets you write imperative code if need be. It compiles to the Java Virtual Machine and is optimized for big data workloads, such that the distributed big data processing technology Apache Spark, is written in Scala.
What all of that means is that you can use existing java libraries (and the many stack overflow answers!), and write elegant, succinct code to get the job done. The sucky part is that you’ll have to learn functional programming. Though, trust me, you won’t regret it.
Step 0: Dependencies and Imports
For the purpose of the tutorial, I started a new project on IntelliJ using SBT (Scala Build Tools). I’ve imported the JSoup library as a dependency in my build.sbt file in my IntelliJ project. We’ll be using JSoup as the library for parsing HTML from the WHO website:
Below are the imports that we’ll be using throughout our project. Personally, my knowledge with the scala standard library is quite limited. Hence, I won’t be going into detail at this point.
Step 1: Get Links from a Webpage
First, let’s write a function that allows us to scrape Links off a Web-page called getLinks. Get links will take two parameters, url, and selector.
Step 2: Clean our URLs
In the World Health Organization website link structure, the hrefs drop the root url and use relative urls. To handle this, we need to add logic to append the root url to each of our links
Step 3: Download Files
Lastly, we’ll write our downloadFiles function which will download and write files to a specified path.
Step 4: Lets Put it all together
Step 5: Extract Data from PDF files using Python
I haven’t solved this problem with Scala, I have previously written on how you can extract data from a PDF file using Python. You can refer to this article to give you a general idea on how you can go about extracting that data.