Whirlpool: A microservice style scalable continuous topical web crawler

dc.contributor.advisor	Soltys, Dr. Michael
dc.contributor.author	Pereira, Rihan Stephen
dc.date.accessioned	2020-01-27T18:02:21Z
dc.date.available	2020-01-27T18:02:21Z
dc.date.issued	2019-12
dc.identifier.uri	http://hdl.handle.net/10211.3/214919
dc.description.abstract	Historically, web crawlers/bots/spiders have been well known for indexing, ranking websites on the internet. This thesis augments the crawling activity but approaches the problem through the lens of a data engineer. Whirlpool as a continuous, topical web crawling tool is also a data ingestion pipeline implemented from bottom-up using RabbitMQ which is a high performance messaging buffer to organize the data flow within its network. It is based on a open, standard blueprint design of mercator. This paper discusses the high and low level design of this complex program covering auxiliary data structures, object-oriented design, addressing scalability concerns, and deployment on AWS. The project name Whirlpool is used as an analogy referring to the naturally occurring phenomenon where opposing water currents in sea cause water to spin round and round drawing various objects into it.	en_US
dc.format.extent	118	en_US
dc.language.iso	en_US	en_US
dc.publisher	California State University Channel Islands	en_US
dc.subject	Distributed systems	en_US
dc.subject	Message Queues	en_US
dc.subject	Deduplication	en_US
dc.subject	Amazon Web Services	en_US
dc.subject	Docker	en_US
dc.subject	Computer Science thesis	en_US
dc.title	Whirlpool: A microservice style scalable continuous topical web crawler	en_US
dc.type	Thesis	en_US
dc.contributor.committeeMember	Thoms, Dr. Brian
dc.contributor.committeeMember	Issacs, Dr. Jason
dc.contributor.committeeMember	Ozturgut, Dr. Osman

Files in this item

Name: Pereira, Rihan MSCS Thesis Fall 19_OCRDone.pdf

Size: 19.71Mb

Format: PDF

Download

This item appears in the following Collection(s)

Computer Science [55]

Show simple item record

Files in this item

This item appears in the following Collection(s)

Search DSpace

My Account

RSS Feeds