Securly Pagescan

Securly uses a Machine Learning (AI) approach to building its URL categorization database. When Securly was founded 5 years ago, we started with a proprietary database we built via crawling of millions of websites, and we have continuously added (and deprecated) thousands of sites weekly to this database.

As a result, the technology is a combination of a static database and a machine-learning (AI) component called PageScan that continually looks at sites our millions of students are visiting, and categorizes them in real-time to be added to the static database.

 

PageScan takes inputs from two sources:

  • Crowd-sourced URL scanning: When a Securly student goes to a website that is NOT in one of our deny lists, the website is let through the first time only. That website then goes through PageScan which dynamically categorizes the site as belonging to one of the following categories – pornography, gambling, games, and anonymous proxies. Securly’s 5M+ enrolled Chromebooks continuously feed information into PageScan’s database and improve coverage for all Securly-filtered devices including Macbooks, iPads and PCs.
  • Search-engine crawling: Apart from PageScan, we also crawl search engines periodically for top adult keywords kids could potentially be searching, and then update our databases with top 50 sites from each of these keyword searches in case we missed any site.

PageScan internally is split into 3 components:

  • TextScan – The actual page crawling and scanning technology that visits websites, and uses statistical models to determine if the text on the site’s titles, meta information, and pages are describing an adult site.
  • ImageScan – The second level of scanning when TextScan is unable to decide on a Good or Bad score. For example, when a site has no text to scan, or is in a language TextScan can’t handle today. Multiple images are downloaded from the site, and scanned using proprietary algorithms that detect pornography in these images. This technology is also capable of detecting other categories like violence, drugs, etc. but we haven’t yet productized that technology yet. ImageScan can only mark a site as Bad, and if it can’t, we rely on the third level of scanning.
  • 3rd-Party-Scan – 3rd party best-in-class paid URL categorization engines coming from paid subscriptions to companies focused exclusively on URL categorization. This stage focuses on URL categorization when TextScan & ImageScan are unable to mark a site as clean. These subscriptions are expensive enough to not be our first line of defense against new sites. They come into the picture only when Securly is not able to decisively catch adult sites algorithmically first.

Here’s the approximate sequence of events inside PageScan in the form of a simple flowchart. Almost all web-filtering companies would have something comparable in place.

PS_Flowchart

Was this article helpful?
2 out of 6 found this helpful
Have more questions?
Submit a request

Comments

0 comments

Please sign in to leave a comment.

Articles in this section

See more