mins read

How are Python modules used for web crawling?

How are Python modules used for web crawling?

April 7, 2020
Green Alert
Last Update posted on
February 3, 2024
Beyond Monitoring: Predictive Digital Risk Protection with CloudSEK

Protect your organization from external threats like data leaks, brand threats, dark web originated threats and more. Schedule a demo today!

Schedule a Demo
Table of Contents
Author(s)
No items found.

 

Let us assume search engines, like Google, never existed! How would you find what you need from across 4.2 billion web pages? Web crawlers are programs written to browse the internet, to gather information, index and parse the collected data, to facilitate quick searches. Crawlers are, thus, a smart solution to big data sets and a catalyst to major advancements in the field of cyber security.

In this article, we will learn:

  1. What is crawling?
  2. Applications of crawling
  3. Python modules used for crawling
  4. Use-case: Fetching downloadable URLs from YouTube using crawlers
  5. How do CloudSEK Crawlers work?

web crawler

What is crawling?

Crawling refers to the process of scraping/ extracting data from websites/ the internet using web crawlers. For instance, Google uses spider bots (crawlers) to read the content of billions of web pages and posts. Then, it gathers data from these sites and arranges them in the Google Search index.

Basic stages of crawling: 
  1. Scrape data from the source
  2. Parse the collected data
  3. Clean the data of any noise or duplicate entries
  4. Structure the data as per requirement

 

Applications of crawling

Organizations crawl and scrape data off of web pages for various reasons that may benefit them or their customers. Here are some lesser known applications of crawling: 

  • Comparing data for market analysis
  • Monitoring data leaks
  • Preparing data sets for Machine Learning algorithms
  • Fact-checking information on social media

 

Python modules used for crawling

  • Requests – Allow you to send HTTP requests to web pages
  • Beautifulsoup – Python library that retrieves data from HTML and XML files, and parses its elements to the required format
  • Selenium – Open source testing suite used for web applications. It also performs browser actions to retrieve data.

 

Use-case: Fetching downloadable URLs from YouTube using crawlers

A single YouTube video may have several downloadable URLs, based on: its content, resolution, bitrate, range and VR/3D. Here is a sample API and CLI code to get downloadable URLs on YouTube along with their Itags:

 

Project structure 

youtube
|
|---- app.py
|---- cli.py
`---- core.py

The project will contain three files:

app.py: For the api interface, using flask micro framework

cli.py: For command line interface, using argparse module

core.py: Contains all the core (common) functionalities which act as helper functions for app.py and cli.py.

# youtube/app.py
import flask
from flask import jsonify, request
import core
app = flask.Flask(__name__)
app.config["DEBUG"] = True
@app.route('/', methods=['GET'])
def get_downloadable_urls():
if 'url' not in request.args:
return "Error: No url field provided. Please specify an youtube url."
url = request.args['url']
urls = core.get_downloadable_urls(url)
return jsonify(urls)
app.run()

The flask interface code to get downloadable URLs through API.

Request url - localhost:<port>/?url=https://www.youtube.com/watch?v=FIVPlraNgXs

# youtube/cli.py
import argparse
import core
my_parser = argparse.ArgumentParser(description='Get youtube downloadable video from url')
my_parser.add_argument('-u', '--url', metavar='', required=True, help='youtube url')
args = my_parser.parse_args()
urls = core.get_downloadable_urls(args.url)
print(f'Got {len(urls)} urls\n')
for index, url in enumerate(urls, start=1):
print(f'{index}. {url}\n')

Code snippet to get downlodable urls through comand line interface (using argparse to parse command like arguments)

Command line interface - python cli.py -u 'https://www.youtube.com/watch?v=aWPYw7iVBg0'

# youtube/core.py
import json
import re
import requests
def get_downloadable_urls(url):
html = requests.get(url).text
RE = re.compile(r'ytplayer[.]config\s*=\s*(\{.*?\});')
conf = json.loads(RE.search(html).group(1))
player_response = json.loads(conf['args']['player_response'])
data = player_response['streamingData']
return [{'itag': frmt['itag'],'url': frmt['url']} for frmt in data['adaptiveFormats']]

This is the core (common) function for both API and CLI interface. 

The execution of these commands will:

  1. Take YouTube url as an argument
  2. Gather page source using the Requests module
  3. Parse it and get streaming data
  4. Return response objects: url and itag

How to use these URLs?

  • Build your own YouTube downloader (web app)
  • Build an API to download YouTube video

Sample result 

[{
'itag': 251,
'Url': 'https://r2---sn-gwpa-h55k.googlevideo.com/videoplayback?expire=1585225812&ei=9Et8Xs6XNoHK4-EPjfyIiA8&ip=157.46.68.124&id=o-AGeDi3DVtAbmT5GiuGsDU7-NPLk23fOXNnY16gGQcHWu&itag=251&source=youtube&requiressl=yes&mh=Av&mm=31%2C26&mn=sn-gwpa-h55k%2Csn-cvh76ned&ms=au%2Conr&mv=m&mvi=1&pl=18&initcwndbps=112500&vprv=1&mime=audio%2Fwebm&gir=yes&clen=14933951&dur=986.761&lmt=1576518368612802&mt=1585204109&fvip=2&keepalive=yes&fexp=23882514&c=WEB&txp=5531432&sparams=expire%2Cei%2Cip%2Cid%2Citag%2Csource%2Crequiressl%2Cvprv%2Cmime%2Cgir%2Cclen%2Cdur%2Clmt&sig=ADKhkGMwRAIgK4L4VVHAlWMPVPEcmdkhnb2u8UM6eYhFz16kGruxZjUCIFXZJM9ejVK7OZJFqx7YwBqa3CrDvVakuU86vcIyMv-a&lsparams=mh%2Cmm%2Cmn%2Cms%2Cmv%2Cmvi%2Cpl%2Cinitcwndbps&lsig=ABSNjpQwRAIgKBhJytjv73-c7eMWbVkb-X8_rNb7_xApZvaPfw7wGcMCIHqJ405fQ3Kr-e_5fV8gokMUNi0rrrLG8T85sLGTQ17W'
}]

What is ITag?

ITag gives us more details about the video such as the type of video content, resolution, bitrate, range and VR/3D. A comprehensive list of YouTube format code ITags can be found here.

How do CloudSEK Crawlers work?

 

cloudsek crawlers

 

CloudSEK’s digital risk monitoring platform, XVigil, scours the internet across surface web, dark web, and deep web, to automatically detect threats and alert customers. After configuring a list of keywords suggested by the clients, CloudSEK Crawlers:

  1. Fetch data from various sources on the internet
  2. Push the gathered data to a centralized queue
  3. ML Classifiers group the data into threats and non-threats
  4. Threats are immediately reported to clients as alerts, via XVigil. Non-threats are simply ignored.

Author

Predict Cyber threats against your organization

Related Posts
Blog Image
May 19, 2020

How to bypass CAPTCHAs easily using Python and other methods

How to bypass CAPTCHAs easily using Python and other methods

Blog Image
June 3, 2020

What is shadow IT and how do you manage shadow IT risks associated with remote work?

What is shadow IT and how do you manage shadow IT risks associated with remote work?

Blog Image
June 11, 2020

GraphQL 101: Here’s everything you need to know about GraphQL

GraphQL 101: Here’s everything you need to know about GraphQL

Join 10,000+ subscribers

Keep up with the latest news about strains of Malware, Phishing Lures,
Indicators of Compromise, and Data Leaks.

Take action now

Secure your organisation with our Award winning Products

CloudSEK Platform is a no-code platform that powers our products with predictive threat analytic capabilities.

Engineering

min read

How are Python modules used for web crawling?

How are Python modules used for web crawling?

Authors
Co-Authors
No items found.

 

Let us assume search engines, like Google, never existed! How would you find what you need from across 4.2 billion web pages? Web crawlers are programs written to browse the internet, to gather information, index and parse the collected data, to facilitate quick searches. Crawlers are, thus, a smart solution to big data sets and a catalyst to major advancements in the field of cyber security.

In this article, we will learn:

  1. What is crawling?
  2. Applications of crawling
  3. Python modules used for crawling
  4. Use-case: Fetching downloadable URLs from YouTube using crawlers
  5. How do CloudSEK Crawlers work?

web crawler

What is crawling?

Crawling refers to the process of scraping/ extracting data from websites/ the internet using web crawlers. For instance, Google uses spider bots (crawlers) to read the content of billions of web pages and posts. Then, it gathers data from these sites and arranges them in the Google Search index.

Basic stages of crawling: 
  1. Scrape data from the source
  2. Parse the collected data
  3. Clean the data of any noise or duplicate entries
  4. Structure the data as per requirement

 

Applications of crawling

Organizations crawl and scrape data off of web pages for various reasons that may benefit them or their customers. Here are some lesser known applications of crawling: 

  • Comparing data for market analysis
  • Monitoring data leaks
  • Preparing data sets for Machine Learning algorithms
  • Fact-checking information on social media

 

Python modules used for crawling

  • Requests – Allow you to send HTTP requests to web pages
  • Beautifulsoup – Python library that retrieves data from HTML and XML files, and parses its elements to the required format
  • Selenium – Open source testing suite used for web applications. It also performs browser actions to retrieve data.

 

Use-case: Fetching downloadable URLs from YouTube using crawlers

A single YouTube video may have several downloadable URLs, based on: its content, resolution, bitrate, range and VR/3D. Here is a sample API and CLI code to get downloadable URLs on YouTube along with their Itags:

 

Project structure 

youtube
|
|---- app.py
|---- cli.py
`---- core.py

The project will contain three files:

app.py: For the api interface, using flask micro framework

cli.py: For command line interface, using argparse module

core.py: Contains all the core (common) functionalities which act as helper functions for app.py and cli.py.

# youtube/app.py
import flask
from flask import jsonify, request
import core
app = flask.Flask(__name__)
app.config["DEBUG"] = True
@app.route('/', methods=['GET'])
def get_downloadable_urls():
if 'url' not in request.args:
return "Error: No url field provided. Please specify an youtube url."
url = request.args['url']
urls = core.get_downloadable_urls(url)
return jsonify(urls)
app.run()

The flask interface code to get downloadable URLs through API.

Request url - localhost:<port>/?url=https://www.youtube.com/watch?v=FIVPlraNgXs

# youtube/cli.py
import argparse
import core
my_parser = argparse.ArgumentParser(description='Get youtube downloadable video from url')
my_parser.add_argument('-u', '--url', metavar='', required=True, help='youtube url')
args = my_parser.parse_args()
urls = core.get_downloadable_urls(args.url)
print(f'Got {len(urls)} urls\n')
for index, url in enumerate(urls, start=1):
print(f'{index}. {url}\n')

Code snippet to get downlodable urls through comand line interface (using argparse to parse command like arguments)

Command line interface - python cli.py -u 'https://www.youtube.com/watch?v=aWPYw7iVBg0'

# youtube/core.py
import json
import re
import requests
def get_downloadable_urls(url):
html = requests.get(url).text
RE = re.compile(r'ytplayer[.]config\s*=\s*(\{.*?\});')
conf = json.loads(RE.search(html).group(1))
player_response = json.loads(conf['args']['player_response'])
data = player_response['streamingData']
return [{'itag': frmt['itag'],'url': frmt['url']} for frmt in data['adaptiveFormats']]

This is the core (common) function for both API and CLI interface. 

The execution of these commands will:

  1. Take YouTube url as an argument
  2. Gather page source using the Requests module
  3. Parse it and get streaming data
  4. Return response objects: url and itag

How to use these URLs?

  • Build your own YouTube downloader (web app)
  • Build an API to download YouTube video

Sample result 

[{
'itag': 251,
'Url': 'https://r2---sn-gwpa-h55k.googlevideo.com/videoplayback?expire=1585225812&ei=9Et8Xs6XNoHK4-EPjfyIiA8&ip=157.46.68.124&id=o-AGeDi3DVtAbmT5GiuGsDU7-NPLk23fOXNnY16gGQcHWu&itag=251&source=youtube&requiressl=yes&mh=Av&mm=31%2C26&mn=sn-gwpa-h55k%2Csn-cvh76ned&ms=au%2Conr&mv=m&mvi=1&pl=18&initcwndbps=112500&vprv=1&mime=audio%2Fwebm&gir=yes&clen=14933951&dur=986.761&lmt=1576518368612802&mt=1585204109&fvip=2&keepalive=yes&fexp=23882514&c=WEB&txp=5531432&sparams=expire%2Cei%2Cip%2Cid%2Citag%2Csource%2Crequiressl%2Cvprv%2Cmime%2Cgir%2Cclen%2Cdur%2Clmt&sig=ADKhkGMwRAIgK4L4VVHAlWMPVPEcmdkhnb2u8UM6eYhFz16kGruxZjUCIFXZJM9ejVK7OZJFqx7YwBqa3CrDvVakuU86vcIyMv-a&lsparams=mh%2Cmm%2Cmn%2Cms%2Cmv%2Cmvi%2Cpl%2Cinitcwndbps&lsig=ABSNjpQwRAIgKBhJytjv73-c7eMWbVkb-X8_rNb7_xApZvaPfw7wGcMCIHqJ405fQ3Kr-e_5fV8gokMUNi0rrrLG8T85sLGTQ17W'
}]

What is ITag?

ITag gives us more details about the video such as the type of video content, resolution, bitrate, range and VR/3D. A comprehensive list of YouTube format code ITags can be found here.

How do CloudSEK Crawlers work?

 

cloudsek crawlers

 

CloudSEK’s digital risk monitoring platform, XVigil, scours the internet across surface web, dark web, and deep web, to automatically detect threats and alert customers. After configuring a list of keywords suggested by the clients, CloudSEK Crawlers:

  1. Fetch data from various sources on the internet
  2. Push the gathered data to a centralized queue
  3. ML Classifiers group the data into threats and non-threats
  4. Threats are immediately reported to clients as alerts, via XVigil. Non-threats are simply ignored.