August 12, 2023627 words

Web Scraping and Security

Why?

I am obsessed with searching and browsing the Internet these days. However, it is kind of tiresome to browse everything myself. So I decided to try out scrapying information.

Scraping Videos

When trying to scrape a website, the first thing to do is to inspect and try to find the html around the important information. Open the developer tool and search to find the important elements.

Devtools is my closest friend when browsing the Internet and will always be the best tool.

For example, youtube video uses have src field like

<video tabindex=-1 class=video-stream html5-main-video controlslist=nodownload style=width: 725px; height: 544px; left: 0px; top: 0px; src=blob...></video>

Blob obfuscates the source, preventing one from downloading the video. However, the source is usually loaded in m3u8 file. An m3u8 file usually consists of many segments of video files (they are combined together).

M3u8 file is sent from the network tab, so go there, reload the crap page, and try find the damn m3u8. yt-dlp can download not only YouTube but nearly all websites (including Google Drive files), and random mp4 or m3u8 files. Or you can basically just view m3u8 directly in mpv if you are online.

Bilibili appears to be using m4s and Akamai, something like https://upos-hz-mirrorakam.akamaized.net/.... Most of the websites are extremely easy to download. YouTube is complicated. Sometimes you would need a clean IP to download YouTube videos with yt-dlp because they block those data center VPN endnodes. Just use free Colab or Kaggle Notebook with yt-dlp (then hoard it to the Google Drive perhaps) because they are in Google Cloud.

Network Sniffing with Selenium

I can use Selenium to automatically sniff the video files.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import json
import time

options = Options()
options.set_capability('goog:loggingPrefs', {'performance': 'ALL'})

driver = webdriver.Chrome(options=options)
driver.get(https://www.instagram.com/p/CvKIrhaI_MO/)
driver.implicitly_wait(10)
time.sleep(5)
logs = driver.get_log(performance)

m3u8_urls=[]
for entry in logs:
    log_data = json.loads(entry[message])[message]
    if log_data[method] == Network.requestWillBeSent:
        urlfind=log_data[params][request][url]
        # print(urlfind)
        if urlfind.endswith(.m3u8) or (.mp4 in urlfind):
            m3u8_urls.append(urlfind)
for url in m3u8_urls:
    print(fExtracted URL: {url})
driver.quit()

Other Free Media

There is a Github repo FreeMediaHeckYeah and a piracy subreddit and Rutracker for nearly all the resources I need. Like, every movie blocked by a paywall can be accessed here while so I never use Netflix or Amazon Prime or anything else. As for music like just use an Adblocker and YouTube music. I never paid for these stuff anyway.

Tor Crawling

I never need tor anyway. It's mostly for illegal stuff. But maybe if we need to access onion websites or crawl it? Since onion websites are so painfully slow to load. As for normal websites using a proxy is the same anyway.

options=Options()

options.set_preference(network.proxy.type, 1)
options.set_preference(network.proxy.socks, 127.0.0.1)
options.set_preference(network.proxy.socks_port, 9050)
options.set_preference(network.proxy.socks_version, 5)
options.set_preference(network.proxy.socks_remote_dns, True)

browser = Firefox(options=options)

We can then access to onion sites

browser.get(http://xmh57jrknzkhv6y3ls3ubitzfqnkrwxhopf5aygthi7d6rplyvk3noyd.onion/)

Brute-Forcing Passwords

Since I can mess up with the IP, I can make many attempts to brute-force the password by sending many requests. But is it useful? What is the point of risking severe consequences for other's misfortune? Besides, passwords are getting stronger these days. A simple 10 digit password requires far more than 1 billion tries, very likely triggering something on the website.

Still, it's best if there can be a difficult captcha for each password attempt.

Form Problem

You can use Selenium to flood an anonymous form or "leaving message" section with random messages of your chosen. But still this is quite futile for the attacker, only to intimidate perhaps.

Email Problem

Attacker can subscribe the victim email to mail lists so the victim's inbox could be flooded with hundreds of emails. This way attacker doesn't need to personally send emails, basically just using various email services to bomb the victim. But again there is no point or financial gains for the attacker.

Loading...




Loading...