Why Choose Cloudflare R2?
The key idea is that Cloudflare R2 has low storage fees(10 dollar per TB per month), and no egress fees. Egress fees are scary when I consume videos like this, say, AWS or GCP's egress fees are like 100 dollars per TB outbound. Cloudflare, however, has 0 egress fees.
See https://getdeploying.com/reference/data-egress
Some Interesting Commands
Note: These commands may not work in the future
# Download videos matching the criteria
yt-dlp -j --flat-playlist https://www.youtube.com/@YandexforML |
jq -r 'select(.view_count != null and .view_count > 50000 and .duration != null and .duration > 300) | \(.view_count) https://youtu.be/\(.id)' # sort -rn | cut -d' ' -f2- | xargs -I {} yt-dlp {}
# 720p
yt-dlp -j --flat-playlist https://www.youtube.com/@YandexforML |
jq -r 'select(.view_count != null and .view_count > 50000 and .duration != null and .duration > 300) | \(.view_count) https://youtu.be/\(.id)' |
sort -rn | cut -d' ' -f2- | xargs -I {} yt-dlp -f 'bestvideo[height<=720][ext=webm]+bestaudio[ext=webm]/bestvideo[height<=720][ext=mp4]+bestaudio[ext=m4a]/best[height<=720][ext=webm]/best[height<=720][ext=mp4]/best[ext=webm]/best[ext=mp4]' {}
yt-dlp -j --flat-playlist https://www.youtube.com/@vdud | jq -r '.id' | sed 's_^_https://youtu.be/_'
# views over 100k and length more than 5 mins
yt-dlp -j --flat-playlist https://www.youtube.com/@vdud |
jq -r 'select(.view_count != null and .view_count > 100000 and .duration != null and .duration > 300) | \(.view_count) https://youtu.be/\(.id)' |
sort -rn |
cut -d' ' -f2-
# non-live
yt-dlp -j --flat-playlist https://www.youtube.com/@yakutia24tv |
jq 'select(.live_status == null) | .id' -r |
sed 's_^_https://youtu.be/_'
# sort by views
yt-dlp -j --flat-playlist https://www.youtube.com/@NadinStrelets |
jq -r 'select(.view_count != null) | \(.view_count) https://youtu.be/\(.id)' |
sort -rn |
cut -d' ' -f2-
# views over 50k and length more than 5 mins
yt-dlp -j --flat-playlist https://www.youtube.com/@NadinStrelets |
jq -r 'select(.view_count != null and .view_count > 50000 and .duration != null and .duration > 300) | \(.view_count) https://youtu.be/\(.id)' |
sort -rn |
cut -d' ' -f2-
Hoarding Videos
So basically you need a clean IP like on Vast AI, which isn't banned by YouTube.
Hoarding was like this: I go on Vastai, run yt-dlp
to download entire channels, such as Надежда Стрелец, Ляйсан Утяшева, Yandex ML, Bi-2, Karna.val, Gavrilina, HypeHouseRu, etc also full music channels like Abel Korzeniowski, Einaudi, Max Richter, etc.
Then I run Whisper+Helsinki NLP to generate dual subtitles, then use ffmpeg
to write it and put it on my S3. I self hosted Alist as frontend to the channels.
I hoarded 400 GB total.
Creating a Dictionary
I fed the the Wiktionary Mixed Web to an LLM to generate Russian-only explanations for each word and compiled it into a PDF of 860 pages. Basically iterate this prompt with an API over the words
Что означает {}? Дайте объяснение только на очень простом русском языке. Дайте синонимы.
Дайте 2 простых примера предложений.
**Объяснение**: [простое объяснение слова без слова спереди].
**Синонимы**: [список синонимов].
**Простые предложения**: «[предложение 1]» «[предложение 2]».
Используйте этот формат, чтобы ответ всегда возвращался в заданной структуре.
There are some other lists like Serge Sharoff.
Smotrim
Format: m3u8
Language: ru
- There are different media types, including podcasts, videos, brand(I think it means Movies and TVs)
- Most videos follow the format of
https://smotrim.ru/video/NUMBER
- Length vary, but most TV show episodes are 40-60 minutes
- By entering
https://smotrim.ru/video/NUMBER
you can sometimes get a video, and sometimes it will return 404. The numbers range from 1 to at least 2848324 - Many videos are protected and shows "Произошла сетевая ошибка" without a Russian IP in the frontend, but the m3u8 is still fetchable and you can use
mpv
withyt-dlp
to play it easily with the url, for example:mpv https://smotrim.ru/video/2228159
. Some movies are locked behind a paywall(but videos seem free as long as your IP is in Russia)
Example URL:
https://smotrim.ru/video/2820848
https://smotrim.ru/video/2847857
The movie Ekaterina(2014) first season:
https://smotrim.ru/video/1146653
https://smotrim.ru/video/1146654
https://smotrim.ru/video/1146655
https://smotrim.ru/video/1146817
Then it went to
https://smotrim.ru/video/2228158
https://smotrim.ru/video/2228159
...
https://smotrim.ru/video/2228163
Thus this makes raw sequential scraping harder.
Douban Discussion
Douban restricts IP (needs login after a while) if you use them too much so you can try to cycle through many different IPs and use multiple threads.
https://github.com/jimchen2/archived-scripts/tree/master/douban-scraping-main
Discussion Table:
Douban's groups have a main discusison table
Iterate through pages
for start in range(0, 175, 25):
# url = fhttps://www.douban.com/group/[]/discussion?start={start}
discussions_table = driver.find_element(By.CSS_SELECTOR, table.olt)
After that:
topic_id_pattern = re.compile(r'/topic/(\d+)/')
rows = soup.find_all('tr')
for row in rows:
title_cell = row.find('td', class_='title')
title = title_cell.find('a').get_text(strip=True)
link = title_cell.find('a')['href']
# Extract the topic ID from the URL using the regular expression
topic_id_match = topic_id_pattern.search(link)
topic_id = topic_id_match.group(1) if topic_id_match else 'Unknown'
author = row.find('td', nowrap='nowrap').find('a').get_text(strip=True)
replies = row.find('td', class_='r-count').get_text(strip=True)
last_post_time = row.find('td', class_='time').get_text(strip=True)
For each discussion there is a topic page (or the main page, there are many more comment pages)
topic_content = WebDriverWait(driver, 100).until(
EC.presence_of_element_located((By.CLASS_NAME, topic-content))
)
Then you can try to extract the user who posted the topic and the post time
user_link = soup.find('a', href=True)
user_href = user_link['href'] if user_link else 'No link found'
user_img_src = soup.find('img', class_='pil')['src'] if soup.find('img', class_='pil') else 'No image found'
user_img_alt = soup.find('img', class_='pil')['alt'] if soup.find('img', class_='pil') else 'No alt text found'
Comments/Replies
Use the total replie count to generate urls
if replies >= 100:
pages = replies // 100
for i in range(1, pages+1):
links.append(f{doc['Link']}?start={i * 100})
Then extract the replies
topic_replies = WebDriverWait(driver, 300).until(
EC.presence_of_all_elements_located((By.CLASS_NAME, topic-reply))
)
comments_data = []
# Iterate through each reply-item within topic-reply
for reply in topic_replies:
comment_items = reply.find_elements(By.CLASS_NAME, comment-item)
for item in comment_items:
comments_data.append(item.get_attribute('innerHTML'))
Then you can extract the reply content, time, face image content, douban homepage from the that
You can choose whether or not to "localize" and scrap the media (e.g. images, mainly GIF) from Douban, which isn't quite large (I scraped 170k comments and there were only 4GB media).
1TV News
Format: m3u8, extremely easy to crawl with yt-dlp
Language: ru
- There are multiple streams a day, with 21:00 generally being the longest.
- They follow the format of
https://www.1tv.ru/news/issue/YYYY-MM-DD/HH:00#N
- Each is 30 min - 3 hours
- The format can date back to at least 2010, but you need some tweaking for older dates, as they don't concat the videos by default, so like fetching videos from #1 to #13 or #17..., but since 2019 you can get the full length video by entering the
https://www.1tv.ru/news/issue/YYYY-MM-DD
and wait for a redirect - Some Olympic stuff are geo protected
Example URL:
https://www.1tv.ru/news/issue/2024-07-11/21:00
https://www.1tv.ru/news/issue/2024-08-09/21:00
Vk
Get a free VK API key from VK for Business. Register an app then get the Access tokens > Service token
Example: https://vk.com/tutberidze.eteri
Refer to: https://dev.vk.com/en/method/users.get
API url example: https://api.vk.com/method/users.get?user_ids=tutberidze.eteri&fields=photo_100,bdate,city,status&access_token=SECRET_KEY&v=5.131
You can change the fields for other information
API example response:
{
response: [
{
id: 643779055,
bdate: 24.2.1974,
city: {
id: 1,
title: Москва
},
status: Coaching for life 🎗Заслуженный тренер России по фигурному катанию на коньках 🎫Мастер Спорта СССР,
......
}
]
}
For VK groups: https://vk.com/tsiskaridzenikolayofficial
Refer to: https://dev.vk.com/en/method/groups.get
API url example: https://api.vk.com/method/groups.getById?group_ids=tsiskaridzenikolayofficial&fields=description,city,photo_100&access_token=ACCESS_TOKEN&v=5.131
API example response:
{
response: [
{
id: 211218365,
city: {
id: 1,
title: Москва
},
description: Официальная страница\nНиколая Цискаридзе\n\nРегистрация в перечне РКН: https://gosuslugi.ru/snet/67927d41ee896061c9ca7145\n\nИСКУССТВО и ЖИЗНЬ\n\nПоказывать можно только зрячим. \nПеть песню — только тем, кто слышит. \nДари себя тому, кто будет благодарен, \nКто понимает, любит и ценит\nОмар Хайям\n\nПо всем вопросам tsiskaridzenikolay@gmail.com,
name: Николай Цискаридзе,
......
}
]
}
Posts: https://dev.vk.com/en/method/wall.get
API Example: https://api.vk.com/method/wall.get?owner_id=tutberidze.eteri&count=20&access_token=ACCESS_TOKEN&v=5.131
API example response:
{
response: {
count: 74,
items: [
{
inner_type: wall_wallpost,
...
}
}
Each Item:
{
...
attachments: [
...
],
date: 1741546333,
edited: 1741546346,
...
text: Приглашаю вас на ледовое шоу #TeamTutberidze! \n\nВас ждут новые истории и легендарные программы в исполнении одних из лучших фигуристов в мире.\n\nЖду встречи с каждым из вас в 11 городах!\n\nСсылка на билеты в шапке профиля.,
views: {
count: 1601
}
},
Photo Attachments:
{
type: photo,
photo: {
...
sizes: [
{
height: 75,
type: s,
width: 75,
url: https://sun6-21.userapi.com/s/v1/ig2/GYs-E55vvY2132z9DgSoTvPO0Sk0Iely8Tp5RSZZB_wEMwfsZCWaM_pJUHuZm4W0l_NaaRQGT-e5m1Y3dOfrxqMx.jpg?quality=95&as=32x32,48x48,72x72,108x108,160x160,240x240,360x360,480x480,540x540,640x640,720x720,1080x1080&from=bu&cs=75x75
},
......
],
orig_photo: {
......
Dzen
I didn't find any APIs for Dzen. It is sold to Mail.ru (I think, previously it was Yandex Zen).
Example: https://dzen.ru/a/Z9khBocwVgufmUKF
Setup Selenium, and you need a Russian IP, or else blocked by captcha.
- Title: og: title
- Date: something like this:
<meta itemprop="datePublished" content="2024-01-15">
- Find the meta tag with datePublished
- Text HTML: Find the tag
<div ... aria-label="Статья 1" ...
, inside the tag find<div... itemprop="articleBody" ...
, Grab it and re-process the images blocks, tags like<div ... data-block-type="image"...
Accounts:
Example: https://dzen.ru/tourister?tab=articles
- Grab og: title, og: image, og: description
- Look for divs with
id="zen-row-xxx"
, like id="zen-row-17", "id="zen-row-22"