Web-Hoarding-Attempts

September 23, 2024 • 1365 words

Web Hoarding Attempts

Why Choose Cloudflare R2?

The key idea is that Cloudflare R2 has low storage fees(10 dollar per TB per month), and no egress fees. Egress fees are scary when I consume videos like this, say, AWS or GCP's egress fees are like 100 dollars per TB outbound. Cloudflare, however, has 0 egress fees.

See https://getdeploying.com/reference/data-egress

Some Interesting Commands

Note: These commands may not work in the future

# Download videos matching the criteria
yt-dlp -j --flat-playlist https://www.youtube.com/@YandexforML |
    jq -r 'select(.view_count != null and .view_count > 50000 and .duration != null and .duration > 300) | \(.view_count) https://youtu.be/\(.id)' #     sort -rn | cut -d' ' -f2- | xargs -I {} yt-dlp {}

# 720p
yt-dlp -j --flat-playlist https://www.youtube.com/@YandexforML |
    jq -r 'select(.view_count != null and .view_count > 50000 and .duration != null and .duration > 300) | \(.view_count) https://youtu.be/\(.id)' |
    sort -rn | cut -d' ' -f2- | xargs -I {} yt-dlp -f 'bestvideo[height<=720][ext=webm]+bestaudio[ext=webm]/bestvideo[height<=720][ext=mp4]+bestaudio[ext=m4a]/best[height<=720][ext=webm]/best[height<=720][ext=mp4]/best[ext=webm]/best[ext=mp4]' {}

yt-dlp -j --flat-playlist https://www.youtube.com/@vdud | jq -r '.id' | sed 's_^_https://youtu.be/_'

# views over 100k and length more than 5 mins
yt-dlp -j --flat-playlist https://www.youtube.com/@vdud |
jq -r 'select(.view_count != null and .view_count > 100000 and .duration != null and .duration > 300) | \(.view_count) https://youtu.be/\(.id)' |
sort -rn |
cut -d' ' -f2-

# non-live
yt-dlp -j --flat-playlist https://www.youtube.com/@yakutia24tv |
jq 'select(.live_status == null) | .id' -r |
sed 's_^_https://youtu.be/_'


# sort by views
 yt-dlp -j --flat-playlist https://www.youtube.com/@NadinStrelets |
                   jq -r 'select(.view_count != null) | \(.view_count) https://youtu.be/\(.id)' |
                   sort -rn |
                   cut -d' ' -f2-


# views over 50k and length more than 5 mins
yt-dlp -j --flat-playlist https://www.youtube.com/@NadinStrelets |
jq -r 'select(.view_count != null and .view_count > 50000 and .duration != null and .duration > 300) | \(.view_count) https://youtu.be/\(.id)' |
sort -rn |
cut -d' ' -f2-

Hoarding Videos

So basically you need a clean IP like on Vast AI, which isn't banned by YouTube.

Hoarding was like this: I go on Vastai, run yt-dlp to download entire channels, such as Надежда Стрелец, Ляйсан Утяшева, Yandex ML, Bi-2, Karna.val, Gavrilina, HypeHouseRu, etc also full music channels like Abel Korzeniowski, Einaudi, Max Richter, etc.

Then I run Whisper+Helsinki NLP to generate dual subtitles, then use ffmpeg to write it and put it on my S3. I self hosted Alist as frontend to the channels.

I hoarded 400 GB total.

Creating a Dictionary

I fed the the Wiktionary Mixed Web to an LLM to generate Russian-only explanations for each word and compiled it into a PDF of 860 pages. Basically iterate this prompt with an API over the words

Что означает {}? Дайте объяснение только на очень простом русском языке. Дайте синонимы.
Дайте 2 простых примера предложений.

**Объяснение**: [простое объяснение слова без слова спереди].
**Синонимы**: [список синонимов].
**Простые предложения**: «[предложение 1]» «[предложение 2]».

Используйте этот формат, чтобы ответ всегда возвращался в заданной структуре.

There are some other lists like Serge Sharoff.

Smotrim

Format: m3u8
Language: ru

There are different media types, including podcasts, videos, brand(I think it means Movies and TVs)
Most videos follow the format of https://smotrim.ru/video/NUMBER
Length vary, but most TV show episodes are 40-60 minutes
By entering https://smotrim.ru/video/NUMBER you can sometimes get a video, and sometimes it will return 404. The numbers range from 1 to at least 2848324
Many videos are protected and shows "Произошла сетевая ошибка" without a Russian IP in the frontend, but the m3u8 is still fetchable and you can use mpv with yt-dlp to play it easily with the url, for example: mpv https://smotrim.ru/video/2228159. Some movies are locked behind a paywall(but videos seem free as long as your IP is in Russia)

Example URL:

https://smotrim.ru/video/2820848
https://smotrim.ru/video/2847857

The movie Ekaterina(2014) first season:

https://smotrim.ru/video/1146653
https://smotrim.ru/video/1146654
https://smotrim.ru/video/1146655
https://smotrim.ru/video/1146817

Then it went to

https://smotrim.ru/video/2228158
https://smotrim.ru/video/2228159
...
https://smotrim.ru/video/2228163

Thus this makes raw sequential scraping harder.

Douban Discussion

Douban restricts IP (needs login after a while) if you use them too much so you can try to cycle through many different IPs and use multiple threads.

https://github.com/jimchen2/archived-scripts/tree/master/douban-scraping-main

Discussion Table:

Douban's groups have a main discusison table

Iterate through pages

for start in range(0, 175, 25):
    # url = fhttps://www.douban.com/group/[]/discussion?start={start}

        discussions_table = driver.find_element(By.CSS_SELECTOR, table.olt)

After that:

    topic_id_pattern = re.compile(r'/topic/(\d+)/')

    rows = soup.find_all('tr')
    for row in rows:
        title_cell = row.find('td', class_='title')
        title = title_cell.find('a').get_text(strip=True)
        link = title_cell.find('a')['href']

        # Extract the topic ID from the URL using the regular expression
        topic_id_match = topic_id_pattern.search(link)
        topic_id = topic_id_match.group(1) if topic_id_match else 'Unknown'

        author = row.find('td', nowrap='nowrap').find('a').get_text(strip=True)

        replies = row.find('td', class_='r-count').get_text(strip=True)

        last_post_time = row.find('td', class_='time').get_text(strip=True)

For each discussion there is a topic page (or the main page, there are many more comment pages)

        topic_content = WebDriverWait(driver, 100).until(
            EC.presence_of_element_located((By.CLASS_NAME, topic-content))
        )

Then you can try to extract the user who posted the topic and the post time

    user_link = soup.find('a', href=True)
    user_href = user_link['href'] if user_link else 'No link found'
    user_img_src = soup.find('img', class_='pil')['src'] if soup.find('img', class_='pil') else 'No image found'
    user_img_alt = soup.find('img', class_='pil')['alt'] if soup.find('img', class_='pil') else 'No alt text found'

Comments/Replies

Use the total replie count to generate urls

        if replies >= 100:
            pages = replies // 100
            for i in range(1, pages+1):
                links.append(f{doc['Link']}?start={i * 100})

Then extract the replies

        topic_replies = WebDriverWait(driver, 300).until(
            EC.presence_of_all_elements_located((By.CLASS_NAME, topic-reply))
        )

        comments_data = []
        # Iterate through each reply-item within topic-reply
        for reply in topic_replies:
            comment_items = reply.find_elements(By.CLASS_NAME, comment-item)
            for item in comment_items:
                comments_data.append(item.get_attribute('innerHTML'))

Then you can extract the reply content, time, face image content, douban homepage from the that

You can choose whether or not to "localize" and scrap the media (e.g. images, mainly GIF) from Douban, which isn't quite large (I scraped 170k comments and there were only 4GB media).

1TV News

Format: m3u8, extremely easy to crawl with yt-dlp
Language: ru

There are multiple streams a day, with 21:00 generally being the longest.
They follow the format of https://www.1tv.ru/news/issue/YYYY-MM-DD/HH:00#N
Each is 30 min - 3 hours
The format can date back to at least 2010, but you need some tweaking for older dates, as they don't concat the videos by default, so like fetching videos from #1 to #13 or #17..., but since 2019 you can get the full length video by entering the https://www.1tv.ru/news/issue/YYYY-MM-DD and wait for a redirect
Some Olympic stuff are geo protected

Example URL:

https://www.1tv.ru/news/issue/2024-07-11/21:00
https://www.1tv.ru/news/issue/2024-08-09/21:00

Vk

Get a free VK API key from VK for Business. Register an app then get the Access tokens > Service token

Example: https://vk.com/tutberidze.eteri

Refer to: https://dev.vk.com/en/method/users.get

API url example: https://api.vk.com/method/users.get?user_ids=tutberidze.eteri&fields=photo_100,bdate,city,status&access_token=SECRET_KEY&v=5.131

You can change the fields for other information

API example response:

{
  response: [
    {
      id: 643779055,
      bdate: 24.2.1974,
      city: {
        id: 1,
        title: Москва
      },
      status: Coaching for life 🎗Заслуженный тренер России по фигурному катанию на коньках 🎫Мастер Спорта СССР,
......
    }
  ]
}

For VK groups: https://vk.com/tsiskaridzenikolayofficial

Refer to: https://dev.vk.com/en/method/groups.get

API url example: https://api.vk.com/method/groups.getById?group_ids=tsiskaridzenikolayofficial&fields=description,city,photo_100&access_token=ACCESS_TOKEN&v=5.131

API example response:

{
    response: [
      {
        id: 211218365,
        city: {
          id: 1,
          title: Москва
        },
        description: Официальная страница\nНиколая Цискаридзе\n\nРегистрация в перечне РКН: https://gosuslugi.ru/snet/67927d41ee896061c9ca7145\n\nИСКУССТВО и ЖИЗНЬ\n\nПоказывать можно только зрячим. \nПеть песню — только тем, кто слышит. \nДари себя тому, кто будет благодарен, \nКто понимает, любит и ценит\nОмар Хайям\n\nПо всем вопросам tsiskaridzenikolay@gmail.com,
        name: Николай Цискаридзе,
......
      }
    ]
  }

Posts: https://dev.vk.com/en/method/wall.get

API Example: https://api.vk.com/method/wall.get?owner_id=tutberidze.eteri&count=20&access_token=ACCESS_TOKEN&v=5.131

API example response:

{
    response: {
      count: 74,
      items: [
        {
          inner_type: wall_wallpost,
  ...
}
}

Each Item:

{
...
    attachments: [
...
    ],
    date: 1741546333,
    edited: 1741546346,
 ...
    text: Приглашаю вас на ледовое шоу #TeamTutberidze! \n\nВас ждут новые истории и легендарные программы в исполнении одних из лучших фигуристов в мире.\n\nЖду встречи с каждым из вас в 11 городах!\n\nСсылка на билеты в шапке профиля.,
    views: {
      count: 1601
    }
  },

Photo Attachments:

      {
        type: photo,
        photo: {
         ...
          sizes: [
            {
              height: 75,
              type: s,
              width: 75,
              url: https://sun6-21.userapi.com/s/v1/ig2/GYs-E55vvY2132z9DgSoTvPO0Sk0Iely8Tp5RSZZB_wEMwfsZCWaM_pJUHuZm4W0l_NaaRQGT-e5m1Y3dOfrxqMx.jpg?quality=95&as=32x32,48x48,72x72,108x108,160x160,240x240,360x360,480x480,540x540,640x640,720x720,1080x1080&from=bu&cs=75x75
            },
         ......
          ],
          orig_photo: {
        ......

Dzen

I didn't find any APIs for Dzen. It is sold to Mail.ru (I think, previously it was Yandex Zen).

Example: https://dzen.ru/a/Z9khBocwVgufmUKF

Setup Selenium, and you need a Russian IP, or else blocked by captcha.

Title: og: title
Date: something like this: <meta itemprop="datePublished" content="2024-01-15">
Find the meta tag with datePublished
Text HTML: Find the tag <div ... aria-label="Статья 1" ..., inside the tag find <div... itemprop="articleBody" ..., Grab it and re-process the images blocks, tags like <div ... data-block-type="image"...

Accounts:

Example: https://dzen.ru/tourister?tab=articles

Grab og: title, og: image, og: description
Look for divs with id="zen-row-xxx", like id="zen-row-17", "id="zen-row-22"