🌐 Как заархивировать всю веб-страницу в одном HTML-файле

Обзоры

Используйте SingleFile для архивирования всей веб-страницы в один HTML-файл.

Расширение браузера

SingleFile – это веб-расширение (и инструмент CLI), совместимое с Chrome, Firefox (Desktop и Mobile), Microsoft Edge, Vivaldi, Brave, Waterfox, Yandex browser и Opera.

Оно помогает сохранить всю веб-страницу в один HTML-файл.

Утилита CLI

Вы также можете использовать утилиту командной строки, которая запустит безголовый браузер.

Установите npm

$ sudo apt install npm

Установите puppeteer.

$ npm install puppeteer

Установите SingleFile.

$ npm install "gildas-lormeau/SingleFile#master"

Убедитесь, что в PATH включены установленные исполняемые файлы.

Обновите файл .bashrc соответствующим образом.

$ export PATH=$PATH:~/node_modules/.bin/

Отображение справочной информации.

$ single-file --help
single-file [url] [output]

Save a page into a single HTML file.

Pozycyjne:
  url     URL or path on the filesystem of the page to save  [ciąg znaków]
  output  Output filename  [ciąg znaków]

Opcje:
  --help                                  Pokaż pomoc  [boolean]
  --version                               Pokaż numer wersji  [boolean]
  --back-end                              Back-end to use  [dostępne: "jsdom", "puppeteer", "webdriver-chromium", "webdriver-gecko", "puppeteer-firefox", "playwright-firefox", "playwright-chromium"] [domyślny: "puppeteer"]
  --block-mixed-content                   Block mixed contents  [boolean] [domyślny: false]
  --browser-server                        Server to connect to (puppeteer only for now)  [ciąg znaków] [domyślny: ""]
  --browser-headless                      Run the browser in headless mode (puppeteer, webdriver-gecko, webdriver-chromium)  [boolean] [domyślny: true]
  --browser-executable-path               Path to chrome/chromium executable (puppeteer, webdriver-gecko, webdriver-chromium)  [ciąg znaków] [domyślny: ""]
  --browser-width                         Width of the browser viewport in pixels  [liczba] [domyślny: 1280]
  --browser-height                        Height of the browser viewport in pixels  [liczba] [domyślny: 720]
  --browser-load-max-time                 Maximum delay of time to wait for page loading in ms (puppeteer, webdriver-gecko, webdriver-chromium)  [liczba] [domyślny: 60000]
  --browser-wait-delay                    Time to wait before capturing the page in ms  [liczba] [domyślny: 0]
  --browser-wait-until                    When to consider the page is loaded (puppeteer, webdriver-gecko, webdriver-chromium)  [dostępne: "networkidle0", "networkidle2", "load", "domcontentloaded"] [domyślny: "networkidle0"]
  --browser-wait-until-fallback           Retry with the next value of --browser-wait-until when a timeout error is thrown  [boolean] [domyślny: true]
  --browser-debug                         Enable debug mode (puppeteer, webdriver-gecko, webdriver-chromium)  [boolean] [domyślny: false]
  --browser-script                        Path of a script executed in the page (and all the frames) before it is loaded  [tablica] [domyślny: []]
  --browser-stylesheet                    Path of a stylesheet file inserted into the page (and all the frames) after it is loaded  [tablica] [domyślny: []]
  --browser-args                          Arguments provided as a JSON array and passed to the browser (puppeteer, webdriver-gecko, webdriver-chromium)  [ciąg znaków] [domyślny: ""]
  --browser-start-minimized               Minimize the browser (puppeteer)  [boolean] [domyślny: false]
  --browser-cookie                        Ordered list of cookie parameters separated by a comma: name,value,domain,path,expires,httpOnly,secure,sameSite,url (puppeteer, webdriver-gecko, webdriver-chromium, jsdom)  [tablica] [domyślny: []]
  --browser-cookies-file                  Path of the cookies file formatted as a JSON file or a Netscape text file (puppeteer, webdriver-gecko, webdriver-chromium, jsdom)  [ciąg znaków] [domyślny: ""]
  --compress-CSS                          Compress CSS stylesheets  [boolean] [domyślny: false]
  --compress-HTML                         Compress HTML content  [boolean] [domyślny: true]
  --crawl-links                           Crawl and save pages found via inner links  [boolean] [domyślny: false]
  --crawl-inner-links-only                Crawl pages found via inner links only if they are hosted on the same domain  [boolean] [domyślny: true]
  --crawl-no-parent                       Crawl pages found via inner links only if their URLs are not parent of the URL to crawl  [boolean]
  --crawl-load-session                    Name of the file of the session to load (previously saved with --crawl-save-session or --crawl-sync-session)  [ciąg znaków]
  --crawl-remove-url-fragment             Remove URL fragments found in links  [boolean] [domyślny: true]
  --crawl-save-session                    Name of the file where to save the state of the session  [ciąg znaków]
  --crawl-sync-session                    Name of the file where to load and save the state of the session  [ciąg znaków]
  --crawl-max-depth                       Max depth when crawling pages found in internal and external links (0: infinite)  [liczba] [domyślny: 1]
  --crawl-external-links-max-depth        Max depth when crawling pages found in external links (0: infinite)  [liczba] [domyślny: 1]
  --crawl-replace-urls                    Replace URLs of saved pages with relative paths of saved pages on the filesystem  [boolean] [domyślny: false]
  --crawl-rewrite-rule                    Rewrite rule used to rewrite URLs of crawled pages  [tablica] [domyślny: []]
  --dump-content                          Dump the content of the processed page in the console ('true' when running in Docker)  [boolean] [domyślny: false]
  --emulate-media-feature                 Emulate a media feature. The syntax is :, e.g. "prefers-color-scheme:dark" (puppeteer)  [tablica]
  --error-file  [ciąg znaków]
  --filename-template                     Template used to generate the output filename (see help page of the extension for more info)  [ciąg znaków] [domyślny: "{page-title} ({date-iso} {time-locale}).html"]
  --filename-conflict-action              Action when the filename is conflicting with existing one on the filesystem. The possible values are "uniquify" (default), "overwrite" and "skip"  [ciąg znaków] [domyślny: "uniquify"]
  --filename-replacement-character        The character used for replacing invalid characters in filenames  [ciąg znaków] [domyślny: "_"]
  --filename-max-length                   Specify the maximum length of the filename  [liczba] [domyślny: 192]
  --filename-max-length-unit              Specify the unit of the maximum length of the filename ('bytes' or 'chars')  [ciąg znaków] [domyślny: "bytes"]
  --group-duplicate-images                Group duplicate images into CSS custom properties  [boolean] [domyślny: true]
  --http-header                           Extra HTTP header (puppeteer, jsdom)  [tablica] [domyślny: []]
  --include-BOM                           Include the UTF-8 BOM into the HTML page  [boolean] [domyślny: false]
  --include-infobar                       Include the infobar  [boolean] [domyślny: false]
  --load-deferred-images                  Load deferred (a.k.a. lazy-loaded) images (puppeteer, webdriver-gecko, webdriver-chromium)  [boolean] [domyślny: true]
  --load-deferred-images-max-idle-time    Maximum delay of time to wait for deferred images in ms (puppeteer, webdriver-gecko, webdriver-chromium)  [liczba] [domyślny: 1500]
  --load-deferred-images-keep-zoom-level  Load deferred images by keeping zoomed out the page  [boolean] [domyślny: false]
  --max-parallel-workers                  Maximum number of browsers launched in parallel when processing a list of URLs (cf --urls-file)  [liczba] [domyślny: 8]
  --max-resource-size-enabled             Enable removal of embedded resources exceeding a given size  [boolean] [domyślny: false]
  --max-resource-size                     Maximum size of embedded resources in MB (i.e. images, stylesheets, scripts and iframes)  [liczba] [domyślny: 10]
  --move-styles-in-head                   Move style elements outside the head element into the head element  [boolean] [domyślny: false]
  --remove-frames                         Remove frames (puppeteer, webdriver-gecko, webdriver-chromium)  [boolean] [domyślny: false]
  --remove-hidden-elements                Remove HTML elements which are not displayed  [boolean] [domyślny: true]
  --remove-unused-styles                  Remove unused CSS rules and unneeded declarations  [boolean] [domyślny: true]
  --remove-unused-fonts                   Remove unused CSS font rules  [boolean] [domyślny: true]
  --remove-imports                        Remove HTML imports  [boolean] [domyślny: true]
  --remove-scripts                        Remove JavaScript scripts  [boolean] [domyślny: true]
  --remove-audio-src                      Remove source of audio elements  [boolean] [domyślny: true]
  --remove-video-src                      Remove source of video elements  [boolean] [domyślny: true]
  --remove-alternative-fonts              Remove alternative fonts to the ones displayed  [boolean] [domyślny: true]
  --remove-alternative-medias             Remove alternative CSS stylesheets  [boolean] [domyślny: true]
  --remove-alternative-images             Remove images for alternative sizes of screen  [boolean] [domyślny: true]
  --save-original-urls                    Save the original URLS in the embedded contents  [boolean] [domyślny: false]
  --save-raw-page                         Save the original page without interpreting it into the browser (puppeteer, webdriver-gecko, webdriver-chromium)  [boolean] [domyślny: false]
  --urls-file                             Path to a text file containing a list of URLs (separated by a newline) to save  [ciąg znaków]
  --user-agent                            User-agent of the browser (puppeteer, webdriver-gecko, webdriver-chromium)  [ciąg znaków]
  --user-script-enabled                   Enable the event API allowing to execute scripts before the page is saved  [boolean] [domyślny: true]
  --web-driver-executable-path            Path to Selenium WebDriver executable (webdriver-gecko, webdriver-chromium)  [ciąg znaków] [domyślny: ""]
  --output-directory                      Path to where to save files, this path must exist.  [ciąg znaków] [domyślny: ""]
  --accept-headers  [domyślny: {"font":"application/font-woff2;q=1.0,application/font-woff;q=0.9,*/*;q=0.8","image":"image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8","stylesheet":"text/css,*/*;q=0.1","script":"*/*","document":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"}]

Выполнение тестовой команды.

$ single-file --back-end puppeteer --browser-executable-path /snap/bin/chromium  https://www.debian.org --dump-content
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html lang="pl"><!--
 Page saved with SingleFile 
 url: https://www.debian.org 
 saved date: Wed Mar 23 2022 00:15:53 GMT+0100 (czas środkowoeuropejski standardowy)
--><head><meta charset="utf-8">
  
  <title>Debian -- The Universal Operating System </title>
  <link rel="author" href="mailto:webmaster@debian.org">
  <meta name="Description" content="Debian to system operacyjny i dystrybucja Wolnego Oprogramowania. Opiekuje się nią wielu użytkowników, którzy poświęcają jej swój czas i wysiłek.">
  <meta name="Generator" content="WML 2.12.2">
  <meta name="Modified" content="2022-03-02 07:57:56">
  <meta name="viewport" content="width=device-width">
  <meta name="mobileoptimized" content="300">
  <meta name="HandheldFriendly" content="true">
[...]
Last Built: śro, 2. mar 2022r, 07:57:56 UTC
  <br>
  Copyright © 1997-2022
 <a href="https://www.spi-inc.org/">SPI</a> i inni; Zobacz <a href="https://www.debian.org/license" rel="copyright">warunki umowy</a><br>
  Debian jest zarejestrowanym <a href="https://www.debian.org/trademark">znakiem handlowym</a> Software in the Public Interest, Inc.
</p>
</div>
<!--/UdmComment-->
</div> <!-- end footer -->

</body></html>

Сохраним веб-страницу.

$ single-file --back-end puppeteer --browser-executable-path /snap/bin/chromium  https://www.debian.org

Отображение архивированной веб-страницы.

$ ls *.html
'Debian -- The Universal Operating System (2022-03-22 00_18_11).html'

Откроем заархивированную веб-страницу.

$ chromium 'Debian -- The Universal Operating System (2022-03-22 00_18_11).html

В качестве альтернативы создайте список URL-адресов.

$ cat <<EOF | tee /tmp/urls.txt
https://linux.com
https://debian.org
EOF
$ single-file --back-end puppeteer --browser-executable-path /snap/bin/chromium --urls-file=/tmp/urls.txt

Проверим:

 ls *.html
'Debian -- The Universal Operating System (2022-03-22 00_18_11).html'
'Debian -- The Universal Operating System (2022-03-22 00_29_31).html'
'Linux.com - News For Open Source Professionals (2022-03-22 00_29_32).html'

 

Добавить комментарий