Используйте SingleFile для архивирования всей веб-страницы в один HTML-файл.
Расширение браузера
SingleFile – это веб-расширение (и инструмент CLI), совместимое с Chrome, Firefox (Desktop и Mobile), Microsoft Edge, Vivaldi, Brave, Waterfox, Yandex browser и Opera.
Оно помогает сохранить всю веб-страницу в один HTML-файл.
Утилита CLI
Вы также можете использовать утилиту командной строки, которая запустит безголовый браузер.
Установите npm
$ sudo apt install npm
Установите puppeteer.
$ npm install puppeteer
Установите SingleFile.
$ npm install "gildas-lormeau/SingleFile#master"
Убедитесь, что в PATH включены установленные исполняемые файлы.
Обновите файл .bashrc соответствующим образом.
$ export PATH=$PATH:~/node_modules/.bin/
Отображение справочной информации.
$ single-file --help
single-file [url] [output] Save a page into a single HTML file. Pozycyjne: url URL or path on the filesystem of the page to save [ciąg znaków] output Output filename [ciąg znaków] Opcje: --help Pokaż pomoc [boolean] --version Pokaż numer wersji [boolean] --back-end Back-end to use [dostępne: "jsdom", "puppeteer", "webdriver-chromium", "webdriver-gecko", "puppeteer-firefox", "playwright-firefox", "playwright-chromium"] [domyślny: "puppeteer"] --block-mixed-content Block mixed contents [boolean] [domyślny: false] --browser-server Server to connect to (puppeteer only for now) [ciąg znaków] [domyślny: ""] --browser-headless Run the browser in headless mode (puppeteer, webdriver-gecko, webdriver-chromium) [boolean] [domyślny: true] --browser-executable-path Path to chrome/chromium executable (puppeteer, webdriver-gecko, webdriver-chromium) [ciąg znaków] [domyślny: ""] --browser-width Width of the browser viewport in pixels [liczba] [domyślny: 1280] --browser-height Height of the browser viewport in pixels [liczba] [domyślny: 720] --browser-load-max-time Maximum delay of time to wait for page loading in ms (puppeteer, webdriver-gecko, webdriver-chromium) [liczba] [domyślny: 60000] --browser-wait-delay Time to wait before capturing the page in ms [liczba] [domyślny: 0] --browser-wait-until When to consider the page is loaded (puppeteer, webdriver-gecko, webdriver-chromium) [dostępne: "networkidle0", "networkidle2", "load", "domcontentloaded"] [domyślny: "networkidle0"] --browser-wait-until-fallback Retry with the next value of --browser-wait-until when a timeout error is thrown [boolean] [domyślny: true] --browser-debug Enable debug mode (puppeteer, webdriver-gecko, webdriver-chromium) [boolean] [domyślny: false] --browser-script Path of a script executed in the page (and all the frames) before it is loaded [tablica] [domyślny: []] --browser-stylesheet Path of a stylesheet file inserted into the page (and all the frames) after it is loaded [tablica] [domyślny: []] --browser-args Arguments provided as a JSON array and passed to the browser (puppeteer, webdriver-gecko, webdriver-chromium) [ciąg znaków] [domyślny: ""] --browser-start-minimized Minimize the browser (puppeteer) [boolean] [domyślny: false] --browser-cookie Ordered list of cookie parameters separated by a comma: name,value,domain,path,expires,httpOnly,secure,sameSite,url (puppeteer, webdriver-gecko, webdriver-chromium, jsdom) [tablica] [domyślny: []] --browser-cookies-file Path of the cookies file formatted as a JSON file or a Netscape text file (puppeteer, webdriver-gecko, webdriver-chromium, jsdom) [ciąg znaków] [domyślny: ""] --compress-CSS Compress CSS stylesheets [boolean] [domyślny: false] --compress-HTML Compress HTML content [boolean] [domyślny: true] --crawl-links Crawl and save pages found via inner links [boolean] [domyślny: false] --crawl-inner-links-only Crawl pages found via inner links only if they are hosted on the same domain [boolean] [domyślny: true] --crawl-no-parent Crawl pages found via inner links only if their URLs are not parent of the URL to crawl [boolean] --crawl-load-session Name of the file of the session to load (previously saved with --crawl-save-session or --crawl-sync-session) [ciąg znaków] --crawl-remove-url-fragment Remove URL fragments found in links [boolean] [domyślny: true] --crawl-save-session Name of the file where to save the state of the session [ciąg znaków] --crawl-sync-session Name of the file where to load and save the state of the session [ciąg znaków] --crawl-max-depth Max depth when crawling pages found in internal and external links (0: infinite) [liczba] [domyślny: 1] --crawl-external-links-max-depth Max depth when crawling pages found in external links (0: infinite) [liczba] [domyślny: 1] --crawl-replace-urls Replace URLs of saved pages with relative paths of saved pages on the filesystem [boolean] [domyślny: false] --crawl-rewrite-rule Rewrite rule used to rewrite URLs of crawled pages [tablica] [domyślny: []] --dump-content Dump the content of the processed page in the console ('true' when running in Docker) [boolean] [domyślny: false] --emulate-media-feature Emulate a media feature. The syntax is :, e.g. "prefers-color-scheme:dark" (puppeteer) [tablica] --error-file [ciąg znaków] --filename-template Template used to generate the output filename (see help page of the extension for more info) [ciąg znaków] [domyślny: "{page-title} ({date-iso} {time-locale}).html"] --filename-conflict-action Action when the filename is conflicting with existing one on the filesystem. The possible values are "uniquify" (default), "overwrite" and "skip" [ciąg znaków] [domyślny: "uniquify"] --filename-replacement-character The character used for replacing invalid characters in filenames [ciąg znaków] [domyślny: "_"] --filename-max-length Specify the maximum length of the filename [liczba] [domyślny: 192] --filename-max-length-unit Specify the unit of the maximum length of the filename ('bytes' or 'chars') [ciąg znaków] [domyślny: "bytes"] --group-duplicate-images Group duplicate images into CSS custom properties [boolean] [domyślny: true] --http-header Extra HTTP header (puppeteer, jsdom) [tablica] [domyślny: []] --include-BOM Include the UTF-8 BOM into the HTML page [boolean] [domyślny: false] --include-infobar Include the infobar [boolean] [domyślny: false] --load-deferred-images Load deferred (a.k.a. lazy-loaded) images (puppeteer, webdriver-gecko, webdriver-chromium) [boolean] [domyślny: true] --load-deferred-images-max-idle-time Maximum delay of time to wait for deferred images in ms (puppeteer, webdriver-gecko, webdriver-chromium) [liczba] [domyślny: 1500] --load-deferred-images-keep-zoom-level Load deferred images by keeping zoomed out the page [boolean] [domyślny: false] --max-parallel-workers Maximum number of browsers launched in parallel when processing a list of URLs (cf --urls-file) [liczba] [domyślny: 8] --max-resource-size-enabled Enable removal of embedded resources exceeding a given size [boolean] [domyślny: false] --max-resource-size Maximum size of embedded resources in MB (i.e. images, stylesheets, scripts and iframes) [liczba] [domyślny: 10] --move-styles-in-head Move style elements outside the head element into the head element [boolean] [domyślny: false] --remove-frames Remove frames (puppeteer, webdriver-gecko, webdriver-chromium) [boolean] [domyślny: false] --remove-hidden-elements Remove HTML elements which are not displayed [boolean] [domyślny: true] --remove-unused-styles Remove unused CSS rules and unneeded declarations [boolean] [domyślny: true] --remove-unused-fonts Remove unused CSS font rules [boolean] [domyślny: true] --remove-imports Remove HTML imports [boolean] [domyślny: true] --remove-scripts Remove JavaScript scripts [boolean] [domyślny: true] --remove-audio-src Remove source of audio elements [boolean] [domyślny: true] --remove-video-src Remove source of video elements [boolean] [domyślny: true] --remove-alternative-fonts Remove alternative fonts to the ones displayed [boolean] [domyślny: true] --remove-alternative-medias Remove alternative CSS stylesheets [boolean] [domyślny: true] --remove-alternative-images Remove images for alternative sizes of screen [boolean] [domyślny: true] --save-original-urls Save the original URLS in the embedded contents [boolean] [domyślny: false] --save-raw-page Save the original page without interpreting it into the browser (puppeteer, webdriver-gecko, webdriver-chromium) [boolean] [domyślny: false] --urls-file Path to a text file containing a list of URLs (separated by a newline) to save [ciąg znaków] --user-agent User-agent of the browser (puppeteer, webdriver-gecko, webdriver-chromium) [ciąg znaków] --user-script-enabled Enable the event API allowing to execute scripts before the page is saved [boolean] [domyślny: true] --web-driver-executable-path Path to Selenium WebDriver executable (webdriver-gecko, webdriver-chromium) [ciąg znaków] [domyślny: ""] --output-directory Path to where to save files, this path must exist. [ciąg znaków] [domyślny: ""] --accept-headers [domyślny: {"font":"application/font-woff2;q=1.0,application/font-woff;q=0.9,*/*;q=0.8","image":"image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8","stylesheet":"text/css,*/*;q=0.1","script":"*/*","document":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"}]
Выполнение тестовой команды.
$ single-file --back-end puppeteer --browser-executable-path /snap/bin/chromium https://www.debian.org --dump-content
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html lang="pl"><!-- Page saved with SingleFile url: https://www.debian.org saved date: Wed Mar 23 2022 00:15:53 GMT+0100 (czas środkowoeuropejski standardowy) --><head><meta charset="utf-8"> <title>Debian -- The Universal Operating System </title> <link rel="author" href="mailto:webmaster@debian.org"> <meta name="Description" content="Debian to system operacyjny i dystrybucja Wolnego Oprogramowania. Opiekuje się nią wielu użytkowników, którzy poświęcają jej swój czas i wysiłek."> <meta name="Generator" content="WML 2.12.2"> <meta name="Modified" content="2022-03-02 07:57:56"> <meta name="viewport" content="width=device-width"> <meta name="mobileoptimized" content="300"> <meta name="HandheldFriendly" content="true"> [...] Last Built: śro, 2. mar 2022r, 07:57:56 UTC <br> Copyright © 1997-2022 <a href="https://www.spi-inc.org/">SPI</a> i inni; Zobacz <a href="https://www.debian.org/license" rel="copyright">warunki umowy</a><br> Debian jest zarejestrowanym <a href="https://www.debian.org/trademark">znakiem handlowym</a> Software in the Public Interest, Inc. </p> </div> <!--/UdmComment--> </div> <!-- end footer --> </body></html>
Сохраним веб-страницу.
$ single-file --back-end puppeteer --browser-executable-path /snap/bin/chromium https://www.debian.org
Отображение архивированной веб-страницы.
$ ls *.html
'Debian -- The Universal Operating System (2022-03-22 00_18_11).html'
Откроем заархивированную веб-страницу.
$ chromium 'Debian -- The Universal Operating System (2022-03-22 00_18_11).html
В качестве альтернативы создайте список URL-адресов.
$ cat <<EOF | tee /tmp/urls.txt https://linux.com https://debian.org EOF
$ single-file --back-end puppeteer --browser-executable-path /snap/bin/chromium --urls-file=/tmp/urls.txt
Проверим:
ls *.html
'Debian -- The Universal Operating System (2022-03-22 00_18_11).html' 'Debian -- The Universal Operating System (2022-03-22 00_29_31).html' 'Linux.com - News For Open Source Professionals (2022-03-22 00_29_32).html'