warc convert warc

Convert WARC file into WARC file

Synopsis

The WARC to WARC converter can be used to reorganize, convert or repair WARC-records. This is an experimental feature.

warc convert warc FILE/DIR ... [flags]

Options

      --close-input-file-hook string    command to run after closing each input file; the command receives these environment variables:
                                        	WARC_COMMAND contains the subcommand name
                                        	WARC_HOOK_TYPE contains the hook type (OpenInputFile, CloseInputFile, OpenOutputFile, CloseOutputFile)
                                        	WARC_FILE_NAME contains the file name of the input file
                                        	WARC_ERROR_COUNT contains the number of errors found if the file was validated and the validation failed
      --close-output-file-hook string   command to run after closing each output file; the command receives these environment variables:
                                        	WARC_COMMAND contains the subcommand name
                                        	WARC_HOOK_TYPE contains the hook type (OpenInputFile, CloseInputFile, OpenOutputFile, CloseOutputFile)
                                        	WARC_FILE_NAME contains the file name of the output file
                                        	WARC_SIZE contains the size of the output file
                                        	WARC_INFO_ID contains the ID of the output file's WARCInfo-record if created
                                        	WARC_SRC_FILE_NAME contains the file name of the input file if the output file is generated from an input file
                                        	WARC_HASH contains the hash of the output file if computed
                                        	WARC_ERROR_COUNT contains the number of errors found if the file was validated and the validation failed
  -z, --compress                        enable gzip compression for WARC output files (default true)
      --compression-level int           gzip compression level (1-9, -1 uses the gzip library default) (default -1)
  -c, --concurrency int                 number of input files to process in parallel (default 6)
  -C, --concurrent-writers int          maximum number of WARC files written concurrently.
                                        This may create at least this many output files even with a single input file. (default 16)
      --continue-on-error               continue processing remaining files and directories after errors
      --default-date string             fallback date used when records are missing WARC-Date metadata (time is set to 12:00 UTC) (default "2026-3-25")
      --file-size string                maximum size of each WARC output file (default "1GB")
      --flush                           sync each WARC file to disk after every record
  -f, --force                           continue iterating even when record read errors occur
      --ftp-pool-size int32             size of the FTP connection pool (default 1)
  -h, --help                            help for warc
      --index-dir string                directory used to store index data (default "/home/runner/.cache/warchaeology/warc")
  -i, --input-file string               input filesystem source; default is the local OS filesystem
                                        Legal values:
                                        	/path/to/archive.( tar | tar.gz | tgz | zip | wacz )
                                        	ftp://user/pass@host:port
                                        
  -k, --keep-index                      keep index files in --index-dir after the run so later runs can continue from them
      --lax-host-parsing                allow lenient host parsing in URL parsing
      --lenient                         minimize validation for faster, more permissive parsing
  -l, --limit int                       maximum number of records to process; ignored when --nth is set
      --min-disk-free string            minimum free disk space required to continue writing WARC output (default "1GB")
      --name-generator string           name generator strategy.
                                        With 'identity', the input filename is reused for output (prefix/suffix may still change),
                                        and exactly one output file is created per input file. (default "default")
  -K, --new-index                       start with a fresh index by deleting any existing index in --index-dir at startup
  -n, --nth int                         process only the n-th record after filtering
  -o, --offset int                      start processing at this byte offset in the input file (default: 0)
      --one-to-one                      write each input file to exactly one output file.
                                        Equivalent to: --concurrent-writers=1 --file-size=0 --name-generator=identity
      --open-input-file-hook string     command to run before opening each input file; the command receives these environment variables:
                                        	WARC_COMMAND contains the subcommand name
                                        	WARC_HOOK_TYPE contains the hook type (OpenInputFile, CloseInputFile, OpenOutputFile, CloseOutputFile)
                                        	WARC_FILE_NAME contains the file name of the input file
      --open-output-file-hook string    command to run before opening each output file; the command receives these environment variables:
                                        	WARC_COMMAND contains the subcommand name
                                        	WARC_HOOK_TYPE contains the hook type (OpenInputFile, CloseInputFile, OpenOutputFile, CloseOutputFile)
                                        	WARC_FILE_NAME contains the file name of the output file
                                        	WARC_SRC_FILE_NAME contains the file name of the input file if the output file is generated from an input file
  -w, --output-dir string               output directory for generated WARC files (must already exist) (default ".")
  -p, --prefix string                   filename prefix for generated WARC files
  -r, --recursive                       walk input directories recursively
  -R, --repair                          attempt to repair malformed records when possible
      --source-file-list string         path to a file listing input paths, one per line
      --strict                          fail on the first validation error
      --subdir-pattern string           pattern used to create output subdirectories.
                                        Use '/' to separate subdirectories on all platforms.
                                        Supported tokens: {YYYY}, {YY}, {MM}, {DD}.
                                        The WARC-Date of each record is used, so one input file may be split across subdirectories.
                                        With --name-generator=identity, only the first record date is used per input file.
      --suffixes strings                only process files with these suffixes (default [.warc,.warc.gz])
  -s, --symlinks                        follow symbolic links while walking
      --tmp-dir string                  directory used for temporary files (default "/tmp")
      --warc-version string             WARC version used for generated files (default "1.1")

Options inherited from parent commands

      --config string       path to config file; if unset, searches standard XDG config locations and the current directory for config.yaml
  -O, --log-file string     log output destination ('-' for stderr) (default "-")
      --log-format string   log output format (text or json) (default "text")
      --log-level string    minimum log level (debug, info, warn, error) (default "info")

warc convert warc​

Synopsis​

Options​

Options inherited from parent commands​

SEE ALSO​

warc convert warc

Synopsis

Options

Options inherited from parent commands

SEE ALSO