Sometimes, you keep forgetting few options for the common command and start googling around. Recently, I was stuck with wget to bypass the validity of ssl certificate and I was wondering what the command was… So, here it is:
wget --no-check-certificate https://kernelcraft.com/foo/abc.tar.gz
Thought of adding few more. So, other useful options for wget is as follows:
If you want to avoid using proxy when downloading files even if the appropriate *_proxy environment variable is defines, then use wget with –no-proxy option as follows:
wget --no-proxy http://kernelcraft.com/foo/abc.tar.gz
In case you want to set https proxy, when downloading the file, you can export when downloading the file as follows:
export https_proxy="USERNAME:PASSWORD@Server-Name:PORT-NUMBER/"
wget https://kernelcraft/foo/abc.tar.gz
Another way to do the above is by passing the username and password to the server using wget command as follows:
export https_proxy="https://server1.kernelcraft.com:3128/"
wget --http-user="USER" --http-password="PASSWORD" https://kernelcraft.com/foo/abc.tar.gz
So, getting back to the basics, wget with just a URL, will pull the 1st page it hits, for eg: index.html, index.php etc.
In case if you want wget to slurp down and grab anything in it’s vicinity, then use it with ‘-m’ option as follows:
wget -m http://www.yahoo.com/
The above command will put all the files it gets hold within a particular directory structure.
You’ll probably want to pair -m with -c (which tells Wget to continue partially-complete downloads) and -b (which tells wget to fork to the background, logging to wget-log).
If you want to grab everything in a specific directory – say, the ‘edu’ directory on the yahoo web site – use the -np flag:
wget -mbc -np http://yahoo.com/edu
This will tell Wget to not go up the directory tree, only downwards
Wget with User-Agents and Robots.txt file:
By default, Wget plays nicely with a website’s robots.txt. This can lead to situations where Wget won’t grab anything, since the robots.txt disallows Wget.
To avoid this: first, you should try using the –user-agent option:
wget -mbc --user-agent="" http://website.com/
This instructs Wget to not send any user agent string at all. Another option for this is:
wget -mbc -e robots=off http://website.com/
…which tells Wget to ignore robots.txt directives altogether.
Wget to search for mirrors:
In case you want to get a list of mirrors providing rsync connectivity for Mageia 3 release, you can use the following wget command:
url=http://mirrors.mageia.org/api/mageia.3.i586.list; wget -q ${url} -O - | grep rsync: