Few wget options I can think of…

Sometimes, you keep forgetting few options for the common command and start googling around. Recently, I was stuck with wget to bypass the validity of ssl certificate and I was wondering what the command was… So, here it is:

wget --no-check-certificate https://kernelcraft.com/foo/abc.tar.gz

Thought of adding few more. So, other useful options for wget is as follows:

If you want to avoid using proxy when downloading files even if the appropriate *_proxy environment variable is defines, then use wget with –no-proxy option as follows:

wget --no-proxy http://kernelcraft.com/foo/abc.tar.gz

In case you want to set https proxy, when downloading the file, you can export when downloading the file as follows:

export https_proxy="USERNAME:PASSWORD@Server-Name:PORT-NUMBER/"
wget https://kernelcraft/foo/abc.tar.gz

Another way to do the above is by passing the username and password to the server using wget command as follows:

export https_proxy="https://server1.kernelcraft.com:3128/"
wget --http-user="USER" --http-password="PASSWORD" https://kernelcraft.com/foo/abc.tar.gz

So, getting back to the basics, wget with just a URL, will pull the 1st page it hits, for eg: index.html, index.php etc.

In case if you want wget to slurp down and grab anything in it’s vicinity, then use it with ‘-m’ option as follows:

wget -m http://www.yahoo.com/

The above command will put all the files it gets hold within a particular directory structure.

You’ll probably want to pair -m with -c (which tells Wget to continue partially-complete downloads) and -b (which tells wget to fork to the background, logging to wget-log).
If you want to grab everything in a specific directory – say, the ‘edu’ directory on the yahoo web site – use the -np flag:

wget -mbc -np http://yahoo.com/edu

This will tell Wget to not go up the directory tree, only downwards

Wget with User-Agents and Robots.txt file:

By default, Wget plays nicely with a website’s robots.txt. This can lead to situations where Wget won’t grab anything, since the robots.txt disallows Wget.
To avoid this: first, you should try using the –user-agent option:

wget -mbc --user-agent="" http://website.com/

This instructs Wget to not send any user agent string at all. Another option for this is:

wget -mbc -e robots=off http://website.com/

…which tells Wget to ignore robots.txt directives altogether.

Wget to search for mirrors:

In case you want to get a list of mirrors providing rsync connectivity for Mageia 3 release, you can use the following wget command:

url=http://mirrors.mageia.org/api/mageia.3.i586.list; wget -q ${url} -O - | grep rsync:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s