Scaleable Web Services

SQUID - HTTP Proxy Cache Server

This is my cache:

4.03 Million requests a day (this is one box)
That was last week, this week 5.6 Million requests on Tuesday, 6 Million on Friday, 2.5 Million of those requests resulted in cache hits.
This time Last year, 1.02 Million requests.

Ok so now lets build something completely from sources...

tar -zxvf squid-2.4-200104262300-src.tar.gz
cd squid-2.4-200104262300
less INSTALL
./configure --prefix=(location you want it to install)
make all
make install

lets take a look at the configuration file

cd (prefix(installed location))/etc/
emacs -nw squid.conf

Some things that obviously have to be changed...

#  TAG: cache_mem       (bytes)
#       NOTE: THIS PARAMETER DOES NOT SPECIFY THE MAXIMUM PROCESS
#       SIZE.  IT PLACES A LIMIT ON ONE ASPECT OF SQUID'S MEMORY
#       USAGE.  SQUID USES MEMORY FOR OTHER THINGS AS WELL.
#       YOUR PROCESS WILL PROBABLY BECOME TWICE OR THREE TIMES
#       BIGGER THAN THE VALUE YOU PUT HERE
#
#       'cache_mem' specifies the ideal amount of memory to be used
#       for:
#               * In-Transit objects
#               * Hot Objects
#               * Negative-Cached objects
#
#       Data for these objects are stored in 4 KB blocks.  This
#       parameter specifies the ideal upper limit on the total size of
#       4 KB blocks allocated.  In-Transit objects take the highest
#       priority.
#
#       In-transit objects have priority over the others.  When
#       additional space is needed for incoming data, negative-cached
#       and hot objects will be released.  In other words, the
#       negative-cached and hot objects will fill up any unused space
#       not needed for in-transit objects.
#
#       If circumstances require, this limit will be exceeded.
#       Specifically, if your incoming request rate requires more than
#       'cache_mem' of memory to hold in-transit objects, Squid will
#       exceed this limit to satisfy the new requests.  When the load
#       decreases, blocks will be freed until the high-water mark is
#       reached.  Thereafter, blocks will be used to store hot
#       objects.
#
#Default:
# cache_mem 8 MB

The amount of memory the cache will use for object storage. Adjust to the scale of the server.

#  TAG: maximum_object_size     (bytes)
#       Objects larger than this size will NOT be saved on disk.  The
#       value is specified in kilobytes, and the default is 4MB.  If
#       you wish to get a high BYTES hit ratio, you should probably
#       increase this (one 32 MB object hit counts for 3200 10KB
#       hits).  If you wish to increase speed more than your want to
#       save bandwidth you should leave this low.
#
#       NOTE: if using the LFUDA replacement policy you should increase
#       this value to maximize the byte hit rate improvement of LFUDA!
#       See replacement_policy below for a discussion of this policy.
#
#Default:
# maximum_object_size 4096 KB

The maximum size for objects to be allowed to remain in the cache. In most cases you can crank the size of this vaule down.

#  TAG: maximum_object_size_in_memory   (bytes)
#        Objects greater than this size will not be attempted to kept in
#        the memory cache. This should be set high enough to keep objects
#        accessed frequently in memory to improve performance whilst low
#        enough to keep larger objects from hoarding cache_mem .
#
#Default:
# maximum_object_size_in_memory 8 KB

Keep this value low only of the amount of cache memory available is small... if it's larger, 64KB or even 128KB might be more appropriate.

#  TAG: ipcache_size    (number of entries)
#  TAG: ipcache_low     (percent)
#  TAG: ipcache_high    (percent)
#       The size, low-, and high-water marks for the IP cache.
#
#Default:
# ipcache_size 1024
# ipcache_low 90
# ipcache_high 95

Increasing the size of the IP cache will increase the speed of DNS lookups considerably on a heavily used cache.

#  TAG: fqdncache_size  (number of entries)
#       Maximum number of FQDN cache entries.
#
#Default:
# fqdncache_size 1024

Likewise the fully qualified domain name cache.

#  TAG: cache_dir
#       Usage:
#
#       cache_dir Type Directory-Name Fs-specific-data [options]
#
#       You can specify multiple cache_dir lines to spread the
#       cache among different disk partitions.
#
#       Type specifies the kind of storage system to use.  Most
#       everyone will want to use "ufs" as the type.  If you are using
#       Async I/O (--enable async-io) on Linux or Solaris, then you may
#       want to try "aufs" as the type.  Async IO support may be
#       buggy, however, so beware.
#
#       'Directory' is a top-level directory where cache swap
#       files will be stored.  If you want to use an entire disk
#       for caching, then this can be the mount-point directory.
#       The directory must exist and be writable by the Squid
#       process.  Squid will NOT create this directory for you.
#Default:
# cache_dir ufs /home/joelja/scratch/hold/squid/cache 100 16 256

Cache directories are the location where data is stored. they may be a subdirectory on the disk or the root of a mounted filesystem. You can have more then one. probably what to specify a size appriate to the amount of cache you intend to have. if filesystems are mounted async or with softupdates enabled or on linux, aufs is probably the filesystem you want to set

#  TAG: ftp_user
#       If you want the anonymous login password to be more informative
#       (and enable the use of picky ftp servers), set this to something
#       reasonable for your domain, like wwwuser@somewhere.net
#
#       The reason why this is domainless by default is that the
#       request can be made on the behalf of a user in any domain,
#       depending on how the cache is used.
#       Some ftp server also validate that the email address is valid
#       (for example perl.com).
#
#Default:
# ftp_user Squid@

I generally change it to something more descriptive for the benefit of the ftp server operatiors (squid@hostname.domainname)

#  TAG: ftp_list_width
#       Sets the width of ftp listings. This should be set to fit in
#       the width of a standard browser. Setting this too small
#       can cut off long filenames when browsing ftp sites.
#
#Default:
# ftp_list_width 32

A list width of 32 means that longish filenames get chopped off, I generally set it to fifty or sixty

#  TAG: dns_nameservers
#       Use this if you want to specify a list of DNS name servers
#       (IP addresses) to use instead of those given in your
#       /etc/resolv.conf file.
#
#       Example: dns_nameservers 10.0.0.1 192.172.0.4
#
#Default:
# none

If you want to use a list of DNS servers other than the ones in your /etc/resolv.conf (for example if you run named locally) you can fill then in here.

# ACCESS CONTROLS
# -----------------------------------------------------------------------------

#  TAG: acl
#       Defining an Access List
#
#       acl aclname acltype string1 ...
#       acl aclname acltype "file" ...
#
#       when using "file", the file should contain one item per line
#
#       acltype is one of src dst srcdomain dstdomain url_pattern
#               urlpath_pattern time port proto method browser user
#
#       By default, regular expressions are CASE-SENSITIVE.  To make
#       them case-insensitive, use the -i option.
#
#       acl aclname src      ip-address/netmask ... (clients IP address)
#       acl aclname src      addr1-addr2/netmask ... (range of addresses)
#       acl aclname dst      ip-address/netmask ... (URL host's IP address)
#       acl aclname myip     ip-address/netmask ... (local socket IP address)
#
#       acl aclname srcdomain   .foo.com ...    # reverse lookup, client IP
#       acl aclname dstdomain   .foo.com ...    # Destination server from URL
#       acl aclname srcdom_regex [-i] xxx ...   # regex matching client name
#       acl aclname dstdom_regex [-i] xxx ...   # regex matching server
#         # For dstdomain and dstdom_regex  a reverse lookup is tried if a IP
#         # based URL is used. The name "none" is used if the reverse lookup
#         # fails.
#
#Examples:
#acl myexample dst_as 1241
#acl password proxy_auth REQUIRED
#acl fileupload req_mime_type -i ^multipart/form-data$
#
#Recommended minimum configuration:
acl all src 0.0.0.0/0.0.0.0
acl manager proto cache_object
acl localhost src 127.0.0.1/255.255.255.255
acl SSL_ports port 443 563
acl Safe_ports port 80          # http
acl Safe_ports port 21          # ftp
acl Safe_ports port 443 563     # https, snews
acl Safe_ports port 70          # gopher
acl Safe_ports port 210         # wais
acl Safe_ports port 1025-65535  # unregistered ports
acl Safe_ports port 280         # http-mgmt
acl Safe_ports port 488         # gss-http
acl Safe_ports port 591         # filemaker
acl Safe_ports port 777         # multiling http
acl CONNECT method CONNECT
#  TAG: http_access
#       Allowing or Denying access based on defined access lists
#
#       Access to the HTTP port:
#       http_access allow|deny [!]aclname ...
#
#       NOTE on default values:
#
#       If there are no "access" lines present, the default is to deny
#       the request.
#
#       If none of the "access" lines cause a match, the default is the
#       opposite of the last line in the list.  If the last line was
#       deny, then the default is allow.  Conversely, if the last line
#       is allow, the default will be deny.  For these reasons, it is a
#       good idea to have an "deny all" or "allow all" entry at the end
#       of your access lists to avoid potential confusion.
#
#Default:
# http_access deny all
#
#Recommended minimum configuration:
#
# Only allow cachemgr access from localhost
http_access allow manager localhost
http_access deny manager
# Deny requests to unknown ports
http_access deny !Safe_ports
# Deny CONNECT to other than SSL ports
http_access deny CONNECT !SSL_ports
#
# INSERT YOUR OWN RULE(S) HERE TO ALLOW ACCESS FROM YOUR CLIENTS
#
# And finally deny all other access to this proxy
http_access deny all

At a minimum you'll need one additional acl line and one more http_access line in order to get the basic functionality out of it. Something like:

acl mynetwork src base.ip.address/netmask

http_access allow mynetwork

http_access deny all should be last line of the http_access rules.

#  TAG: icp_access
#       Allowing or Denying access to the ICP port based on defined
#       access lists
#
#       icp_access  allow|deny [!]aclname ...
#
#       See http_access for details
#
#Default:
# icp_access deny all
#
#Allow ICP queries from eveyone
icp_access allow all

Configure who you want to accept icp queries from by default.

#  TAG: miss_access
#       Use to force your neighbors to use you as a sibling instead of
#       a parent.  For example:
#
#               acl localclients src 172.16.0.0/16
#               miss_access allow localclients
#               miss_access deny  !localclients
#
#       This means that only your local clients are allowed to fetch
#       MISSES and all other clients can only fetch HITS.
#
#       By default, allow all clients who passed the http_access rules
#       to fetch MISSES from us.
#
#Default setting:
# miss_access allow all

Configure who you want to allow to fetch be able to fetch cache misses through you.

#  TAG: cache_mgr
#       Email-address of local cache manager who will receive
#       mail if the cache dies.  The default is "webmaster."
#
#Default:
# cache_mgr webmaster

Configure an email address for the cache manager.

#  TAG: cache_effective_user
#  TAG: cache_effective_group
#
#       If the cache is run as root, it will change its effective/real
#       UID/GID to the UID/GID specified below.  The default is to
#       change to UID to nobody and GID to nogroup.
#
#       If Squid is not started as root, the default is to keep the
#       current UID/GID.  Note that if Squid is not started as root then
#       you cannot set http_port to a value lower than 1024.
#
#Default:
# cache_effective_user nobody
# cache_effective_group nogroup

you may want to create a distinct user for the cache instead of using nobody, if run by root squid will assume the identity of the user specified here

#  TAG: visible_hostname
#       If you want to present a special hostname in error messages, etc,
#       then define this.  Otherwise, the return value of gethostname()
#       will be used. If you have multiple caches in a cluster and
#       get errors about IP-forwarding you must set them to have individual
#       names with this setting.
#
#Default:
# none

If you have a cluster of cache boxes you may want to setthe visisble hostname to something uniform on all all of them

#Default:
# announce_period 0
#
#To enable announcing your cache, just uncomment the line below.
#announce_period 1 day

#  TAG: announce_host
#  TAG: announce_file
#  TAG: announce_port
#       announce_host and announce_port set the hostname and port
#       number where the registration message will be sent.
#
#       Hostname will default to 'tracker.ircache.net' and port will
#       default default to 3131.  If the 'filename' argument is given,
#       the contents of that file will be included in the announce
#       message.
#
#Default:
# announce_host tracker.ircache.net
# announce_port 3131

Annoucing you cache allows other people looking for cache peers to find you. Depnding on your situation and the application for the cache this may or may not be useful. MISCELLANEOUS

#  TAG: logfile_rotate
#       Specifies the number of logfile rotations to make when you
#       type 'squid -k rotate'.  The default is 10, which will rotate
#       with extensions 0 through 9.  Setting logfile_rotate to 0 will
#       disable the rotation, but the logfiles are still closed and
#       re-opened.  This will enable you to rename the logfiles
#       yourself just before sending the rotate signal.
#
#       Note, the 'squid -k rotate' command normally sends a USR1
#       signal to the running squid process.  In certain situations
#       (e.g. on Linux with Async I/O), USR1 is used for other
#       purposes, so -k rotate uses another signal.  It is best to get
#       in the habit of using 'squid -k rotate' instead of 'kill -USR1
#       '.
#
#Default:
# logfile_rotate 10

When squid rotates the logs this value controlls how many iterations of logs it keeps, depending on how often you rotate the logs, and how fast your logs grow you may want to tweak the size of your logs, on my primary box I have it set to 2 and rotate the logs daily.

#  TAG: append_domain
#       Appends local domain name to hostnames without any dots in
#       them.  append_domain must begin with a period.
#
#Example:
# append_domain .yourdomain.com

When the cache recieves a request that isn't a fully qualified domain name, it can attempt to complete it if you fill in this value

#  TAG: memory_pools    on|off
#       If set, Squid will keep pools of allocated (but unused) memory
#       available for future use.  If memory is a premium on your
#       system and you believe your malloc library outperforms Squid
#       routines, disable this.
#
#Default:
# memory_pools on

On a box with a small amout of memory you want to turn this off

#  TAG: forwarded_for   on|off
#       If set, Squid will include your system's IP address or name
#       in the HTTP requests it forwards.  By default it looks like
#       this:
#
#               X-Forwarded-For: 192.1.2.3
#
#       If you disable this, it will appear as
#
#               X-Forwarded-For: unknown
#
#Default:
# forwarded_for on

If you turn this off you users may have more annonymity but it may some applications that are heavily dependant on the client ip address...

#  TAG: log_icp_queries on|off
#       If set, ICP queries are logged to access.log. You may wish
#       do disable this if your ICP load is VERY high to speed things
#       up or to simplify log analysis.
#
#Default:
# log_icp_queries on

In clusters of cache servers you may not want to log icp queries in the access log because of the large number of them

#  TAG: store_avg_object_size   (kbytes)
#       Average object size, used to estimate number of objects your
#       cache can hold.  See doc/Release-Notes-1.1.txt.  The default is
#       13 KB.
#
#Default:
# store_avg_object_size 13 KB

In my cache the actual average object size is around 40K Bytes.

#  TAG: buffered_logs   on|off
#       Some log files (cache.log, useragent.log) are written with
#       stdio functions, and as such they can be buffered or
#       unbuffered.  By default they will be unbuffered. Buffering them
#       can speed up the writing slightly (though you are unlikely to
#       need to worry).
#
#Default:
# buffered_logs off

Buffering the logfile writes will use slightly more memory but reduce cpu and disk churn slightly

Having adjusted those values to suit us, we are ready to run squid.
Run /u/squid/bin/squid -z
This will create the cache directory structure
Then run /usr/local/squid/bin/squid
That Should launch a working squid
If it fail to start run it in debug mode squid -d 1

Performance Tweaking of FreeBSD

Softupdates

Soft updates is (now) built into the generic kernel of FreeBSD. 4.3 will allow you to select filesystems to run with softupdates during the install, 4.2 won't. So now all you need to do is enable them on a per-filesystem basis.

Why is softupdates important?

One of the most serious bottlenecks in squid is the creation, reading, and replacement of files off of the disk.
A high-end proxy server must be able to serve several hundred connections per second some of which will replace objects currently in the cache.
How many create, write, destroy operations can you do per second on a filesystem

Configuring Softupdates

boot into single-user mode boot -s
Make sure the filesystem you want to enable softupdate on is unmounted
run the following command on it's mountpoint (/u in this case) tunefs -n enable /mountpoint
Then reboot

DiskD and Squid

What is DiskD? DiskD is a feature new to Squid 2.4 it creates a child process for each cache filesystem in order to keep the Squid cache processes from blocking on writes. In the 2nd NLANR cache bakeoff this resulted in a 4-fold improvement in the performance of the squid boxes on FreeBSD

Configuring The kernel for DiskD

What does DiskD require from the kernel?

Sys V message que support
Shared memory support

FreeBSD has both on by default, however paramaters need to be tweaked.

For SYSVMSG
- options MSGMNB=16384
- options MSGMNI=41
- options MSGSEG=2049
- options MSGSSZ=64
- options MSGTQL=512
For Shared Memory
- options SHMSEG=16
- options SHMMNI=32
- options SHMMAX=2097152
- options SHMALL=4096

Then configure, and recompile your kernel

Setting up clients

manually

via autoconf files

via wpad

Squid and WebCache Related resources

WPAD and autoconfiguration related standards work

FreeBSD handbook, building a custom kernel

Last modified: Sun May 6 04:34:28 PDT 2001