TSYS Systems

This article covers the (high level) systems architecture that supports TSYS/Redwood Group. Other articles will go more in depth on specific systems. This article provides a general overview.

The architecture was designed to :

  • meet the highest levels of information assurance and reliability (at a single site)
  • support (up to) Top Secret workloads for R&D (SBIR/OTA) (non production) contract work (by US Citizens only) being done for the United States Department of Defense/Energy/State by various components of TSYS Group.

Virtual Machines: Redundant (mix of active/passive active/active)

We are (with exception of R&D product development (being a hardware/IOT product) 99.9%) virtualized :

Exceptions to virtualized infrastructure:

  • raspberry pi providing stratum0 (via hat) and server room badge reader functionality (via usb badge reader and lock relay)
  • intermediate CA HSM passed through to a VM on vm3
  • UPS units connected to vm3 via usb/serial

Any further exceptions to virtual infra require CEO/board approval and extensive justification.

Networking

  • Functions
    • TFTP server
    • DHCP server
    • HaProxy (443 terminates here)
    • Dev/qa/prod Core routing/firewall
    • (multi provider) WAN edge routing/firewall
    • Static/dynamic routing
    • inbound/outbound SMTP handling
    • Caching/scanning (via ClamAV)Web proxy
    • Suricata IDS/IPS

All the above is provided on an active/passive basis via CARP IP with sub 2ms failover.

  • Machines |VM Name | VM ID | Vm Host | Storage Enclosure| Storage Array | |---|---|---|---|---| |pfv-core-rtr01|120|vm1|stor2|tier2vm| |pfv-core-rtr02|xx|vm3|stor1|s1-wwwdb|

DNS/NTP (user/server facing)

We do not expose the core domain controllers (dc2/3) directly to users or servers. Everything flows through pihole. We allow DNS (via firewall rules) to ONLY pihole 1,2 no other DNS is allowed. pihole 1,2 is only allowed to realy to the core dc, then the dc are allowed to relay to the internet (8.8.8.8).

This blocks the vast majority of spyware/trackerware/malware/c2c etc (using the pihole blacklists). DNS filtering is the first line of defense against attackers and far less false positives when doing log review.

  • Functions

    • DNS (with ad filtering) (pihole)
    • NTP
  • Machines

    VM NameVM IDVm HostStorage EnclosureStorage Array
    pihole1101vm3stor1s1-wwwdb
    pihole2103vm1stor2tier2vm

Database layer

All the data for all the things. Everything is clustered, shared service model.

  • Functions

    • Mysql (galera)
    • Postgresql (patroni)
    • ETcd
    • MQTT Brok
    • Rabbitmq
    • Elasticsearch
    • Longhorn
    • K3s control plane
  • Machines

    VM NameVM IDVm HostStorage EnclosureStorage Array
    db1125vm4stor1s1-wwwdb
    db2126vm5stor2tier2vm
    db3127vm1stor2tier2vm

Web/bizops/IT control plane application layer

All the websites for TSYS/Redwood Group live on this infra. It's served up via HAProxy (active/passive on r1/42) in an active/active setup (each node running 50% of workload, capable of 100% for handling node maintenace)

  • Functions

    • All brand properties
    • Data repository (discourse)
    • IT Control plane (job clustering/monitoring/alerting/siem etc)
    • Business operations (marketing/sales/finance/etc)
    • Apache server (for non dockerized applications)
    • k3s worker nodes (we are moving all workloads to docker containers with longhorn PVC)
  • Machines

    VM NameVM IDVm HostStorage EnclosureStorage Array
    www1123vm5stor2tier2vm
    www2124vm4stor1s1-wwwdb

Line of business Application layer

  • Functions

    • Guacamole (serving up rackrental customer workloads, also developer workstations)
    • Webmail (for a number of our domains, we don't use Office 365)
  • Machines

    VM NameVM IDVm HostStorage EnclosureStorage Array
    tsys-dc-02129vm5stor2tier2vm
    tsys-dc-03130vm4stor1s1-wwwdb

Network Security Monitoring

We will be using security onion in some fashion. Looking into that with OpenVAS/Lynis/Graylog as a SIEM/scanner. More to follow soon. It will be a distributed, highly available setup.

Virtual Machines: Non Redundant

VPN

You'll notice VPN missing from the redundant networking list. A few comments on that:

  • We employ a zero trust access model for vast majority of systems
  • We heavily utilize web interfaces/APIs for just about all systems/functionality and secure acces via 2fa/Univention Corporate Server ("AD") and a zero trust model.
  • We do have our R&D systems behind the VPN for direct SSH access (as opposed to through various abstraction layers)
  • We utilize WIreguard (via the ansible setup provided by algo trailofbits). We don't have a redundant Wireguard setup, just a single small Ubuntu VM. It's worked incredibly well and the occasional 90 seconds or so of downtime for kernel patching is acceaptable.
  • Due to ITAR and other regulations, we utilize a VPN for access control. We may in the future, upon appropriate review and approval, setup haproxy with SSH SNI certifcates to route connections to R&D systems directly.
VM NameVM IDVm HostStorage EnclosureStorage Array
pfv-vpn106vm3stor2tier2vm

Physical Surveilance

We can take 90 seconds of downtime for occasional kernel patching and not be processing the surveilance feeds for a bit. Everyone knows that criminals just loop the footage anyway....

VM NameVM IDVm HostStorage EnclosureStorage Array
pfv-nvr104vm5stor2tier2vm

Building automation

We can take 90 seconds of downtime for occasional kernel patching and wait to turn on a light or whatever.

VM NameVM IDVm HostStorage EnclosureStorage Array
HomeAssistant116vm3stor2tier2vm

Sipwise

We can take 90 seconds of downtime for occasional kernel patching, and have the phones "stop ringing" for that long.

VM NameVM IDVm HostStorage EnclosureStorage Array
sipwise105vm4stor1s1-wwwdb

Online CA (Intermeidate to offline root)

We can take 90 seconds of downtime for occasional kernel patching.

We serve the CRL and other "always on" SSL related bits via cloudflare ssl toolkit in docker using the web/app layer over HTTP(S) and it's fully redundant.

This VM is only used occasionally to issue long lived certs or perform needed maintenance.

It could be down for weeks/months without issue.

It's using XCA for administration and talking to the db cluster. It is locked to vm3, because we pass through a Nitrokey HSM, works wonderully.

VM NameVM IDVm HostStorage EnclosureStorage Array
pfv-ca131vm3stor1s1-wwwdb

Operations/administration/management (OAM)

This is the back office IT bits.

  • Functions

    • librenms (monitoring/alerting/long term metrics)
    • netdata (central dashboard)
    • upsd (central dashboard)
    • rundeck (internal orchestration only)
    • sshaudit
    • lynis
    • crash dump server
    • openvas
    • etc
    VM NameVM IDVm HostStorage EnclosureStorage Array
    pfv-toolbox121vm3stor2tier2vm

Storage Infrastructure

  • We keep it very simple and utilize TrueNAS Core on Dell PowerEdge 2950 with 32gb ram.
  • We run zero plugins.
  • We have a variety of pools setup and served out over NFS to the 10.251.30.0/24 network
  • No samba, just NFS
  • Utilize built in snapshots/replication for retention/backup

Virtualization Infrastructure

  • We keep it very simple and utilize Proxmox on a mix of :
    • Dell Optiplex (i3/i7) (all with 32gb ram)
    • Dell PowerEdge (dual socket, quad core xeon) (all with 32gb ram)
    • Dell Precision system (i7) (16gb ram) (with nvida quadaro card passed through to kvm guest (either windows 10 or Ubuntu Server 20.04 depending on what we need todo)
    • We run the nodes with single power supply and single OS drive.

Vm node failure is expected (we keep the likelihood low with use of thumb drives with syslog set to only log to the virtualized logging infra), and we handle the downtime via the redundancy we outlined above (by using virtual machines spread across hypervisors / arrays / enclosures ) and redundancy happens at the application level).

Restoring a vritual server node would take maybe 30 minutes

(plug a new thumb drive, re-install, join cluster).

In the meantime the vm has auto migrated to another node using proxmox HA functionality (if it's an SPOF VM).

Overall system move to production status

HostnameOSSECRundeckNetdatalibrenms monlibrenms logDNS(x)DPNTPSlackLyrisSCAPAuditdOpenVASoxidized
Pfv-vmsrv-01YYYYYYYYN/A
Pfv-vmsrv-02YYYYYYYYN/A
Pfv-vmsrv-03YYYYYYYYN/A
Pfv-vmsrv-04YYYYYYYYN/A
Pfv-vmsrv-06YYYYYYYYN/A
Pfv-time1YYYYYYYN/A
Pfv-stor1N/AN/AN/AYYYxN/AN/AN/A
Pfv-stor2N/AN/AN/AYYYxN/AN/AN/A
Pfv-consrv01N/AN/AN/AYYYYxN/AN/AN/A
Pfv-core-sw01N/AN/AN/AYYYYxN/AN/A
Pfv-core-ap01N/AN/AN/AYN/AYYxN/AN/A
Pfv-lab-sw01N/AN/AN/AYYYx
Pfv-lab-sw02N/AN/AN/AYYYYx
Pfv-lab-sw03N/AN/AN/AYYYx
Pfv-lab-sw04N/AN/AN/AYYYYx
3dpsrvYYYYYYN/AYN/A
Pfv-core-rtr01N/AN/AN/AYYYYxN/AN/A
Pfv-core-rtr02N/AN/AN/AYYYYxN/AN/A
tsys-dc-01YYYYYYY
tsys-dc-02YYYYYYY
tsys-dc-03YYYYYYY
Tsys-dc-04YYYYYYYN/A
pihole1YYYYYYYN/A
pihole2YYYYYYYN/A
pfv-toolboxYYYYYYYN/A
caYYYYYYYN/A
www1YYYYYYY
www2YYYYYYY
www3YYYYYYY
db1YYYYYYY
db2YYYYYYY
db3YYYYYYY