CloudSoda provides seamless data orchestration between file based and object storage environments. Its DataIntell functionality delivers storage analytics, visualizations, and insights to optimize storage usage. This paper outlines CloudSoda’s key components, data integrity methods, access requirements, and network settings needed to properly scan objects and files in the CloudSoda system.
CloudSoda Components
CloudSoda's key components are the Controller and Agents.
Controller
-
Deployment Options: Can be deployed on-premises, in a private cloud, or hosted by CloudSoda (optional).
-
UI and API Management: Hosts the web-based user interface and GraphQL/REST APIs for configuration, reporting, and automation.
-
Centralized Coordination: Coordinates data scans, reporting, lifecycle policies, and transfer workflows.
-
System Monitoring: Tracks all data intelligence activity and orchestration jobs—complete with audit logs and dashboards.
-
Security and Access: Manages users, RBAC (Role-Based Access Control), and authentication integrations (e.g., Microsoft Entra ID).
Agent
-
Flexible Installation: Runs on any Linux, Mac or Windows based system, including servers, VMs, or computers.
-
Direct Communication: Requires a direct network connection to the Controller.
-
Storage Access: Connects to SMB, NFS, S3, and other supported storage systems to index data and perform operations.
-
Metadata Indexing: Scans and indexes file and folder metadata, including attributes like size, ownership, access/modification dates, and more.
-
Data Movement: Handles peer-to-peer transfers, including LAN-local and WAN-optimized workflows.
-
Policy Execution: Enforces lifecycle actions such as tiering, replication, archival, or deletion based on defined rules.
Data Transfer Integrity
Within the customer environment, CloudSoda ensures secure and accurate data transfers using checksum validation. To ensure data integrity during these transfers, CloudSoda employs two distinct validation methods, file-to-file transfers and file-to-object transfers.
File-to-File Transfers
-
Generates MD5 checksums at the source and destination.
-
Uses a temporary file to prevent overwriting.
-
Retries up to three times if validation fails.
When a file-to-file data transfer is initiated, CloudSoda generates a MD5 checksum (1) on the source file. To avoid overwriting an existing file, the source file is transferred to a temporary file on the target system. CloudSoda generates a MD5 checksum on the target temp file and compares the two checksums. If they match, then CloudSoda renames and updates the attributes on the target file, ensuring a complete transfer without data corruption. If the checksums do not match, then CloudSoda re-tries the data transfer operation (up to three times) before marking it as failed.
File-to-Object Transfers
-
Uses S3 ETag validation for AWS and other S3-compatible storage.
-
Retries up to 40 times with exponential backoff if an upload fails.
-
Google Cloud transfers use CRC32C for validation.
-
Azure Blob storage validation relies on byte count and metadata MD5.
When a source file is transferred to an S3 object, CloudSoda performs the same transfer operation as in the file-to-file scenario, but validation is based on the S3 ETag (2) mechanism. AWS or the S3-based object storage generates an Etag when the file is uploaded and the transfer is complete. CloudSoda validates the generated Etag by calculating it using source data and the number of parts used in the file upload. If the Etag does not match or the upload fails, then CloudSoda retries the transfer operation (up to 40 times) with an exponential backoff before marking it as failed. When an object is downloaded, CloudSoda compares the object’s Etag to the local temp file's MD5 checksum before completing the object-to-file transfer. This validation guarantees that the files and objects are identical, and that no data corruption occurred during the transfer.
Google Cloud and Azure Blob Transfers
For Google Cloud Storage transfers:
- CloudSoda uses the same data validation method as S3.
- Instead of an Etag, it uses a CRC32C hash.
For Azure Blob Storage transfers:
- Azure does not provide a hash for multi-part uploads.
- CloudSoda verifies the byte count to ensure the upload is correct.
- After the upload, CloudSoda sets an MD5 hash as metadata for future validation.
Data Movement and Transfer Protocols
The Controller manages data movement between Agents in the CloudSoda system, while the Agent handles the actual data transfers. Protocols for different data transfers:
- SMB/NFS transfers: These protocols do not support encryption by default.
- Cloud or object storage transfers: CloudSoda relies on the cloud provider’s SDK. (3)
- AWS transfers: The Azure SDK uses TLS v1.2 (4) (5) (6) for secure data transfer over the WAN.
- Google Cloud Storage transfers: The Google SDK uses TLS 1.3 for secure data movement into object storage. (7) (8) (9)
The CloudSoda Agent can be installed on Linux, Windows, or macOS. It accesses all files available to the host OS, including local volumes, DAS, and SMB/NFS-mounted storage. The Agent supports SMB mounts via an SMB accessor for direct access.
CloudSoda Network (Software Access and Port/Firewall Settings)
The CloudSoda Controller can be deployed on-premises, in a customer cloud, or in a CloudSoda-hosted cloud. Figure 1 illustrates the CloudSoda network diagram, showing the ports and software packages used for communication with external services. For a detailed list of networking ports and services, see Appendix A.
Figure 1 - CloudSoda Network Diagram
The CloudSoda Controller utilizes designated ports and network connections to perform four key functions:
Administration
- Deploys, upgrades, and monitors the CloudSoda system.
- Provides benefits such as easy installation, real-time updates, and simplified troubleshooting.
- Collects system health data to help customers optimize performance.
Data Movement and Scans
-
Scans storage systems and manages file and object transfers via CloudSoda Agents.
Management
-
Creates and manages the mesh network.
-
Upgrades and monitors CloudSoda Agents.
Web UI and API
- Provides both a graphical UI and an API for platform access.
- Requires port 443 for web UI and API access.
- Port 80 can be used but is automatically upgraded to SSL (443).
CloudSoda Communication (Controller / Agents / Storage)
The CloudSoda Agent communicates with the CloudSoda Controller using two outbound UDP ports (7498 and 7499), with all data encrypted. For NFS/SMB communication, the Agent requires access to the ports listed in Figure 2.
When accessing public or private cloud storage, no firewall exceptions are needed.
- CloudSoda uses ports 80 and 443 to connect to cloud storage targets.
- All data transfers are encrypted.
Figure 2 - Controller with Agents
Data Transfers via CloudSoda Control Plane and Agent Mesh
CloudSoda's encrypted Agent-to-Agent transfer is initiated when the source and target storage are located on different Agents. To enable these transfers, CloudSoda creates an Agent mesh, where all Agents attempt to connect with each other.
- Agents require dynamic UDP ports (30000 - 65535) open for both ingress and egress.
- A unique UDP port can be hard-coded for port forwarding or specific networking needs.
Agents use the CloudSoda control plane to connect via LAN routes, VPN tunnels, and WAN.
- To access the control plane, an Agent must reach: https://controlplane.sna.cloudsoda.io/.
- If the control plane is unavailable, NAT traversal or private network bridging is not possible.
See Figure 3 for details.
Figure 3 - Data Transfers via CloudSoda Control Plane and Agent Mesh
Data Intelligence Module
The Data Intelligence module provides visibility into your unstructured data by analyzing indexed metadata, not live file content. All metadata is collected by CloudSoda Agents, which scan supported storage systems without disrupting active workloads.
How it works
-
Agents connect to storage (on-prem, cloud, or hybrid) and extract metadata from files and objects.
-
Only metadata is indexed—no file content is read or opened.
-
The module displays insights using this indexed data, enabling safe, fast analysis at scale.
Indexed metadata typically includes
-
File name and full path
-
File extension
-
Creation, access, and modification dates
-
File size and owner/group
-
Storage-specific attributes (e.g., storage class or tier for cloud buckets)
Important behaviors and safeguards
-
Cloud bucket scans do not issue “Read” operations, so they won’t trigger object retrieval from deep archival tiers like AWS Glacier Deep Archive.
-
Scanning is non-intrusive—files remain untouched, and usage costs are not affected.
-
The module is designed for metadata intelligence only; content indexing or deep inspection (e.g., full-text search or content classification) is not performed.
-
Role-based access control (RBAC) is enforced, ensuring users only see metadata for storage and files they’re authorized to access.
This allows you to explore storage usage, track growth trends, identify redundancy, and support governance policies—without introducing risk or cost to your environment.
References
- MD5
- How can I check the integrity of an object uploaded to Amazon S3?
- AWS SDK for Go
- Security for this AWS Product or Service
- Azure / azure-storage-blob-go
- Configure Transport Layer Security (TLS) for a client application
- Storage package - cloud.google.com/go/storage - Go Packages
- gsutil tool | Cloud Storage
- TLS 1.3 is now on by default for Google Cloud services
Appendix A - CloudSoda Network Requirements
CloudSoda uses dynamic IPs due to cloud-based services relying on load balancing and regional routing. These are the required ports and endpoints:
Container Image Pulling
- Ports: TCP 80, 443
- Endpoints:
us-west1-docker.pkg.dev
registry-1.docker.io
DNS Services
- Ports: UDP 53, TCP 443
- URLs:
https://route53.amazonaws.com
https://acme-v02.api.letsencrypt.org/directory
Control Plane
- Ports: TCP 80, 443, 3080
- URL:
https://cloudsoda.teleport.sh
Location & Price Book Service
- Port: TCP 50051
- URLs:
https://compass.cloudsoda.io
https://books.cloudsoda.io
Exception Monitoring
- Ports: TCP 80, 443
- URL:
https://api.rollbar.com
DataDog Monitoring
- Ports: TCP 443, 10516, 10255, 10250, UDP 123
- URLs:
trace.agent.datadoghq.com
process.datadoghq.com
agent-intake.logs.datadoghq.com
agent-http-intake.logs.datadoghq.com
orchestrator.datadoghq.com
app.datadoghq.com
-
*.agent.datadoghq.com
(10)
The following endpoints are necessary only for installation. We recommend temporarily allowing all HTTPS traffic out to ensure a smooth installation.
- https://get.k3s.io
- https://k3s-ci-builds.s3.amazonaws.com
- https://update.k3s.io
- https://cloudsoda.teleport.sh
- https://charts.releases.teleport.dev
- https://cannery.cloudsoda.io
- https://github.com
- https://raw.githubusercontent.com
- https://api.github.com
- https://cloudsodainitscripts-usw2-p-1.s3.amazonaws.com
- https://docs.datadoghq.com/agent/guide/network/?tab=agentv6v7)
Comments
0 comments
Article is closed for comments.