A Pipeline defines how your data will be wrangled when running a Job.
Sample data can be used when creating a Pipeline. If this sample data roughly represents the data you expect to be pass through the Pipeline (a Job), then Segna’s suggestions (how to map columns, clean data, format data) when running a Job will be better from the get-go, and improve faster over time.
Sample data does NOT need to be exactly what you expect to see.
A destination for the output of the Pipeline needs to be specified during creation. Currently Segna supports outputting data to data-lakes (e.g. S3, Google Cloud Storage), as well as databases (e.g. PostgreSQL, MySQL, Athena). You can add multiple outputs for a single file - should you want to replicate data across multiple databases/datalakes. If you think there is a connector missing, please reach out - it’s likely that the connector you want is currently in development or testing!
The output schema is just the list of fields that you want to appear in the output data:
Fields can be added by clicking the “Add Field” button. Whatever fields that are added to the Pipeline are required to be present in the output data of every Job using the Pipeline. Don’t worry, Segna will help you match your input data to the pipeline schema if the input data doesn’t match the schema!
It is worth noting that you cannot remove fields specified on the Pipeline dynamically on the Job.
Selecting “Allow for future additional fields” allows you to add additional fields dynamically to the output data of each job.
Deselecting it will ensure that the schema of the output data is always fixed and will always reflect what is on the Pipeline.
There are 4 data types available:
- Number - floats, integers
- Category - text with a limited number of unique values
- Rich text - anything else
If a field is selected to be a Number data type, it will nullify any values that cannot be safely type-casted to an integer or a float.
If a field is selected to be Datetime data type, it will nullify any values that cannot be safely type-casted to a date, a time, or a date-time.
Selecting Category as a data type instead of Rich text.
If the data type of a field is datetime, then you can select the desired output timezone in the form of a unit. This means that the timezone of an input column of a Job will be converted to whatever is specified on the respective output field’s timezone on a pipeline. This way conversions will happen automatically and you won’t have to worry about timezones.
If the data type of a field is datetime, you will also be able to specify the desired datetime format e.g. 12:34 p.m 21/Jan/2020.