Mapping_lineages
Function mapping_lineages (lineage_cmut,alias_df)
Function chunk_lineage(lineage_cmut,alias_df)
1.Chunks lineages and sublineages by the first character of Pangolin string.
One element of the input data frame is considered at an instance and all the
members of the clade to which it belongs are chunked and passed to the downstream
processing.
2. If there are no lineages having the same first
character but there are entries in the input df, the lineage is
mapped to it's own and is stored in alias_df.
3. else If the there are no more entries in the
lineage_cmut then the lineage is mapped to itself and return is called.
4. else the chunk stored in temp_df is passed to
long_sublineage
End
Function long_sublineage(temp_df,lineage_cmut,
alias_df,alias_df_temp)
1. Finds the sublineage with longest character string
and stores it in longlineage_df
2. If multiple lineages have long character string
both the lineages are stored in longlineage_df
3. If linegaes length is just one then it is the
parental lineage and it is mapped to it's own and is
removed from temp_df amd is stored in alias_df
END
Function match_merge(longlineage_df,temp_df
lineage_cmut,alias_df,alias_df_temp)
1. Iterates through the longlineage_df, forms pattern
from the first element taken and tries to find
neighbours in longlineage_df based on jaccard value
using function find_jaccard
1. If neighbours are found their mutations are
combined (union).
2. Checks if these neighbors are paretnal lineage to
some other lineage in the alias_temp_df
3. Checking if there is a parental lineage to the
neighbours in the temp_df
4. If parental lineage is found and if the length of the parental
lineage string is more than one, the mutations of the
neighbours and the parernal lineages are again
combined (union) and stored in the place of
mutations of the parental lineage in temp_df. Neighbours are
mapped to their found parent and are stored in alias_df_temp,
since there is potential for surther mapping. This parental
lineage also becomes the parental lineage for the
sublineages that had these neighbours as parental
lineage in alias_df_temp.These neighbors are removed from
longlineage_df and the loop is iterated for the next round.
5. Else if the length of the parental lineage is
equal to 1 then everything in the previous point
that was written in the alias_df_temp is written to
alias_df. Mutations are not meddled with, since it is
the ultimate paretnal lineage and there is no go further.
6. If no parental lineage was found then the the
Neighbours are mapped to the pattern which is the
name of the neigbors without the last character. This
pattern concatinated with x becomes the parental
lineage of the neighbors. This also becomes the
parental lineage for those sublineages for which the
nighbors were parental lineage.
2. If there are no neighbors found
1. Code directly starts finding the parental
lineage for the element being considered.
2. If paretnal lineage is found and the length of
the lineage is more than 1, the element in hand
is mapped to the found paretnal lineage and jaccard
value is stored in the alias_df_temp. Mutations of
the element and the found parental lineage is combined and
stored in the place of the parental mutations in temp_df
3. Sublineages for which the lineage in hand is the parental
lineage in alias_df_temp gets mapped to the newly found parental lineage.
4. If the length of the parental lineage being
found is equal to one then point 2,3 is repeated
but difference would be that instaed of
alias_temp_df, alias_df is used and mutations are not meddeled with.
3. If no parerntal and neighbors were found
1. The lineage being considered is mapped to
itself.
2. For sublineages in the alias_df_temp that has
the lineage being considered as parental lineage is
remains the same. It is just transfered to
alias_df with no changes.
Once the longlineage_df has been fully processed if
there are entries in temp_df long_sublineage is called else chunk lineage is called.
END
End
Function find_jaccard(pat,search_df,pat_mutations=0)
search_lineage_loc<-grep(pat,search_df$lineage)
1. If pat_mutations==0 means the function is finding the
neighbours. Else the function is overloaded
to find the parental lineage.
2. If pat_mutations==0 the length(search_lineage_loc) should be
more than 1 - neighbours other than the lineage in hand.
3. If no neighbours were found then function returns
neighbours="0", jaccard_value=-1,neighbour_loc=0
4. Other than point 2 the overloaded function does the
same functionality for both the overloaded purposes
and returns jaccard value, neighbours, neighbour_loc
End
Function Find_parental(pat,parental_df)
1. Recursively searches with the pattern
until it finds the parental lineage satisfying
the conditions
2. The pattern is shortened every iteration.
End
question on the treshold